Projects and Publications
My comprehensive collection of research, development projects, and educational resources that Iโve created, focusing on Large Language Models (LLMs), Natural Language Processing, and AI applications. This repository showcases my practical implementations, theoretical foundations, and production-ready solutions for modern AI systems that Iโve developed over time.
๐ Core LLM Foundations
Prerequisites
๐ป Interactive Notebooks:
- ๐ Linear Algebra Fundamentals for LLMs (Colab) - This notebook will guide you through the essential linear algebra concepts required for understanding Large Language Models (LLMs). Weโll cover vectors, matrices, and basic operations using NumPy, with a focus on their application within the attention mechanism.
- ๐ Probability and Statistics for LLMs (Colab) - This notebook provides an in-depth exploration of probability concepts foundational to Large Language Models (LLMs), combining theoretical explanations with real-world examples and code implementations in PyTorch.
- ๐ GPU Essentials for LLMs (Colab) - This Jupyter Notebook tutorial explores the crucial role of GPUs (Graphics Processing Units) in powering Large Language Models (LLMs). Youโll learn why GPUs are essential, how they accelerate AI workloads, and the latest advancements in GPU technology.
1. Tokenization
Text preprocessing, BPE, WordPiece, SentencePiece, multilingual tokenization
๐ Publications:
- Introduction to Tokenization: A Theoretical Perspective
- Understanding BPE Tokenization
- Fast Tokenizers: How Rust is Turbocharging NLP
๐ป Interactive Notebooks:
- ๐ Tokenization Techniques (Interactive Colab)
- ๐ GPT Tokenizer Implementation (Colab)
- ๐ Tokenizer Comparison (Colab)
- ๐ Hugging Face Tokenizers (Colab)
- ๐ Build and Push a Tokenizer (Colab)
- ๐ New Tokenizer Training (Colab)
- ๐ Compare Tokenizers Performance (Colab)
- ๐ Tokenization BPE (Colab)
- ๐ Tokenizer Training (Colab)
- ๐ Tokenizing with Different Methods (Colab)
- ๐ Persian BPE Tokenizer (Colab)
- ๐ Persian Gemma Tokenizer (Colab)
- ๐ Train Llama Tokenizer (Colab)
๐ค Persian Tokenizers:
- PersianBPETokenizer - BPE tokenizer optimized for Persian text
- PersianGemmaTokenizerFast - Fast tokenizer for Persian Gemma models
- PersianWordPieceTokenizer - WordPiece tokenizer for Persian language
- PersianUnigramTokenizer - Unigram-based tokenizer for Persian
- PersianLlamaTokenizerFast - Fast tokenizer for Persian Llama models
2. Embeddings
Word2Vec, GloVe, BERT, contextual embeddings, semantic search, multimodal embeddings
๐ Publications:
๐ป Interactive Notebooks:
- ๐ Interactive Word2Vec Posts
- ๐ Embedding Techniques (Colab)
- ๐ Pre-trained Embeddings (Colab)
- ๐ Traditional Word Embedding (Colab)
- ๐ Train a Word2Vec Model (Colab)
- ๐ Word Embeddings (Colab)
3. Neural Networks
Backpropagation, activation functions, optimization, regularization, mixed precision training
4. Traditional Language Models
N-gram models, RNNs, LSTMs, GRUs, sequence-to-sequence models, attention mechanisms
๐ Publications:
๐ป Interactive Notebooks:
5. Transformers
Self-attention, multi-head attention, positional encodings, decoder-only architecture
6. Data Preparation
Data collection, web scraping, cleaning, deduplication, quality assessment, synthetic data generation
๐ป Interactive Notebooks:
- ๐ Dataset Merge (Colab)
- ๐ Dataset Merger Simple (Colab)
- ๐ Dataset Merger Speech (Colab)
- ๐ Noise Reduction Test (Colab)
- ๐ EEG Artifact Detection (Colab)
๐ Open Source Projects:
- AdvancedWebScraper - Comprehensive web scraping tool with versatile data extraction capabilities
- Prompt-Scraper - Effortlessly collect and transform Midjourney prompts into LM datasets
- Youtube2Book - Extract transcripts from YouTube videos and structure with AI
- Word-Frequency-Analyzer - Analyze word frequency in monthly news data
- pytsetmc-api - Python client for Tehran Stock Exchange Market Center data retrieval
- langchain_crawler - Web crawling implementation using LangChain
๐งช Model Training & Fine-Tuning
7. Pre-Training
Unsupervised pre-training, causal language modeling, distributed training, scaling laws
8. Post-Training Datasets
Instruction datasets, chat templates, conversation formatting, synthetic data generation
๐ Publications:
๐ป Interactive Notebooks:
9. Supervised Fine-Tuning
LoRA, QLoRA, PEFT, instruction tuning, domain adaptation, model merging
๐ Publications:
๐ป Interactive Notebooks:
- ๐ Gemma SFT (Colab)
- ๐ Gemma3 4B (Colab)
- ๐ Gemma3 4B Persian (Colab)
- ๐ Gemma3 4B Persian v2 (Colab)
- ๐ Persian Gemma3 4B (Colab)
- ๐ SFT (Supervised Fine-Tuning) (Colab)
10. Preference Alignment
RLHF, DPO, reward modeling, Constitutional AI, safety evaluation
11. Model Architectures
Mixture of Experts, state space models, Mamba, RWKV, long context architectures
12. Reasoning
Chain-of-Thought, tree-of-thoughts, process reward models, test-time compute scaling
๐ Publications:
13. Evaluation
Benchmarking, MMLU, GSM8K, HumanEval, human evaluation, bias testing
๐ Production & Deployment
14. Quantization
Post-training quantization, quantization-aware training, GGUF, INT4/INT8 quantization
15. Inference Optimization
Flash Attention, KV cache, speculative decoding, high-throughput inference
๐ Publications:
๐ Open Source Projects:
- vram-calculator - Calculate VRAM requirements for LLMs and recommend suitable GPUs
16. Model Enhancement
Context window extension, model merging, knowledge distillation, continual learning
17. Security & Responsible AI
OWASP LLM Top 10, prompt injection, jailbreaking, bias detection, privacy protection
18. Running LLMs
API integration, local deployment, production servers, streaming responses
๐ Publications:
๐ Open Source Projects:
- ollama-desktop - Powerful desktop application for interacting with local AI models
- ollama_gui - User-friendly Qt desktop application for Ollama backend
- SubTrans-Ollama - Simple tool for translating movie subtitles (.srt) files
- ChatGPT-Desktop-App - Interactive desktop app with document uploads and conversation management
- OpenRouterChatApp - Simple chat application using OpenRouter API
- GPT-Translator - Streamlit translation app using advanced language models
- Pdf-Finder-Telegram-bot - Search for book PDFs in Telegram bot
- healthcare-assistant - Healthcare chat interface for emotional support and stress analysis
๐ค Applications & Systems
19. RAG
Retrieval Augmented Generation, vector databases, Graph RAG, conversational RAG
๐ Open Source Projects:
- ollama_rag - Fully local RAG system using Ollama and FAISS
- open-notebook - AI-powered knowledge management and question-answering system
- TalkWithWeb - Customizable AI chatbot with personalized knowledge base
- DataSpeakGPT - Read files and images and retrieve data for LLM
- Cortex - Advanced AI Deep Scholar Researcher Agent with RAG and Milvus integration
- RAG-Agent - RAG implementation with LangChain and LangGraph libraries
- RAG_CAG_SFT - Educational overview of RAG, Cache-Augmented Generation, and SFT techniques
20. Agents
Function calling, tool usage, multi-agent systems, autonomous task execution
๐ Open Source Projects:
- ReActMCP - Reactive MCP client for real-time web search insights (141โญ)
- EasyMCP - Beginner-friendly client for Model Context Protocol
- Groogle - Groq + Google integration for enhanced search capabilities
- GoogleGPT - Combine Google search with ChatGPT capabilities
- simple_function_calling - Beginner tutorial on connecting LLMs to external tools
- SuperAgent - Advanced agent implementation
- SuperNova-Desktop - Desktop agent application
21. Multimodal
Vision-language models, text-to-image generation, audio processing, document understanding
๐ป Interactive Notebooks:
- ๐ Whisper Turbo (Colab)
- ๐ Whisper Turbo FP32 Async (Colab)
- ๐ Maestro Qwen2.5 VL JSON Extraction (Colab)
- ๐ XTTS Test on Long Text (Colab)
๐ Open Source Projects:
- Text2Prompt2Image - Flask app using Mixtral-8x7B & Playground-v2 for text-to-image generation
- flux_local - Lightweight toolkit for running FLUX.1-schnell text-to-image models locally
22. LLMOps
Model versioning, CI/CD pipelines, monitoring, deployment strategies, cost optimization
๐ Datasets & Resources
๐ฎ๐ท Persian Language Resources
๐ Persian Language Datasets:
- PersianCorpus_merged - Massive Persian corpus with 14.7M records (38 downloads)
- PersianTelegramChannels - Persian Telegram channels dataset with 12.1k records (74 downloads)
- persian-document-corpus - Persian document corpus with 13.1k records (31 downloads)
- persian_blogs - Persian blog posts dataset with 27.4k records (28 downloads)
- persian-tweets-2024 - Persian tweets from 2024 with 900 records (23 downloads)
- persian-search-queries - Persian search queries dataset with 1.31k records (27 downloads)
๐ Persian Instruction Datasets:
- Persian_sft - Persian supervised fine-tuning dataset with 681k records (59 downloads)
- Persian_sft_jsonl - Persian SFT in JSONL format with 681k records (26 downloads)
- Persian_sft_QA - Persian SFT Q&A format with 681k records (33 downloads)
- merged_persian_alpaca - Persian Alpaca-style instruction dataset with 527k records (21 downloads)
- merged_persian_sharegpt - Persian ShareGPT format dataset with 527k records (17 downloads)
- Persian_lmsys_QA - Persian LMSYS Q&A dataset with 5.43k records (59 downloads)
- alpaca_persian_telegram - Persian Alpaca with Telegram data, 1k records (17 downloads)
๐ Persian Evaluation Datasets:
- multiple-choice-persian-eval - Persian multiple-choice evaluation dataset with 364 records (20 downloads)
- SCED - Specialized evaluation dataset with 32 records (18 downloads)
๐ต Persian Audio Datasets:
- persian_tts_merged - Persian TTS dataset with 82.2k records (60 downloads)
- farsi_asr_merged - Persian ASR dataset (Private)
๐ Multi-Language & Specialized Datasets
๐ Multi-Language Instruction Datasets:
- Dolly_Alpaca_Lmsys - Merged instruction dataset with 1.07M records (26 downloads)
- merged_mental_health_dataset - Mental health support dataset with 868k records (26 downloads)
๐ Educational Datasets:
- ielts-practice-sentences - IELTS practice sentences with 45.7k records (24 downloads)
๐จ Creative Datasets:
- Midjourney-Art-Prompts - Curated collection of Midjourney art prompts (3 records)
๐ค Persian LLM Models
๐ฅ Featured Models:
- gemma-3-4b-persian-v0 - Persian fine-tuned Gemma-3 4B model (1.74K downloads)
- gemma-2-2b-fa-v2 - Persian Gemma-2 2B model v2 (21 downloads)
- Gemma-2-2b-fa - Persian Gemma-2 2B model (13 downloads)
- persian_phi-3 - Persian fine-tuned Phi-3 model (Private)
๐ฏ Specialized Variants:
- gemma-3-4b-persian-lora-adaptors - LoRA adapters for Persian Gemma-3 (9 downloads)
- gemma-3-4b-persian-v0-abliterated - Abliterated version for uncensored responses (6 downloads)
- gemma-3-4b-persian-v0-abliterated-Q8_0-GGUF - Quantized GGUF format (11 downloads)
๐ Community & Learning Resources
๐ Learning Platforms
๐ฏ Curated Collections:
- Awesome-AI - Best AI resources, tools, samples, and demos (124โญ)
- Awesome-Prompts - Ready-to-use prompts for productivity and creativity
๐ Educational Resources:
- LLMs-Journey - Progress tracking with code, projects, and notes
- Python-Course - Teaching materials from Kazerun University course
๐ Connect
- ๐ง Email: shojaei.dev@gmail.com
- ๐ผ LinkedIn: mshojaei77
- ๐ GitHub: mshojaei77
- ๐ค Hugging Face: mshojaei77