Projects and Publications

My comprehensive collection of research, development projects, and educational resources that I’ve created, focusing on Large Language Models (LLMs), Natural Language Processing, and AI applications. This repository showcases my practical implementations, theoretical foundations, and production-ready solutions for modern AI systems that I’ve developed over time.

📚 Core LLM Foundations

Prerequisites

💻 Interactive Notebooks:

🟠 Linear Algebra Fundamentals for LLMs (Colab) - This notebook will guide you through the essential linear algebra concepts required for understanding Large Language Models (LLMs). We’ll cover vectors, matrices, and basic operations using NumPy, with a focus on their application within the attention mechanism.
🟠 Probability and Statistics for LLMs (Colab) - This notebook provides an in-depth exploration of probability concepts foundational to Large Language Models (LLMs), combining theoretical explanations with real-world examples and code implementations in PyTorch.
🟠 GPU Essentials for LLMs (Colab) - This Jupyter Notebook tutorial explores the crucial role of GPUs (Graphics Processing Units) in powering Large Language Models (LLMs). You’ll learn why GPUs are essential, how they accelerate AI workloads, and the latest advancements in GPU technology.

1. Tokenization

Text preprocessing, BPE, WordPiece, SentencePiece, multilingual tokenization

📄 Publications:

💻 Interactive Notebooks:

🤗 Persian Tokenizers:

PersianBPETokenizer - BPE tokenizer optimized for Persian text
PersianGemmaTokenizerFast - Fast tokenizer for Persian Gemma models
PersianWordPieceTokenizer - WordPiece tokenizer for Persian language
PersianUnigramTokenizer - Unigram-based tokenizer for Persian
PersianLlamaTokenizerFast - Fast tokenizer for Persian Llama models

2. Embeddings

Word2Vec, GloVe, BERT, contextual embeddings, semantic search, multimodal embeddings

📄 Publications:

💻 Interactive Notebooks:

3. Neural Networks

Backpropagation, activation functions, optimization, regularization, mixed precision training

4. Traditional Language Models

N-gram models, RNNs, LSTMs, GRUs, sequence-to-sequence models, attention mechanisms

📄 Publications:

Understanding Language Models

💻 Interactive Notebooks:

5. Transformers

Self-attention, multi-head attention, positional encodings, decoder-only architecture

6. Data Preparation

Data collection, web scraping, cleaning, deduplication, quality assessment, synthetic data generation

💻 Interactive Notebooks:

🚀 Open Source Projects:

AdvancedWebScraper - Comprehensive web scraping tool with versatile data extraction capabilities
Prompt-Scraper - Effortlessly collect and transform Midjourney prompts into LM datasets
Youtube2Book - Extract transcripts from YouTube videos and structure with AI
Word-Frequency-Analyzer - Analyze word frequency in monthly news data
pytsetmc-api - Python client for Tehran Stock Exchange Market Center data retrieval
langchain_crawler - Web crawling implementation using LangChain

🧪 Model Training & Fine-Tuning

7. Pre-Training

Unsupervised pre-training, causal language modeling, distributed training, scaling laws

8. Post-Training Datasets

Instruction datasets, chat templates, conversation formatting, synthetic data generation

📄 Publications:

RAG vs. CAG vs. Fine-Tuning: Which Brain Boost Does Your LLM Actually Need?

💻 Interactive Notebooks:

9. Supervised Fine-Tuning

LoRA, QLoRA, PEFT, instruction tuning, domain adaptation, model merging

📄 Publications:

The LoRA Cookbook: Fine-Tuning Large Language Models for Everyone

💻 Interactive Notebooks:

🟠 Gemma SFT (Colab)
🟠 Gemma3 4B (Colab)
🟠 Gemma3 4B Persian (Colab)
🟠 Gemma3 4B Persian v2 (Colab)
🟠 Persian Gemma3 4B (Colab)
🟠 SFT (Supervised Fine-Tuning) (Colab)

10. Preference Alignment

RLHF, DPO, reward modeling, Constitutional AI, safety evaluation

11. Model Architectures

Mixture of Experts, state space models, Mamba, RWKV, long context architectures

12. Reasoning

Chain-of-Thought, tree-of-thoughts, process reward models, test-time compute scaling

📄 Publications:

How AI Learns to Fix Own Mistakes

13. Evaluation

Benchmarking, MMLU, GSM8K, HumanEval, human evaluation, bias testing

🚀 Production & Deployment

14. Quantization

Post-training quantization, quantization-aware training, GGUF, INT4/INT8 quantization

15. Inference Optimization

Flash Attention, KV cache, speculative decoding, high-throughput inference

📄 Publications:

Understanding the Differences Between CPU, GPU, TPU, and LPU

🚀 Open Source Projects:

vram-calculator - Calculate VRAM requirements for LLMs and recommend suitable GPUs

16. Model Enhancement

Context window extension, model merging, knowledge distillation, continual learning

17. Security & Responsible AI

OWASP LLM Top 10, prompt injection, jailbreaking, bias detection, privacy protection

18. Running LLMs

API integration, local deployment, production servers, streaming responses

📄 Publications:

Guide to Deploying Qwen 3 with vLLM on RunPod

🚀 Open Source Projects:

ollama-desktop - Powerful desktop application for interacting with local AI models
ollama_gui - User-friendly Qt desktop application for Ollama backend
SubTrans-Ollama - Simple tool for translating movie subtitles (.srt) files
ChatGPT-Desktop-App - Interactive desktop app with document uploads and conversation management
OpenRouterChatApp - Simple chat application using OpenRouter API
GPT-Translator - Streamlit translation app using advanced language models
Pdf-Finder-Telegram-bot - Search for book PDFs in Telegram bot
healthcare-assistant - Healthcare chat interface for emotional support and stress analysis

🤖 Applications & Systems

19. RAG

Retrieval Augmented Generation, vector databases, Graph RAG, conversational RAG

🚀 Open Source Projects:

ollama_rag - Fully local RAG system using Ollama and FAISS
open-notebook - AI-powered knowledge management and question-answering system
TalkWithWeb - Customizable AI chatbot with personalized knowledge base
DataSpeakGPT - Read files and images and retrieve data for LLM
Cortex - Advanced AI Deep Scholar Researcher Agent with RAG and Milvus integration
RAG-Agent - RAG implementation with LangChain and LangGraph libraries
RAG_CAG_SFT - Educational overview of RAG, Cache-Augmented Generation, and SFT techniques

20. Agents

Function calling, tool usage, multi-agent systems, autonomous task execution

🚀 Open Source Projects:

ReActMCP - Reactive MCP client for real-time web search insights (141⭐)
EasyMCP - Beginner-friendly client for Model Context Protocol
Groogle - Groq + Google integration for enhanced search capabilities
GoogleGPT - Combine Google search with ChatGPT capabilities
simple_function_calling - Beginner tutorial on connecting LLMs to external tools
SuperAgent - Advanced agent implementation
SuperNova-Desktop - Desktop agent application

21. Multimodal

Vision-language models, text-to-image generation, audio processing, document understanding

💻 Interactive Notebooks:

🚀 Open Source Projects:

Text2Prompt2Image - Flask app using Mixtral-8x7B & Playground-v2 for text-to-image generation
flux_local - Lightweight toolkit for running FLUX.1-schnell text-to-image models locally

22. LLMOps

Model versioning, CI/CD pipelines, monitoring, deployment strategies, cost optimization

📊 Datasets & Resources

🇮🇷 Persian Language Resources

📚 Persian Language Datasets:

PersianCorpus_merged - Massive Persian corpus with 14.7M records (38 downloads)
PersianTelegramChannels - Persian Telegram channels dataset with 12.1k records (74 downloads)
persian-document-corpus - Persian document corpus with 13.1k records (31 downloads)
persian_blogs - Persian blog posts dataset with 27.4k records (28 downloads)
persian-tweets-2024 - Persian tweets from 2024 with 900 records (23 downloads)
persian-search-queries - Persian search queries dataset with 1.31k records (27 downloads)

📊 Persian Instruction Datasets:

Persian_sft - Persian supervised fine-tuning dataset with 681k records (59 downloads)
Persian_sft_jsonl - Persian SFT in JSONL format with 681k records (26 downloads)
Persian_sft_QA - Persian SFT Q&A format with 681k records (33 downloads)
merged_persian_alpaca - Persian Alpaca-style instruction dataset with 527k records (21 downloads)
merged_persian_sharegpt - Persian ShareGPT format dataset with 527k records (17 downloads)
Persian_lmsys_QA - Persian LMSYS Q&A dataset with 5.43k records (59 downloads)
alpaca_persian_telegram - Persian Alpaca with Telegram data, 1k records (17 downloads)

📊 Persian Evaluation Datasets:

multiple-choice-persian-eval - Persian multiple-choice evaluation dataset with 364 records (20 downloads)
SCED - Specialized evaluation dataset with 32 records (18 downloads)

🎵 Persian Audio Datasets:

persian_tts_merged - Persian TTS dataset with 82.2k records (60 downloads)
farsi_asr_merged - Persian ASR dataset (Private)

🌐 Multi-Language & Specialized Datasets

🌐 Multi-Language Instruction Datasets:

Dolly_Alpaca_Lmsys - Merged instruction dataset with 1.07M records (26 downloads)
merged_mental_health_dataset - Mental health support dataset with 868k records (26 downloads)

📚 Educational Datasets:

ielts-practice-sentences - IELTS practice sentences with 45.7k records (24 downloads)

🎨 Creative Datasets:

Midjourney-Art-Prompts - Curated collection of Midjourney art prompts (3 records)

🤖 Persian LLM Models

🔥 Featured Models:

gemma-3-4b-persian-v0 - Persian fine-tuned Gemma-3 4B model (1.74K downloads)
gemma-2-2b-fa-v2 - Persian Gemma-2 2B model v2 (21 downloads)
Gemma-2-2b-fa - Persian Gemma-2 2B model (13 downloads)
persian_phi-3 - Persian fine-tuned Phi-3 model (Private)

🎯 Specialized Variants:

gemma-3-4b-persian-lora-adaptors - LoRA adapters for Persian Gemma-3 (9 downloads)
gemma-3-4b-persian-v0-abliterated - Abliterated version for uncensored responses (6 downloads)
gemma-3-4b-persian-v0-abliterated-Q8_0-GGUF - Quantized GGUF format (11 downloads)

Projects and Publications

📚 Core LLM Foundations

Prerequisites

💻 Interactive Notebooks:

1. Tokenization

📄 Publications:

💻 Interactive Notebooks:

🤗 Persian Tokenizers:

2. Embeddings

📄 Publications:

💻 Interactive Notebooks:

3. Neural Networks

4. Traditional Language Models

📄 Publications:

💻 Interactive Notebooks:

5. Transformers

6. Data Preparation

💻 Interactive Notebooks:

🚀 Open Source Projects:

🧪 Model Training & Fine-Tuning

7. Pre-Training

8. Post-Training Datasets

📄 Publications:

💻 Interactive Notebooks:

9. Supervised Fine-Tuning

📄 Publications:

💻 Interactive Notebooks:

10. Preference Alignment

11. Model Architectures

12. Reasoning

📄 Publications:

13. Evaluation

🚀 Production & Deployment

14. Quantization

15. Inference Optimization

📄 Publications:

🚀 Open Source Projects:

16. Model Enhancement

17. Security & Responsible AI

18. Running LLMs

📄 Publications:

🚀 Open Source Projects:

🤖 Applications & Systems

19. RAG

🚀 Open Source Projects:

20. Agents

🚀 Open Source Projects:

21. Multimodal

💻 Interactive Notebooks:

🚀 Open Source Projects:

22. LLMOps

📊 Datasets & Resources

🇮🇷 Persian Language Resources

📚 Persian Language Datasets:

📊 Persian Instruction Datasets:

📊 Persian Evaluation Datasets:

🎵 Persian Audio Datasets:

🌐 Multi-Language & Specialized Datasets

🌐 Multi-Language Instruction Datasets:

📚 Educational Datasets:

🎨 Creative Datasets:

🤖 Persian LLM Models

🔥 Featured Models:

🎯 Specialized Variants:

🌟 Community & Learning Resources

📚 Learning Platforms

🎯 Curated Collections:

📚 Educational Resources:

🔗 Connect