Projects and Publications

My comprehensive collection of research, development projects, and educational resources that Iโ€™ve created, focusing on Large Language Models (LLMs), Natural Language Processing, and AI applications. This repository showcases my practical implementations, theoretical foundations, and production-ready solutions for modern AI systems that Iโ€™ve developed over time.

๐Ÿ“š Core LLM Foundations

Prerequisites

๐Ÿ’ป Interactive Notebooks:

  • ๐ŸŸ  Linear Algebra Fundamentals for LLMs (Colab) - This notebook will guide you through the essential linear algebra concepts required for understanding Large Language Models (LLMs). Weโ€™ll cover vectors, matrices, and basic operations using NumPy, with a focus on their application within the attention mechanism.
  • ๐ŸŸ  Probability and Statistics for LLMs (Colab) - This notebook provides an in-depth exploration of probability concepts foundational to Large Language Models (LLMs), combining theoretical explanations with real-world examples and code implementations in PyTorch.
  • ๐ŸŸ  GPU Essentials for LLMs (Colab) - This Jupyter Notebook tutorial explores the crucial role of GPUs (Graphics Processing Units) in powering Large Language Models (LLMs). Youโ€™ll learn why GPUs are essential, how they accelerate AI workloads, and the latest advancements in GPU technology.

1. Tokenization

Text preprocessing, BPE, WordPiece, SentencePiece, multilingual tokenization

๐Ÿ“„ Publications:

๐Ÿ’ป Interactive Notebooks:

๐Ÿค— Persian Tokenizers:

2. Embeddings

Word2Vec, GloVe, BERT, contextual embeddings, semantic search, multimodal embeddings

๐Ÿ“„ Publications:

๐Ÿ’ป Interactive Notebooks:

3. Neural Networks

Backpropagation, activation functions, optimization, regularization, mixed precision training

4. Traditional Language Models

N-gram models, RNNs, LSTMs, GRUs, sequence-to-sequence models, attention mechanisms

๐Ÿ“„ Publications:

๐Ÿ’ป Interactive Notebooks:

5. Transformers

Self-attention, multi-head attention, positional encodings, decoder-only architecture

6. Data Preparation

Data collection, web scraping, cleaning, deduplication, quality assessment, synthetic data generation

๐Ÿ’ป Interactive Notebooks:

๐Ÿš€ Open Source Projects:

  • AdvancedWebScraper - Comprehensive web scraping tool with versatile data extraction capabilities
  • Prompt-Scraper - Effortlessly collect and transform Midjourney prompts into LM datasets
  • Youtube2Book - Extract transcripts from YouTube videos and structure with AI
  • Word-Frequency-Analyzer - Analyze word frequency in monthly news data
  • pytsetmc-api - Python client for Tehran Stock Exchange Market Center data retrieval
  • langchain_crawler - Web crawling implementation using LangChain

๐Ÿงช Model Training & Fine-Tuning

7. Pre-Training

Unsupervised pre-training, causal language modeling, distributed training, scaling laws

8. Post-Training Datasets

Instruction datasets, chat templates, conversation formatting, synthetic data generation

๐Ÿ“„ Publications:

๐Ÿ’ป Interactive Notebooks:

9. Supervised Fine-Tuning

LoRA, QLoRA, PEFT, instruction tuning, domain adaptation, model merging

๐Ÿ“„ Publications:

๐Ÿ’ป Interactive Notebooks:

10. Preference Alignment

RLHF, DPO, reward modeling, Constitutional AI, safety evaluation

11. Model Architectures

Mixture of Experts, state space models, Mamba, RWKV, long context architectures

12. Reasoning

Chain-of-Thought, tree-of-thoughts, process reward models, test-time compute scaling

๐Ÿ“„ Publications:

13. Evaluation

Benchmarking, MMLU, GSM8K, HumanEval, human evaluation, bias testing

๐Ÿš€ Production & Deployment

14. Quantization

Post-training quantization, quantization-aware training, GGUF, INT4/INT8 quantization

15. Inference Optimization

Flash Attention, KV cache, speculative decoding, high-throughput inference

๐Ÿ“„ Publications:

๐Ÿš€ Open Source Projects:

  • vram-calculator - Calculate VRAM requirements for LLMs and recommend suitable GPUs

16. Model Enhancement

Context window extension, model merging, knowledge distillation, continual learning

17. Security & Responsible AI

OWASP LLM Top 10, prompt injection, jailbreaking, bias detection, privacy protection

18. Running LLMs

API integration, local deployment, production servers, streaming responses

๐Ÿ“„ Publications:

๐Ÿš€ Open Source Projects:

๐Ÿค– Applications & Systems

19. RAG

Retrieval Augmented Generation, vector databases, Graph RAG, conversational RAG

๐Ÿš€ Open Source Projects:

  • ollama_rag - Fully local RAG system using Ollama and FAISS
  • open-notebook - AI-powered knowledge management and question-answering system
  • TalkWithWeb - Customizable AI chatbot with personalized knowledge base
  • DataSpeakGPT - Read files and images and retrieve data for LLM
  • Cortex - Advanced AI Deep Scholar Researcher Agent with RAG and Milvus integration
  • RAG-Agent - RAG implementation with LangChain and LangGraph libraries
  • RAG_CAG_SFT - Educational overview of RAG, Cache-Augmented Generation, and SFT techniques

20. Agents

Function calling, tool usage, multi-agent systems, autonomous task execution

๐Ÿš€ Open Source Projects:

  • ReActMCP - Reactive MCP client for real-time web search insights (141โญ)
  • EasyMCP - Beginner-friendly client for Model Context Protocol
  • Groogle - Groq + Google integration for enhanced search capabilities
  • GoogleGPT - Combine Google search with ChatGPT capabilities
  • simple_function_calling - Beginner tutorial on connecting LLMs to external tools
  • SuperAgent - Advanced agent implementation
  • SuperNova-Desktop - Desktop agent application

21. Multimodal

Vision-language models, text-to-image generation, audio processing, document understanding

๐Ÿ’ป Interactive Notebooks:

๐Ÿš€ Open Source Projects:

  • Text2Prompt2Image - Flask app using Mixtral-8x7B & Playground-v2 for text-to-image generation
  • flux_local - Lightweight toolkit for running FLUX.1-schnell text-to-image models locally

22. LLMOps

Model versioning, CI/CD pipelines, monitoring, deployment strategies, cost optimization


๐Ÿ“Š Datasets & Resources

๐Ÿ‡ฎ๐Ÿ‡ท Persian Language Resources

๐Ÿ“š Persian Language Datasets:

๐Ÿ“Š Persian Instruction Datasets:

๐Ÿ“Š Persian Evaluation Datasets:

  • multiple-choice-persian-eval - Persian multiple-choice evaluation dataset with 364 records (20 downloads)
  • SCED - Specialized evaluation dataset with 32 records (18 downloads)

๐ŸŽต Persian Audio Datasets:

๐ŸŒ Multi-Language & Specialized Datasets

๐ŸŒ Multi-Language Instruction Datasets:

๐Ÿ“š Educational Datasets:

๐ŸŽจ Creative Datasets:


๐Ÿค– Persian LLM Models

๐ŸŽฏ Specialized Variants:


๐ŸŒŸ Community & Learning Resources

๐Ÿ“š Learning Platforms

๐ŸŽฏ Curated Collections:

  • Awesome-AI - Best AI resources, tools, samples, and demos (124โญ)
  • Awesome-Prompts - Ready-to-use prompts for productivity and creativity

๐Ÿ“š Educational Resources:

  • LLMs-Journey - Progress tracking with code, projects, and notes
  • Python-Course - Teaching materials from Kazerun University course

๐Ÿ”— Connect


Back to top

Copyright © 2025 Mohammad Shojaei. All rights reserved. You may copy and distribute this work, but please note that it may contain other authors' works which must be properly cited. Any redistribution must maintain appropriate attributions and citations.