LLM Engineering in Action

This book teaches the production skills that appear repeatedly in 2026 LLM engineering job postings and strong real-world resumes: RAG, agentic workflows, fine-tuning, LLMOps, evaluation, security, cloud deployment, and portfolio-ready product work.

Market-Aligned Outcomes

By the end of the book, readers should be able to build and explain:

  • A document-grounded RAG assistant with citations and retrieval tests
  • A hybrid search system with reranking and measurable retrieval quality
  • A permission-aware enterprise assistant connected to documents and SQL data
  • A tool-connected LLM app with approval gates and audit logs
  • A stateful agent workflow with checkpoints, retries, and human review
  • A fine-tuned open-source model with before/after evaluation
  • A production LLM API with Docker, CI/CD, monitoring, and cost controls
  • An evaluation and security harness for RAG and agent systems
  • A portfolio capstone with architecture notes, metrics, and trade-off analysis

Table of Contents

Introduction: Modern LLM Engineering

  • The LLM Engineer Role
  • The Production AI Mindset
  • The LLM Application Stack
  • Business-to-System Translation
  • Reliability, Cost, and Risk
  • How to Use This Book

Part 1: Foundations and First Applications

Chapter 1: LLM Text Generation

  • Generation Loop
  • Tokenization and Subwords
  • Decoder-Only Transformers
  • Context Windows and Attention
  • Logits and Sampling
  • Decoding Controls
  • Generation Failure Scenarios
  • Chat Playgrounds
  • Hands-On Exercise

Chapter 2: Production Model Selection

  • Model Selection Criteria
  • Closed and Open Models
  • API, Hosted, and Local Access
  • Capability and Task Fit
  • Benchmarks and Product Tests
  • LLM Leaderboards
  • Cost, Latency, and Reliability
  • Model Decision Records
  • Hands-On Exercise

Chapter 3: Streaming Chatbot Applications

  • Chat API Structure
  • Roles and Message History
  • Environment and API Keys
  • Streaming Response Handling
  • Command-Line Chatbot Design
  • Context Growth Management
  • Runtime Error Handling
  • Part 1 Capstone Project

Part 2: Context, Retrieval, and Enterprise Grounding

Chapter 4: Prompting and Structured Outputs

  • System and User Prompts
  • Production Prompt Anatomy
  • System Prompt Design
  • Instruction Hierarchy
  • Prompt Security Boundaries
  • Prompt Patterns
  • Prompt Chaining and Review Loops
  • Few-Shot Examples
  • Context Assembly
  • Prompting for Cost and Latency
  • Prompt Versioning and Debugging
  • Prompt Anti-Patterns
  • JSON Schema Outputs
  • Pydantic Validation Contracts
  • Hands-On Exercise
  • What is an Embedding?
  • Practical Chunking Strategies
  • Choosing Embedding Models
  • Embedding Generation Best Practices
  • Vector Databases and Storage Options
  • Basic Semantic Search Implementation
  • Evaluation Metrics for Retrieval
  • Common Pitfalls and Failure Scenarios
  • Hands-On Exercise

Chapter 6: Retrieval-Augmented Generation

  • Search Is Not an Answer
  • The Basic RAG Loop
  • Context Construction and Token Budgeting
  • Grounded Prompting and โ€œI Donโ€™t Knowโ€
  • Source Citations as a Product Contract
  • FastAPI and Streamlit Interface
  • Debugging Retrieval and Context
  • Minimal Smoke Tests
  • Hands-On Exercise: PDF Q&A Bot with Citations

Chapter 7: Hybrid Retrieval, Reranking, and RAG Evaluation

  • Why Basic Vector Search Fails
  • BM25 and Sparse Retrieval
  • Metadata Filters and Exact-Match Needs
  • Hybrid Dense+Sparse Search
  • Query Rewriting
  • Reciprocal Rank Fusion
  • Cross-Encoder Reranking
  • Golden Query Sets
  • Retrieval Metrics: Recall@k, Precision@k, MRR, NDCG
  • RAGAS, DeepEval, and Custom RAG Evaluation
  • Hands-On Exercise: Support Search System with Measured Retrieval Lift

Chapter 8: Enterprise Data Integration and Permission-Aware RAG

  • Document Ingestion Pipelines
  • Parser Selection and Metadata Contracts
  • Incremental Indexing and Data Freshness
  • SQL and Postgres Grounding
  • Permission-Aware Retrieval
  • Tenant Isolation and Row-Level Security
  • PII Handling and Audit Logs
  • Knowledge Graphs and Structured Context
  • Access-Control Test Suites
  • Hands-On Exercise: Internal Assistant with Documents, SQL Data, and Role Filters

Part 3: Tools, State, and Agents

Chapter 9: Tool-Connected LLM Apps and MCP-Style Interfaces

  • Function Calling and Tool Schemas
  • Argument Validation
  • Tool-Result Injection
  • Read Tools vs Write Tools
  • API Safety and Idempotency
  • Human Approval for Risky Actions
  • MCP-Style Tool Boundaries
  • Tool-Call Logging and Audit Trails
  • Hands-On Exercise: Customer-Support Copilot with Docs, Order Lookup, and Approval Gates

Chapter 10: Stateful Agent Workflows

  • State Machines and Workflow Graphs
  • Planning Loops
  • Conversation State and User Preferences
  • Checkpoints and Resumable Execution
  • Retries, Interruptions, and Failure Recovery
  • Human-in-the-Loop Review
  • Basic Agent Observability
  • Hands-On Exercise: Resumable Research Assistant Workflow

Chapter 11: Multi-Agent Systems and Agent Evaluation

  • When Multi-Agent Systems Help
  • Supervisor and Router Patterns
  • Specialist Agents and Handoffs
  • Shared State
  • Parallel Research
  • Conflict Resolution
  • Cost Explosion and Latency Risks
  • Agent Tracing and Evaluation
  • Hands-On Exercise: Multi-Agent Proposal Builder

Part 4: Open Models and Customization

Chapter 12: Local LLM Inference

  • Open-Weight Model Families
  • Licensing and Deployment Constraints
  • Tokenizers and Context Length
  • VRAM/RAM Limits
  • Quantization and GGUF
  • Ollama, llama.cpp, and Hugging Face Loading
  • CPU vs GPU Trade-Offs
  • Privacy, Cost, and Offline Use Cases
  • Hands-On Exercise: Local Private Document Assistant

Chapter 13: LLM Fine-Tuning with LoRA, QLoRA, and PEFT

  • When Fine-Tuning Helps
  • When RAG or Prompting Is Better
  • Dataset Formatting and Quality
  • Train/Validation Splits
  • Supervised Fine-Tuning
  • LoRA, QLoRA, and PEFT
  • Hyperparameters and Overfitting
  • Before/After Evaluation
  • Model Cards and Dataset Cards
  • Hands-On Exercise: Fine-Tuned Support-Response Model

Chapter 14: Preference Tuning, Distillation, and Training Judgment

  • DPO, RLHF, and RLAIF Concepts
  • Synthetic Data Generation
  • Teacher-Student Distillation
  • Rationale Transfer
  • Continued Pre-Training Awareness
  • Tokenizer and Data Quality Decisions
  • Compute Budgeting
  • Build vs Buy vs Fine-Tune
  • Hands-On Exercise: Distilled Classifier or Mini Assistant

Part 5: Production, Reliability, and Governance

Chapter 15: Production LLM Serving and LLMOps

  • API Design and Authentication
  • Queues, Rate Limits, and Caching
  • Streaming at Scale
  • Docker, CI/CD, Secrets, and Rollbacks
  • Cloud Deployment Basics
  • vLLM, TGI, TensorRT-LLM, and Model Servers
  • Continuous Batching and KV Cache
  • Autoscaling and Health Checks
  • Hands-On Exercise: Productionized RAG/Agent API

Chapter 16: Observability, Cost, and Reliability Engineering

  • Structured Logs and Traces
  • Prompt and Model Versioning
  • Latency Metrics and p95 Debugging
  • Token Usage and Cost Tracking
  • Model Routing and Fallbacks
  • Cache Hit Rate and Queue Depth
  • Budget Alerts and Incident Review
  • Load Tests and Reliability Reports
  • Hands-On Exercise: LLMOps Dashboard

Chapter 17: Evaluation, AI Security, and Governance

  • Golden Datasets and Regression Tests
  • LLM-as-Judge and Human Review
  • RAG Quality and Citation Faithfulness
  • Hallucination Checks
  • Prompt Injection
  • Data Leakage
  • Insecure Tool Use and Excessive Agency
  • Guardrails, Red Teaming, and Incident Response
  • Governance and AI Risk Registers
  • Hands-On Exercise: Evaluation and Security Harness

Part 6: Multimodal, Voice, and Domain Products

Chapter 18: Vision-Language and Document Intelligence Applications

  • Image Inputs and Document Screenshots
  • OCR vs Vision-Language Models
  • Chart and Table Understanding
  • Visual Question Answering
  • Image-Grounded Extraction
  • Multimodal Prompting and RAG
  • Visual Evidence Checking
  • Hands-On Exercise: Invoice or Form Intelligence Assistant

Chapter 19: Speech and Voice Agents

  • Speech-to-Text and Text-to-Speech
  • Real-Time Audio Streaming
  • Voice Activity Detection
  • Turn-Taking and Interruptions
  • Latency Budgets
  • Voice UX and Failure Recovery
  • Tool Use During Calls
  • Transcript Storage and Consent
  • Hands-On Exercise: Voice Appointment Assistant

Chapter 20: Domain LLM Product and Portfolio Capstone

  • Domain Discovery and Workflow Mapping
  • User Roles and Data Access
  • RAG vs Fine-Tuning Decisions
  • Tool Integration and Human Review
  • Compliance, Governance, and Rollout
  • Product Metrics and Maintenance
  • Architecture Documents
  • Evaluation Reports
  • Cost and Latency Reports
  • Resume Bullets and Technical Walkthroughs
  • Capstone Project: Complete Domain Copilot

After the Book: Career and Portfolio

Portfolio Development

  • Project Selection
  • Architecture Documents
  • Public Demo Applications
  • Technical Walkthroughs
  • Evaluation Reports
  • Security Notes
  • Cost and Latency Reports
  • Trade-Off Explanations
  • Resume Bullets with Measurable Outcomes

LLM Engineering Careers

  • RAG and Knowledge Systems Engineer
  • Agentic AI Engineer
  • LLM Fine-Tuning Specialist
  • LLMOps or AI Platform Engineer
  • Applied AI / Full-Stack LLM Engineer
  • LLM Evaluation Engineer
  • LLM Solutions Engineer
  • Forward-Deployed AI Engineer
  • Enterprise AI Architect

Back to top

Copyright © 2025 Mohammad Shojaei. All rights reserved. You may copy and distribute this work, but please note that it may contain other authors' works which must be properly cited. Any redistribution must maintain appropriate attributions and citations.

This site uses Just the Docs, a documentation theme for Jekyll.