LLMs: From Foundation to Production

Chapter 1
What LLM Engineers Actually Build

LLM engineers build production systems that reliably turn raw model capabilities into user-facing products. Their work centers on four core responsibilities: model use (selecting the right base or fine-tuned model, crafting effective prompts, and routing requests intelligently), orchestration (chaining together retrieval, tools, memory, and agents using frameworks such as LangChain or LlamaIndex), evaluation (designing offline metrics, LLM-as-a-judge pipelines, synthetic test sets, and human rubrics that measure faithfulness, relevance, and safety), and deployment (serving models efficiently with tools like vLLM or TGI, optimizing latency, scaling under load, monitoring in production, controlling costs, running A/B tests, and enforcing guardrails).

This is fundamentally different from pure research. Researchers invent new architectures, discover scaling laws, design novel pre-training objectives, or pioneer post-training techniques such as instruction tuning, RLHF/RLAIF, and preference optimization. Engineers, by contrast, treat the model largely as a powerful but opaque component. They focus on integrating it into systems that are observable, testable, and reliable enough for real users.

Text-based product surfaces are the most straightforward: discrete tokens flow through a prompt-and-completion loop, context management is manageable, batching is simple, and evaluation can rely on text similarity metrics or LLM judges. More advanced surfaces—such as real-time or multimodal interfaces—introduce streaming constraints, voice activity detection, modality conversions, prosody handling, and much tighter latency budgets. In practice, even sophisticated multimodal pipelines often rely on well-understood cascaded designs for reliability and tool integration.

The foundational ideas that guide this work come directly from established references. Speech and Language Processing by Jurafsky and Martin explains transformers (the role of attention and self-attention in sequence modeling), post-training (how instruction tuning teaches models to follow directives), and retrieval-augmented generation (RAG), which grounds model outputs in external data to reduce hallucinations. The Hugging Face Open-Source AI Cookbook translates these concepts into immediately usable, notebook-first patterns: turning documents into embeddings for vector database retrieval, building advanced RAG pipelines with reranking and query rewriting, and evaluating everything with LLM judges.

To make the engineering mindset concrete, consider three representative systems. Each map highlights the same engineering questions—inputs, outputs, failure modes, and acceptance criteria—applied to different product surfaces.

System Map: Document QA Bot (Text Surface, RAG-Heavy)
Architecture overview: An ingestion pipeline (chunking, embedding, vector database indexing) paired with a query-time flow of rewrite → hybrid retrieval (vector + keyword) → reranking/filtering → context assembly → grounded LLM generation.

Model Inputs: User query plus retrieved chunks (within context window) and optional conversation history; a separate embedding model for retrieval.
Model Outputs: Grounded answer with explicit citations to sources.
Failure Modes: Irrelevant retrieval (poor chunking, weak embeddings, or query mismatch); hallucination on missing information; context overload (“lost in the middle”); stale index; permission leaks; ambiguous queries without clarification.
Acceptance Criteria: Faithfulness above 95% (no unsupported claims, verifiable by citations); direct relevance to the query; end-to-end latency under 2 seconds for most requests; strong coverage on domain documents; robust evaluation scores (LLM-as-judge or human review) on both synthetic and real query sets.

System Map: Meeting Transcriber (Audio-to-Text Surface, Pipeline with Post-Processing)
Architecture overview: Audio capture (streaming or batch) → speech-to-text with diarization → transcript cleanup → LLM summarization and structured analysis (key points, action items, decisions). Orchestration includes speaker labeling, timestamp alignment, and optional RAG for company-specific terminology.

Model Inputs: Raw audio waveform (or pre-transcribed segments) plus meeting metadata such as participants and agenda.
Model Outputs: Timestamped transcript with speaker labels, plus structured summary (bullets, action items, themes) and fully searchable text.
Failure Modes: Audio quality issues (accents, overlap, noise leading to transcription errors); diarization mistakes (speaker confusion); summarization drift or fabrication on long meetings; domain jargon mishandled; privacy risks in storage or processing.
Acceptance Criteria: Word error rate (WER) below 10–15% on clean speech; diarization accuracy above 90%; summary completeness verified by human review; latency acceptable for batch processing (minutes for an hour-long meeting) or near-real-time streaming; perfectly parsable structured output (JSON); evaluation via ROUGE/BERTScore or LLM judges for fidelity to the original transcript.

System Map: Realtime Voice Agent (Low-Latency Surface, Cascaded or Native)
Architecture overview: Voice input layer (voice activity detection → streaming speech-to-text or native audio) → LLM agent (reasoning, tool calling, memory) → streaming text-to-speech or native audio output. Orchestration handles interruptions, context management, and function calling for external tools (calendar, search). Pipelining keeps every stage streaming to minimize turn gaps; newer native speech-to-speech models reduce hops but may constrain tool use.

Model Inputs: Streaming audio chunks (or partial transcripts), conversation history, and tool results.
Model Outputs: Streaming audio response with natural prosody and support for interruptions.
Failure Modes: Excessive latency (anything over 1–2 seconds feels unnatural); cascading errors from early-stage mistakes; poor interruption handling; context loss in long sessions; guardrail bypass in voice mode; tone or emotion mismatch; tool failures under streaming pressure.
Acceptance Criteria: Time-to-first-audio under 1 second (P50); conversational latency low enough for natural back-and-forth; high intelligibility and naturalness (MOS scores or user preference); task completion rate above 85% with reliable tool use; strong robustness to barge-ins, noisy environments, and safety checks; production monitoring for cost and latency drift.

These system maps are not exhaustive, but they illustrate the consistent engineering lens: every product is defined by observable inputs and outputs, explicit failure modes, and measurable acceptance criteria. Engineers iterate through evaluation harnesses, synthetic test sets, A/B experiments, and live monitoring—turning powerful research outputs (transformers, post-trained models, retrieval techniques) into systems that users can trust at scale. Text products emphasize accuracy and grounding; real-time or multimodal surfaces add constraints around latency, streaming, and modality. In every case, the goal is the same: reliable, observable, and iterable production systems.