LLMs: From Foundation to Production

Chapter 2: Choosing Models in a Moving Market

2.1 Why Model Choice is a Moving Target

Large-Language Model (LLM) engineering is a logistics problem as much as it is a research problem. A few years ago, choosing a model was simple: you bought access to the smartest proprietary API you could afford, and you built your app around it. Today, that approach is a recipe for legacy debt.

Each generation of models brings better accuracy, longer context windows, new modalities, and entirely different pricing structures. A model that looked state-of-the-art last quarter may be deprecated next quarter. For instance, OpenAI’s deprecations page explicitly notes that “legacy” models are removed when improved versions become available, forcing engineers to migrate.

To work in this environment, an engineer must learn to evaluate models continuously rather than picking a winner once. Two forces drive this churn:

Capabilities Change: Models evolve in size and architecture. For example, Google’s Gemini leaped from 8k/32k-token context windows to accepting over one million tokens, enabling entire knowledgebase ingestions that previously required complex retrieval systems.
Economics Change: API pricing, quotas, and discounts shift constantly. Providers routinely retire older tiers, add “reasoning” surcharges, or introduce batch discounts.

The Engineering Takeaway: Model selection is part of the build-measure-learn cycle. You will choose a starting model, monitor its performance and cost, and be ready to switch.

2.2 Roles, Not Names

A common beginner mistake is assuming an AI application is powered by a single massive intelligence. Production systems rarely rely on a single model. Instead, they orchestrate multiple models performing different roles.

This reflects a general architectural pattern: use the right model for the job.

Role	Purpose	Typical Model Category	Example
Generator	Produce the final text response.	Chat/Instruct models	Claude Opus 4.7 Thinking
Embedder	Convert text into vector embeddings for search.	Embedding models	Gemini Embedding 001, Qwen3-Embedding-8B
Reranker	Score and reorder retrieved text for relevance.	Rerank models	Qwen3-Reranker-4B, Qwen3-Reranker-8B
Classifier/Router	Route requests or filter toxicity.	Small, fast classifiers	Granite Guardian 4.1, Qwen3.5 0.8B
Reasoning Model	Handle complex logic with internal planning.	“Pro” or “Thinking” variants	GPT-5.5 xhigh
Vision Model	Read images, PDFs, or charts (VLMs).	Multimodal models	Gemini 3.1 Pro
Audio Model	Convert speech to/from text (STT/TTS).	Speech-to-text and text-to-speech models	ElevenLabs Scribe v2, Realtime TTS 1.5 Max
Fallback Model	Handle traffic when the primary model is down.	Cheaper or alternate API provider	Gemini 3 Flash, DeepSeek V4 Pro

By splitting tasks across specialized models, you optimize both latency and cost.

2.3 Buy, Rent, or Host: The Access Strategy Decision

Before choosing which specific model to use, decide how you will access intelligence. Three strategies dominate the enterprise landscape:

Buy through a proprietary API: Providers like OpenAI, Anthropic, Google, and AWS package models with infrastructure, safety filtering, billing, and support. You never manage the weights or servers.
Rent via Serverless Inference: Companies like Together AI, Fireworks AI, and Groq host open-weight models and expose OpenAI-compatible endpoints. You still call an API, but you have more control over the model weights, often at a lower cost.
Host the model yourself: Download the weights and run inference on your hardware or cloud GPUs using engines like vLLM, SGLang, or llama.cpp. This maximizes control and privacy but requires serious MLOps resources.

The Strategy Decision Matrix:

Strategy	Best when…	Avoid when…
Buy	Speed to market, proven quality, enterprise support.	You need full control over data and weights.
Rent	You want open models without managing GPUs.	You require strict on-premise infrastructure.
Host	You need absolute privacy, or offline deployment.	You lack MLOps expertise or cloud budget.

(Note: We will discuss the mechanics of local self-hosting in Chapter 11 and enterprise serving optimization in Chapter 14).

2.4 Proprietary APIs: Managed Intelligence

Using a proprietary API means you are buying a managed intelligence product. Providers bundle the model with features that fundamentally shape your backend architecture.

When evaluating providers, check for these critical capabilities:

Structured Outputs: Many providers allow you to specify JSON schemas that the model is mathematically forced to follow. This reduces the need for fragile string parsing. (Note: We will dive deep into forcing LLMs to return strict JSON using Pydantic schemas in Chapter 4).
Prompt Caching: Caching reduces latency and cost when the beginning of a prompt (or system prompt) remains the same. OpenAI notes that repeated prefixes (stored for up to 24 hours) can reduce latency by 80% and input costs by 90%. Anthropic offers similar caching with lifetimes of 5 to 60 minutes.
Batch APIs: For asynchronous processing, providers offer Batch endpoints. Submitting a .jsonl file to OpenAI’s batch API provides a 50% discount and higher rate limits, with jobs processed within 24 hours.
Data Controls (ZDR): Enterprise customers must know if prompts are stored. OpenAI retains content for up to 30 days for abuse monitoring unless Zero Data Retention (ZDR) is approved. Anthropic offers similar ZDR policies where data is not stored after the response is generated.

2.5 Fast Models vs. Reasoning Models

Not all models are built for the same speed-quality trade-off. Today, the most crucial distinction is between Fast models (throughput-optimized) and Reasoning models (deep deliberation).

OpenAI’s reasoning documentation explains that these models use internal “reasoning tokens” to plan, inspect alternatives, and recover from ambiguity. The reasoning.effort parameter controls how many of these tokens are used. High effort yields better answers but costs more.

Warning: These invisible reasoning tokens count toward your context window and are billed as output tokens.

The economic decision is simple: use the cheapest model that meets your quality requirement.

Based on current freelance listings, the most common LLM projects are RAG chatbots, customer-support agents, document Q\&A/extraction, workflow automation, proposal automation, legal/regulatory assistants, voice agents, and business-data agents across Upwork, Fiverr, and Freelancer.

Real-World LLM Project / Use Case	Use Fast Model?	Use Reasoning Model?
Customer-support RAG chatbot for company docs or website FAQs	Yes	Sometimes
Internal knowledge-base assistant for PDFs, SOPs, and policies	Yes	Sometimes
Invoice, receipt, contract, or form data extraction pipeline	Yes	Sometimes
AI email reply assistant for sales, support, or CRM workflows	Yes	Sometimes
Lead qualification chatbot connected to CRM or booking tools	Yes	No
Voice agent for call-center intake, routing, or appointment booking	Yes	Sometimes
Multi-agent proposal automation for RFPs and government contracts	Sometimes	Yes + Human Review
Legal document assistant for OCR, metadata extraction, and Q\&A	Sometimes	Yes + Human Review
Regulatory document monitoring and compliance-risk analysis system	No	Yes + Human Review
Supplier-vetting agent for import/export documents and risk flags	Sometimes	Yes
AI code debugging, refactoring, or test-generation assistant	Sometimes	Yes
Automated content/SEO/social-media production workflow	Yes	No

Tiny architecture note: “fast model” is best for repeatable, high-volume steps; “reasoning model” earns its keep when the project needs multi-step judgment, compliance logic, risk analysis, or tool-using agents. (Note: We will build RAG systems in Chapters 6 & 7, and agentic workflows in Chapters 9 & 10).

2.6 Hugging Face and Open-Weight Discovery

When an engineer steps away from proprietary APIs, their first stop is the Hugging Face Hub. Engineers visit the Hub not just to download models, but to perform legal and technical due diligence using the Model Card.

The Model Card Inspection Checklist:

License: Can we use this commercially?
Intended Use: Was it built for summarization, code, or chat?
Training Data: Were there restrictions or potential PII?
Safety: Does it note known harmful behaviors?
Maintenance: Is the model actively updated or abandoned?

Safe Formats and Supply Chain

Model files are part of your security posture.

Safetensors: A format designed to safely store tensors without the risk of arbitrary code execution (faster and significantly safer than Python’s older pickle format).
GGUF: Packages both tensors and metadata, designed for highly efficient local inference with tools like llama.cpp. (Note: We will explore running models locally with GGUF and llama.cpp in Chapter 11).

Open Source vs. Open Weight vs. Open-Washing

Terminology in the AI market is legally treacherous. The Open Source Initiative (OSI) definition 1.0 states that true Open Source AI requires access to the training data descriptions, training code, and model parameters under OSI-approved terms.

If a model shares its final weights but hides the training code or dataset, it is Open Weight, not Open Source.

For example, **OLMo 3.1 32B **is a strong example of a genuinely open model: Ai2 releases the model, data, code, checkpoints, and training details. By contrast, high-performing models like Qwen3.6, DeepSeek V4 and Kimi K2.6 are better described as Open Weight: their weights are available and often permissively licensed, but they do not provide the full training-data and training-code transparency required by the OSI definition.

Finally, vendors often use “open” marketing while imposing strong restrictions. Meta Llama 4, for example, is widely discussed as open-weight rather than truly open source because its license requires companies with more than 700 million monthly active users to request a separate license from Meta. Always read the model license and Acceptable Use Policy before building on any “open” model.

Gated Models

Some models require you to explicitly request access from the author, sharing your contact information. For enterprise work, gating means you cannot assume your CI/CD pipelines or all team members can download the weights anonymously.

2.7 Serverless Open-Weight Inference

Serverless inference allows you to run open-weight models on-demand without managing GPUs. Providers like Together AI, Fireworks AI, and Groq expose endpoints compatible with OpenAI’s SDK.

Together AI, for example, emphasizes high performance (claiming up to 2.75x faster throughput than competitors) and discounts for cached input tokens.

Advantages:

No GPU Management: Avoid purchasing or renting cloud GPUs.
Fast Experimentation: Try hundreds of models with a single API key.
Scalability: Demand spikes are handled by the provider.

Risks:

Provider Lock-in: You still depend on a third party’s uptime.
Changing Catalog: The provider may remove a model from their servers.
Privacy: Data still passes through the provider’s servers.

2.8 API Gateways and Model Routers

When building multi-model systems, hardcoding a single provider is dangerous. You need a routing layer using libraries like LiteLLM or services like OpenRouter.

A model router acts as an insurance policy. It can:

Expose a single API key while internally routing to different vendors.
Apply fallback logic when a provider returns a 429 Too Many Requests error.
Route by cost or latency (e.g., send sensitive prompts to a local private model, and trivial prompts to a cheap API).
Enforce budget limits across your engineering team.

2.9 Rate Limits and Reliability

A model is only as good as the service hosting it. Hitting API limits results in immediate application failure. Evaluate these factors before choosing a provider:

Reliability Factor	Selection Question
RPM (Requests Per Min)	Can the API handle our peak concurrent users?
TPM (Tokens Per Min)	Can it handle the massive prompts generated by our RAG system?
Concurrency	How many requests can run simultaneously?
Region	Is the model available in our specific cloud region?
Fallback	What happens when the provider is down?

2.10 Tokenomics and Cost per Successful Task

LLM billing is incredibly granular. Providers separate input, output, cached, reasoning, and batch tokens. Rather than memorizing numbers, compute costs by this high-level formula:

Monthly Cost = number_of_requests × (
    (avg_input_tokens × input_price) + 
    (avg_output_tokens × output_price) + 
    (avg_reasoning_tokens × reasoning_price) - 
    cached_token_savings - 
    batch_discount_savings + 
    retry_costs
)

The Golden Rule of AI Economics: A cheap model that frequently fails, times out, or hallucinates is not cheap. You must measure the Cost per Successful Task (total_model_cost / successful_tasks). An expensive model that succeeds on the first try is often cheaper than a budget model that requires three retries to produce valid JSON.

2.11 Privacy, Data Retention, and Residency

Privacy compliance dictates provider choice for enterprises. You must know where your data goes. Ask these questions:

Data Flow: Does the provider use your prompts to train their future models? (Most enterprise tiers opt out by default, but always verify).
Retention: How long are logs stored? (e.g., 30 days vs. Zero Data Retention).
Region / Residency: Can you specify where data is processed? (For instance, Google Vertex AI allows region selection for generative models, which is critical for EU GDPR laws).
Support Access: Can provider support engineers read your prompts when debugging?

2.12 Benchmark Literacy: Maps, Not Territory

Evaluating models requires an understanding of benchmarks, but benchmarks are not a leaderboard game. They measure specific tasks and may not reflect your workload.

Holistic Evaluation: The HELM project evaluates models across accuracy, calibration, robustness, fairness, and efficiency. The key insight is that one number cannot capture everything.
Static Benchmarks: Fixed tests like MMLU (academic knowledge), GPQA (graduate-level reasoning), and SWE-bench (resolving real GitHub issues). (Note: SWE-bench is so difficult that even state-of-the-art models solve only around 2% to 15% of tasks).
Warning: Static benchmarks are vulnerable to contamination—if a model saw the test during training, its score is artificially inflated.
Live Benchmarks: Platforms like LiveBench and LiveCodeBench release new questions regularly and delay open-sourcing them to prevent contamination.
Human Preference Arenas: Platforms like Chatbot Arena use blind pairwise tests (Elo ratings) based on crowdsourced human votes.
Warning: These arenas reveal which models humans prefer in casual chat, but humans suffer from “verbosity bias” voting for longer, confident answers even if they are factually flawed.
LLM-as-Judge: Using Larger LLM to grade smaller models is highly scalable, but LLM judges exhibit position bias, verbosity bias, and self-preference. Always calibrate them with human review. (Note: We will build automated “LLM-as-a-judge” evaluation pipelines in Chapter 15).

2.13 Artificial Analysis: A Practical Leaderboard for Choosing Models

One useful resource for comparing modern AI models is Artificial Analysis. It is an independent benchmarking website that tracks AI models, API providers, inference speed, pricing, context windows, latency, and model quality across language, image, video, speech, agents, hardware, and open-weight models.

The site is especially useful because it does not only ask, “Which model is smartest?” It also helps answer more practical production questions:

Which model gives the best quality for the price?
Which provider is fastest for the same open-weight model?
Which model has the lowest latency?
Which model performs best on coding, long-context reasoning, agentic tasks, image generation, speech-to-text, or text-to-speech?
Which open-weight models are actually competitive with proprietary frontier models?

Artificial Analysis is popular among builders because it combines intelligence, cost, speed, latency, and provider comparison in one place. Instead of choosing a model based on hype, developers can compare tradeoffs: a top reasoning model may score highest on intelligence, while a cheaper or faster model may be better for customer support, summarization, routing, or high-volume automation.

A good way to use the site is not to blindly pick the model at the top of the leaderboard. Use it as a decision dashboard. For example, if you are building a legal review agent, intelligence and long-context reasoning matter more. If you are building a live chatbot, latency and output speed may matter more. If you are running millions of classifications per month, price and throughput may matter more than frontier-level reasoning.

Public feedback around Artificial Analysis is generally positive among AI builders, especially for model comparison and methodology. Some users describe it as one of the best AI model comparison sites, while others prefer it specifically when they care about transparent methodology. However, social discussions also point out an important caveat: rankings can change significantly when the benchmark mix changes. That is not necessarily a flaw, but it means readers should treat any leaderboard as a snapshot, not a permanent truth.

The best practice is to use Artificial Analysis alongside your own evaluation set. Leaderboards tell you which models are strong in general; your own tests tell you which model is best for your product.

2.14 Public Benchmarks vs. Private Product Evaluations

Public benchmarks help you shortlist models; private evaluations decide deployment.

While leaderboards are great for narrowing down your options, they do not know your business logic. Real products have domain-specific edge cases, internal jargon, and strict formatting requirements (like specific JSON schemas) that public benchmarks completely ignore. Furthermore, public benchmarks can suffer from data contamination (the model might have memorized the test data during its training).

To confidently deploy a model, you must build a Golden Dataset, a curated collection of real user tasks, edge cases, and expected outputs. You do not need thousands of examples to start; 50 to 100 high-quality, diverse examples are often enough to catch regressions. (Note: We will implement private evaluation suites and CI/CD pipelines for models in Chapter 15).

Anatomy of a Golden Dataset:

Private Eval Element	What it is & Why it matters	Example
Real User Task	Actual queries pulled from your logs. Captures true user intent and messiness.	A messy Zendesk support ticket with typos.
Golden Answer	The ideal, human-approved output. Used as the ground truth for scoring.	A polite, accurate resolution text.
Grading Rubric	Specific criteria for success. Helps automated judges score objectively.	“Fail if the output is not valid JSON or tone is robotic.”
Adversarial Case	Malicious inputs designed to break your system. Tests safety boundaries.	A prompt injection attempt (“Ignore previous instructions…”).
Regression Case	A previously fixed bug. Ensures swapping models doesn’t re-introduce errors.	A specific query the previous model hallucinated on.

The Private Evaluation Workflow:

Sample: Extract representative interactions from your production logs (or beta testers).
Annotate: Have domain experts write or approve the “Golden Answers” for these interactions.
Run Candidate: Pass the tasks through the new model you want to test.
Score: Use deterministic scripts (e.g., JSON validation) or an LLM-as-a-Judge to compare the candidate’s output against the Golden Answers.

2.15 The Model Selection Decision Matrix

When you architect your system, use this decision matrix to force your team to articulate their assumptions:

Criterion	The Engineering Question
Task Fit	What exact job must the model do? (Generator, Router, Embedder?)
Quality	Is it accurate enough? What is the acceptable failure rate?
Latency & Cost	Is it fast enough for the UX? What is the cost per successful task?
Context	Do we need a 1M context window, or should we build a RAG pipeline?
Output Reliability	Can it reliably produce strict JSON?
Privacy / License	Can this data leave our infrastructure? Is it commercially licensed?
Migration	How easily can we swap this model out if prices rise?

2.16 Model Migration and Drift

Model selection is never final. Providers deprecate models, and behaviors “drift” silently over time (e.g., safety filters tighten, outputs become terser).

Plan for migration on day one by:

Using Versioned Endpoints: Specify exact model versions (e.g., model-v1-0314) instead of generic aliases (model-latest) to prevent sudden backend breakages.
Implementing Regression Tests: Never switch models without running your private evaluation suite.
Designing Fallbacks: Use your model router to seamlessly switch providers during an outage.

Now that we know how to discover, evaluate, and navigate the economics of the model ecosystem, it is time to start building. In Chapter 3, we will leave the theory behind and write our first production-grade code to integrate these models into a reliable backend application.