Open Source AI

What does Open Source AI Even Mean

Collaborative Open-Source AI Network

Mohammad Shojaei, Applied AI Engineer

11 Sep 2025

Deconstructing an AI Model

The Complete AI Lifecycle: From Training to Model Weights

Prerequisites

Training Data
Raw knowledge the model learns from
Architecture
Network blueprint
Training Code
Recipe for the learning process

Training Process

Learning algorithms optimize parameters

Model Weights

Crystallized knowledge as numbers

Training transforms raw ingredients into learned knowledge

The Four Freedoms Applied to AI

The Free Software Foundation's Four Freedoms provide a robust framework for understanding AI openness

Freedom 1: The Freedom to Run

Run AI systems for any purpose without restriction

Unrestricted access to model weights and inference code
Clear licensing that permits commercial and non-commercial use
Minimal technical barriers to deployment
Transparent system requirements and dependencies

Freedom 2: The Freedom to Study

Study how AI systems work and adapt them to your needs

Complete access to training code and methodologies
Comprehensive documentation of model architecture and design decisions
Access to training data or detailed descriptions when data cannot be shared
Transparent evaluation metrics and benchmarking procedures

Freedom 3: The Freedom to Redistribute

Redistribute copies of AI systems to help others

Permissive licensing for model weights and associated code
Clear guidelines for attribution and redistribution
Technical formats that support easy sharing and deployment
Community infrastructure for distribution and discovery

Freedom 4: The Freedom to Distribute Modified Versions

Distribute modified versions to benefit the community

Access to training infrastructure and methodologies
Permissive licensing for derivative works
Community standards for documenting modifications
Technical and legal frameworks supporting innovation

These freedoms ensure that AI systems remain accessible, transparent, shareable, and improvable for everyone

The Spectrum: From Locked Down to Actually Open

Understanding the four levels of AI transparency and what you actually get

🔒 Locked Down
🌟 Actually Open
⚠️ Openwashing Zone

Closed / API-Only

Examples:
GPT-5, Claude 4.1
Gemini 2.5, Midjourney
What You Get:
None of: Training Data, Architecture, Training Code, Training Process, Model Weights

Open-Weight

Examples:
Llama 3/4, DeepSeek-R1
Falcon, BLOOM, Whisper
What You Get:
Model Weights only (missing Training Data, Architecture, Training Code, Training Process)

Open-Source AI

Examples:
Mistral, DBRX
Pythia, Phi-3
What You Get:
Architecture, Training Code, Model Weights (Training Data often limited)

Radical Openness

Examples:
SmolLM, OLMo (AI2)
Open Thinker 7B
What You Get:
All components: Training Data, Architecture, Training Code, Training Process, Model Weights

The spectrum reveals a harsh reality: most "open" AI is actually openwashing

True openness requires complete transparency, permissive licensing, and reproducible methodology — not just model weights

The Gold Standard

Exemplars of True Openness

Pythia (EleutherAI)

70M–12B • Apache-2.0
Training Data: Public data in same order
Architecture: 16 model variants
Training Code: Exact dataloader reconstruction
Training Process: 154 intermediate checkpoints
Model Weights: Gold-standard reproducibility

OLMo (AI2)

1B–32B • Apache-2.0
Training Data: Dolma multitrillion-token corpus
Architecture: Full weights + data mixtures
Training Code: Training/eval code public
Training Process: Intermediate checkpoints
Model Weights: Artifact collections public

SmolLM (Hugging Face)

135M/360M/1.7B • Permissive
Training Data: SmolLM-Corpus (FineWeb-Edu)
Architecture: Small LLMs from scratch
Training Code: Training details documented
Training Process: 11T-token recipe (SmolLM2)
Model Weights: Eval harness setup

TinyLlama

1.1B • Open weights/code
Training Data: ~1T tokens from scratch
Architecture: Compact efficient design
Training Code: Public training code/recipe
Training Process: Leverages open tooling (Lit-GPT)
Model Weights: Paper and checkpoints public

Big Tech's Response to OSS Pressure

How Open Source Communities Forced Strategic Shifts

Company Responses: The Open-Weight Convergence

OpenAI

Gpt-oss 20b/120b

Model Weight, Apache-2.0

Google

Gemma 1-3

Model Weight, Gemma Terms of Use

xAI

Grok 1-2

Model Weight, Architecture, Apache-2.0

Meta

Llama 1-4

Model Weight, Llama Community License

Microsoft

Phi 3/3.5/4

Model Weight, MIT License

Apple

OpenELM

Model Weight, Training Code, Apple License

NVIDIA

Nemotron/Minitron

Architecture, Training Code, Training Process, Model Weight, NVIDIA Open Model License

Alibaba

Qwen 2/2.5/3

Model Weight, Apache-2.0

Open source communities successfully pressured Big Tech to converge on open-weight releases, fundamentally shifting the AI landscape from closed APIs to permissionless innovation ecosystems.

The Open Ecosystem

Core Open Tools & Frameworks by Stage

Distribution & Training

PyTorch
TensorFlow
Megatron
Unsloth
Hugging Face Transformers

Local Inference

Llama.cpp
Ollama
LMstudio
MLX
ComfyUI

Production Inference

vLLM
SgLang
TGI
Diffusers
ONNX

Application Dev

Langchain
LlamaIndex
OpenAI Agents SDK
Haystack
Agno
Open source tools democratize AI development breaking down barriers from research to production

Who Released the Most Open Models?

China
Europe
U.S.
Others

China's Leading Open Models

D

DeepSeek-R1/V3

MIT License

Reasoning models with downloadable weights + distills

Q

Qwen3

Apache-2.0

Alibaba's permissive foundation suite (text + coder + VL)

K

Kimi K2

Open-Weight

Moonshot's trillion-param MoE on Hugging Face

G

GLM-4.5

MIT License

Zhipu's agentic/coding focus with thinking modes

Export controls drove China's pivot to open source AI, enabling global reach

Multilingual AI

Open Source Democratizes Language Technology

Community-driven collaboration enables developers worldwide to freely access, modify, and contribute to models supporting underrepresented languages through shared datasets and fine-tuning pipelines.

Adaptation Techniques

Vocabulary Expansion

Adding language-specific tokens to base models

Continual Pre-training

Training on language-specific corpora

Instruction Fine-tuning

Task-specific adaptation with cultural context

LoRA Adaptation

Low-rank efficient fine-tuning on consumer hardware

Open source prevents digital extinction by enabling community-led development that closes performance gaps by 40-50% for low-resource languages, ensuring linguistic diversity thrives in an AI-driven world.