Open Source AI

What does Open Source AI Even Mean

Mohammad Shojaei, Applied AI Engineer

11 Sep 2025

Deconstructing an AI Model

The Complete AI Lifecycle: From Training to Model Weights

Prerequisites

Training Data

Raw knowledge the model learns from

Architecture

Network blueprint

Training Code

Recipe for the learning process

→

Training Process

Learning algorithms optimize parameters

→

Model Weights

Crystallized knowledge as numbers

Training transforms raw ingredients into learned knowledge

The Four Freedoms Applied to AI

The Free Software Foundation's Four Freedoms provide a robust framework for understanding AI openness

Freedom 1: The Freedom to Run

Run AI systems for any purpose without restriction

Unrestricted access to model weights and inference code

Clear licensing that permits commercial and non-commercial use

Minimal technical barriers to deployment

Transparent system requirements and dependencies

Freedom 2: The Freedom to Study

Study how AI systems work and adapt them to your needs

Complete access to training code and methodologies

Comprehensive documentation of model architecture and design decisions

Access to training data or detailed descriptions when data cannot be shared

Transparent evaluation metrics and benchmarking procedures

Freedom 3: The Freedom to Redistribute

Redistribute copies of AI systems to help others

Permissive licensing for model weights and associated code

Clear guidelines for attribution and redistribution

Technical formats that support easy sharing and deployment

Community infrastructure for distribution and discovery

Freedom 4: The Freedom to Distribute Modified Versions

Distribute modified versions to benefit the community

Access to training infrastructure and methodologies

Permissive licensing for derivative works

Community standards for documenting modifications

Technical and legal frameworks supporting innovation

These freedoms ensure that AI systems remain accessible, transparent, shareable, and improvable for everyone

The Spectrum: From Locked Down to Actually Open

Understanding the four levels of AI transparency and what you actually get

🔒 Locked Down

🌟 Actually Open

⚠️ Openwashing Zone

Closed / API-Only

Examples:

GPT-5, Claude 4.1

Gemini 2.5, Midjourney

What You Get:
None of: Training Data, Architecture, Training Code, Training Process, Model Weights

Open-Weight

Examples:

Llama 3/4, DeepSeek-R1

Falcon, BLOOM, Whisper

What You Get:
Model Weights only (missing Training Data, Architecture, Training Code, Training Process)

Open-Source AI

Examples:

Mistral, DBRX

Pythia, Phi-3

What You Get:
Architecture, Training Code, Model Weights (Training Data often limited)

Radical Openness

Examples:

SmolLM, OLMo (AI2)

Open Thinker 7B

What You Get:
All components: Training Data, Architecture, Training Code, Training Process, Model Weights

The spectrum reveals a harsh reality: most "open" AI is actually openwashing

True openness requires complete transparency, permissive licensing, and reproducible methodology — not just model weights

The Gold Standard

Exemplars of True Openness

Pythia (EleutherAI)

70M–12B • Apache-2.0

Training Data: Public data in same order

Architecture: 16 model variants

Training Code: Exact dataloader reconstruction

Training Process: 154 intermediate checkpoints

Model Weights: Gold-standard reproducibility

OLMo (AI2)

1B–32B • Apache-2.0

Training Data: Dolma multitrillion-token corpus

Architecture: Full weights + data mixtures

Training Code: Training/eval code public

Training Process: Intermediate checkpoints

Model Weights: Artifact collections public

SmolLM (Hugging Face)

135M/360M/1.7B • Permissive

Training Data: SmolLM-Corpus (FineWeb-Edu)

Architecture: Small LLMs from scratch

Training Code: Training details documented

Training Process: 11T-token recipe (SmolLM2)

Model Weights: Eval harness setup

TinyLlama

1.1B • Open weights/code

Training Data: ~1T tokens from scratch

Architecture: Compact efficient design

Training Code: Public training code/recipe

Training Process: Leverages open tooling (Lit-GPT)

Model Weights: Paper and checkpoints public

Big Tech's Response to OSS Pressure

How Open Source Communities Forced Strategic Shifts

Company Responses: The Open-Weight Convergence

OpenAI

Gpt-oss 20b/120b

Model Weight, Apache-2.0

Google

Gemma 1-3

Model Weight, Gemma Terms of Use

xAI

Grok 1-2

Model Weight, Architecture, Apache-2.0

Microsoft

Phi 3/3.5/4

Model Weight, MIT License

Apple

OpenELM

Model Weight, Training Code, Apple License

NVIDIA

Nemotron/Minitron

Architecture, Training Code, Training Process, Model Weight, NVIDIA Open Model License

Alibaba

Qwen 2/2.5/3

Model Weight, Apache-2.0

Open source communities successfully pressured Big Tech to converge on open-weight releases, fundamentally shifting the AI landscape from closed APIs to permissionless innovation ecosystems.

The Open Ecosystem

Core Open Tools & Frameworks by Stage

Distribution & Training

PyTorch

TensorFlow

Megatron

Unsloth

Hugging Face Transformers

Local Inference

Llama.cpp

Ollama

LMstudio

MLX

ComfyUI

Production Inference

vLLM

SgLang

TGI

Diffusers

ONNX

Application Dev

Langchain

LlamaIndex

OpenAI Agents SDK

Haystack

Agno

Open source tools democratize AI development breaking down barriers from research to production

Who Released the Most Open Models?

China

Europe

U.S.

Others

China's Leading Open Models

D

DeepSeek-R1/V3

MIT License

Reasoning models with downloadable weights + distills

Q

Qwen3

Apache-2.0

Alibaba's permissive foundation suite (text + coder + VL)

K

Kimi K2

Open-Weight

Moonshot's trillion-param MoE on Hugging Face

G

GLM-4.5

MIT License

Zhipu's agentic/coding focus with thinking modes

Export controls drove China's pivot to open source AI, enabling global reach

Multilingual AI

Open Source Democratizes Language Technology

Community-driven collaboration enables developers worldwide to freely access, modify, and contribute to models supporting underrepresented languages through shared datasets and fine-tuning pipelines.

Adaptation Techniques

Vocabulary Expansion

Adding language-specific tokens to base models

Continual Pre-training

Training on language-specific corpora

Instruction Fine-tuning

Task-specific adaptation with cultural context

LoRA Adaptation

Low-rank efficient fine-tuning on consumer hardware

Open source prevents digital extinction by enabling community-led development that closes performance gaps by 40-50% for low-resource languages, ensuring linguistic diversity thrives in an AI-driven world.