OSAIM
Open Source AI Models

Glossary

Plain-English definitions of the benchmarks, architecture terms and quantization formats used across the directory. If a term on a model page is unfamiliar, it probably lives here.

MMLU(Massive Multitask Language Understanding)
57 academic-subject multiple-choice exam. Standard headline benchmark for general knowledge; 5-shot prompting. A score of 50 is roughly GPT-3.5 territory; 80+ is frontier-class.
MMLU-Pro
Harder, deduplicated MMLU with 10 answer options per question (vs 4 in original MMLU). Designed to discriminate among frontier models.
HumanEval
OpenAI's 164-problem Python coding benchmark. Scored as pass@1: the percentage of problems solved correctly on first try. Saturating around 90+ for frontier models, which means newer code benchmarks (LiveCodeBench, BigCodeBench) better discriminate.
MATH
Hendrycks competition-mathematics benchmark. 12,500 problems from AMC, AIME, Putnam, USAMO. Exact-match grading. Pre-reasoning-model frontier was ~70; reasoning models like DeepSeek R1 push 95+.
GSM8K
Grade-school math word problems. 8,500 problems, multi-step. Mostly saturated by all modern frontier models; included here for legacy comparisons.
GPQA(Graduate-Level Google-Proof Q&A)
Physics, chemistry, biology questions designed to be unsolveable by Google search. ~250 questions. Strong test of genuine reasoning.
Parameters (B)
Number of learned weights in the model, in billions. Drives memory footprint and compute cost. Doesn't perfectly predict quality — a well-trained 14B (Phi-4) often beats a less well-trained 70B.
Active parameters
In a mixture-of-experts (MoE) model, only a subset of the total parameters are used per token. A '141B / 39B active' model has 141B total parameters but routes each token through only 39B.
Context length
Maximum number of tokens the model can consider at once (prompt + generation). 128K = ~96,000 English words; 256K = ~192,000. Effective context (where the model actually pays attention to all of it) is often shorter than advertised.
Token
Subword unit the model operates on. ~0.75 words per token for English; much higher (4-6 characters per token) for code. Cost and rate limits are usually billed per million tokens.
KV cache
Memory the model uses to remember the keys and values from previous tokens in the context. Grows linearly with sequence length. Often dominates GPU memory at long context — a 70B model needs 16 GB of KV cache for 32K context.
MoE(Mixture of Experts)
Architecture where a router picks a small subset of 'expert' subnetworks per token. Yields large parameter counts (good for quality) with low per-token compute (good for speed). Mixtral and DeepSeek V3 are the headline open-weights MoEs.
fp16(half precision)
16-bit floating-point weights. The default 'full precision' for inference. Each parameter takes 2 bytes; a 70B model needs ~140 GB of VRAM at fp16.
fp8
8-bit floating-point. Halves VRAM vs fp16 with minimal quality loss on modern hardware (H100, MI300). Increasingly common for production inference.
Quantization
Storing weights at lower precision to fit larger models on smaller hardware. Tradeoff: smaller memory + faster inference vs slight quality loss. Common levels: Q8 (~negligible loss), Q5_K_M (good balance), Q4_K_M (4× smaller than fp16, ~1-3% quality drop), Q3 (visible degradation).
Q4_K_M / Q5_K_M / Q8_0
GGUF quantization formats from llama.cpp. The number is bits per weight; K_M means 'k-quant medium' (mixed precision per weight group). Q4_K_M is the modern default for local inference.
GGUF
Single-file model format used by llama.cpp, Ollama, LM Studio and most local-inference tooling. Successor to GGML. Files are typically named like 'model-Q4_K_M.gguf'.
Distillation
Training a small model to mimic a larger 'teacher' model's outputs. DeepSeek R1 Distill Llama 70B is R1's reasoning distilled into a Llama 3.3 70B body — much cheaper to run than full R1.
Instruction-tuned (Instruct)
Fine-tuned with supervised examples of following instructions, often plus RLHF. The 'Instruct' suffix on a model name means it's ready to chat; the bare base model is for further fine-tuning.
RLHF(Reinforcement Learning from Human Feedback)
Training method where humans rank model outputs and the model learns to prefer higher-ranked behaviour. Largely responsible for ChatGPT-style helpfulness.
LoRA(Low-Rank Adaptation)
Fine-tuning method that updates only small adapter matrices instead of all weights. Lets you fine-tune a 70B model on a 4090; the resulting LoRA is a few hundred MB that can be merged or loaded at runtime.
Open source vs open weights
OSI 'open source' requires no field-of-use restrictions. Llama, Gemma and Qwen 72B don't qualify — they're 'open weights' or 'source-available'. Apache 2.0 and MIT models (Mistral, DeepSeek R1, Phi-4) are genuinely open source.
Reasoning model
Model trained to generate long internal chains-of-thought before answering. Trades latency for accuracy. Open examples: DeepSeek R1, QwQ. Closed examples: OpenAI o1, o3.