Glossary

Plain-English definitions of the benchmarks, architecture terms and quantization formats used across the directory. If a term on a model page is unfamiliar, it probably lives here.

MMLU(Massive Multitask Language Understanding): 57 academic-subject multiple-choice exam. Standard headline benchmark for general knowledge; 5-shot prompting. A score of 50 is roughly GPT-3.5 territory; 80+ is frontier-class.
MMLU-Pro: Harder, deduplicated MMLU with 10 answer options per question (vs 4 in original MMLU). Designed to discriminate among frontier models.
HumanEval: OpenAI's 164-problem Python coding benchmark. Scored as pass@1: the percentage of problems solved correctly on first try. Saturating around 90+ for frontier models, which means newer code benchmarks (LiveCodeBench, BigCodeBench) better discriminate.
MATH: Hendrycks competition-mathematics benchmark. 12,500 problems from AMC, AIME, Putnam, USAMO. Exact-match grading. Pre-reasoning-model frontier was ~70; reasoning models like DeepSeek R1 push 95+.
GSM8K: Grade-school math word problems. 8,500 problems, multi-step. Mostly saturated by all modern frontier models; included here for legacy comparisons.
GPQA(Graduate-Level Google-Proof Q&A): Physics, chemistry, biology questions designed to be unsolveable by Google search. ~250 questions. Strong test of genuine reasoning.
Parameters (B): Number of learned weights in the model, in billions. Drives memory footprint and compute cost. Doesn't perfectly predict quality — a well-trained 14B (Phi-4) often beats a less well-trained 70B.
Active parameters: In a mixture-of-experts (MoE) model, only a subset of the total parameters are used per token. A '141B / 39B active' model has 141B total parameters but routes each token through only 39B.
Context length: Maximum number of tokens the model can consider at once (prompt + generation). 128K = ~96,000 English words; 256K = ~192,000. Effective context (where the model actually pays attention to all of it) is often shorter than advertised.
Token: Subword unit the model operates on. ~0.75 words per token for English; much higher (4-6 characters per token) for code. Cost and rate limits are usually billed per million tokens.
KV cache: Memory the model uses to remember the keys and values from previous tokens in the context. Grows linearly with sequence length. Often dominates GPU memory at long context — a 70B model needs 16 GB of KV cache for 32K context.
MoE(Mixture of Experts): Architecture where a router picks a small subset of 'expert' subnetworks per token. Yields large parameter counts (good for quality) with low per-token compute (good for speed). Mixtral and DeepSeek V3 are the headline open-weights MoEs.
fp16(half precision): 16-bit floating-point weights. The default 'full precision' for inference. Each parameter takes 2 bytes; a 70B model needs ~140 GB of VRAM at fp16.
fp8: 8-bit floating-point. Halves VRAM vs fp16 with minimal quality loss on modern hardware (H100, MI300). Increasingly common for production inference.
Quantization: Storing weights at lower precision to fit larger models on smaller hardware. Tradeoff: smaller memory + faster inference vs slight quality loss. Common levels: Q8 (~negligible loss), Q5_K_M (good balance), Q4_K_M (4× smaller than fp16, ~1-3% quality drop), Q3 (visible degradation).
Q4_K_M / Q5_K_M / Q8_0: GGUF quantization formats from llama.cpp. The number is bits per weight; K_M means 'k-quant medium' (mixed precision per weight group). Q4_K_M is the modern default for local inference.
GGUF: Single-file model format used by llama.cpp, Ollama, LM Studio and most local-inference tooling. Successor to GGML. Files are typically named like 'model-Q4_K_M.gguf'.
Distillation: Training a small model to mimic a larger 'teacher' model's outputs. DeepSeek R1 Distill Llama 70B is R1's reasoning distilled into a Llama 3.3 70B body — much cheaper to run than full R1.
Instruction-tuned (Instruct): Fine-tuned with supervised examples of following instructions, often plus RLHF. The 'Instruct' suffix on a model name means it's ready to chat; the bare base model is for further fine-tuning.
RLHF(Reinforcement Learning from Human Feedback): Training method where humans rank model outputs and the model learns to prefer higher-ranked behaviour. Largely responsible for ChatGPT-style helpfulness.
LoRA(Low-Rank Adaptation): Fine-tuning method that updates only small adapter matrices instead of all weights. Lets you fine-tune a 70B model on a 4090; the resulting LoRA is a few hundred MB that can be merged or loaded at runtime.
Open source vs open weights: OSI 'open source' requires no field-of-use restrictions. Llama, Gemma and Qwen 72B don't qualify — they're 'open weights' or 'source-available'. Apache 2.0 and MIT models (Mistral, DeepSeek R1, Phi-4) are genuinely open source.
Reasoning model: Model trained to generate long internal chains-of-thought before answering. Trades latency for accuracy. Open examples: DeepSeek R1, QwQ. Closed examples: OpenAI o1, o3.