Leaderboards
Self-reported benchmark scores compiled from model cards and papers. Higher is better. Numbers should be treated as guidance, not gospel — labs use slightly different evaluation harnesses. See methodology for sources.
MMLU
39 modelsMassive Multitask Language Understanding — 57 academic subjects, 5-shot. Saturating among frontier models; included for legacy comparison.
| # | Model | Params | MMLU | Per B |
|---|---|---|---|---|
| 1 | DeepSeek R1· deepseek | 671B | 90.80 | 0.14 |
| 2 | DeepSeek V3· deepseek | 671B | 88.50 | 0.13 |
| 3 | Qwen2.5 72B Instruct· qwen | 72B | 86.10 | 1.20 |
| 4 | Llama 3.3 70B Instruct· llama | 70B | 86.00 | 1.23 |
| 5 | Llama 3.2 90B Vision· llama | 90B | 86.00 | 0.96 |
| 6 | DeepSeek R1 Distill Llama 70B· deepseek | 70B | 86.00 | 1.23 |
| 7 | Llama 3.1 Nemotron 70B Instruct· nemotron | 70B | 85.00 | 1.21 |
| 8 | Phi-4 14B· phi | 14B | 84.80 | 6.06 |
| 9 | Qwen2.5 32B Instruct· qwen | 32B | 83.30 | 2.60 |
| 10 | Jamba 1.5 Large· jamba | 398B | 81.20 | 0.20 |
| 11 | Nemotron-4 340B Instruct· nemotron | 340B | 81.10 | 0.24 |
| 12 | Mistral Small 3· mistral | 24B | 81.00 | 3.38 |
| 13 | Qwen2.5 14B Instruct· qwen | 14B | 79.70 | 5.69 |
| 14 | DeepSeek Coder V2· deepseek | 236B | 79.20 | 0.34 |
| 15 | Phi-3 Medium 14B· phi | 14B | 78.00 | 5.57 |
| 16 | Mixtral 8×22B Instruct· mistral | 141B | 77.75 | 0.55 |
| 17 | Yi 1.5 34B Chat· yi | 34B | 76.80 | 2.26 |
| 18 | Command R+· command | 104B | 75.70 | 0.73 |
| 19 | Gemma 2 27B· gemma | 27B | 75.20 | 2.79 |
| 20 | Qwen2.5 Coder 32B· qwen | 32B | 75.10 | 2.35 |
| 21 | QwQ 32B Preview· qwen | 32B | 75.00 | 2.34 |
| 22 | Qwen2.5 7B Instruct· qwen | 7B | 74.20 | 10.60 |
| 23 | Llama 3.2 11B Vision· llama | 11B | 73.00 | 6.64 |
| 24 | Grok 1· grok | 314B | 73.00 | 0.23 |
| 25 | Gemma 2 9B· gemma | 9B | 71.30 | 7.92 |
| 26 | Mixtral 8×7B Instruct· mistral | 47B | 70.60 | 1.51 |
| 27 | Llama 3.1 8B Instruct· llama | 8B | 69.40 | 8.68 |
| 28 | Phi-3 Mini 4K Instruct· phi | 4B | 68.80 | 18.11 |
| 29 | Falcon 3 7B Instruct· falcon | 7B | 68.50 | 9.79 |
| 30 | Command R· command | 35B | 68.20 | 1.95 |
| 31 | Mistral Nemo 12B· mistral | 12B | 68.00 | 5.67 |
| 32 | OLMo 2 13B· olmo | 13B | 67.50 | 5.19 |
| 33 | OLMo 2 7B· olmo | 7B | 63.70 | 9.10 |
| 34 | Llama 3.2 3B· llama | 3B | 63.40 | 21.13 |
| 35 | Falcon Mamba 7B· falcon | 7B | 62.00 | 8.86 |
| 36 | Stable LM 2 12B· stablelm | 12B | 61.00 | 5.08 |
| 37 | Mistral 7B v0.3· mistral | 7B | 60.10 | 8.59 |
| 38 | Gemma 2 2B· gemma | 3B | 51.30 | 19.73 |
| 39 | Llama 3.2 1B· llama | 1B | 49.30 | 49.30 |
HumanEval
35 modelsOpenAI's Python coding benchmark — pass@1 on function completion. Saturating around 90+ for frontier models.
| # | Model | Params | HumanEval | Per B |
|---|---|---|---|---|
| 1 | Qwen2.5 Coder 32B· qwen | 32B | 92.70 | 2.90 |
| 2 | DeepSeek Coder V2· deepseek | 236B | 90.20 | 0.38 |
| 3 | Qwen2.5 32B Instruct· qwen | 32B | 88.40 | 2.76 |
| 4 | Llama 3.3 70B Instruct· llama | 70B | 88.40 | 1.26 |
| 5 | Qwen2.5 72B Instruct· qwen | 72B | 86.60 | 1.20 |
| 6 | DeepSeek R1 Distill Llama 70B· deepseek | 70B | 86.00 | 1.23 |
| 7 | Mistral Small 3· mistral | 24B | 84.80 | 3.53 |
| 8 | Qwen2.5 7B Instruct· qwen | 7B | 84.80 | 12.11 |
| 9 | Llama 3.1 Nemotron 70B Instruct· nemotron | 70B | 84.00 | 1.20 |
| 10 | Qwen2.5 14B Instruct· qwen | 14B | 83.50 | 5.96 |
| 11 | DeepSeek V3· deepseek | 671B | 82.60 | 0.12 |
| 12 | Phi-4 14B· phi | 14B | 82.60 | 5.90 |
| 13 | Mixtral 8×22B Instruct· mistral | 141B | 76.00 | 0.54 |
| 14 | Yi 1.5 34B Chat· yi | 34B | 75.20 | 2.21 |
| 15 | Nemotron-4 340B Instruct· nemotron | 340B | 73.20 | 0.22 |
| 16 | Llama 3.1 8B Instruct· llama | 8B | 72.60 | 9.07 |
| 17 | Jamba 1.5 Large· jamba | 398B | 71.30 | 0.18 |
| 18 | Command R+· command | 104B | 70.70 | 0.68 |
| 19 | Mistral Nemo 12B· mistral | 12B | 64.40 | 5.37 |
| 20 | Grok 1· grok | 314B | 63.20 | 0.20 |
| 21 | Phi-3 Medium 14B· phi | 14B | 62.20 | 4.44 |
| 22 | Phi-3 Mini 4K Instruct· phi | 4B | 59.10 | 15.55 |
| 23 | Falcon 3 7B Instruct· falcon | 7B | 56.70 | 8.10 |
| 24 | Command R· command | 35B | 53.70 | 1.53 |
| 25 | Gemma 2 27B· gemma | 27B | 51.80 | 1.92 |
| 26 | Llama 3.2 3B· llama | 3B | 51.50 | 17.17 |
| 27 | Gemma 2 9B· gemma | 9B | 40.20 | 4.47 |
| 28 | Mixtral 8×7B Instruct· mistral | 47B | 40.20 | 0.86 |
| 29 | Llama 3.2 1B· llama | 1B | 37.20 | 37.20 |
| 30 | Mistral 7B v0.3· mistral | 7B | 30.50 | 4.36 |
| 31 | Falcon Mamba 7B· falcon | 7B | 29.90 | 4.27 |
| 32 | OLMo 2 13B· olmo | 13B | 28.70 | 2.21 |
| 33 | Stable LM 2 12B· stablelm | 12B | 27.40 | 2.28 |
| 34 | OLMo 2 7B· olmo | 7B | 22.60 | 3.23 |
| 35 | Gemma 2 2B· gemma | 3B | 17.70 | 6.81 |
MATH
32 modelsHendrycks competition mathematics. Exact-match grading. Reasoning models like DeepSeek R1 push 95+; non-reasoning frontier sits around 70–85.
| # | Model | Params | MATH | Per B |
|---|---|---|---|---|
| 1 | DeepSeek R1· deepseek | 671B | 97.30 | 0.15 |
| 2 | DeepSeek R1 Distill Llama 70B· deepseek | 70B | 94.50 | 1.35 |
| 3 | QwQ 32B Preview· qwen | 32B | 90.60 | 2.83 |
| 4 | DeepSeek V3· deepseek | 671B | 84.00 | 0.13 |
| 5 | Qwen2.5 32B Instruct· qwen | 32B | 83.10 | 2.60 |
| 6 | Qwen2.5 72B Instruct· qwen | 72B | 83.10 | 1.15 |
| 7 | Phi-4 14B· phi | 14B | 80.40 | 5.74 |
| 8 | Qwen2.5 14B Instruct· qwen | 14B | 80.00 | 5.71 |
| 9 | Llama 3.3 70B Instruct· llama | 70B | 77.00 | 1.10 |
| 10 | DeepSeek Coder V2· deepseek | 236B | 75.70 | 0.32 |
| 11 | Qwen2.5 7B Instruct· qwen | 7B | 75.50 | 10.79 |
| 12 | Mistral Small 3· mistral | 24B | 70.60 | 2.94 |
| 13 | Llama 3.1 Nemotron 70B Instruct· nemotron | 70B | 67.40 | 0.96 |
| 14 | Nemotron-4 340B Instruct· nemotron | 340B | 65.50 | 0.19 |
| 15 | Qwen2.5 Coder 32B· qwen | 32B | 65.00 | 2.03 |
| 16 | Mistral Nemo 12B· mistral | 12B | 55.10 | 4.59 |
| 17 | Llama 3.1 8B Instruct· llama | 8B | 51.90 | 6.49 |
| 18 | Yi 1.5 34B Chat· yi | 34B | 50.10 | 1.47 |
| 19 | Llama 3.2 3B· llama | 3B | 48.00 | 16.00 |
| 20 | Gemma 2 27B· gemma | 27B | 42.30 | 1.57 |
| 21 | Phi-3 Medium 14B· phi | 14B | 41.80 | 2.99 |
| 22 | Mixtral 8×22B Instruct· mistral | 141B | 41.80 | 0.30 |
| 23 | Falcon 3 7B Instruct· falcon | 7B | 39.30 | 5.61 |
| 24 | Command R+· command | 104B | 38.60 | 0.37 |
| 25 | Gemma 2 9B· gemma | 9B | 36.60 | 4.07 |
| 26 | Llama 3.2 1B· llama | 1B | 30.60 | 30.60 |
| 27 | Mixtral 8×7B Instruct· mistral | 47B | 28.40 | 0.61 |
| 28 | Phi-3 Mini 4K Instruct· phi | 4B | 28.00 | 7.37 |
| 29 | Command R· command | 35B | 26.60 | 0.76 |
| 30 | Grok 1· grok | 314B | 23.90 | 0.08 |
| 31 | Mistral 7B v0.3· mistral | 7B | 13.10 | 1.87 |
| 32 | Gemma 2 2B· gemma | 3B | 11.80 | 4.54 |
What does "Per B" mean?
Score divided by parameters in billions — a rough efficiency metric. Models that punch above their weight on a benchmark (Phi-4 on reasoning, Qwen2.5 Coder 32B on code) climb this ranking. Not a perfect measure (training data quality matters more than headline parameter count) but useful for spotting capable small models.