OSAIM
Open Source AI Models

Leaderboards

Self-reported benchmark scores compiled from model cards and papers. Higher is better. Numbers should be treated as guidance, not gospel — labs use slightly different evaluation harnesses. See methodology for sources.

MMLU

39 models

Massive Multitask Language Understanding — 57 academic subjects, 5-shot. Saturating among frontier models; included for legacy comparison.

#ModelParamsMMLUPer B
1DeepSeek R1· deepseek671B90.800.14
2DeepSeek V3· deepseek671B88.500.13
3Qwen2.5 72B Instruct· qwen72B86.101.20
4Llama 3.3 70B Instruct· llama70B86.001.23
5Llama 3.2 90B Vision· llama90B86.000.96
6DeepSeek R1 Distill Llama 70B· deepseek70B86.001.23
7Llama 3.1 Nemotron 70B Instruct· nemotron70B85.001.21
8Phi-4 14B· phi14B84.806.06
9Qwen2.5 32B Instruct· qwen32B83.302.60
10Jamba 1.5 Large· jamba398B81.200.20
11Nemotron-4 340B Instruct· nemotron340B81.100.24
12Mistral Small 3· mistral24B81.003.38
13Qwen2.5 14B Instruct· qwen14B79.705.69
14DeepSeek Coder V2· deepseek236B79.200.34
15Phi-3 Medium 14B· phi14B78.005.57
16Mixtral 8×22B Instruct· mistral141B77.750.55
17Yi 1.5 34B Chat· yi34B76.802.26
18Command R+· command104B75.700.73
19Gemma 2 27B· gemma27B75.202.79
20Qwen2.5 Coder 32B· qwen32B75.102.35
21QwQ 32B Preview· qwen32B75.002.34
22Qwen2.5 7B Instruct· qwen7B74.2010.60
23Llama 3.2 11B Vision· llama11B73.006.64
24Grok 1· grok314B73.000.23
25Gemma 2 9B· gemma9B71.307.92
26Mixtral 8×7B Instruct· mistral47B70.601.51
27Llama 3.1 8B Instruct· llama8B69.408.68
28Phi-3 Mini 4K Instruct· phi4B68.8018.11
29Falcon 3 7B Instruct· falcon7B68.509.79
30Command R· command35B68.201.95
31Mistral Nemo 12B· mistral12B68.005.67
32OLMo 2 13B· olmo13B67.505.19
33OLMo 2 7B· olmo7B63.709.10
34Llama 3.2 3B· llama3B63.4021.13
35Falcon Mamba 7B· falcon7B62.008.86
36Stable LM 2 12B· stablelm12B61.005.08
37Mistral 7B v0.3· mistral7B60.108.59
38Gemma 2 2B· gemma3B51.3019.73
39Llama 3.2 1B· llama1B49.3049.30
Click any column header to sort.

HumanEval

35 models

OpenAI's Python coding benchmark — pass@1 on function completion. Saturating around 90+ for frontier models.

#ModelParamsHumanEvalPer B
1Qwen2.5 Coder 32B· qwen32B92.702.90
2DeepSeek Coder V2· deepseek236B90.200.38
3Qwen2.5 32B Instruct· qwen32B88.402.76
4Llama 3.3 70B Instruct· llama70B88.401.26
5Qwen2.5 72B Instruct· qwen72B86.601.20
6DeepSeek R1 Distill Llama 70B· deepseek70B86.001.23
7Mistral Small 3· mistral24B84.803.53
8Qwen2.5 7B Instruct· qwen7B84.8012.11
9Llama 3.1 Nemotron 70B Instruct· nemotron70B84.001.20
10Qwen2.5 14B Instruct· qwen14B83.505.96
11DeepSeek V3· deepseek671B82.600.12
12Phi-4 14B· phi14B82.605.90
13Mixtral 8×22B Instruct· mistral141B76.000.54
14Yi 1.5 34B Chat· yi34B75.202.21
15Nemotron-4 340B Instruct· nemotron340B73.200.22
16Llama 3.1 8B Instruct· llama8B72.609.07
17Jamba 1.5 Large· jamba398B71.300.18
18Command R+· command104B70.700.68
19Mistral Nemo 12B· mistral12B64.405.37
20Grok 1· grok314B63.200.20
21Phi-3 Medium 14B· phi14B62.204.44
22Phi-3 Mini 4K Instruct· phi4B59.1015.55
23Falcon 3 7B Instruct· falcon7B56.708.10
24Command R· command35B53.701.53
25Gemma 2 27B· gemma27B51.801.92
26Llama 3.2 3B· llama3B51.5017.17
27Gemma 2 9B· gemma9B40.204.47
28Mixtral 8×7B Instruct· mistral47B40.200.86
29Llama 3.2 1B· llama1B37.2037.20
30Mistral 7B v0.3· mistral7B30.504.36
31Falcon Mamba 7B· falcon7B29.904.27
32OLMo 2 13B· olmo13B28.702.21
33Stable LM 2 12B· stablelm12B27.402.28
34OLMo 2 7B· olmo7B22.603.23
35Gemma 2 2B· gemma3B17.706.81
Click any column header to sort.

MATH

32 models

Hendrycks competition mathematics. Exact-match grading. Reasoning models like DeepSeek R1 push 95+; non-reasoning frontier sits around 70–85.

#ModelParamsMATHPer B
1DeepSeek R1· deepseek671B97.300.15
2DeepSeek R1 Distill Llama 70B· deepseek70B94.501.35
3QwQ 32B Preview· qwen32B90.602.83
4DeepSeek V3· deepseek671B84.000.13
5Qwen2.5 32B Instruct· qwen32B83.102.60
6Qwen2.5 72B Instruct· qwen72B83.101.15
7Phi-4 14B· phi14B80.405.74
8Qwen2.5 14B Instruct· qwen14B80.005.71
9Llama 3.3 70B Instruct· llama70B77.001.10
10DeepSeek Coder V2· deepseek236B75.700.32
11Qwen2.5 7B Instruct· qwen7B75.5010.79
12Mistral Small 3· mistral24B70.602.94
13Llama 3.1 Nemotron 70B Instruct· nemotron70B67.400.96
14Nemotron-4 340B Instruct· nemotron340B65.500.19
15Qwen2.5 Coder 32B· qwen32B65.002.03
16Mistral Nemo 12B· mistral12B55.104.59
17Llama 3.1 8B Instruct· llama8B51.906.49
18Yi 1.5 34B Chat· yi34B50.101.47
19Llama 3.2 3B· llama3B48.0016.00
20Gemma 2 27B· gemma27B42.301.57
21Phi-3 Medium 14B· phi14B41.802.99
22Mixtral 8×22B Instruct· mistral141B41.800.30
23Falcon 3 7B Instruct· falcon7B39.305.61
24Command R+· command104B38.600.37
25Gemma 2 9B· gemma9B36.604.07
26Llama 3.2 1B· llama1B30.6030.60
27Mixtral 8×7B Instruct· mistral47B28.400.61
28Phi-3 Mini 4K Instruct· phi4B28.007.37
29Command R· command35B26.600.76
30Grok 1· grok314B23.900.08
31Mistral 7B v0.3· mistral7B13.101.87
32Gemma 2 2B· gemma3B11.804.54
Click any column header to sort.

What does "Per B" mean?

Score divided by parameters in billions — a rough efficiency metric. Models that punch above their weight on a benchmark (Phi-4 on reasoning, Qwen2.5 Coder 32B on code) climb this ranking. Not a perfect measure (training data quality matters more than headline parameter count) but useful for spotting capable small models.