OSAIM
Open Source AI Models

Best for Reasoning

Best open-source reasoning models

Reasoning models trade latency for accuracy: they generate hundreds to thousands of tokens of internal 'thinking' before producing a final answer. The open-source frontier here is led by DeepSeek R1 and its distilled variants, plus Qwen's QwQ line.

What we optimise for

We're optimising for MATH, GPQA and HumanEval scores at the cost of higher per-call latency. Suitable for workloads where a slow correct answer beats a fast wrong one.

Why it matters

Reasoning models routinely outperform their non-reasoning parents on math and complex multi-step tasks by 20+ points. The latency hit (often 2–10× slower) is the trade.

Our picks

  1. The frontier of open reasoning. MIT licensed; rivals o1 on MATH.

  2. Most accessible way to run R1-class reasoning locally on a single H100.

  3. Qwen's reasoning specialist; Apache 2.0; fits on a 4090 at Q4.

  4. Not a 'thinking' model per se, but the best 14B on MATH and GPQA.

Things to watch out for

  • Reasoning model outputs include lengthy <think>...</think> blocks. Make sure your downstream tooling strips them before showing users.
  • Cost: a reasoning model that generates 5,000 internal tokens for a 200-token answer costs 25× more per query than a non-reasoning baseline.
  • Streaming UX: hide the thinking until done, or surface it as a collapsible 'reasoning' panel.

All picks at a glance

DeepSeek R1
671B

Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.

Context
128K
License
mit
VRAM Q4
402.6 GB
DeepSeek R1 Distill Llama 70B
70B

R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.

Context
128K
License
llama-3
VRAM Q4
42 GB
QwQ 32B Preview
32B

Qwen's reasoning-focused 'thinking' model. Generates long chains-of-thought before answering, similar to OpenAI's o1 and DeepSeek R1 lineage. Optimised for math and competition-style problem solving. The Preview tag means Qwen is iterating quickly; later versions may obsolete this one. Useful today for math-heavy workloads where a slow, careful answer is preferred to a fast wrong one.

Context
33K
License
apache-2-0
VRAM Q4
19.2 GB
Phi-4 14B
14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context
16K
License
mit
VRAM Q4
8.4 GB
Last reviewed 2026-06-08. We refresh these picks as new models ship. See the full directory at /models.