Best for Reasoning
Best open-source reasoning models
Reasoning models trade latency for accuracy: they generate hundreds to thousands of tokens of internal 'thinking' before producing a final answer. The open-source frontier here is led by DeepSeek R1 and its distilled variants, plus Qwen's QwQ line.
We're optimising for MATH, GPQA and HumanEval scores at the cost of higher per-call latency. Suitable for workloads where a slow correct answer beats a fast wrong one.
Reasoning models routinely outperform their non-reasoning parents on math and complex multi-step tasks by 20+ points. The latency hit (often 2–10× slower) is the trade.
Our picks
The frontier of open reasoning. MIT licensed; rivals o1 on MATH.
Most accessible way to run R1-class reasoning locally on a single H100.
Qwen's reasoning specialist; Apache 2.0; fits on a 4090 at Q4.
Not a 'thinking' model per se, but the best 14B on MATH and GPQA.
Things to watch out for
- Reasoning model outputs include lengthy <think>...</think> blocks. Make sure your downstream tooling strips them before showing users.
- Cost: a reasoning model that generates 5,000 internal tokens for a 200-token answer costs 25× more per query than a non-reasoning baseline.
- Streaming UX: hide the thinking until done, or surface it as a collapsible 'reasoning' panel.
All picks at a glance
Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.
- Context
- 128K
- License
- mit
- VRAM Q4
- 402.6 GB
R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 42 GB
Qwen's reasoning-focused 'thinking' model. Generates long chains-of-thought before answering, similar to OpenAI's o1 and DeepSeek R1 lineage. Optimised for math and competition-style problem solving. The Preview tag means Qwen is iterating quickly; later versions may obsolete this one. Useful today for math-heavy workloads where a slow, careful answer is preferred to a fast wrong one.
- Context
- 33K
- License
- apache-2-0
- VRAM Q4
- 19.2 GB
14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.
- Context
- 16K
- License
- mit
- VRAM Q4
- 8.4 GB