Best for Reasoning

Best open-source reasoning models

Reasoning models trade latency for accuracy: they generate hundreds to thousands of tokens of internal 'thinking' before producing a final answer. The open-source frontier here is led by DeepSeek R1 and its distilled variants, plus Qwen's QwQ line.

What we optimise for

We're optimising for MATH, GPQA and HumanEval scores at the cost of higher per-call latency. Suitable for workloads where a slow correct answer beats a fast wrong one.

Why it matters

Reasoning models routinely outperform their non-reasoning parents on math and complex multi-step tasks by 20+ points. The latency hit (often 2–10× slower) is the trade.

Our picks

#1DeepSeek R1
The frontier of open reasoning. MIT licensed; rivals o1 on MATH.
#2DeepSeek R1 Distill Llama 70B
Most accessible way to run R1-class reasoning locally on a single H100.
#3QwQ 32B Preview
Qwen's reasoning specialist; Apache 2.0; fits on a 4090 at Q4.
#4Phi-4 14B
Not a 'thinking' model per se, but the best 14B on MATH and GPQA.

Things to watch out for

Reasoning model outputs include lengthy <think>...</think> blocks. Make sure your downstream tooling strips them before showing users.
Cost: a reasoning model that generates 5,000 internal tokens for a 200-token answer costs 25× more per query than a non-reasoning baseline.
Streaming UX: hide the thinking until done, or surface it as a collapsible 'reasoning' panel.

All picks at a glance

DeepSeek R1

671B

Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.

Context: 128K
License: mit
VRAM Q4: 402.6 GB

DeepSeek R1 Distill Llama 70B

70B

R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.

Context: 128K
License: llama-3
VRAM Q4: 42 GB

QwQ 32B Preview

32B

Qwen's reasoning-focused 'thinking' model. Generates long chains-of-thought before answering, similar to OpenAI's o1 and DeepSeek R1 lineage. Optimised for math and competition-style problem solving. The Preview tag means Qwen is iterating quickly; later versions may obsolete this one. Useful today for math-heavy workloads where a slow, careful answer is preferred to a fast wrong one.

Context: 33K
License: apache-2-0
VRAM Q4: 19.2 GB

Phi-4 14B

14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context: 16K
License: mit
VRAM Q4: 8.4 GB

Last reviewed 2026-07-05. We refresh these picks as new models ship. See the full directory at /models.