OSAIM
Open Source AI Models

Best for Coding

Best open-source AI models for coding

Code is where open-source AI models have closed the gap with closed proprietary frontiers fastest. The current crop of code-specialised open weights rivals GPT-4-class quality on HumanEval and BigCodeBench at parameter counts that fit on a single high-end GPU.

What we optimise for

We're optimising for HumanEval and MATH scores, broad language coverage (not just Python), licence permissiveness for commercial deployments, and tractable hardware footprints.

Why it matters

Code assistants run in tight feedback loops with the developer. Latency matters as much as raw quality, which is why a fast 32B coder often beats a slower 70B generalist in practice.

Our picks

  1. GPT-4o-class HumanEval at 32B, Apache 2.0, fits on a single H100.

  2. Code-focused MoE with 21B active params; supports 338 languages.

  3. Generalist V3 outperforms most code specialists on HumanEval at frontier scale.

  4. Strongest non-coder-specialised model at this size; runs on a 4090 at Q4.

  5. 14B reasoning model that punches above its weight on code+math benchmarks.

Things to watch out for

  • Most coding assistants chain multiple LLM calls. Picking a model with fast streaming output (vLLM, TGI) often matters more than the absolute HumanEval score.
  • Code workloads are sensitive to context length on large monorepos. Prefer 128K+ context models if you plan to feed in surrounding files.
  • Apache 2.0 simplifies downstream redistribution if you're building a commercial product on top.

All picks at a glance

Qwen2.5 Coder 32B
32B

Coding-specialised Qwen2.5 32B fine-tune. GPT-4o-class on HumanEval and BigCodeBench at the time of release. Trained on additional code-heavy data with extended pre-training. Apache 2.0. Natural pick for self-hosted coding assistants, code-review automation, and any agent loop that primarily writes code.

Context
128K
License
apache-2-0
VRAM Q4
19.2 GB
DeepSeek Coder V2
236B

Coding-focused MoE model with 21B active parameters out of 236B total. Supports 338 programming languages with strong performance across mainstream stacks (Python, TypeScript, Go, Rust, Java, C++) and competent results on niche languages where most open models falter. The DeepSeek licence applies — commercial use permitted with some application restrictions.

Context
128K
License
deepseek
VRAM Q4
141.6 GB
DeepSeek V3
671B

671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.

Context
128K
License
deepseek
VRAM Q4
402.6 GB
Qwen2.5 32B Instruct
32B

32B sweet-spot model: strong reasoning, fits on one H100 in fp16, on a 4090 at Q4. The 32B size in particular hits a quality/cost knee — quality scales with parameters faster than cost up to ~32B, and slower afterwards. Favoured for production chat where 7B isn't sharp enough and where 70B+ would over-spec the hardware budget. Apache 2.0 licence.

Context
128K
License
apache-2-0
VRAM Q4
19.2 GB
Phi-4 14B
14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context
16K
License
mit
VRAM Q4
8.4 GB
Last reviewed 2026-06-08. We refresh these picks as new models ship. See the full directory at /models.