DeepSeek
DeepSeek
Hangzhou-based lab known for highly efficient MoE training. DeepSeek V3 and R1 set new bars for open reasoning and coding.
Visit homepage ↗History & context
DeepSeek's series of releases through 2024 and into 2025 changed the economics of frontier AI. The Hangzhou-based lab specialises in mixture-of-experts training: their architecture and training pipeline lets them train frontier-class models at a fraction of the compute cost of dense-model labs.
DeepSeek V3 (December 2024) — a 671B-parameter MoE with 37B active per token — was reportedly trained for around $5.6M of compute. Quality on academic benchmarks rivals the closed proprietary frontier. DeepSeek R1 (January 2025) followed with reinforcement-learning-trained reasoning capability — and uniquely, the weights are MIT-licensed.
R1 also shipped a family of distilled variants (R1 Distill Qwen 7B / 14B / 32B, R1 Distill Llama 8B / 70B) that recover most of R1's reasoning quality at much smaller scales. The 70B distill is the most practical way to run R1-class reasoning on a single H100.
Flagship model
4 models in this family
671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.
- Context
- 128K
- License
- deepseek
- VRAM Q4
- 402.6 GB
Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.
- Context
- 128K
- License
- mit
- VRAM Q4
- 402.6 GB
Coding-focused MoE model with 21B active parameters out of 236B total. Supports 338 programming languages with strong performance across mainstream stacks (Python, TypeScript, Go, Rust, Java, C++) and competent results on niche languages where most open models falter. The DeepSeek licence applies — commercial use permitted with some application restrictions.
- Context
- 128K
- License
- deepseek
- VRAM Q4
- 141.6 GB
R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 42 GB