OSAIM
Open Source AI Models

DeepSeek R1 Distill Llama 70B

R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU).

Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.

Parameters
70B
Context length
128K
Modality
text
Released
2025-01-20

Memory & hardware

VRAM (fp16)
140 GB
VRAM (Q4)
42 GB
Recommended
1× H100 80GB or RTX 4090 (Q4)
Quantizations
fp16, q8_0, q5_k_m, q4_k_m

License: Llama 3 Community License

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

MATH
94.5
MMLU
86.0
HumanEval
86.0
Benchmarks last verified 2026-05-18.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
groqCheapest$0.75$0.99
Pricing last verified 2026-05-18. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run DeepSeek R1 Distill Llama 70B locally
Ollama (easiest)
ollama run deepseek-r1:70b
Single-line install + run; uses the official Ollama registry tag for this family.
vLLM (production)
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-70B")
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Related models

Same family or similar size — useful when shopping around.

DeepSeek Coder V2
236B

Coding-focused MoE model with 21B active parameters out of 236B total. Supports 338 programming languages with strong performance across mainstream stacks (Python, TypeScript, Go, Rust, Java, C++) and competent results on niche languages where most open models falter. The DeepSeek licence applies — commercial use permitted with some application restrictions.

Context
128K
License
deepseek
VRAM Q4
141.6 GB
DeepSeek V3
671B

671B-parameter MoE model with 37B active per token. Trained for roughly $5.6M of compute — a landmark in cost-efficient frontier training. Frontier-class quality at a fraction of the cost of the closed proprietary frontier. The DeepSeek licence permits commercial use with limited restrictions on military and unlawful applications. Running V3 yourself requires serious hardware (8× H100 at fp8); most teams will use it via the DeepSeek API or providers like Together.

Context
128K
License
deepseek
VRAM Q4
402.6 GB
DeepSeek R1
671B

Reasoning model trained with reinforcement learning on top of DeepSeek V3-Base. MIT licence — even the weights are unrestricted, making R1 the most permissively-licensed frontier reasoning model. Generates long internal chains-of-thought before answering, trading latency for accuracy on math, code, and reasoning benchmarks. Distilled variants (e.g. R1 Distill Llama 70B) recover most of the quality at much smaller scales.

Context
128K
License
mit
VRAM Q4
402.6 GB
Llama 3.1 Nemotron 70B Instruct
70B

NVIDIA's RLHF-tuned Llama 3.1 70B. Tops several Arena-style human-preference leaderboards and shipped with NVIDIA's reward-model research. Inherits the Llama 3 community licence.

Context
128K
License
llama-3
VRAM Q4
42 GB
Llama 3.3 70B Instruct
70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context
128K
License
llama-3
VRAM Q4
42 GB
Qwen2.5 72B Instruct
72B

The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.

Context
128K
License
qwen
VRAM Q4
43.2 GB