Llama 4 Scout 17B (16E)
Meta's April 2025 mixture-of-experts release. 17B active parameters across 16 experts (109B total). Natively multimodal with an unprecedented 10M-token context window — a leap far beyond Llama 3's 128K. Scout was designed to run on a single GPU at Q4 while beating Llama 3.3 70B on reasoning and multilingual benchmarks. The Llama 4 licence tightened acceptable-use provisions vs Llama 3.
- Parameters
- 17B
- Context length
- 10.0M
- Modality
- text, vision
- Released
- 2025-04-05
Memory & hardware
- VRAM (fp16)
- 34 GB
- VRAM (Q4)
- 10.2 GB
- Recommended
- 1× H100 80GB at fp8; RTX 4090 possible at Q4 with reduced context
- Quantizations
- fp16, fp8, q8_0, q5_k_m, q4_k_m
License: Llama 4 Community License
- SPDX
- —
- Commercial use
- Yes
- Modification
- Yes
- Redistribution
- Yes
Benchmarks
Hosted inference pricing
USD per million tokens.
| Provider | Input | Output | |
|---|---|---|---|
| together | $0.18 | $0.59 | |
| groqCheapest | $0.11 | $0.34 |
Run it yourself
Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-Scout-17B-16E-Instruct", device_map="auto", torch_dtype="auto"
)meta-llama/Llama-4-Scout-17B-16E-Instruct Related models
Same family or similar size — useful when shopping around.
Larger Llama 4 sibling of Scout — 17B active across 128 experts (400B total). 1M-token native context. Positioned as GPT-4o-class on chat and reasoning while remaining tractable on a single high-end host at fp8. Multimodal from the ground up; instruction-tuned by Meta with a heavier synthetic-data pipeline than Llama 3.
- Context
- 1.0M
- License
- llama-4
- VRAM Q4
- 10.2 GB
Mid-size Llama 2 chat model. Deprecated in most 2025 workloads by Llama 3.1 8B, but remains the baseline against which many post-2023 fine-tunes report.
- Context
- 4K
- License
- llama-2
- VRAM Q4
- 7.8 GB
Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.
- Context
- 128K
- License
- mit
- VRAM Q4
- 8.4 GB
14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.
- Context
- 16K
- License
- mit
- VRAM Q4
- 8.4 GB
Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.
- Context
- 128K
- License
- apache-2-0
- VRAM Q4
- 8.4 GB
Larger OLMo 2 release. Same fully-open philosophy as the 7B variant. The 13B size makes it more competitive with mainstream production-grade chat models.
- Context
- 4K
- License
- apache-2-0
- VRAM Q4
- 7.8 GB