OSAIM
Open Source AI Models

Llama 4 Scout 17B (16E)

Meta's April 2025 mixture-of-experts release. 17B active parameters across 16 experts (109B total). Natively multimodal with an unprecedented 10M-token context window — a leap far beyond Llama 3's 128K. Scout was designed to run on a single GPU at Q4 while beating Llama 3.3 70B on reasoning and multilingual benchmarks. The Llama 4 licence tightened acceptable-use provisions vs Llama 3.

Parameters
17B
Context length
10.0M
Modality
text, vision
Released
2025-04-05

Memory & hardware

VRAM (fp16)
34 GB
VRAM (Q4)
10.2 GB
Recommended
1× H100 80GB at fp8; RTX 4090 possible at Q4 with reduced context
Quantizations
fp16, fp8, q8_0, q5_k_m, q4_k_m

License: Llama 4 Community License

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

ArenaHard
88.5
IFEval
87.4
HumanEval
79.9
MMLU
79.6
MMLU-Pro
74.3
MMMU
69.4
MATH
50.3
SWE-bench Verified
22.4
Benchmarks last verified 2026-07-02.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
together$0.18$0.59
groqCheapest$0.11$0.34
Pricing last verified 2026-07-02. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Llama 4 Scout 17B (16E) locally
No official Ollama registry tag for this model — use transformers or vLLM below.
vLLM (production)
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-Scout-17B-16E-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-Scout-17B-16E-Instruct", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: meta-llama/Llama-4-Scout-17B-16E-Instruct

Related models

Same family or similar size — useful when shopping around.

Llama 4 Maverick 17B (128E)
17B

Larger Llama 4 sibling of Scout — 17B active across 128 experts (400B total). 1M-token native context. Positioned as GPT-4o-class on chat and reasoning while remaining tractable on a single high-end host at fp8. Multimodal from the ground up; instruction-tuned by Meta with a heavier synthetic-data pipeline than Llama 3.

Context
1.0M
License
llama-4
VRAM Q4
10.2 GB
Llama 2 13B Chat
13B

Mid-size Llama 2 chat model. Deprecated in most 2025 workloads by Llama 3.1 8B, but remains the baseline against which many post-2023 fine-tunes report.

Context
4K
License
llama-2
VRAM Q4
7.8 GB
Phi-3 Medium 14B
14B

Phi-3's mid-tier model with extended 128K context. MIT licence. Strong reasoning relative to its parameter count thanks to Microsoft's heavy investment in synthetic training data.

Context
128K
License
mit
VRAM Q4
8.4 GB
Phi-4 14B
14B

14B model trained primarily on synthetic data. Punches above its weight on reasoning, especially MATH and GPQA. MIT licensed. A standout choice when you want strong reasoning quality without paying 70B-tier hardware costs. Phi-4 in particular demonstrated that careful synthetic-data curation can extract frontier-class reasoning from a relatively small dense model.

Context
16K
License
mit
VRAM Q4
8.4 GB
Qwen2.5 14B Instruct
14B

Mid-size Qwen2.5 with broad task coverage. The sweet spot for users who want noticeably better quality than 7B but can't justify the hardware footprint of 32B or 72B.

Context
128K
License
apache-2-0
VRAM Q4
8.4 GB
OLMo 2 13B
13B

Larger OLMo 2 release. Same fully-open philosophy as the 7B variant. The 13B size makes it more competitive with mainstream production-grade chat models.

Context
4K
License
apache-2-0
VRAM Q4
7.8 GB