Llama

Meta's open-weights LLM family. Llama 3 introduced 128K context and strong instruction tuning across 1B–405B parameter scales.

Visit homepage ↗

History & context

Meta's Llama family is the centre of gravity for open-weights AI. The original Llama (February 2023) was research-only; Llama 2 (July 2023) introduced the community licence that has shaped every release since. Llama 3 in April 2024 was the inflection point — it caught up with closed proprietary frontiers in many academic benchmarks and made open-weights deployment a serious option for production workloads.

Llama 3.1 (July 2024) brought the 405B flagship plus a 128K context refresh across 8B and 70B. Llama 3.2 (September 2024) added vision (11B and 90B) and on-device sizes (1B, 3B). Llama 3.3 (December 2024) refreshed the 70B to nearly match 405B quality at a fraction of the cost — and is the default choice today for production chat at 70B scale.

The licence remains the Llama Community License: free commercial use unless your service has 700M+ monthly active users (in which case you need a separate licence from Meta). Acceptable-use restrictions apply to certain categories. The licence is widely accepted in industry but is not OSI-approved open source — strictly speaking, Llama is source-available.

Flagship model

Llama 3.1 405B Instruct

405B

Meta's July 2024 flagship — the first open-weights model at 405B parameters. Trained on 15T tokens with 128K context. Rivals GPT-4o on many academic benchmarks and set the ceiling for open-weights quality for most of 2024. Running it self-hosted requires serious hardware (8× H100 at fp8 or multi-node at fp16); most users will run it via a hosted provider (Together, Groq, Fireworks). Llama 3.3 70B closed most of the practical gap at a fraction of the cost, so 405B is now most useful when 70B specifically hits its ceiling.

Context: 128K
License: llama-3
VRAM Q4: 243 GB

13 models in this family

Llama 3.1 405B Instruct

405B

Context: 128K
License: llama-3
VRAM Q4: 243 GB

Llama 3.2 90B Vision

90B

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Context: 128K
License: llama-3
VRAM Q4: 54 GB

Llama 2 70B Chat

70B

Flagship Llama 2 release. Fundamentally superseded by Llama 3 70B on every benchmark, but relevant historically: the model that made 'open-weights chat model at frontier scale' credible for enterprise workloads.

Context: 4K
License: llama-2
VRAM Q4: 42 GB

Llama 3.3 70B Instruct

70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context: 128K
License: llama-3
VRAM Q4: 42 GB

Llama 3.1 70B Instruct

70B

The pre-3.3 70B workhorse. Same base architecture as Llama 3.3 70B but the earlier instruction-tuning recipe. Still widely referenced as a baseline in papers and provider docs, and still the default 70B on some hosted providers.

Context: 128K
License: llama-3
VRAM Q4: 42 GB

Llama 4 Scout 17B (16E)

17B

Meta's April 2025 mixture-of-experts release. 17B active parameters across 16 experts (109B total). Natively multimodal with an unprecedented 10M-token context window — a leap far beyond Llama 3's 128K. Scout was designed to run on a single GPU at Q4 while beating Llama 3.3 70B on reasoning and multilingual benchmarks. The Llama 4 licence tightened acceptable-use provisions vs Llama 3.

Context: 10.0M
License: llama-4
VRAM Q4: 10.2 GB

Llama 4 Maverick 17B (128E)

17B

Larger Llama 4 sibling of Scout — 17B active across 128 experts (400B total). 1M-token native context. Positioned as GPT-4o-class on chat and reasoning while remaining tractable on a single high-end host at fp8. Multimodal from the ground up; instruction-tuned by Meta with a heavier synthetic-data pipeline than Llama 3.

Context: 1.0M
License: llama-4
VRAM Q4: 10.2 GB

Llama 2 13B Chat

13B

Mid-size Llama 2 chat model. Deprecated in most 2025 workloads by Llama 3.1 8B, but remains the baseline against which many post-2023 fine-tunes report.

Context: 4K
License: llama-2
VRAM Q4: 7.8 GB

Llama 3.2 11B Vision

11B

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Context: 128K
License: llama-3
VRAM Q4: 6.6 GB

Llama 3.1 8B Instruct

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context: 128K
License: llama-3
VRAM Q4: 4.8 GB

Llama 2 7B Chat

The original 7B RLHF chat model. Historically important — the first widely-adopted commercially-usable open-weights chat model. Still cited as a baseline in most 2024–25 papers.

Context: 4K
License: llama-2
VRAM Q4: 4.2 GB

Llama 3.2 3B

Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.

Context: 128K
License: llama-3
VRAM Q4: 1.8 GB

Llama 3.2 1B

The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.

Context: 128K
License: llama-3
VRAM Q4: 0.6 GB

Comparing Llama against another family? Try the side-by-side comparator or browse all leaderboards.