OSAIM
Open Source AI Models

Meta

Llama

Meta's open-weights LLM family. Llama 3 introduced 128K context and strong instruction tuning across 1B–405B parameter scales.

Visit homepage ↗

History & context

Meta's Llama family is the centre of gravity for open-weights AI. The original Llama (February 2023) was research-only; Llama 2 (July 2023) introduced the community licence that has shaped every release since. Llama 3 in April 2024 was the inflection point — it caught up with closed proprietary frontiers in many academic benchmarks and made open-weights deployment a serious option for production workloads.

Llama 3.1 (July 2024) brought the 405B flagship plus a 128K context refresh across 8B and 70B. Llama 3.2 (September 2024) added vision (11B and 90B) and on-device sizes (1B, 3B). Llama 3.3 (December 2024) refreshed the 70B to nearly match 405B quality at a fraction of the cost — and is the default choice today for production chat at 70B scale.

The licence remains the Llama Community License: free commercial use unless your service has 700M+ monthly active users (in which case you need a separate licence from Meta). Acceptable-use restrictions apply to certain categories. The licence is widely accepted in industry but is not OSI-approved open source — strictly speaking, Llama is source-available.

Flagship model

6 models in this family

Llama 3.2 90B Vision
90B

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Context
128K
License
llama-3
VRAM Q4
54 GB
Llama 3.3 70B Instruct
70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context
128K
License
llama-3
VRAM Q4
42 GB
Llama 3.2 11B Vision
11B

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Context
128K
License
llama-3
VRAM Q4
6.6 GB
Llama 3.1 8B Instruct
8B

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Llama 3.2 3B
3B

Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.

Context
128K
License
llama-3
VRAM Q4
1.8 GB
Llama 3.2 1B
1B

The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.

Context
128K
License
llama-3
VRAM Q4
0.6 GB
Comparing Llama against another family? Try the side-by-side comparator or browse all leaderboards.