Comparison
Llama 3.2 90B Vision vs Llama 3.2 11B Vision
Side-by-side specs, benchmarks and hosted-inference pricing.
Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.
Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.
Specs
| Parameters | 90B | 11B |
| Context length | 128K | 128K |
| Modality | text, vision | text, vision |
| Released | 2024-09-25 | 2024-09-25 |
| License | Llama 3 Community License | Llama 3 Community License |
| Commercial use | Yes | Yes |
| VRAM fp16 | 180 GB | 22 GB |
| VRAM Q4 | 54 GB | 6.6 GB |
Benchmarks
| MMLU | 86.0 | 73.0 |