Comparison

Llama 3.2 90B Vision vs Llama 3.2 11B Vision

Side-by-side specs, benchmarks and hosted-inference pricing.

Side A

Meta · Llama

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Side B

Llama 3.2 11B Vision

Meta · Llama

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Specs

Parameters	90B	11B
Context length	128K	128K
Modality	text, vision	text, vision
Released	2024-09-25	2024-09-25
License	Llama 3 Community License	Llama 3 Community License
Commercial use	Yes	Yes
VRAM fp16	180 GB	22 GB
VRAM Q4	54 GB	6.6 GB

Benchmarks

MMLU	86.0	73.0
MMMU	60.3	50.7

Cheapest hosted pricing

Llama 3.2 90B Vision

groq: $0.90 in / $0.90 out per 1M tokens

Llama 3.2 11B Vision

groq: $0.18 in / $0.18 out per 1M tokens

Highlighted cells indicate the better value for that row (higher score, larger context, lower VRAM).