OSAIM
Open Source AI Models

Comparison

Llama 3.2 90B Vision vs Llama 3.2 11B Vision

Side-by-side specs, benchmarks and hosted-inference pricing.

Side A
Llama 3.2 90B Vision
Meta · Llama

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Side B
Llama 3.2 11B Vision
Meta · Llama

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Specs

Parameters90B11B
Context length128K128K
Modalitytext, visiontext, vision
Released2024-09-252024-09-25
LicenseLlama 3 Community LicenseLlama 3 Community License
Commercial useYesYes
VRAM fp16180 GB22 GB
VRAM Q454 GB6.6 GB

Benchmarks

MMLU86.073.0

Cheapest hosted pricing

Llama 3.2 90B Vision
groq: $0.90 in / $0.90 out per 1M tokens
Llama 3.2 11B Vision
groq: $0.18 in / $0.18 out per 1M tokens
Highlighted cells indicate the better value for that row (higher score, larger context, lower VRAM).