OSAIM
Open Source AI Models

Llama 3.2 90B Vision

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough.

Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Parameters
90B
Context length
128K
Modality
text, vision
Released
2024-09-25

Memory & hardware

VRAM (fp16)
180 GB
VRAM (Q4)
54 GB
Recommended
2× A100 80GB or 1× H200
Quantizations
fp16, q8_0, q5_k_m, q4_k_m

License: Llama 3 Community License

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

MMLU
86.0
Benchmarks last verified 2026-05-18.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
groqCheapest$0.90$0.90
Pricing last verified 2026-05-18. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Llama 3.2 90B Vision locally
Ollama (easiest)
ollama run llama3.2:90b
Single-line install + run; uses the official Ollama registry tag for this family.
vLLM (production)
vllm serve meta-llama/Llama-3.2-90B-Vision-Instruct
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-90B-Vision-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-90B-Vision-Instruct", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: meta-llama/Llama-3.2-90B-Vision-Instruct

Related models

Same family or similar size — useful when shopping around.

Llama 3.3 70B Instruct
70B

Meta's December 2024 refresh of Llama 3 70B that closes most of the gap with Llama 3.1 405B for chat workloads while remaining tractable on a single H100. Strong instruction following, robust tool-use behaviour, and a 128K context window make it the default choice for production chat at 70B scale. The 3.3 release was trained on a refreshed instruction-tuning data mix and benefits from Meta's most recent alignment work. It outperforms the much larger 3.1 405B on several reasoning benchmarks at a fraction of inference cost. The licence is the Llama 3 Community License, which permits commercial use unless your service exceeds 700M monthly active users. Good pick for: production chat at scale, RAG over long documents, agentic workflows where tool use matters, and any 70B-tier replacement for closed proprietary models.

Context
128K
License
llama-3
VRAM Q4
42 GB
Command R+
104B

Cohere's flagship 104B model. RAG-focused with native multilingual support across ~10 high-resource languages. CC-BY-NC weights; commercial use via Cohere's hosted API.

Context
128K
License
mrl
VRAM Q4
62.4 GB
Qwen2.5 72B Instruct
72B

The flagship Qwen 2.5 release. Competes with Llama 3.1 405B on many benchmarks at one-fifth the parameter count. Note the 72B specifically uses the Qwen License (commercial use up to 100M MAU) — the smaller Qwen2.5 sizes are Apache 2.0.

Context
128K
License
qwen
VRAM Q4
43.2 GB
DeepSeek R1 Distill Llama 70B
70B

R1 reasoning capabilities distilled into a Llama 3.3 70B base. The most accessible way to run R1-class reasoning locally — fits on a single H100 in fp16 or on a 4090 at Q4. Inherits Llama 3's community licence (commercial use under 700M MAU). Great pick for production reasoning workloads where the full R1 is too expensive to host but o1/R1-style quality is required.

Context
128K
License
llama-3
VRAM Q4
42 GB
Llama 3.1 Nemotron 70B Instruct
70B

NVIDIA's RLHF-tuned Llama 3.1 70B. Tops several Arena-style human-preference leaderboards and shipped with NVIDIA's reward-model research. Inherits the Llama 3 community licence.

Context
128K
License
llama-3
VRAM Q4
42 GB
Mixtral 8×22B Instruct
141B

Scaled-up Mixtral with 22B-parameter experts. ~39B active parameters out of 141B total. Strong long-context performance and competitive coding scores. Apache 2.0 makes it attractive for self-hosting where the licence terms of Llama 3 are a non-starter.

Context
66K
License
apache-2-0
VRAM Q4
84.6 GB