OSAIM
Open Source AI Models

Llama 3.2 11B Vision

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence.

The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Parameters
11B
Context length
128K
Modality
text, vision
Released
2024-09-25

Memory & hardware

VRAM (fp16)
22 GB
VRAM (Q4)
6.6 GB
Recommended
RTX 4090 24GB (fp16)
Quantizations
fp16, q8_0, q5_k_m, q4_k_m

License: Llama 3 Community License

SPDX
Commercial use
Yes
Modification
Yes
Redistribution
Yes

Benchmarks

MMLU
73.0
Benchmarks last verified 2026-05-18.

Hosted inference pricing

USD per million tokens.

ProviderInputOutput
groqCheapest$0.18$0.18
Pricing last verified 2026-05-18. Providers update rates frequently; confirm before integrating.

Run it yourself

Drop-in commands for the three most common open-source inference paths. The Ollama tag is a best-effort match against the registry; verify the size variant before pulling.

Run Llama 3.2 11B Vision locally
Ollama (easiest)
ollama run llama3.2:11b
Single-line install + run; uses the official Ollama registry tag for this family.
vLLM (production)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct
High-throughput hosted inference; one command to expose an OpenAI-compatible HTTP server.
Transformers (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", torch_dtype="auto"
)
Direct PyTorch usage. Pin a torch / cuda version that matches your GPU.
Hugging Face ID: meta-llama/Llama-3.2-11B-Vision-Instruct

Related models

Same family or similar size — useful when shopping around.

Llama 3.1 8B Instruct
8B

The workhorse 8B instruction-tuned model. Excellent quality-to-cost ratio and the broadest ecosystem support of any open-weights model — every major inference engine, fine-tuning library, and quantization toolchain has a 3.1 8B preset. Fits in 24 GB of VRAM at fp16, ~6 GB at Q4. Strong default for production chat where 70B is overkill, for fine-tuning on a specialist task, and for any workload where you want a known-good baseline.

Context
128K
License
llama-3
VRAM Q4
4.8 GB
Llama 3.2 3B
3B

Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.

Context
128K
License
llama-3
VRAM Q4
1.8 GB
Stable LM 2 12B
12B

Stability AI's general-purpose 12B model. Apache 2.0. Useful default when you need a permissively-licensed 12B-scale model.

Context
4K
License
apache-2-0
VRAM Q4
7.2 GB
Mistral Nemo 12B
12B

Joint Mistral × NVIDIA model with 128K context, designed as a drop-in upgrade to Mistral 7B. Trained with NVIDIA's Megatron stack and released under Apache 2.0. Strong multilingual coverage thanks to the Tekken tokenizer.

Context
128K
License
apache-2-0
VRAM Q4
7.2 GB
Gemma 2 9B
9B

Mid-tier Gemma. Strong general-purpose chat model at small scale. The Gemma Terms of Use permit commercial use subject to Google's prohibited-use policy.

Context
8K
License
gemma
VRAM Q4
5.4 GB
Mistral 7B v0.3
7B

The original Mistral 7B refresh with 32K context and extended vocabulary. Permissive Apache 2.0 weights and the first widely-deployed sliding-window-attention model. Still useful in 2026 for very-low-cost inference and as a baseline for fine-tuning experiments.

Context
33K
License
apache-2-0
VRAM Q4
4.2 GB