Best for Vision

Best open-source vision-language models

Vision-language models are the multimodal frontier of open-source AI. The gap with closed proprietary frontiers is wider here than for text-only models, but the leading open options are now usable for document understanding, OCR-adjacent extraction, and image captioning.

What we optimise for

We're optimising for image-text-to-text quality on benchmarks like MMMU and DocVQA, licence permissiveness, and tractable inference cost.

Why it matters

Closed APIs charge per image and may rate-limit on volume. Open vision models let you batch through millions of images without per-call billing.

Our picks

#1Llama 3.2 90B Vision
Strongest open vision-language model at 90B; competitive with the proprietary frontier.
#2Llama 3.2 11B Vision
11B variant — cheap to host, runs on a single 4090.
#3Yi VL 34B
Best open option for Chinese-language image text.

Things to watch out for

Image preprocessing matters: most vision models expect specific image resolutions. Mis-sizing can drop accuracy 10–20%.
Token budget: each image consumes hundreds to thousands of tokens of context, which competes with text input.
For OCR specifically, dedicated OCR tools (Tesseract, modern vision-LM document parsers) often outperform general VLMs on dense text pages.

All picks at a glance

Llama 3.2 90B Vision

90B

Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.

Context: 128K
License: llama-3
VRAM Q4: 54 GB

Llama 3.2 11B Vision

11B

Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.

Context: 128K
License: llama-3
VRAM Q4: 6.6 GB

Yi VL 34B

34B

Vision-language variant of Yi 34B. Image-text reasoning via an MLP adapter on a CLIP encoder. Useful for bilingual EN/中 multimodal workloads where the major Western vision-language models underperform on Chinese text in images.

Context: 4K
License: apache-2-0
VRAM Q4: 20.4 GB

Last reviewed 2026-07-05. We refresh these picks as new models ship. See the full directory at /models.