Best for Vision
Best open-source vision-language models
Vision-language models are the multimodal frontier of open-source AI. The gap with closed proprietary frontiers is wider here than for text-only models, but the leading open options are now usable for document understanding, OCR-adjacent extraction, and image captioning.
We're optimising for image-text-to-text quality on benchmarks like MMMU and DocVQA, licence permissiveness, and tractable inference cost.
Closed APIs charge per image and may rate-limit on volume. Open vision models let you batch through millions of images without per-call billing.
Our picks
Strongest open vision-language model at 90B; competitive with the proprietary frontier.
11B variant — cheap to host, runs on a single 4090.
Best open option for Chinese-language image text.
Things to watch out for
- Image preprocessing matters: most vision models expect specific image resolutions. Mis-sizing can drop accuracy 10–20%.
- Token budget: each image consumes hundreds to thousands of tokens of context, which competes with text input.
- For OCR specifically, dedicated OCR tools (Tesseract, modern vision-LM document parsers) often outperform general VLMs on dense text pages.
All picks at a glance
Larger vision-language Llama variant, competitive with the proprietary multimodal frontier on standard image-understanding benchmarks. Drops in as a vision upgrade where 11B isn't sharp enough. Requires substantial GPU memory in fp16; most teams will run it quantized or on multi-GPU. A natural pairing with retrieval pipelines that fetch image-rich chunks alongside text.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 54 GB
Llama 3's first vision-language model. Image understanding via a separately-trained ViT adapter bolted onto Llama 3 weights. Useful for OCR-adjacent workloads, document understanding, and image captioning at a permissive licence. The 11B size makes it cheap to host. Combined with the 128K text context, it handles long PDF-with-images workflows comfortably on a single 4090.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 6.6 GB
Vision-language variant of Yi 34B. Image-text reasoning via an MLP adapter on a CLIP encoder. Useful for bilingual EN/中 multimodal workloads where the major Western vision-language models underperform on Chinese text in images.
- Context
- 4K
- License
- apache-2-0
- VRAM Q4
- 20.4 GB