Best for Edge
Best open-source AI models for edge / on-device inference
Edge inference is where the model meets the user: no network round-trip, no server bills, no data leaving the device. The current crop of sub-4B models is good enough for many real workloads.
We're optimising for VRAM under 4 GB at Q4, CPU performance, and broad ecosystem support (Ollama, llama.cpp, MLX, NPU-accelerated runtimes).
On-device inference enables fundamentally different products: offline assistants, latency-sensitive UX, private inference with zero cloud dependency.
Our picks
Runs on a modern smartphone. The de facto on-device default.
Step up in quality; still fits comfortably on a laptop or recent phone.
GPT-3.5-class on academic benchmarks at <4B params.
Compact Gemma. Strong default if Llama community licence doesn't fit.
Things to watch out for
- Smaller models hallucinate more. Pair them with strict structured-output formats (JSON schema) when correctness matters.
- On-device fine-tuning is now practical with LoRA at this scale — consider personalising a 1–3B model to your user's data on-device.
- Battery and thermal budget matter as much as raw model quality on mobile.
All picks at a glance
The smallest Llama 3 release, designed for on-device inference on phones and laptops. The 1B model runs comfortably in <2 GB of RAM at Q4 quantization and is fast enough for real-time chat on a modern smartphone. Useful for edge inference, on-device assistants where round-tripping to a server is undesirable, and as a draft model for speculative decoding in front of a larger Llama 3 variant.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 0.6 GB
Pocket-sized Llama 3 variant for edge deployment. Surprising chat quality after instruction tuning makes it competitive with much larger models from a previous generation. At Q4 it fits in ~2 GB of VRAM and runs on consumer GPUs and recent Apple Silicon. A strong default for on-device chat, summarisation, and structured extraction tasks where the workload doesn't need frontier reasoning quality.
- Context
- 128K
- License
- llama-3
- VRAM Q4
- 1.8 GB
Microsoft's flagship small-model demonstration: GPT-3.5-class on academic benchmarks at <4B parameters. The 4K context-window variant is the lightest; a 128K variant ships separately. MIT licensed, well-suited to on-device assistants and structured-extraction workloads where compactness matters more than absolute quality.
- Context
- 4K
- License
- mit
- VRAM Q4
- 2.3 GB
Compact Gemma variant designed for on-device inference. Trained with knowledge distillation from larger Gemma 2 teachers. Runs comfortably on a phone at Q4.
- Context
- 8K
- License
- gemma
- VRAM Q4
- 1.6 GB