Redefining Efficiency for Visual AI
FastVLM is a state-of-the-art Vision Language Model (VLM) engineered by Apple for exceptional performance. It runs directly on your devices, like iPhone and Mac, ensuring privacy and real-time responsiveness. The core innovation is the FastViTHD encoder, which dramatically reduces latency, especially for high-resolution images, making advanced visual understanding faster than ever.
Core Advantages of FastVLM
Extreme Speed
Experience a monumental leap in performance. FastVLM-0.5B achieves a first-token output that is 85x faster than comparable models, enabling real-time interactions.
Compact & Efficient
With a model size 3.4x smaller than alternatives like LLaVA-OneVision, FastVLM is optimized for on-device deployment without compromising power.
On-Device Intelligence
By processing data locally, FastVLM eliminates cloud dependency, enhances user privacy, and delivers instant results for edge AI applications.
How FastVLM Works
Understand Image Content
The innovative FastViTHD encoder efficiently converts high-resolution images into compact visual tokens.
Generate Textual Output
These tokens are processed by a Large Language Model (LLM) to produce accurate descriptions, answers, or analyses.
Optimize for Performance
By minimizing token count and latency, FastVLM achieves its remarkable speed without sacrificing the quality of its output.
Real-World Applications
Image Captioning
Automatically generate vivid and accurate text descriptions for any image.
Visual Question Answering
Ask questions about an image's content and receive instant, intelligent answers.
Intelligent Analysis
Recognize and analyze objects, text, and data within images for powerful insights.
See FastVLM in Action
Get Started with FastVLM
Download the models to begin building with FastVLM. Checkpoints are available for PyTorch, along with pre-converted models for Apple Silicon devices.
PyTorch Checkpoints
| Model | Stage | Download Link |
|---|---|---|
| FastVLM-0.5B | 2 | fastvlm_0.5b_stage2 |
| FastVLM-0.5B | 3 | fastvlm_0.5b_stage3 |
| FastVLM-1.5B | 2 | fastvlm_1.5b_stage2 |
| FastVLM-1.5B | 3 | fastvlm_1.5b_stage3 |
| FastVLM-7B | 2 | fastvlm_7b_stage2 |
| FastVLM-7B | 3 | fastvlm_7b_stage3 |
Apple Silicon Compatible Models
| Model | Download |
|---|---|
| FastVLM-0.5B (Stage 3, fp16) | Download |
| FastVLM-1.5B (Stage 3, int8) | Download |
| FastVLM-7B (Stage 3, int4) | Download |
Getting Started for Developers
Setup
To train or fine-tune your own FastVLM variants, please follow the instructions in the LLaVA codebase. The following commands will help you set up the environment for running inference.
conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .
Download Models
To download all the pretrained checkpoints, run the shell script below. The files will be downloaded to a checkpoints directory.
bash get_models.sh
Usage Example
Run inference with a PyTorch checkpoint using the following command:
python predict.py --model-path /path/to/checkpoint-dir \\
--image-file /path/to/image.png \\
--prompt "Describe the image."
Inference on Apple Silicon
To run inference on Apple Silicon, PyTorch checkpoints must be exported to a suitable format. Detailed instructions can be found in the model_export subfolder in the official repository. Pre-converted models are also available for convenience in the download section above.
Citation
If you find FastVLM useful in your research, please consider citing the paper:
@InProceedings{fastvlm2025,
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}