FastVLM

Apple's groundbreaking Vision Language Model, delivering unparalleled speed and on-device performance.

Redefining Efficiency for Visual AI

FastVLM is a state-of-the-art Vision Language Model (VLM) engineered by Apple for exceptional performance. It runs directly on your devices, like iPhone and Mac, ensuring privacy and real-time responsiveness. The core innovation is the FastViTHD encoder, which dramatically reduces latency, especially for high-resolution images, making advanced visual understanding faster than ever.

Core Advantages of FastVLM

Extreme Speed

Experience a monumental leap in performance. FastVLM-0.5B achieves a first-token output that is 85x faster than comparable models, enabling real-time interactions.

Compact & Efficient

With a model size 3.4x smaller than alternatives like LLaVA-OneVision, FastVLM is optimized for on-device deployment without compromising power.

On-Device Intelligence

By processing data locally, FastVLM eliminates cloud dependency, enhances user privacy, and delivers instant results for edge AI applications.

How FastVLM Works

1

Understand Image Content

The innovative FastViTHD encoder efficiently converts high-resolution images into compact visual tokens.

2

Generate Textual Output

These tokens are processed by a Large Language Model (LLM) to produce accurate descriptions, answers, or analyses.

3

Optimize for Performance

By minimizing token count and latency, FastVLM achieves its remarkable speed without sacrificing the quality of its output.

Real-World Applications

Image Captioning

Automatically generate vivid and accurate text descriptions for any image.

Visual Question Answering

Ask questions about an image's content and receive instant, intelligent answers.

Intelligent Analysis

Recognize and analyze objects, text, and data within images for powerful insights.

See FastVLM in Action

Get Started with FastVLM

Download the models to begin building with FastVLM. Checkpoints are available for PyTorch, along with pre-converted models for Apple Silicon devices.

PyTorch Checkpoints

Model Stage Download Link
FastVLM-0.5B2fastvlm_0.5b_stage2
FastVLM-0.5B3fastvlm_0.5b_stage3
FastVLM-1.5B2fastvlm_1.5b_stage2
FastVLM-1.5B3fastvlm_1.5b_stage3
FastVLM-7B2fastvlm_7b_stage2
FastVLM-7B3fastvlm_7b_stage3

Apple Silicon Compatible Models

Model Download
FastVLM-0.5B (Stage 3, fp16)Download
FastVLM-1.5B (Stage 3, int8)Download
FastVLM-7B (Stage 3, int4)Download

Getting Started for Developers

Setup

To train or fine-tune your own FastVLM variants, please follow the instructions in the LLaVA codebase. The following commands will help you set up the environment for running inference.

conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .

Download Models

To download all the pretrained checkpoints, run the shell script below. The files will be downloaded to a checkpoints directory.

bash get_models.sh

Usage Example

Run inference with a PyTorch checkpoint using the following command:

python predict.py --model-path /path/to/checkpoint-dir \\
                                      --image-file /path/to/image.png \\
                                      --prompt "Describe the image."

Inference on Apple Silicon

To run inference on Apple Silicon, PyTorch checkpoints must be exported to a suitable format. Detailed instructions can be found in the model_export subfolder in the official repository. Pre-converted models are also available for convenience in the download section above.

Citation

If you find FastVLM useful in your research, please consider citing the paper:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025},
}