FastVLM: Accelerating Vision-Language Models with Efficient Visual Encoders

2025-09-16

Apple’s FastVLM (CVPR 2025) tackles a key challenge in Vision-Language Models (VLMs): latency.
The work focuses on two main bottlenecks:

Visual encoder inference time,
The number of visual tokens passed to the LLM (which directly impacts prefill latency).

The Problem: High Resolution, High Latency

Text-rich images (documents, tables, charts) require higher resolutions for accuracy. However, traditional ViT-based encoders suffer from:

This creates a tradeoff: accuracy improves, but usability suffers due to delay.

At the heart of FastVLM lies FastViTHD, a hierarchical hybrid encoder combining convolution and transformer layers:

Early convolutional stages provide natural downsampling,
Later transformer blocks refine high-level features,
Multi-scale feature fusion improves text-rich image understanding,
Only input resolution scaling is needed to reach new accuracy–latency points.

Unlike token pruning or complex resamplers, FastViTHD achieves efficiency just by resolution scaling.

~3.2× faster TTFT under the LLaVA-1.5 training recipe.
With a 0.5B LLM vs. LLaVA-OneVision@1152²:
- Comparable or better accuracy,
- ~85× faster TTFT,
- ~3.4× smaller visual encoder.
On larger setups (e.g., Qwen2-7B), FastVLM remains competitive with models like Cambrian-1, while keeping encoder speed advantages.

Stage 1: Projector alignment (LLaVA-558K, 1 epoch).
Stage 2: Visual instruction tuning (LLaVA-665K, 1 epoch).
Optional Stage 1.5: Caption-heavy datasets (CC3M/CC12M) for resolution scaling.
Hardware: Trained on 8× H100-80GB (e.g., Stage 1.5 at 1024² ≈ 77 hours).

For on-device multimodal AI, static resolution scaling is often more efficient than dynamic tiling.
Avoid pairing tiny LLMs with ultra-high resolution — the tokens are wasted and latency dominates.
Instead of pruning tokens, train with hierarchical encoders at lower resolutions to produce fewer but more informative tokens.

FastVLM redefines efficiency for VLMs:

This makes it a strong candidate for document understanding, RAG pipelines, and mobile multimodal applications where latency is critical.