FastVLM: Accelerating Vision-Language Models with Efficient Visual Encoders
2025-09-16
Apple’s FastVLM (CVPR 2025) tackles a key challenge in Vision-Language Models (VLMs): latency.
The work focuses on two main bottlenecks:
- Visual encoder inference time,
 - The number of visual tokens passed to the LLM (which directly impacts prefill latency).
 
The Problem: High Resolution, High Latency
Text-rich images (documents, tables, charts) require higher resolutions for accuracy. However, traditional ViT-based encoders suffer from:
- Exploding token counts at higher input resolutions,
 - Longer LLM prefill phases,
 - Significant increases in TTFT (Time-To-First-Token).
 
This creates a tradeoff: accuracy improves, but usability suffers due to delay.
The Solution: FastViTHD Encoder
At the heart of FastVLM lies FastViTHD, a hierarchical hybrid encoder combining convolution and transformer layers:
- Early convolutional stages provide natural downsampling,
 - Later transformer blocks refine high-level features,
 - Multi-scale feature fusion improves text-rich image understanding,
 - Only input resolution scaling is needed to reach new accuracy–latency points.
 
Unlike token pruning or complex resamplers, FastViTHD achieves efficiency just by resolution scaling.
Results
- ~3.2× faster TTFT under the LLaVA-1.5 training recipe.
 - With a 0.5B LLM vs. LLaVA-OneVision@1152²:
- Comparable or better accuracy,
 - ~85× faster TTFT,
 - ~3.4× smaller visual encoder.
 
 - On larger setups (e.g., Qwen2-7B), FastVLM remains competitive with models like Cambrian-1, while keeping encoder speed advantages.
 
Training Setup
- Stage 1: Projector alignment (LLaVA-558K, 1 epoch).
 - Stage 2: Visual instruction tuning (LLaVA-665K, 1 epoch).
 - Optional Stage 1.5: Caption-heavy datasets (CC3M/CC12M) for resolution scaling.
 - Hardware: Trained on 8× H100-80GB (e.g., Stage 1.5 at 1024² ≈ 77 hours).
 
Practical Takeaways
- For on-device multimodal AI, static resolution scaling is often more efficient than dynamic tiling.
 - Avoid pairing tiny LLMs with ultra-high resolution — the tokens are wasted and latency dominates.
 - Instead of pruning tokens, train with hierarchical encoders at lower resolutions to produce fewer but more informative tokens.
 
Why It Matters
FastVLM redefines efficiency for VLMs:
- Fewer tokens → faster responses,
 - Smaller encoders → cheaper deployment,
 - Resolution-only scaling → simpler design.
 
This makes it a strong candidate for document understanding, RAG pipelines, and mobile multimodal applications where latency is critical.