FastVLM: Accelerating Vision-Language Models with Efficient Visual Encoders
2025-09-16
Apple’s FastVLM (CVPR 2025) tackles a key challenge in Vision-Language Models (VLMs): latency.
The work focuses on two main bottlenecks:
- Visual encoder inference time,
- The number of visual tokens passed to the LLM (which directly impacts prefill latency).
The Problem: High Resolution, High Latency
Text-rich images (documents, tables, charts) require higher resolutions for accuracy. However, traditional ViT-based encoders suffer from:
- Exploding token counts at higher input resolutions,
- Longer LLM prefill phases,
- Significant increases in TTFT (Time-To-First-Token).
This creates a tradeoff: accuracy improves, but usability suffers due to delay.
The Solution: FastViTHD Encoder
At the heart of FastVLM lies FastViTHD, a hierarchical hybrid encoder combining convolution and transformer layers:
- Early convolutional stages provide natural downsampling,
- Later transformer blocks refine high-level features,
- Multi-scale feature fusion improves text-rich image understanding,
- Only input resolution scaling is needed to reach new accuracy–latency points.
Unlike token pruning or complex resamplers, FastViTHD achieves efficiency just by resolution scaling.
Results
- ~3.2× faster TTFT under the LLaVA-1.5 training recipe.
- With a 0.5B LLM vs. LLaVA-OneVision@1152²:
- Comparable or better accuracy,
- ~85× faster TTFT,
- ~3.4× smaller visual encoder.
- On larger setups (e.g., Qwen2-7B), FastVLM remains competitive with models like Cambrian-1, while keeping encoder speed advantages.
Training Setup
- Stage 1: Projector alignment (LLaVA-558K, 1 epoch).
- Stage 2: Visual instruction tuning (LLaVA-665K, 1 epoch).
- Optional Stage 1.5: Caption-heavy datasets (CC3M/CC12M) for resolution scaling.
- Hardware: Trained on 8× H100-80GB (e.g., Stage 1.5 at 1024² ≈ 77 hours).
Practical Takeaways
- For on-device multimodal AI, static resolution scaling is often more efficient than dynamic tiling.
- Avoid pairing tiny LLMs with ultra-high resolution — the tokens are wasted and latency dominates.
- Instead of pruning tokens, train with hierarchical encoders at lower resolutions to produce fewer but more informative tokens.
Why It Matters
FastVLM redefines efficiency for VLMs:
- Fewer tokens → faster responses,
- Smaller encoders → cheaper deployment,
- Resolution-only scaling → simpler design.
This makes it a strong candidate for document understanding, RAG pipelines, and mobile multimodal applications where latency is critical.