FastVLM: Accelerating Vision-Language Models with Efficient Visual Encoders

2025-09-16

Apple’s FastVLM (CVPR 2025) tackles a key challenge in Vision-Language Models (VLMs): latency.
The work focuses on two main bottlenecks:

  1. Visual encoder inference time,
  2. The number of visual tokens passed to the LLM (which directly impacts prefill latency).


The Problem: High Resolution, High Latency

Text-rich images (documents, tables, charts) require higher resolutions for accuracy. However, traditional ViT-based encoders suffer from:

  • Exploding token counts at higher input resolutions,
  • Longer LLM prefill phases,
  • Significant increases in TTFT (Time-To-First-Token).

This creates a tradeoff: accuracy improves, but usability suffers due to delay.



The Solution: FastViTHD Encoder

At the heart of FastVLM lies FastViTHD, a hierarchical hybrid encoder combining convolution and transformer layers:

  • Early convolutional stages provide natural downsampling,
  • Later transformer blocks refine high-level features,
  • Multi-scale feature fusion improves text-rich image understanding,
  • Only input resolution scaling is needed to reach new accuracy–latency points.

Unlike token pruning or complex resamplers, FastViTHD achieves efficiency just by resolution scaling.



Results

  • ~3.2× faster TTFT under the LLaVA-1.5 training recipe.
  • With a 0.5B LLM vs. LLaVA-OneVision@1152²:
    • Comparable or better accuracy,
    • ~85× faster TTFT,
    • ~3.4× smaller visual encoder.
  • On larger setups (e.g., Qwen2-7B), FastVLM remains competitive with models like Cambrian-1, while keeping encoder speed advantages.


Training Setup

  • Stage 1: Projector alignment (LLaVA-558K, 1 epoch).
  • Stage 2: Visual instruction tuning (LLaVA-665K, 1 epoch).
  • Optional Stage 1.5: Caption-heavy datasets (CC3M/CC12M) for resolution scaling.
  • Hardware: Trained on 8× H100-80GB (e.g., Stage 1.5 at 1024² ≈ 77 hours).


Practical Takeaways

  • For on-device multimodal AI, static resolution scaling is often more efficient than dynamic tiling.
  • Avoid pairing tiny LLMs with ultra-high resolution — the tokens are wasted and latency dominates.
  • Instead of pruning tokens, train with hierarchical encoders at lower resolutions to produce fewer but more informative tokens.


Why It Matters

FastVLM redefines efficiency for VLMs:

  • Fewer tokens → faster responses,
  • Smaller encoders → cheaper deployment,
  • Resolution-only scaling → simpler design.

This makes it a strong candidate for document understanding, RAG pipelines, and mobile multimodal applications where latency is critical.



References