POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

CVPR 2026
Haicheng Wang1,2*, Yuan Liu2*†, Yikun Liu1,2*, Zhemeng Yu1, Zhongyin Zhao2, Yangxiu You2, Zilin Yu2, Le Tian2, Xiao Zhou2, Jie Zhou2, Weidi Xie1, Yanfeng Wang1
1School of Artificial Intelligence, Shanghai Jiao Tong University, China
2WeChat AI, Tencent, China
*Core contributor.  Corresponding author.
Teaser

POINTS-Long bridges the gap between human perception and MLLM scalability. Inspired by the human visual system, POINTS-Long introduces two complementary modes—high-fidelity Focus and efficient Standby—to dynamically trade off accuracy and cost for long-form and streaming visual understanding.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences—especially in long-video and streaming scenarios—poses a major challenge to scalability and real-world deployment. We introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains optimal performance, while on long-form general visual understanding, the standby mode retains 97.7–99.7% of the original accuracy using only 1/40–1/10 of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Key Contributions

  • Native dual-mode perception: a single MLLM that supports Focus (high-fidelity) and Standby (high-compression) visual understanding, enabling an explicit efficiency–accuracy choice at inference.
  • Two-stage post-training: (i) visual distillation/alignment by training only new modules; (ii) LLM mode adaptation with a 2-forward objective to preserve Focus performance while learning Standby.
  • Deployment-friendly design: asymmetric attention masking compatible with modern inference kernels (e.g., FlashAttention) and practical serving frameworks (e.g., SGLang).
  • Streaming-ready memory: detachable KV-cache strategy that retains a high-fidelity local window while migrating compact standby-cache to a long-term memory bank.

At a Glance

1/40 – 1/10
visual tokens (Standby vs. original)
97.7% – 99.7%
performance retained on long-form video benchmarks
Up to 6.2×
generation throughput (serving-side)

Numbers are summarized from the paper tables shown below (e.g., OpenCompass video benchmark and real-world speed-up analysis).

Method

Architecture Overview

POINTS-Long architecture. A dual-path ViT introduces a small set of learnable standby tokens while keeping the original focus pathway unchanged. A two-stage post-training pipeline (visual distillation/alignment, then LLM mode adaptation) equips the model with an efficient Standby mode without harming the Focus mode.

Streaming Memory

Streaming Inference

Streaming inference with detachable KV cache. POINTS-Long maintains a high-fidelity local window (Focus) while migrating compact standby-cache into a long-term memory bank, enabling ultra-long streaming visual understanding without expensive re-prefills.

Main Results

OpenCompass Video Benchmark

OpenCompass video benchmark. Standby mode retains 97.7–99.7% performance using only 2.5–10% of the original tokens, while focus mode preserves (and slightly improves) the original performance.

More Video Benchmarks
Scalability of Frames
Image Benchmarks
Streaming Understanding
Ablation Study

BibTeX

@inproceedings{wang2026points_long (not yet available),
        title={POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs},
        author={Wang, Haicheng and Liu, Yuan and Liu, Yikun and Yu, Zhemeng and Zhao, Zhongyin and You, Yangxiu and Yu, Zilin and Tian, Le and Zhou, Xiao and Zhou, Jie and Xie, Weidi and Wang, Yanfeng},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2026}
}
      

Contact

If you have any questions, feel free to reach out: