POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

CVPR 2026

Haicheng Wang^1,2^*, Yuan Liu²^*†, Yikun Liu^1,2^*, Zhemeng Yu¹, Zhongyin Zhao², Yangxiu You², Zilin Yu², Le Tian², Xiao Zhou², Jie Zhou², Weidi Xie¹, Yanfeng Wang¹^†

¹School of Artificial Intelligence, Shanghai Jiao Tong University, China
²WeChat AI, Tencent, China
^*Core contributor. ^†Corresponding author.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences—especially in long-video and streaming scenarios—poses a major challenge to scalability and real-world deployment. We introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains optimal performance, while on long-form general visual understanding, the standby mode retains 97.7–99.7% of the original accuracy using only 1/40–1/10 of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Key Contributions

Native dual-mode perception: a single MLLM that supports Focus (high-fidelity) and Standby (high-compression) visual understanding, enabling an explicit efficiency–accuracy choice at inference.
Two-stage post-training: (i) visual distillation/alignment by training only new modules; (ii) LLM mode adaptation with a 2-forward objective to preserve Focus performance while learning Standby.
Deployment-friendly design: asymmetric attention masking compatible with modern inference kernels (e.g., FlashAttention) and practical serving frameworks (e.g., SGLang).
Streaming-ready memory: detachable KV-cache strategy that retains a high-fidelity local window while migrating compact standby-cache to a long-term memory bank.

At a Glance

1/40 – 1/10

visual tokens (Standby vs. original)

97.7% – 99.7%

performance retained on long-form video benchmarks

Up to 6.2×

generation throughput (serving-side)

Numbers are summarized from the paper tables shown below (e.g., OpenCompass video benchmark and real-world speed-up analysis).

BibTeX

@inproceedings{wang2026points_long (not yet available), title={POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs}, author={Wang, Haicheng and Liu, Yuan and Liu, Yikun and Yu, Zhemeng and Zhao, Zhongyin and You, Yangxiu and Yu, Zilin and Tian, Le and Zhou, Xiao and Zhou, Jie and Xie, Weidi and Wang, Yanfeng}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026} }

Contact

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

POINTS-Long bridges the gap between human perception and MLLM scalability. Inspired by the human visual system, POINTS-Long introduces two complementary modes—high-fidelity Focus and efficient Standby—to dynamically trade off accuracy and cost for long-form and streaming visual understanding.

Abstract

Key Contributions

At a Glance

Method

Streaming Memory

Streaming inference with detachable KV cache. POINTS-Long maintains a high-fidelity local window (Focus) while migrating compact standby-cache into a long-term memory bank, enabling ultra-long streaming visual understanding without expensive re-prefills.

Main Results

OpenCompass video benchmark. Standby mode retains 97.7–99.7% performance using only 2.5–10% of the original tokens, while focus mode preserves (and slightly improves) the original performance.

BibTeX

Contact