CVPR 2026 ViSCALE Workshop · Denver, CO

Rethinking Dense Optical Flow
without Test-Time Scaling

IEEE/CVF CVPR 2026, ViSCALE

A dense optical flow framework that estimates dense motion in a single forward pass by reusing pretrained visual semantic and geometric priors.

Authors: Praroop Chanda, Suryansh Kumar

Visual and Spatial AI Lab · VCCM Section · College of Performance, Visualization & Fine Arts · Department of Electrical & Computer Engineering · Department of Computer Science & Engineering · Texas A&M University

Qualitative optical flow comparison on Sintel Final
Qualitative comparison on Sintel Final. Our method preserves sharp motion boundaries without iterative refinement.

Core Idea

Is scaling test-time computation the only way to improve dense optical flow accuracy?

Answer

No. We show that strong foundation priors can substitute for expensive iterative refinement in dense optical flow.

Approach

Fuse frozen DINO-v2 visual features with monocular depth foundation cues, then perform global matching without recurrent updates.

Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks.

Method Overview

Our method rethinks dense optical flow as a representation-driven correspondence problem rather than a refinement-heavy prediction task. Given two consecutive frames, we first extract semantic visual features from a frozen DINO-v2 backbone and geometric cues from a monocular depth foundation model. These two sources of information are complementary. First, DINO-v2 provides robust appearance and semantic consistency, and second, the depth prior contributes boundary-aware structural cues around occlusions, motion discontinuities, and thin objects. We project and fuse these features into a unified representation, which is then used for global correspondence matching. Instead of repeatedly refining flow estimates at test time, our framework predicts dense optical flow in a single forward pass, showing that strong foundation priors can reduce the need for computationally expensive test-time scaling. In summary, we replace task-specific feature learning with frozen foundation representations and a lightweight fusion pipeline.

1. Visual Priors

DINO-v2 provides semantically rich and spatially coherent visual features learned without motion supervision.

2. Geometric Priors

Depth Anything V2 provides boundary-aware scene structure, useful near occlusions, motion discontinuities, and thin structures.

3. Global Matching

Fused representations are matched globally to estimate dense correspondences without recurrent refinement.


Method overview figure

Results

The model achieves strong cross-dataset generalization while avoiding iterative test-time refinement.

2.81
Sintel Final EPE
without refinement
0
Refinement Steps
single forward pass
3.02
Things Val EPE
cross-dataset setting
Method # Refine Things Val EPE Sintel Clean EPE Sintel Final EPE
RAFT324.251.432.71
GMFlow03.481.502.96
SEA-RAFT (S)41.274.32
FlowSeek (T)43.941.162.48
Ours no refinement03.021.462.81

Cross-dataset generalization after training on Chairs and Things. Lower is better.

Citation

If you find this work useful, please cite the paper.

@inproceedings{chanda2026rethinking,
  title     = {Rethinking Dense Optical Flow without Test-Time Scaling},
  author    = {Chanda, Praroop and Kumar, Suryansh},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year      = {2026},
  note      = {ViSCALE Workshop, Denver},
  url       = {https://arxiv.org/abs/2605.08000}
}