Test-Time Domain Adaptation for Interactive Video Generation

Ishaan Singh Rawal    Suryansh Kumar
Visual and Spatial AI Lab
College of PVFA, Department of ECEN, and Department of CSCE
Texas A&M University, College Station, Texas USA
IEEE/CVF CVPR 2026, VGBE Workshop
Teaser image

Abstract

Text-conditioned diffusion models have emerged as powerful tools for video synthesis, yet enabling Interactive Video Generation (IVG), where users explicitly control object trajectories, remains challenging. While recent training-free approaches utilize attention masking for guidance, they often trade off perceptual quality for control. In this work, we identify the root causes of this degradation as two distinct domain shifts: (1) internal covariate shift induced by applying masks to pretrained models, and (2) an initialization gap where random noise lacks alignment with trajectory conditions. We propose a test time domain adaptation framework to resolve these shifts. To this end, we first introduce Mask Normalization, a pre-normalization layer that mitigates (1), i.e., covariate shift via feature distribution alignment. Next, a Temporal Intrinsic Prior that enforces spatio-temporal consistency during denoising is introduced to bridge the initialization gap, thus addressing (2). Extensive evaluations on popular dataset demonstrate that our approach outperforms the state-of-the-art IVG methods in both perceptual quality and trajectory adherence.

Project Overview

Comparison Results

Spider

Text Prompt: A spider descending on its web from a branch
Bounding-box trajectory:
motion mask
Peekaboo
Trailblazer
Ours

Citation

@inproceedings{park2026hum4d,
  title={A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture},
  author={Rawal, Ishaan Singh and Kumar, Suryansh},
  booktitle={Proceedings of the IEEE/CVF CVPR},
   year={2026}
}

Contact

For questions or collaborations, please contact Ishaan Singh Rawal or Suryansh Kumar.