Text-conditioned diffusion models have emerged as powerful tools for video synthesis, yet enabling Interactive Video Generation (IVG), where users explicitly control object trajectories, remains challenging. While recent training-free approaches utilize attention masking for guidance, they often trade off perceptual quality for control. In this work, we identify the root causes of this degradation as two distinct domain shifts: (1) internal covariate shift induced by applying masks to pretrained models, and (2) an initialization gap where random noise lacks alignment with trajectory conditions. We propose a test time domain adaptation framework to resolve these shifts. To this end, we first introduce Mask Normalization, a pre-normalization layer that mitigates (1), i.e., covariate shift via feature distribution alignment. Next, a Temporal Intrinsic Prior that enforces spatio-temporal consistency during denoising is introduced to bridge the initialization gap, thus addressing (2). Extensive evaluations on popular dataset demonstrate that our approach outperforms the state-of-the-art IVG methods in both perceptual quality and trajectory adherence.
For questions or collaborations, please contact Ishaan Singh Rawal or Suryansh Kumar.