Interactive Video Generation via Domain Adaptation

Ishaan Rawal¹ Suryansh Kumar¹

¹Texas A&M University, College Station, Texas USA
Department of Visualization, Department of ECEN, and Department of CSCE

Paper (arXiv) Code (coming soon)

Abstract

Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.

Project Overview

Interactive video generation with user guidance
Domain adaptation for handling appearance and structure variations
The model generates a video conditioned on user-provided text prompt and bounding-box trajectory
Text-to-video diffusion models to allow motion trajectory control without requiring any training
Supports a wide range of object scales and motion paths
Outperforms SOTA methods on standard benchmarks

Comparison Results

Spider

Text Prompt: A spider descending on its web from a branch
Bounding-box trajectory:

Peekaboo

Trailblazer

Ours

Kangaroo

Text Prompt: A kangaroo hooping down a gentle slope
Bounding-box trajectory:

Peekaboo

Trailblazer

Ours

Squirrel

Text Prompt: A squirrel descending a tree after gathering nuts
Bounding-box trajectory:

Peekaboo

Trailblazer

Ours

Swan

Text Prompt: A Swan floating gracefully on a lake
Bounding-box trajectory:

Peekaboo

Trailblazer

Ours

Citation

@article{rawal2025interactive,
  title={Interactive Video Generation via Domain Adaptation},
  author={Rawal, Ishaan and Kumar, Suryansh},
  journal={arXiv preprint arXiv:2505.24253},
  year={2025}
}

Contact

For questions or collaborations, please contact Ishaan Rawal or Suryansh Kumar.