Browser Notice: This page may have display issues in Chrome. For the best experience, we recommend using Safari, Firefox, or other browsers.

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Yulu Gan^1,*,†,‡ Ligeng Zhu^2,* Dandan Shan^3,* Baifeng Shi^2,4

Hongxu Yin² Boris Ivanovic² Song Han^1,2 Trevor Darrell⁴ Jitendra Malik⁴ Marco Pavone^2,5,‡ Boyi Li^2,4,†,‡

^* equal contribution ^† project lead ^‡ corresponding authors

⁴

⁵

Paper Code (we released everything) Interactive Demo (huggingface space)

TL;DRFoundationMotion is an automated data-curation pipeline for constructing large-scale motion-understanding video datasets. Models trained on these datasets show strong improvements in motion understanding.

Interested in using our data curation pipeline? Data Demo Dataset Data Curation Code

Interested in using our model? Model Demo Model Weights Inference Code

Start with toy examples: Data Curation Demo Code Inference Demo Code

Introducing FoundationMotion

We Design a Fully Automated Data Curation Pipeline that Constructs Large-Scale Motion Datasets

Our dataset contains raw videos, processed videos, captions, and question-answer pairs.

Raw Video

Processed Video

We Introduce a Model for Improved Motion Understanding

Cooking Robot

Question

Which hand is the robot using to flip the toast?

✓ Answer: Right hand

FoundationMotion Model: Nvila-15B fine-tuned with the dataset obtained from our proposed FoundationMotion automated data collection pipeline.

Autonomous Vehicle

Question

What is the primary driving behavior demonstrated by the ego vehicle in the video?

✓ Answer: The ego vehicle changes lanes at night to avoid a car with hazard lights ahead

FoundationMotion Model: Nvila-15B fine-tuned with the dataset obtained from our proposed FoundationMotion automated data collection pipeline.

Performance Comparison Across Benchmarks

Green bars with ↑ indicators show improvements of our FoundationMotion model over the baseline

Much Larger Models

Baseline Models

FT w/ FoundationMotion (Ours)

100 80 60 40 20 0

Citation

@misc{gan2025foundationmotion,
    title={FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos}, 
    author={Yulu Gan and Ligeng Zhu and Dandan Shan and Baifeng Shi and Hongxu Yin and Boris Ivanovic and Song Han and Trevor Darrell and Jitendra Malik and Marco Pavone and Boyi Li},
    year={2025},
    eprint={2512.10927},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.10927}, 
}

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Introducing FoundationMotion

We Design a Fully Automated Data Curation Pipeline that Constructs Large-Scale Motion Datasets

Raw Video

Processed Video

We Introduce a Model for Improved Motion Understanding

Cooking Robot

Model Response:

Autonomous Vehicle

Model Response:

Performance Comparison Across Benchmarks

Citation