1
2
3
4
5
TL;DRFoundationMotion is an automated data-curation pipeline for constructing large-scale motion-understanding video datasets. Models trained on these datasets show strong improvements in motion understanding.
Our dataset contains raw videos, processed videos, captions, and question-answer pairs.
Question
Which hand is the robot using to flip the toast?
✓ Answer: Right hand
FoundationMotion Model: Nvila-15B fine-tuned with the dataset obtained from our proposed FoundationMotion automated data collection pipeline.
Question
What is the primary driving behavior demonstrated by the ego vehicle in the video?
✓ Answer: The ego vehicle changes lanes at night to avoid a car with hazard lights ahead
FoundationMotion Model: Nvila-15B fine-tuned with the dataset obtained from our proposed FoundationMotion automated data collection pipeline.
@misc{gan2025foundationmotion,
title={FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos},
author={Yulu Gan and Ligeng Zhu and Dandan Shan and Baifeng Shi and Hongxu Yin and Boris Ivanovic and Song Han and Trevor Darrell and Jitendra Malik and Marco Pavone and Boyi Li},
year={2025},
eprint={2512.10927},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.10927},
}