Ego4D-HCap Dataset

Ego4D-HCap is a hierarchical video captioning dataset containing captions for long-range videos across three temporal granularities: short-range clip captions focusing on specific human actions, medium-length segment descriptions focusing on specific human actions, and long-range video summaries depicting the overall intent and goals of the video. The dataset is built on Ego4D videos.

Ego4D videos

Download the Ego4D videos using the following steps.

Get License Agreement and download the Ego4D videos.
Use crop_and_resize.sh to crop and chunk the videos to the smaller side of 288 pixels and chunk length of 5 minutes.
(Note: This step is required for faster I/O. You can also evaluate the pretrained models without this step.)

Ego4D-HCap captions

You can download the Ego4D-HCap captions from this link. The following are the descriptions of the caption annotations.

1. Clip Caption

clips_train.pkl and clips_val.pkl are the training and validation files for short-range clip captions. Each file contains a list of tuples of (vid, start_sec, end_sec, clip_caption).

vid :   video_id
start_sec:  start second of the clip
end_sec:    end second of the clip
clip_caption:  ground-tuth clip caption

2. Segment Description

segments_train.pkl and segments_val.pkl are the training and the validation files for medium-range segment descriptions. segments_train_pseudo.pkl is the LLM-generated pseudo-annotations for segments.
Each file contains a list of dictionaries containing following fields:

'vid' :   video_id
'start_sec' :  start second of the segment
'end_sec' :    end second of the segment
'captions_pred' :  predicted clip captions by Video ReCap for the particular segment
'captions_gt' :    ground-truth captions of the particular segment
'segment_description' :    ground-truth segment description

3. Video Summary

videos_train.pkl and videos_val.pkl are the training and the validation files for long-range video summaries. videos_train_pseudo.pkl is the LLM-generated pseudo-annotations for segments.
Each file contains a list of dictionaries containing following fields:

'vid' :   video_id
'start_sec' :  start second of the segment
'end_sec' :    end second of the segment
'segment_descriptions' :  ground-truth segment descriptions of the particular video
'segment_descriptions_pred' :    predicted segment descriptions by Video ReCap for the particular video
'video_summary' :    ground-truth video_summary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

datasets.md

datasets.md

Ego4D-HCap Dataset

Ego4D videos

Ego4D-HCap captions

1. Clip Caption

2. Segment Description

3. Video Summary

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Ego4D-HCap Dataset

Ego4D videos

Ego4D-HCap captions

1. Clip Caption

2. Segment Description

3. Video Summary