X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization, CVPR 2024

Datasets

Epic-Kitchens (we used rgb frames)

Ego4D (we used fho subset)

Hand-crops (this step can be skipped to run basic version):

Epic:

download hand crops for epic kitchens from the following repo
we preprocess the provided crops by applying union on the objects that touch with hands and all visible hands. We keep the default parameters from the respective library.
Put in a pickle in a format: dict[segment_id][frame_idx] = (left, top, right, bottom)
(Otherwise, the library works too long if used without preextracting and preprocessing)
save file with the name:hand_thd0.8_obj_thd0.01_ONLY_inter_obj_with_HANDS_v2

Download hand crops detection here for Ego4D and apply similar preprocessing: https://github.com/Chuhanxx/helping_hand_for_egocentric_videos

Splits:

All splits of shared and unique (novel) noun and verb classes are in folder anno/

Prerequisites

follow CoOp to install prerequisites. However, skip installation of Dassl as its modified version is already integrated into the framework and the requirements will be installed during the next step
Go to the Dassl folder and run:

cd x-mic/Dassl.pytorch

# Install dependencies
pip install -r requirements.txt

# Install this library (no need to re-build if the source code is modified)
python setup.py develop

[In case of no internet connection during training] In general, CLIP model will be downloaded automatically. However, in case if you do not have internet connection during training, download CLIP vit-b-16 manually and set the path in ‘x-mic/clip/clip’ as a default parameter in _download function “root” parameter.

Extract features for faster training and evaluation

this step also can be skipped

Full frames

Epic config: extract_EPIC_clip_vitb16_segments.yaml

To change:

DATASET.ROOT - where your dataset is located with the structure DATASET.ROOT/annotations, DATASET.ROOT/epic_kitchens_videos_256ss

and OUTPUT_DIR

Ego config: extract_EGO4D_clip_vitb16.yaml

To change:

DATASET.ROOT - where your dataset is located with the structure DATASET.ROOT/annotations, DATASET.ROOT/epic_kitchens_videos_256ss

DATA.PATH_TO_DATA_DIR: - path to annotations

DATA.PATH_PREFIX: - path to videos

DATASET.ROOT - path to videos (same as path_prefix)

and OUTPUT_DIR

Hand Crops:

Epic config: extract_EPIC_clip_vitb16_segments_handcrops.yaml

see full frames +

DATASET.DETECTION_ROOT - path to hand crop annotations

Ego4d config: extract_EGO4D_clip_vitb16_handcrops.yaml

Run the scrips:

To run the script on a subset distributed over 8 gpus:

export OMP_NUM_THREADS=64; export NCCL_ASYNC_ERROR_HANDLING=1; torchrun --standalone --nproc_per_node=8 --nnodes 1 feat_extractor_segments_distributed.py --config_name XX --split YY --distributed --seed 42

To run the script on a subset on a single gpu: python feat_extractor_segments.py --config_name XX --split YY --div 0

XX - config name without “.yaml” extension and folder

YY - train or validation

Similarly, features can be extracted with DINO and Lavila models.

Run Training and Eval

Config params:

DATA.PATH_TO_DATA_DIR - Ego4D dataset annotations location

DATA.PATH_PREFIX - Ego4D features that will be classified with adopted classifier - best results with hand cropped frames

DATA.PATH_PREFIX_DINO - Ego4D features that will be adopted - best results with hand cropped frames

DATA.PATH_PREFIX_DINO2 - Ego4D features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames

DATALOADER.FEATURES_NAME - Epic features that will be classified with adopted classifier - best results with hand cropped frames

DATALOADER.FEATURES_NAME_DINO - Epic features that will be adopted - best results with hand cropped frames

DATALOADER.FEATURES_NAME_DINO2 - Epic features that will be adopted. This and previous features will be combined in the adaptation module - best results with full frames

note that all these features can be the same. If use the model without hand crops, set DATALOADER.USE_DINO_FEATURES2 = False

Set resolution of conditioning features in DATALOADER.DINO_DIM if it’s different from 512

If only one dataset is available, disable cross-dataset evaluation by setting TEST.CROSS_DATASET.EVAL = False

Run the scrips

train X-MIC config: XMIC_vitb16.yaml

setup data or feature paths for one or two datasets

XX - name of the config file located in scripts/configs folder

With single gpu:

Epic nouns:

sh scripts/baselines/epic_gpu1.sh noun XX

Epic verbs:

sh scripts/baselines/epic_gpu1.sh verb XX

Ego4d nouns:

sh scripts/baselines/ego_gpu1.sh noun XX

Ego4d verbs:

sh scripts/baselines/ego_gpu1.sh verb XX

With 8 gpus:

Epic nouns:

sh scripts/baselines/epic_gpu8.sh noun XX

Epic verbs:

sh scripts/baselines/epic_gpu8.sh verb XX

Ego4d nouns:

sh scripts/baselines/ego_gpu8.sh noun XX

Ego4d verbs:

sh scripts/baselines/ego_gpu8.sh verb XX

Tips

Model code is trainers/xmic.py
To add additional trainer include it also in train.py or train_dist.py

Important Note

Unfortunately, after my internship all models and data were deleted due to internal refactoring. Therefore, I lost all the pretrained models, parts of code and could not make a final verification of the code.

Feel free to connect with me via email in case of any questions.

I sincerely apologise for the inconvenience it may cause.

Citation

If you use our work, please consider citing:


@inproceedings{kukleva2024xmic,
  title={X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization},
  author={Kukleva, Anna and Sener, Fadime and Remelli, Edoardo and Tekin, Bugra and Sauser, Eric and Schiele, Bernt and Ma, Shugao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Acknowledgements

The code is based on CoOp and Maple repos

Name	Name	Last commit message	Last commit date
Latest commit Annusha Nov 7, 2024 5b8a29b · Nov 7, 2024 History 12 Commits
Dassl.pytorch	Dassl.pytorch	code upload	Jun 27, 2024
anno	anno	code upload	Jun 27, 2024
blip	blip	code upload	Jun 27, 2024
clip	clip	code upload	Jun 27, 2024
datasets	datasets	code upload	Jun 27, 2024
docs	docs	code upload	Jun 27, 2024
figs	figs	code upload	Jun 27, 2024
lavila	lavila	code upload	Jun 27, 2024
scripts	scripts	code upload	Jun 27, 2024
trainers	trainers	Create imagenet_templates.py	Oct 14, 2024
utils	utils	Add files via upload	Sep 24, 2024
LICENSE	LICENSE	code upload	Jun 27, 2024
README.md	README.md	code upload	Jun 27, 2024
feat_extractor_segments.py	feat_extractor_segments.py	code upload	Jun 27, 2024
feat_extractor_segments_distributed.py	feat_extractor_segments_distributed.py	code upload	Jun 27, 2024
requirements.txt	requirements.txt	code upload	Jun 27, 2024
train.py	train.py	Update train.py	Nov 7, 2024
train_dist.py	train_dist.py	Update train_dist.py	Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization, CVPR 2024

Datasets

Hand-crops (this step can be skipped to run basic version):

Splits:

Prerequisites

Extract features for faster training and evaluation

Full frames

Hand Crops:

Run the scrips:

Run Training and Eval

Run the scrips

Tips

Important Note

Citation

Acknowledgements

About

Releases

Packages

Languages

License

Annusha/xmic

Folders and files

Latest commit

History

Repository files navigation

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization, CVPR 2024

Datasets

Hand-crops (this step can be skipped to run basic version):

Splits:

Prerequisites

Extract features for faster training and evaluation

Full frames

Hand Crops:

Run the scrips:

Run Training and Eval

Run the scrips

Tips

Important Note

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages