This repo contains some video analysis, especiall multimodal learning for video analysis, research. I summarize some papers and categorize them by myself. You are kindly invited to pull requests!
I pay more attention on multimodal learning related work and some research like action recognition is not the main scope of this repo.
- Tutorial
- Dataset
- Tool
- Video Classification
- Multimodal for Video Analysis
- Video Moment Localization
- Video Retrieval
- Video Advertisement
- Visual Commonsense Reasoning
- Video Highlight
- Object Tracking
- Audio-Visual Dialog
- Audio-visual paper list [GitHub]
- CVPR2019:Multi-Modal Learning from Videos [Project Page]
- awesome-multimodal-ml: Reading list for research topics in multimodal machine learning [GitHub]
- A Comprehensive Study of Deep Video Action Recognition [Paper]
I find a very interesting website
Sortable and searchable compilation of video dataset [Video Dataset Overview]
- AVA dataset: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. [Project]
- PyVideoResearch: A repositsory of common methods, datasets, and tasks for video research [GitHub]
- How2 Dataset: How2: A Large-scale Dataset for Multimodal Language Understanding [Paper] [GitHub]
- Moments in Time Dataset A large-scale dataset for recognizing and understanding action in videos [Dataset] [Pretrained Model]
- Pretrained image and video models for Pytorch [GitHub]
- Youtube-8M, new segment task! [Blog]
- X-Temporal is an open source video understanding codebase from Sensetime X-Lab group that provides state-of-the-art video classification models [GitHub]
- facebookresearch/ClassyVision: An end-to-end PyTorch framework for image and video classification [GitHub]
- MediaPipe is a cross-platform framework for building multimodal applied machine learning pipelines [GitHub]
- This document describes the collection of utilities created for Detection and Classification of Acoustic Scenes and Events (DCASE). [GitHub]
- Easy to use video deep features extractor [GitHub]
- Video Platform for Action Recognition and Object Detection in Pytorch [GitHub]
- FAIR Self-Supervised Learning Integrated Multi-modal Environment (SSLIME) [GitHub]
- Learnable pooling with Context Gating for video classification [Paper] [GitHub]
- TSM: Temporal Shift Module for Efficient Video Understanding [Paper] [GitHub]
- Long-Term Feature Banks for Detailed Video Understanding (CVPR2019) [Paper][GitHub]
- Deep Learning for Video Classification and Captioning [Paper]
- Large-scale Video Classification with Convolutional Neural Networks [Paper]
- Learning Spatiotemporal Features with 3D Convolutional Networks [Paper]
- Two-Stream Convolutional Networks for Action Recognition in Videos [Paper]
- Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors [Paper]
- Non-local neural networks [Paper] [GitHub]
- Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. (CVPR 2018)
- Summary:
- Learning Correspondence from the Cycle-consistency of Time [Paper] [GitHub]
- Xiaolong Wang and Allan Jabri and Alexei A. Efros (CVPR2019)
- Summary:
- 3D ConvNets in Pytorch [GitHub]
- Awsome list for multimodal learning [GitHub]
- VideoBERT: A Joint Model for Video and Language Representation Learning [Paper]
- AENet: Learning Deep Audio Features for Video Analysis [Paper] [GitHub]
- Look, Listen and Learn [Paper]
- Objects that Sound [Paper]
- Learning to Separate Object Sounds by Watching Unlabeled Video [Paper]
- Gao, Ruohan, Rogerio Feris, and Kristen Grauman. arXiv preprint arXiv:1804.01665 2018
- Ambient Sound Provides Supervision for Visual Learning [Paper]
- Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. ECCV 2016
- Summary: unsupervised learning
- Learning Cross-Modal Temporal Representations from Unlabeled Videos [Google Blog]
- Use What You Have: Video retrieval using representations from collaborative experts [GitHub]
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [Project Website]
- Miech, Antoine, et al. (arXiv:1906.03327 (2019))
- Learning a Text-Video Embedding from Incomplete and Heterogeneous Data." [Paper][GitHub]
- Miech, Antoine, Ivan Laptev, and Josef Sivic. ECCV 2018
- Summary: combine multi-modality information, calculate similarities and weight different similarities
- Cross-Modal and Hierarchical Modeling of Video and Text [Paper]
- B. Zhang * , H. Hu * , F. Sha ECCV 2018
- Summary: learning the intrinsic hierarchical structures of both videos and texts. (Make video and text closer, make videos closer and make text closer)
- A dataset for movie description. [Paper]
- Rohrbach, Anna, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. CVPR 2015
- Summary: dataset paper
- Web-scale Multimedia Search for Internet Video Content. [Thesis]
- Lu Jiang
- Summary: amazing thesis
- Automatic understanding of image and video advertisements [Paper] [Project]
- Hussain, Zaeem, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. CVPR 2017
- Summary: Image and video advertisement datasets and baselines.
- Multimodal Representation of Advertisements Using Segment-level Autoencoders [Paper] [GitHub]
- Somandepalli, Krishna, Victor Martinez, Naveen Kumar, and Shrikanth Narayanan. ICMI 2018
- Summary: video and audio features to understand whether video is funny or not.
- Story Understanding in Video Advertisements. [Paper] [GitHub]
- Keren Ye, Kyle Buettner, Adriana Kovashka BMVC 2018
- Summary: Combine multiple features including climax, audio and so on to analyze video ads.
- ADVISE: Symbolism and External Knowledge for Decoding Advertisements. [Paper] [GitHub]
- Keren Ye and Adriana Kovashka. (ECCV2018)
- Summary: action-reason statement for advertisement. Many pre-trained models are as prior knowledge. SSD, DenseCAP and GloVe.
- From Recognition to Cognition: Visual Commonsense Reasoning [Paper] [Project Website]
- Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi (CVPR2019)
- Summary: First dataset paper. Use BERT and fastrcnn as the baseline
- Video highlight prediction using audience chat reactions
- Fu, Cheng-Yang, Joon Lee, Mohit Bansal, and Alexander C. Berg. (EMNLP 2017)
- SenseTime's research platform for single object tracking research, implementing algorithms like SiamRPN and SiamMask. [GitHub]
- Audio-Visual Scene-Aware Dialog [GitHub]
- Alamri, Huda, Vincent Cartillier, Abhishek Das, Jue Wang, Stefan Lee, Peter Anderson, Irfan Essa et al.
- arXiv preprint arXiv:1901.09107 (2019)