Awesome Deep Learning for Video Analysis

This repo contains some video analysis, especiall multimodal learning for video analysis, research. I summarize some papers and categorize them by myself. You are kindly invited to pull requests!

I pay more attention on multimodal learning related work and some research like action recognition is not the main scope of this repo.

Tutorial

Audio-visual paper list [GitHub]
CVPR2019:Multi-Modal Learning from Videos [Project Page]
awesome-multimodal-ml: Reading list for research topics in multimodal machine learning [GitHub]
A Comprehensive Study of Deep Video Action Recognition [Paper]

Dataset:

I find a very interesting website

Sortable and searchable compilation of video dataset [Video Dataset Overview]

AVA dataset: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. [Project]
PyVideoResearch: A repositsory of common methods, datasets, and tasks for video research [GitHub]
How2 Dataset: How2: A Large-scale Dataset for Multimodal Language Understanding [Paper] [GitHub]
Moments in Time Dataset A large-scale dataset for recognizing and understanding action in videos [Dataset] [Pretrained Model]
Pretrained image and video models for Pytorch [GitHub]
Youtube-8M, new segment task! [Blog]

Tool

X-Temporal is an open source video understanding codebase from Sensetime X-Lab group that provides state-of-the-art video classification models [GitHub]
facebookresearch/ClassyVision: An end-to-end PyTorch framework for image and video classification [GitHub]
MediaPipe is a cross-platform framework for building multimodal applied machine learning pipelines [GitHub]
This document describes the collection of utilities created for Detection and Classification of Acoustic Scenes and Events (DCASE). [GitHub]
Easy to use video deep features extractor [GitHub]
Video Platform for Action Recognition and Object Detection in Pytorch [GitHub]
FAIR Self-Supervised Learning Integrated Multi-modal Environment (SSLIME) [GitHub]

Paper:

Video Classification (Spatiotemporal Features)

Learnable pooling with Context Gating for video classification [Paper] [GitHub]
TSM: Temporal Shift Module for Efficient Video Understanding [Paper] [GitHub]
Long-Term Feature Banks for Detailed Video Understanding (CVPR2019) [Paper][GitHub]
Deep Learning for Video Classification and Captioning [Paper]
Large-scale Video Classification with Convolutional Neural Networks [Paper]
Learning Spatiotemporal Features with 3D Convolutional Networks [Paper]
Two-Stream Convolutional Networks for Action Recognition in Videos [Paper]
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors [Paper]
Non-local neural networks [Paper] [GitHub]
- Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. (CVPR 2018)
- Summary:
Learning Correspondence from the Cycle-consistency of Time [Paper] [GitHub]
- Xiaolong Wang and Allan Jabri and Alexei A. Efros (CVPR2019)
- Summary:
3D ConvNets in Pytorch [GitHub]

Multimodal For video Analysis

Awsome list for multimodal learning [GitHub]
VideoBERT: A Joint Model for Video and Language Representation Learning [Paper]
AENet: Learning Deep Audio Features for Video Analysis [Paper] [GitHub]
Look, Listen and Learn [Paper]
Objects that Sound [Paper]
Learning to Separate Object Sounds by Watching Unlabeled Video [Paper]
- Gao, Ruohan, Rogerio Feris, and Kristen Grauman. arXiv preprint arXiv:1804.01665 2018
Ambient Sound Provides Supervision for Visual Learning [Paper]
- Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. ECCV 2016
- Summary: unsupervised learning
Learning Cross-Modal Temporal Representations from Unlabeled Videos [Google Blog]

Video Moment Localization

Localizing Moments in Video with Natural Language [Paper][GitHub]

Video Retrieval

Use What You Have: Video retrieval using representations from collaborative experts [GitHub]
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [Project Website]
- Miech, Antoine, et al. (arXiv:1906.03327 (2019))
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data." [Paper][GitHub]
- Miech, Antoine, Ivan Laptev, and Josef Sivic. ECCV 2018
- Summary: combine multi-modality information, calculate similarities and weight different similarities
Cross-Modal and Hierarchical Modeling of Video and Text [Paper]
- B. Zhang * , H. Hu * , F. Sha ECCV 2018
- Summary: learning the intrinsic hierarchical structures of both videos and texts. (Make video and text closer, make videos closer and make text closer)
A dataset for movie description. [Paper]
- Rohrbach, Anna, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. CVPR 2015
- Summary: dataset paper
Web-scale Multimedia Search for Internet Video Content. [Thesis]
- Lu Jiang
- Summary: amazing thesis

Video Advertisement (Also include some image advertisement paper)

Automatic understanding of image and video advertisements [Paper] [Project]
- Hussain, Zaeem, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. CVPR 2017
- Summary: Image and video advertisement datasets and baselines.
Multimodal Representation of Advertisements Using Segment-level Autoencoders [Paper] [GitHub]
- Somandepalli, Krishna, Victor Martinez, Naveen Kumar, and Shrikanth Narayanan. ICMI 2018
- Summary: video and audio features to understand whether video is funny or not.
Story Understanding in Video Advertisements. [Paper] [GitHub]
- Keren Ye, Kyle Buettner, Adriana Kovashka BMVC 2018
- Summary: Combine multiple features including climax, audio and so on to analyze video ads.
ADVISE: Symbolism and External Knowledge for Decoding Advertisements. [Paper] [GitHub]
- Keren Ye and Adriana Kovashka. (ECCV2018)
- Summary: action-reason statement for advertisement. Many pre-trained models are as prior knowledge. SSD, DenseCAP and GloVe.

Visual Commonsense Reasoning

From Recognition to Cognition: Visual Commonsense Reasoning [Paper] [Project Website]
- Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi (CVPR2019)
- Summary: First dataset paper. Use BERT and fastrcnn as the baseline

Video Highlight Prediction

Video highlight prediction using audience chat reactions
- Fu, Cheng-Yang, Joon Lee, Mohit Bansal, and Alexander C. Berg. (EMNLP 2017)

Object Tracking

SenseTime's research platform for single object tracking research, implementing algorithms like SiamRPN and SiamMask. [GitHub]

Audio-Visual Dialog

Audio-Visual Scene-Aware Dialog [GitHub]
- Alamri, Huda, Vincent Cartillier, Abhishek Das, Jue Wang, Stefan Lee, Peter Anderson, Irfan Essa et al.
- arXiv preprint arXiv:1901.09107 (2019)

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Deep Learning for Video Analysis

Contents

Video

Tutorial

Dataset:

Sortable and searchable compilation of video dataset [Video Dataset Overview]

Tool

Paper:

Video Classification (Spatiotemporal Features)

Multimodal For video Analysis

Video Moment Localization

Video Retrieval

Video Advertisement (Also include some image advertisement paper)

Visual Commonsense Reasoning

Video Highlight Prediction

Object Tracking

Audio-Visual Dialog

About

Releases

Packages

License

ashishpatel26/Awsome-Deep-Learning-for-Video-Analysis

Folders and files

Latest commit

History

Repository files navigation

Awesome Deep Learning for Video Analysis

Contents

Video

Tutorial

Dataset:

Sortable and searchable compilation of video dataset [Video Dataset Overview]

Tool

Paper:

Video Classification (Spatiotemporal Features)

Multimodal For video Analysis

Video Moment Localization

Video Retrieval

Video Advertisement (Also include some image advertisement paper)

Visual Commonsense Reasoning

Video Highlight Prediction

Object Tracking

Audio-Visual Dialog

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages