Skip to content

Papers, code and datasets about deep learning and multi-modal learning for video analysis

License

Notifications You must be signed in to change notification settings

ashishpatel26/Awsome-Deep-Learning-for-Video-Analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 

Repository files navigation

Maintenance Awesome GitHub

Awesome Deep Learning for Video Analysis

This repo contains some video analysis, especiall multimodal learning for video analysis, research. I summarize some papers and categorize them by myself. You are kindly invited to pull requests!

I pay more attention on multimodal learning related work and some research like action recognition is not the main scope of this repo.

Contents

Video

Tutorial

  • Audio-visual paper list [GitHub]
  • CVPR2019:Multi-Modal Learning from Videos [Project Page]
  • awesome-multimodal-ml: Reading list for research topics in multimodal machine learning [GitHub]
  • A Comprehensive Study of Deep Video Action Recognition [Paper]

Dataset:

I find a very interesting website

Sortable and searchable compilation of video dataset [Video Dataset Overview]

  • AVA dataset: AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. [Project]
  • PyVideoResearch: A repositsory of common methods, datasets, and tasks for video research [GitHub]
  • How2 Dataset: How2: A Large-scale Dataset for Multimodal Language Understanding [Paper] [GitHub]
  • Moments in Time Dataset A large-scale dataset for recognizing and understanding action in videos [Dataset] [Pretrained Model]
  • Pretrained image and video models for Pytorch [GitHub]
  • Youtube-8M, new segment task! [Blog]

Tool

  • X-Temporal is an open source video understanding codebase from Sensetime X-Lab group that provides state-of-the-art video classification models [GitHub]
  • facebookresearch/ClassyVision: An end-to-end PyTorch framework for image and video classification [GitHub]
  • MediaPipe is a cross-platform framework for building multimodal applied machine learning pipelines [GitHub]
  • This document describes the collection of utilities created for Detection and Classification of Acoustic Scenes and Events (DCASE). [GitHub]
  • Easy to use video deep features extractor [GitHub]
  • Video Platform for Action Recognition and Object Detection in Pytorch [GitHub]
  • FAIR Self-Supervised Learning Integrated Multi-modal Environment (SSLIME) [GitHub]

Paper:

Video Classification (Spatiotemporal Features)

  • Learnable pooling with Context Gating for video classification [Paper] [GitHub]
  • TSM: Temporal Shift Module for Efficient Video Understanding [Paper] [GitHub]
  • Long-Term Feature Banks for Detailed Video Understanding (CVPR2019) [Paper][GitHub]
  • Deep Learning for Video Classification and Captioning [Paper]
  • Large-scale Video Classification with Convolutional Neural Networks [Paper]
  • Learning Spatiotemporal Features with 3D Convolutional Networks [Paper]
  • Two-Stream Convolutional Networks for Action Recognition in Videos [Paper]
  • Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors [Paper]
  • Non-local neural networks [Paper] [GitHub]
    • Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. (CVPR 2018)
    • Summary:
  • Learning Correspondence from the Cycle-consistency of Time [Paper] [GitHub]
    • Xiaolong Wang and Allan Jabri and Alexei A. Efros (CVPR2019)
    • Summary:
  • 3D ConvNets in Pytorch [GitHub]

Multimodal For video Analysis

  • Awsome list for multimodal learning [GitHub]
  • VideoBERT: A Joint Model for Video and Language Representation Learning [Paper]
  • AENet: Learning Deep Audio Features for Video Analysis [Paper] [GitHub]
  • Look, Listen and Learn [Paper]
  • Objects that Sound [Paper]
  • Learning to Separate Object Sounds by Watching Unlabeled Video [Paper]
    • Gao, Ruohan, Rogerio Feris, and Kristen Grauman. arXiv preprint arXiv:1804.01665 2018
  • Ambient Sound Provides Supervision for Visual Learning [Paper]
    • Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. ECCV 2016
    • Summary: unsupervised learning
  • Learning Cross-Modal Temporal Representations from Unlabeled Videos [Google Blog]

Video Moment Localization

Video Retrieval

  • Use What You Have: Video retrieval using representations from collaborative experts [GitHub]
  • HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [Project Website]
    • Miech, Antoine, et al. (arXiv:1906.03327 (2019))
  • Learning a Text-Video Embedding from Incomplete and Heterogeneous Data." [Paper][GitHub]
    • Miech, Antoine, Ivan Laptev, and Josef Sivic. ECCV 2018
    • Summary: combine multi-modality information, calculate similarities and weight different similarities
  • Cross-Modal and Hierarchical Modeling of Video and Text [Paper]
    • B. Zhang * , H. Hu * , F. Sha ECCV 2018
    • Summary: learning the intrinsic hierarchical structures of both videos and texts. (Make video and text closer, make videos closer and make text closer)
  • A dataset for movie description. [Paper]
    • Rohrbach, Anna, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. CVPR 2015
    • Summary: dataset paper
  • Web-scale Multimedia Search for Internet Video Content. [Thesis]
    • Lu Jiang
    • Summary: amazing thesis

Video Advertisement (Also include some image advertisement paper)

  • Automatic understanding of image and video advertisements [Paper] [Project]
    • Hussain, Zaeem, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. CVPR 2017
    • Summary: Image and video advertisement datasets and baselines.
  • Multimodal Representation of Advertisements Using Segment-level Autoencoders [Paper] [GitHub]
    • Somandepalli, Krishna, Victor Martinez, Naveen Kumar, and Shrikanth Narayanan. ICMI 2018
    • Summary: video and audio features to understand whether video is funny or not.
  • Story Understanding in Video Advertisements. [Paper] [GitHub]
    • Keren Ye, Kyle Buettner, Adriana Kovashka BMVC 2018
    • Summary: Combine multiple features including climax, audio and so on to analyze video ads.
  • ADVISE: Symbolism and External Knowledge for Decoding Advertisements. [Paper] [GitHub]
    • Keren Ye and Adriana Kovashka. (ECCV2018)
    • Summary: action-reason statement for advertisement. Many pre-trained models are as prior knowledge. SSD, DenseCAP and GloVe.

Visual Commonsense Reasoning

  • From Recognition to Cognition: Visual Commonsense Reasoning [Paper] [Project Website]
    • Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi (CVPR2019)
    • Summary: First dataset paper. Use BERT and fastrcnn as the baseline

Video Highlight Prediction

  • Video highlight prediction using audience chat reactions
    • Fu, Cheng-Yang, Joon Lee, Mohit Bansal, and Alexander C. Berg. (EMNLP 2017)

Object Tracking

  • SenseTime's research platform for single object tracking research, implementing algorithms like SiamRPN and SiamMask. [GitHub]

Audio-Visual Dialog

  • Audio-Visual Scene-Aware Dialog [GitHub]
    • Alamri, Huda, Vincent Cartillier, Abhishek Das, Jue Wang, Stefan Lee, Peter Anderson, Irfan Essa et al.
    • arXiv preprint arXiv:1901.09107 (2019)

About

Papers, code and datasets about deep learning and multi-modal learning for video analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published