Skip to content

This repo has all the basic things you'll need in-order to understand complete vision transformer architecture and its various implementations.

License

Notifications You must be signed in to change notification settings

paperwave/Vision-Transformers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zero-to-Hero: ViT🚀

I have tried to cover all the bases for understanding and implementing Vision Transformers (ViT) and their evolution into Video Vision Transformers (ViViT). The main focus is on dealing with the spatio-temporal relations using visual transformers.

image

1. Vision Transformer (ViT) Fundamentals

Surveys and Overviews:

Key Papers:

  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Paper | Code
  • Training data-efficient image transformers & distillation through attention (DeiT): Paper | Code

Concepts and Tutorials:

  • "Attention Is All You Need": Paper
  • "The Illustrated Transformers": Blog Post
  • "Vision Transformer Explained" Blog Post

2. Convolutional ViT and Hybrid Models:

  • CvT: Introducing Convolutions to Vision Transformers: Paper | Code
  • CoAtNet: Marrying Convolution and Attention for All Data Sizes: Paper
  • ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases: Paper | Code

3. Efficient Transformers and Swin Transformer:

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows: Paper | Code
  • Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions: Paper | Code
  • Efficient Transformers: A Survey: Paper

4. Space-Time Attention and Video Transformers:

  • TimeSformer: Is Space-Time Attention All You Need for Video Understanding? Paper | Code
  • Space-Time Mixing Attention for Video Transformer: Paper
  • MViT: Multiscale Vision Transformers: Paper | Code

5. Video Vision Transformer (ViViT):

How to use this Repo?

  • Start by reading the survey papers to get a broad understanding of the field.
  • For each key paper, read the abstract and introduction, then skim through the methodology and results sections.
  • Implement key concepts using the provided GitHub repositories or your own code.
  • Experiment with different architectures and datasets to solidify your understanding.
  • Use the additional resources to dive deeper into specific topics or applications.

About

This repo has all the basic things you'll need in-order to understand complete vision transformer architecture and its various implementations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.0%
  • Jupyter Notebook 15.0%