Introduction to Vision Transformers

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 arXiv DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and the self-attention mechanism, forming the basis for Vision Transformers.
An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020 ICLR DOI: 10.48550/arXiv.2010.11929 - Presents the Vision Transformer (ViT), directly applying Transformer encoders to image classification by processing images as sequences of patches.