Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017arXivDOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture and the self-attention mechanism, forming the basis for Vision Transformers.
An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020ICLRDOI: 10.48550/arXiv.2010.11929 - Presents the Vision Transformer (ViT), directly applying Transformer encoders to image classification by processing images as sequences of patches.