An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, 2020International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2010.11929 - Introduces the Vision Transformer, highlighting its capabilities and data requirements, which form the basis for hybrid model development.
CoAtNet: Marrying Convolution and Attention for All Data Scales, Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan, 2021Advances in Neural Information Processing Systems (NeurIPS), Vol. 34DOI: 10.48550/arXiv.2106.04803 - Proposes a general architecture that unifies convolutions and self-attention, demonstrating their complementary strengths across various scales and datasets.