Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.5555/3295222.3295349 - Defines the Transformer architecture, including the input embedding and positional encoding components.
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013International Conference on Learning Representations (ICLR) Workshop ProceedingsDOI: 10.48550/arXiv.1301.3781 - Introduces Word2Vec, a foundational method for learning dense word embeddings that capture semantic relationships, a core concept underlying the Transformer's input embeddings.
Preprocessing with a tokenizer, Hugging Face, 2024 (Hugging Face) - Explains the process of tokenization and how input IDs are prepared for model input, including the role of embedding layers in practical Transformer implementations.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Provides fundamental knowledge on distributed representations and embedding layers in neural networks, setting the stage for understanding their use in Transformer models.