Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing SystemsDOI: 10.48550/arXiv.1706.03762 - Describes the Transformer architecture, including the attention mechanism and the use of masks in both the encoder and decoder to manage sequence information.
Preprocessing data, Hugging Face, 2024 - Provides practical guidance on tokenization, padding, and attention mask creation for Transformer models using the Hugging Face transformers library.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2023 (Stanford University) - A comprehensive textbook covering core NLP concepts, including data preparation techniques for sequence models and details on Transformer architectures, which implicitly rely on padding and masking.
Writing Custom Datasets, DataLoaders and Transforms, PyTorch Team, 2024 - Explains the fundamental concepts of Dataset and DataLoader in PyTorch, which are essential for batching and custom padding logic in data preparation pipelines for deep learning models, including Transformers.