Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1 (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE) for NLP, detailing its effectiveness in handling rare and out-of-vocabulary words by segmenting them into subword units, directly impacting sequence length and vocabulary size discussions.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, which underpins modern LLMs. Explains the self-attention mechanism, whose quadratic computational cost with respect to sequence length ($O(L^2)$) is a primary factor in vocabulary size trade-offs.
RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, 2019arXiv preprint arXiv:1907.11692DOI: 10.48550/arXiv.1907.11692 - Explores optimized pre-training strategies for BERT-like models, including the impact of using a larger Byte Pair Encoding (BPE) vocabulary (50K tokens) compared to BERT's WordPiece, demonstrating improvements and the practical implications of vocabulary size.