BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, 2019Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Vol. Volume 1 (Association for Computational Linguistics)DOI: 10.18653/v1/N19-1423 - Introduces the BERT model and its pre-training objectives (Masked Language Modeling and Next Sentence Prediction), explaining the purpose and usage of [CLS], [SEP], and [MASK] tokens.
Tokenizers - Hugging Face documentation, Hugging Face, 2024 - Provides comprehensive documentation on how tokenizers are built, configured, and used, including the explicit handling and management of special tokens within the Hugging Face ecosystem.
Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, and Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE) for subword tokenization, which is a foundational algorithm that reduces OOV rates and helps manage vocabulary size, setting the stage for the need for special tokens to add structural information.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean, 2016arXiv preprint arXiv:1609.08144DOI: 10.48550/arXiv.1609.08144 - Introduces WordPiece tokenization, an alternative subword tokenization method used in models like BERT, which complements BPE in managing vocabulary and rare words.