Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1DOI: 10.18653/v1/P16-1162 - This foundational paper introduced Byte Pair Encoding (BPE) to the field of natural language processing for handling rare and out-of-vocabulary words in neural machine translation, establishing it as a common subword tokenization method.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2019Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Vol. 1DOI: 10.18653/v1/N19-1423 - This seminal paper introduced the BERT model, which prominently uses WordPiece tokenization. It demonstrates the practical application and importance of subword tokenization in large-scale transformer models for language understanding.
tokenizers: Fast State-of-the-Art Tokenizers, Hugging Face, 2023 (Hugging Face) - The official documentation for the Hugging Face tokenizers library, which provides optimized implementations of BPE, WordPiece, and other subword tokenization algorithms used in transformer models.
Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022 (O'Reilly Media) - A practical guide that includes detailed explanations and code examples for various tokenization methods, special tokens, and their application within the Hugging Face Transformers ecosystem.