Vocabulary Size Selection Trade-offs

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Neural Machine Translation of Rare Words with Subword Units, Rico Sennrich, Barry Haddow, Alexandra Birch, 2016 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1 (Association for Computational Linguistics) DOI: 10.18653/v1/P16-1162 - Introduces Byte Pair Encoding (BPE) for NLP, detailing its effectiveness in handling rare and out-of-vocabulary words by segmenting them into subword units, directly impacting sequence length and vocabulary size discussions.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (NIPS 2017) DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, which underpins modern LLMs. Explains the self-attention mechanism, whose quadratic computational cost with respect to sequence length ($O(L^2)$) is a primary factor in vocabulary size trade-offs.
RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, 2019 arXiv preprint arXiv:1907.11692 DOI: 10.48550/arXiv.1907.11692 - Explores optimized pre-training strategies for BERT-like models, including the impact of using a larger Byte Pair Encoding (BPE) vocabulary (50K tokens) compared to BERT's WordPiece, demonstrating improvements and the practical implications of vocabulary size.