LangChain Text Splitters, LangChain Development Team, 2024 (LangChain) - Official documentation detailing various text splitting strategies within LangChain, including RecursiveCharacterTextSplitter and TokenTextSplitter, essential for practical implementation.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020NeurIPS 2020DOI: 10.48550/arXiv.2005.11401 - This paper introduces the Retrieval-Augmented Generation (RAG) framework, providing the theoretical context for why breaking down documents into manageable chunks is a fundamental preprocessing step for efficient information retrieval.
Lost in the Middle: How Language Models Use Long Contexts, Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023Transactions of the Association for Computational Linguistics (TACL)DOI: 10.48550/arXiv.2307.03172 - Investigates the performance of language models when dealing with long input contexts, identifying the 'lost in the middle' phenomenon that chunking strategies aim to overcome by providing focused text segments.