Scalable Document Chunking and Preprocessing Strategies
Was this section helpful?
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020NeurIPSDOI: 10.48550/arXiv.2005.11401 - Presents the foundational concept of Retrieval-Augmented Generation (RAG) and the hybrid approach of combining parametric and non-parametric memory for NLP tasks.
Learning Spark: Lightning-Fast Data Analytics, Bill Chambers, Matei Zaharia, 2020 (O'Reilly Media) - Provides a detailed guide to Apache Spark, covering its architecture, programming models, and methods for large-scale data processing and analytics. This edition is widely referenced.
Lost in the Middle: How Language Models Use Long Contexts, Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023Transactions of the Association for Computational Linguistics (TACL) (Association for Computational Linguistics)DOI: 10.48550/arXiv.2307.03172 - Investigates how large language models utilize long input contexts and identifies the "lost in the middle" problem, which informs strategies for optimal context window management and chunking.