Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela, 2020Advances in Neural Information Processing Systems (NeurIPS) 2020DOI: 10.48550/arXiv.2005.11401 - Introduces the original RAG architecture, essential for understanding the system's components and potential latency points.
Efficiently Serving LLMs with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023arXiv preprint arXiv:2309.06180DOI: 10.48550/arXiv.2309.06180 - Presents PagedAttention, a technique that significantly improves LLM serving throughput and reduces latency for optimized inference.