Dense Passage Retrieval for Open-Domain Question Answering, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, 2020EMNLP 2020DOI: 10.48550/arXiv.2004.04906 - Details Dense Passage Retrieval (DPR), a system for learning and retrieving dense embeddings from large text corpora, addressing how to build effective retrieval systems for RAG at scale.
vLLM: Efficient LLM Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP)DOI: 10.48550/arXiv.2309.06180 - Presents vLLM, a high-throughput LLM serving system that incorporates PagedAttention and continuous batching to optimize inference performance, directly relevant to the generation layer's efficiency.
LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021arXiv preprintDOI: 10.48550/arXiv.2106.09685 - Introduces LoRA, a parameter-efficient fine-tuning technique that allows adapting large language models with fewer trainable parameters, addressing efficiency in LLM specialization.