Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, D. Kiela, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.)DOI: 10.48550/arXiv.2005.11401 - This paper introduced the RAG paradigm, providing the core framework for understanding how retrieval and generation components interact. It is foundational for evaluating RAG systems.
A Survey of Hallucination in Large Language Models: Principles, Taxonomy, and Challenges, Ziwei Ji, Nayeon Lee, Rita Singh, Eric P. Xing, 2023ACM Computing Surveys, Vol. 56 (Association for Computing Machinery (ACM))DOI: 10.1145/3618497 - A comprehensive survey that categorizes and discusses methods for detecting and mitigating hallucinations in LLMs. This is directly relevant to measuring faithfulness and hallucination rates in RAG system outputs.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, 2017 (O'Reilly Media) - This book provides fundamental principles and practices for operating large-scale distributed systems, including detailed discussions on latency, throughput, error rates, and other operational metrics essential for RAG systems at production scale.
Benchmarking Large Language Models for Retrieval-Augmented Generation, Junzhang Shi, Kaiyu Huang, Shibo Hao, Ziyuan Zeng, Xiaofei Sun, Wenge Rong, Jianxin Li, Yexin Li, 2023arXiv preprint (arXiv) - This paper presents a benchmarking framework for RAG, addressing various aspects of evaluation including retrieval effectiveness, generation quality, and efficiency, offering insights directly applicable to defining evaluation metrics.
Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications, Chip Huyen, 2022 (O'Reilly Media) - This book covers the engineering principles for building and operating production-ready ML systems. It discusses cost efficiency, resource utilization, scalability, and MLOps, which are vital for large-scale RAG system performance and operational metrics.