Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - Examines the efficacy and methodology of using large language models as evaluators for other LLMs, relevant for reference-free generation metrics.
Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, 2008 (Cambridge University Press) - A classic textbook providing foundational explanations for information retrieval metrics such as MRR and nDCG, used in RAG systems.