Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks TrackDOI: 10.48550/arXiv.2306.05685 - Investigates the reliability and capabilities of using large language models as evaluators, a technique discussed for measuring metrics like faithfulness and relevance.
RAGAS Documentation, RAGAS Contributors, 2024 - Provides practical guidance and detailed explanations for implementing and using the RAGAS framework and its advanced metrics for RAG evaluation.