Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023NeurIPS 2023 Datasets and Benchmarks Track (NeurIPS)DOI: 10.48550/arXiv.2306.05685 - Presents MT-Bench for evaluating multi-turn instruction following and conversational abilities, and rigorously analyzes the effectiveness and limitations of using large language models as evaluators for other LLMs.