MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
Abstract
As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is now imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from multilingual LLMs, prior works often employed LLM-based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well.
Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine the multilingual proficiency of evaluator LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising six subsets that cover 18 languages.