MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Guijin Son; Dongkeun Yoon; Juyoung Suk; Javier Aula-Blasco; Mano Aslan; Vu Trong Kim; Shayekh Bin Islam; Jaume Prats-Cristià; Lucía Tormo-Bañuelos; Seungone Kim

Preprint·2024

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Vu Trong Kim, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim

Abstract

As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is now imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from multilingual LLMs, prior works often employed LLM-based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well.

Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine the multilingual proficiency of evaluator LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising six subsets that cover 18 languages.

Cite

@article{son2024mmeval,
  title   = {MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models},
  author  = {Son, Guijin and Yoon, Dongkeun and Suk, Juyoung and Aula-Blasco, Javier and Aslan, Mano and Kim, Vu Trong and Islam, Shayekh Bin and Prats-Cristià, Jaume and Tormo-Bañuelos, Lucía and Kim, Seungone},
  journal = {arXiv preprint arXiv:2410.17578},
  year    = {2024},
  url     = {https://arxiv.org/abs/2410.17578}
}