Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks

TLDR

This paper examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation and explores the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR-NEXT metric.