The elephant is in the room, or MT is still far from human parity

In the middle of the May holidays, on May 6, Slator, the industry’s largest news portal, published an overview post about a Google research team’s report, the significance of which the industry still needs to understand. Despite its somewhat misleading title, the report will certainly draw a great deal of attention, so we’ll try to clarify some points from a practical perspective in the context of our industry.

The article is complex, and the issue of measuring the quality of translation–as evidenced by article’s opening sentence–is difficult. So, we’ll be concise.

WMT is a machine-translation conference that discusses the results of the “shared” tasks that the organizing committee puts before the participating researchers each year. One such task is the development of automatic metrics by which to assess the quality of translation. Automatic metrics are needed in order to judge the quality of machine translation when working on NMT models in the development process. This is particularly so when the quality of the MT output must be assessed programmatically and instantly, due in part to the fact that human quality assessment would be expensive.

The metric shared-task exercise has a basic premise: Taking machine translation, human translation and “human quality” assessment as the general source, researchers are attempting to build a metric that better correlates with so-called human judgment.

It certainly sounds reasonable, but as always, the devil is in the details. For example, when I realized how these “human estimates” were formed, I was shocked. The fact is that all these years the “human” quality assessment was formed by a crowd of random “workers,” from Mechanical Turk. In other words, anyone–not even amateur translators but literally just anyone–received translations on the Internet and gave them a score from zero to 100 points.

Any specialist in the field of professional translation would laugh at such an idea. After all, what could be compared to such a method? It’s like trying to treat disease by seeking advice from the first person you meet. It’s like trying to get legal advice at the village bazaar.

Amazingly, years ago it was often a struggle to convince large clients that carefully developed metrics and trained linguists were necessary for accurate translation assessment; that even an ordinary translator couldn’t give a reliable quality score without minimal training in regard to quality assessment. However, the idea turned out to be amazingly tenacious, and for many years the entire industry had to endure judgments received from a completely wrong approach.

The Google Research article under discussion, titled “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation,” is the first to write what translation experts have been saying over and over again in recent years: “There is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions.”

There seems to be an elephant in the room, yeah. We are glad that you finally noticed him!

What have Google Research researchers done and discovered?

They did what should have done a long time ago: apply what the translation and localization industry has developed in recent years, namely the MQM (Multidimensional Quality Metrics) typology and develop a quality assessment metric for machine translation based on it, which should be used as a “platinum standard” quality assessment.

That is, Google Research specialists finally took the MQM typology, built a subset of error categories and ordered professional linguists to annotate translations according to the selected set of categories and severity levels.

Then they received a “benchmark” assessment of the quality of the translation.

After that, they asked two groups of randomly recruited (crowdsourced) and professional linguists to rate translations using this metric.

The results obtained demonstrated that:

  1. The results of the evaluation of the quality of translation done previously (crowd with Mechanical Turk) have very weak correlation with the MQM assessment (simply put, they aren’t suitable at all).
  2. Even automated metrics based on embeddings give better results.
  3. This is because crowdsourced ratings don’t adequately assess the quality of translation “Our results support the assumption that crowd-workers are biased to prefer literal, easy-to-rate translations and rank Human-P low”. (Simply put, in order to distinguish a good translation from a bad one, you need to possess certain level of language and understand what the text is actually about.)
  4. The quality gap between MT and human translations is still large. “The gap between human translations and MT is even more visible when looking at the MQM ratings which sets the human translations first by a large margin, demonstrating that the quality difference between MT and human translation is still great.” “Unlike ratings acquired by crowd-worker and ratings acquired by professional translators on simpler human evaluation methodologies, MQM labels acquired with professional translators show a large gap between the quality of human and machine generated translations. This demonstrates that MT is still far from human parity.”

Consider it this way: When the quality of machine translation hasn’t been suitable at all, even the evaluation of “workers” with Mechanical Turk has been useful. However, as the quality of the MT has improved, it misleadingly points to inaccurate conclusions (quote): “due to expedience, human evaluation of MT is often carried out on isolated sentences by inexperienced raters with the aim of assigning a single score or ranking. When MT quality is poor, this can provide a useful signal; but as quality improves, there is a risk that the signal will become lost in rater noise or bias. Recent papers have argued that poor human evaluation practices have led to misleading results.”

New, more accurate quality measurement methods are needed now.

Among the inaccurate conclusions are estimates of some machine-translation models, which, given their exaggerated estimates, received much better scores. A few others were rated somewhat lower than they deserved. Simply put, the dumb literal approach did not give proper consideration of the actual quality of such MT models, in most cases inflating the result.

We also noticed claims that “our blind evaluators preferred machine translation to human translation,” “machine translation delivers human-level quality,” which are clearly the result of incorrect quality assessments discussed above and the hype, and which go against the industry understanding, but the voice of professionals was labeled as “resistance to change”.

Fortunately, thanks to the people at Google Research for their research and article, which finally acknowledges and demonstrates that the entire translation industry has been forced to listen to hype. That hype came as a chorus so deafening that it was hard to counter.

I’ll repeat a few points:

  1. Human parity is still a long way off: Machine translation is very different from professional human translation, despite the fact that translations of some texts based on general vocabulary are striking in their quality. The cause can be found in my article, “WHAT IS HIDING THE “AI”: IT’S AROUND US, BUT DOES IT EXIST AT ALL?
  2. An automated translation-quality assessment metric has yet to be developed.
  3. Professional translation is a pie that isn’t so easily sliced.

Enjoy studying the above materials and understanding the “newly discovered” facts!

11 MAY 2021