Why you should not base your workflow process decisions on any segment-level score (including Phrase’s new QPS)

Elephant-LinkedIn

As I watched the recent video presentation of the Quality Performance Score (QPS) from Phrase with great interest, it raised some pertinent questions that I felt compelled to share.

I did like the fact that MQM (Multidimensional Quality Metrics) was mentioned 52 times during this presentation. After all, my entire research in the field of translation quality evaluation has been centered around MQM ever since my participation in the QTLaunchpad project in 2012. It’s great to see that the long-time, continuous effort of outstanding volunteer experts is now recognized as an invaluable tool by Google Research and other key industry players, including Phrase. I am also happy that the vision I formulated in 2020 about translation quality evaluation becoming more, not less important, with the advent of AI is indeed becoming a reality.

But two things nagged at me: the fact that QPS only very indirectly is built on MQM, and of course my understanding that such segment-level evaluations cannot be accurate, stable, or reliable. Let’s have a look why.

From the presentation, it appears that QPS is a segment-level direct estimate of quality on the scale of 0 to 100 (made by some  kind of AI model trained on some kind of data). In NLP, this scoring method is called Direct Assessment (DA), and it has been around for a while. In the process of Direct Assessment, humans rate each segment of the output from an MT system with an absolute score or label. The method has been in use since the WMT16 challenge in 2016.

It bears noting that initially, the humans for the DA on WMT challenge were sourced from Amazon’s Mechanical Turk crowdsourcing environment. I remember being both amused and horrified when I learned that the NLP researchers based their “human parity” claims about the DA ratings on Mechanical Turk.

It is trivial knowledge in any human activity that the quality of evaluation depends on the reviewer’s qualifications. If you are not a mechanical engineer, you would not be able to do a quality evaluation of another mechanical engineer’s work. If you are not a lawyer, you cannot judge legal work. If you are not a medical professional, you are unable to assess the quality of professional medical services or advice. I could never understand how people fail to see that the same applies to language and translation. Clearly, the basic premise of proper translation quality evaluation is that evaluation should be conducted by a qualified linguist.

But there’s so more to it. The holistic DA on a segment level is problematic in and of itself.

For example, you may be asked to rate a movie on a scale of 0 to 5 based on how you like it in general. Naturally, in this type of assessment you are not presented with the whole universe of criteria that critics operate on – you just provide your overall impression.

The strength of such “holistic” assessments is that they look simple and uniform. The holistic approach can be quite powerful when applied to large and complex samples. But when it’s applied to a particular translation segment and not the entire sample, its weaknesses prevail:

•  The holistic approach loses a lot of details that could be very important for the evaluation results.

•  It is much more random than analytical assessments.

•  It is much less stable and consistent.

•  It does not take into account the context of the previous and following sentences.

•  It is therefore much less precise than analytical segment-level assessments.

Let me explain here what analytical segment-level assessment is.

In 2021, Marcus Freitag from Google et. Al. published a paper entitled “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation” [1]. In this paper, an MQM-based segment-level assessment called SQM was pioneered. It was later used for the WMT2020 metrics task.

The method works as follows: annotators go through segments and annotate errors with the MQM error typology. Afterwards, a segment-level scoring similar to DA is formed following a certain scoring formula that is explained in the paper.

Such SQM metric can, with some stretch, be called an “MQM-based” metric, because, at least, the error annotation was done in accordance with the MQM typology.

If you do an analytic annotation of errors first, and THEN calculate the segment score with some sort of scoring formula based on sentence penalty points, this can be called an MQM-based metric.

However, it is not a full-fledged MQM-based metric because MQM is typically about assigning MQM scores to a sample, not to the translation unit! Individual segment-level scores statistically make little sense as a matter of principle due to their low reliability which is caused by significant variance in annotator judgments [5].

There are two very important reasons for that:

(a) Sentence-level scoring is not precise or reliable in principle due to the statistical nature of errors and variance of judgment by annotators; it has been shown that human error annotations may vary significantly due to a plethora of reasons [7]. That is why, statistically, information about errors in a sample of less than 1,000 words is not reliable [6]. Naturally, a model that is trained on such data is even less reliable as far as the segment-level scoring is concerned. This is fundamental and cannot be improved by using a larger model.

(b) Sentence-level scoring misses even the most proximal context completely, and there are many sentences that can have very different translations depending on their context.

BUT if you don’t even do analytical error annotation and simply assign the segment a score, it is not an MQM-based metric, it is not a reliable score, and it won’t be accurate regardless of how you obtained it: by human evaluation, AI, or from another type of language model.

Phrase hinted that the score was obtained from some not-GenAI language model that was pre-trained on human evaluations from historical data.

If that data was “MQM segment-level scores,” then the training data is not reliable in the first place for the reasons explained above.

Second, unlike human evaluators, an AI model cannot capture all errors. Humans see more errors than any automatic AI metric, and therefore such a metric will inevitably inflate quality scores as compared with even human segment-level evaluation, as we demonstrated clearly in our extensive work [3].

Third, the accuracy, reliability, and stability of such predictions are evidently low as a matter of principle. That is, it is probably more reliable and accurate than zero-shot direct assessment from GenAI (such as GEMBA-SQM, implemented in our Perfectionist TQE tool [4]), but this has yet to be measured.

And here comes the final point: even though everything related to AI in many of its forms typically becomes a part of the media hype cycle, it is first and foremost about research. And considering the multiple possible implications of anything AI-related on the world as we know it today, we want results of that research to be reliable, responsible, and  substantiated claims.

AI and NLP desperately need proper benchmarks and verifiable transparency, not unsubstantiated claims that media love, or process decisions made on the basis of unreliable scoring.

The language industry needs research and implementations that are transparent, have certain rigorous science and math behind them, and it needs published research, disclosing the language model used, allowing us to have a look at the training dataset and samples of the data. Only this will enable us to reproduce the results and test their veracity, accuracy, and reliability, and be confident about them.

The method that Phrase is using is similar to COMET [2], but while we know that COMET has a great idea behind it, it requires actual implementation rigor to be truly trustworthy and reliable for concrete applications, and remains an indirect automatic metric not fully equivalent to human judgment [3].

To conclude,

•  Segment-level direct assessment prediction is not accurate, reliable, or stable.

•  Segment-level DA prediction misses even immediate context.

•  MQM is not just a typology, it is a typology and a sample-based (not segment-based) scoring model, and statistical reliability comes from the size of the sample.

•  Predicting scores with AI models will lead to the same problem as with other automated metrics: they inflate the quality score as compared to human judgment.

•  Segment-level SQM scores based on MQM annotation is not a reliable set of data for AI training.

Therefore, I would argue that it is premature process-wise to base project management decisions on such  scores, except in a very narrow number of cases. The applicability scope is currently unclear.

For better or for worse, the sample-based human evaluation truly based on analytical approach remains the only reliable gold standard for translation quality evaluation, and should be part of the equation.

References

[1] Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation, Markus Freitag et.al., 29 April 2021, https://arxiv.org/pdf/2104.14478.pdf

[2] The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics, Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Alon Lavie , Luisa Coheur, André F. T. Martins, 19 May 2023, https://arxiv.org/pdf/2305.11806.pdf

[3] Neural Machine Translation of Clinical Text: An Empirical Investigation into Multilingual Pre-Trained Language Models and Transfer-Learning, Lifeng Han, Serge Gladkoff, Gleb Erofeev, Irina Sorokina, Betty Galiano, Goran Nenadic, https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2024.1211564/abstract

[4] GEMBA-SQM translation quality evaluation is easy to implement as zero-shot LLM prompt … and totally useless, Serge Gladkoff https://ai-lab.logrusglobal.com/gemba-sqm-translation-quality-evaluation-is-easy-to-implement-as-zero-shot-llm-prompt-and-totally-useless/

[5] Assessing Inter-Annotator Agreement for Translation Error Annotation, Arle Lommel, Maja Popović, Aljoscha Burchardt, DFKI https://www.dfki.de/fileadmin/user_upload/import/7445_LREC-Lommel-Burchardt-Popovic.pdf

[6] Measuring Uncertainty in Translation Quality Evaluation (TQE), Serge Gladkoff1 Irina Sorokina1,2 Lifeng Han3∗ and Alexandra Alekseeva, [2111.07699] Measuring Uncertainty in Translation Quality Evaluation (TQE) (arxiv.org)

[7] Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce, Serge Gladkoff, Lifeng Han, and Goran Nenadic, [2303.04526] Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce (arxiv.org)

By Serge Gladkoff, Logrus Global AI Lab, rev.2.5.2024