GEMBA-SQM translation quality evaluation is easy to implement as zero-shot LLM prompt … and totally useless

 

title

The hype ignores AI hallucination, because the hype is caused by people hallucinating on AI.

One of the popular methods of human translation quality evaluation (TQE) is to present the user with translation and request them to rate it on a scale of 0 to 100.

This method is fairly susceptible to criticism from a professional linguistic standpoint, as it generates an unintelligible quality score. With this, it becomes challenging to discern both the nature and severity of the errors identified, not to mention understanding the underlying causes of the errors.

But it has undeniable important advantages, such as simplicity and an almost instantaneous quality rating with an easily understood metric. (Which is, of course, the same reason that the drunkard gave to the police officer when he asked him why he was looking under the streetlight if he lost his keys in the park, and the drunkard answered, “That’s where the light is.”)

Interestingly enough, this method also faces criticism from the proponents of automatic evaluation: the latter claim that human evaluations are very subjective and such ratings are not stable or accurate.

Surely such evaluation task would be a trivial thing for the “omnipotent” LLMs which should be applying its mighty “reasoning capabilities”? To find out, we ran a small experiment.

Here’s the ChatGPT prompt that requests direct assessment:

Score the following translation from English to German with respect to the human reference on a continuous scale from 0 to 100 that starts with “No meaning preserved,” goes through “Some meaning preserved,” then “Most meaning preserved and few grammar mistakes,” up to “Perfect meaning and grammar.” English source: “The Big Bang theory is how scientists believe the Earth began.” German translation: “Die Urknalltheorie ist, wie Wissenschaftler glauben, dass die Erde begann.”
Score (0-100): ?

The beauty of this prompt is that it’s exactly the question that you would ask a human. Unfortunately, the beauty ends there because everything else in this interaction was nowhere as good.

GenAI is trained to please its conversation partners. To achieve that, the Transformer is augmented by the Reinforcement Learning (RL) which is trained to generate the answers humans like. This optimization goal does not go hand in hand with stability and accuracy. And since GenAI doesn’t think, it fails to properly analyze the meaning or to apply logic to the evaluation task.

For all these reasons, although GPT4 model works much better than the previous iterations, the results are anything but consistent. Moreover, presented with human-like tests, the higher-quality bigger models give answers with even greater variance than humans.

In other words, the result of such conversation will be different from session to session and will vary in underlying “substantiation” due to the different aspects of translation captured by the neural network in each specific case.

This is far worse than a human performance because a trained human professional does not miss significant factors, and despite a variance caused by personal perception, the practice shows that the trained reviewers tend to agree, rather than disagree with each other. Not to mention that in the process of a dialogue ChatGPT makes a lot of errors that a human operator must correct, and it is quite annoying because a human would never make these errors in the first place.

Consider the example cited in this article – there are many things that went wrong in this evaluation.

· First, GPT4 missed a major factual error in the source, and only “noticed” it after a hint.

· Second, it failed to properly apply such concept as “faithfulness to the source” to its answers, which is a basic skill for a human linguist, translation studies, and standard industry practices. 

· Third, without this error, GPT4 rated this translation too low, citing minor stylistic issues.

· Fourth, In the verbose explanatory dialogue that followed, it mentioned the facts and principles that it clearly did not apply.

· Fifth, when given a Spanish translation with the same factual error, it ignored the error again, even though in the previous response it received instructions that such things should be noted. (so much for “in-context learning” capabilities)

· Again, it gave the too low score on minor stylistic issues.

· Finally, on the French translation it gave the score of 84 immediately after it gave the score of 82 without any change in the task or comments.

Implementing the zero-shot direct assessment of the type called GEMBA-SQM on GPT4 is easy, there’s no argument about that. But the increased variability and variance in ratings with arbitrary, unsubstantiated scores with variability exceeding that of human linguist, that depend on nothing more than alignment of electrons in the stochastic process of neural generation and the fundamental lack of reasoning and understanding of the task at hand, make this implementation quite useless.

Follow us to learn more about how LLMs work and to understand what they can and cannot to do.

For illustration purposes, please, review the link below. GEMBA-SQM