Data scientists seem to believe that a magic Genie in the form of AI exists, and all you need is rub the Genie’s lamp prompt it properly and it will produce a true translation quality score on a sentence level.
I am sorry to disappoint, but there is no Genie that can present you with the “right answer.” Not only because the current Genie version does not possess any intelligence, but because of a more fundamental problem: there is not enough information in a sentence for that.
Many data scientists believe that the text is the information.
This is NOT true.
The text is NOT an encoded message. It is not even a coherent unit of information.
Three weeks ago, I heard a true story of a meeting with the client’s Language Specialist who marked the translation as containing an error because “the author wanted to say specifically that other thing.” The linguist argued that she had a different understanding. Neither side could prove their point, because both their statements were opinions.
Those familiar with the legal sphere are well aware that two lawyers can argue their case using the same piece of legislation but interpret it differently. The law is ambiguous because no matter how carefully I write something down, you can interpret it in a different way. That is fundamental to the language itself and the fact of the matter is that the original thought itself is never in the text. Not quite, at least.
Any text that we read is just an expression of what its author meant to say. It can be rather precise – or not so much, but even in the best-case scenario we will never know this author’s mind for sure. The text is a tangible expression of something that itself is totally and completely outside of it. That something was conceptualized in the author’s mind, but it never really fully “makes it” into the text.
The true root cause of the linguists’ differing opinions on what constitutes a “correct” translation can be found in this assumption that the author meant to say one particular thing – and not something else.
But any text is ambiguous by its very nature. It sometimes carries double and triple meanings, which may vary in different scenarios or over time or when interpreted by diverse readers with distinct perspectives.
No matter how clear the text is, you cannot be certain that it is an accurate interpretation of the intended meaning.
Two hundred years ago, the Russian poet Fyodor Tyutchev wrote a poem called “Silentium,” and its English translation by Vladimir Nabokov is a masterpiece in itself. There is also a very good translation by Robert Chandler.
This gives us at least two interpretations of what the author said in Russian:
Silentium!
by Fyodor Tyutchev (1829)
Speak not, lie hidden, and conceal How can a heart expression find? Live in your inner self alone |
Be silent, hide away and let What heart can ever speak its mind? Live in yourself. There is a whole |
[1 Nabokov] |
[2 Chandler] |
So, there you have it: “a thought, once spoken, is a lie” expresses the futility of any attempt to trust the language with our thoughts and ideas.
The language is a powerful tool, yet there’s no way to be 100% certain of any specific intended meaning. This makes each particular sentence very much open to interpretations and opinions as to what the author meant. The bigger context of a larger sample makes such interpretation much more certain, which is why for samples of sufficient size the translation quality evaluation makes much greater sense. And on the contrary, the sheer uncertainty of error detection and annotation on a sentence level makes sentence level score meaningless unless we are talking about purely spelling errors. But even spelling may be the function of style or dialect, and if we judge the translation without the necessary context, we risk giving it a low mark and writing off the work that was actually done well.
Numbers, of course, have their own magic. When we see a number, we feel like we have obtained some footing. This is a purely psychological phenomena, which explains the following real-life dialogue between a Quality Manager and a Data Scientist, which had recently taken place at one very large organization:
— Do you know that your segment level quality scores do not correlate with human evaluation?
— Yes, we know that, but we think it’s still useful.
— How is it useful, can you explain?
— Well, we can’t explain it, but you need some number to start with.
— But this number does not make any sense.
— Yes, we understand that, but we think that having some number is still useful.
(Curtain falls.)
But surely the data scientists must know that there is a field in mathematics called statistics? And in statistics, every measurement has its confidence interval and its confidence level, which are especially important for small-size samples. In our paper “Measuring Uncertainty in Translation Quality Evaluation,” we have demonstrated that for a quality measurement to have an acceptable confidence interval, the sample size must be at least a hundred sentences. When you decrease the sample size below that number, the confidence interval increases exponentially and for one sentence it shoots through the roof.
That’s why on small sample sizes a totally different and very complex mathematical apparatus is applied, called Statistical Quality Control (SQC). It operates on small samples and predicts not the quality, but the risks for the producer and the consumer.
To reiterate: it is fundamentally impossible, both from the mathematical/statistical standpoint and from the philosophical one, to make a reliable judgment about the “quality score” of one sentence. I don’t even know what’s more responsible for the magical Genie’s inability to create a sentence-level quality score, nature of cognition or statistics. Whichever it is, the conclusion is the same: please, don’t ask the Genie to do the impossible things.
The most that a smart Genie can do, if it ever emerges from the lamp, is to provide just another varying opinion in a chorus of all other possible opinions about the intended meaning of the sentence and to join the debate about the errors made in translating that meaning, their class, severity, and so on.
So, don’t tell me that a certain LLM or AI has measured the quality of a sentence. Doing that requires understanding the sentence first – and we all know that’s not how the Genie works. And even if by some miracle we will reach a stage where the said Genie does understand the sentence – well, it can then join the club, where its evaluation will be yet another opinion (one of many) on the intended meaning of the text.
References:
[1] Vladimir Nabokov
https://web.stanford.edu/class/slavic272/materials/tiutchev_silentium.pdf
[2] Translation from The Penguin Book of Russian Poetry, edited by Robert Chandler, Boris Dralyuk and Irina Mashinski. https://www.theguardian.com/books/2015/feb/21/saturday-poem-silentium-by-fyodor-tyutchev-robert-chandler
[4] Measuring Uncertainty in Translation Quality Evaluation. https://arxiv.org/abs/2111.07699