Why the BLEU score is usually inflated

During the model training process, a standard practice is to divide the data set into 90% for training and 10% for testing so that one can train the model on the 90% and test it on the 10%. When we’re dealing with pictures of cats and dogs, it’s simple enough to use the standard function of the scikit-learn library.

Text, however, presents a very different scenario. When a model is trained on a translation memory (TM) and is then given а sentence from that same TM, which it has already seen, it will operate just like a translation memory: The output will be the human translation that the model has already seen and “remembered” very well. Text, though, is actually even trickier: Often it’s comprised of similar sentences. So, if the data set is mechanically divided into a training set and a testing set, the test corpus will include sentences that are very close to the data already seen by the model. Those sentences will be translated very well.

Unfortunately, this type of testing isn’t a true representation of the quality of the model and its training. That’s because quality is defined by good translations of sentences that the model hasn’t previously seen. In order to obtain an honest BLEU score, we need to find the sentences that are present in the training corpus and exclude them from our test corpus. We also need to remove sentences that are close to them, meaning those that are similar to the high fuzzy metric for TMs. This is necessary because the translation quality of sentences with greater dissimilarity will be much lower, but they more clearly indicate the performance of the trained model. This task is relatively difficult and can’t be done with CAT tools, which aren’t designed for it. The correct approach is to take high fuzzies out of the test corpus. You might wonder, “Why? It’s complicated, it takes too long, and the BLEU score will be LOWER.”

So, don’t be surprised when someone reports suspiciously good training results. In that situation, I’d ask the person, “How did you prepare the testing data set?

Aren’t we simply looking at a super-fancy neural network-based translation memory as opposed to a well-trained neural model? 

To find out a bit more, see this presentation.

Ask us to prepare the dataset correctly, and it will be a challenging task. To be honest, it isn’t easy to do.