Why the BLEU score is usually inflated

During the model training process, a standard practice is to divide the data set into 90% for training and 10% for testing so that one can train the model on the 90% and test it on the 10%.

We have implemented hLEPOR metric as a public Python library, for the first time ever

It was always a mystery to us why BLEU is the most widespread metric, given that hLEPOR is a more advanced solution.