During the model training process, a standard practice is to divide the data set into 90% for training and 10% for testing so that one can train the model on the 90% and test it on the 10%.
It was always a mystery to us why BLEU is the most widespread metric, given that hLEPOR is a more advanced solution.