Smaller models are still better than LLMs if trained well

cover-2

We live in the world of hype. It is rarely completely harmless, but it is especially detrimental when it is accepted without scientific verification, and used as the basis for long-term growth plans. The prevalence of increasingly large language models, touted by the media and the industry startups as nothing less than singularity, is just an example of such hype.

Three years ago, we saw the emergence of transformers, and this was rightfully hailed as a breakthrough. The problem was that the claims went too far to say that the models had reached parity with humans when it came to translations. Indeed, transformers were much of an improvement, but “human parity” was pure hype. It took a great deal of experiments to prove that, but the proof was conclusive.

As soon as the false claims of parity were debunked, the world was introduced to LLMs (large language models), and the new cycle of hype began.

We knew that neural translation based on the training corpus of texts may look like magic, but it’s important to understand that there is still no intelligence behind it. The translations are based on the frequency of occurrence of specific words in the neighborhood, not on the machine’s “understanding” of the text it’s translating.

In the summer of 2022, our AI Lab, together with researchers from the University of Manchester, took part in the annual Biomedical MT ClinSpEn challenge from WMT2022. For the challenge, we fine-tuned three pre-trained language models:

• Marian Helsinki with 1.7 million parameters
• WMT21 – a hypertransformer model with 4.7 billion parameters
• NLLB – another hypertransformer with 54 billion parameters.

To finetune these models, we aligned sufficient 250,000 pairs training corpus of good-quality clinical case data in English and Spanish. We then cleaned this data thoroughly and fine-tuned these models on it. We expected that LLMs will be better than simple Marian Helsinki, and wanted to get an idea just how much better.

The amazing result of the experiment was that the rather small Marian Helsinki model trained on clean data was shown to translate the clinical case texts better than the billion-parameter LLMs. It even took first place (later we moved to second place due to the late resubmission of the winner) in the challenge’s Clinical Cases category.

The difficulty in quality measurement is that automated metrics do not always work, or give significant results. We had to apply a special metric and carry out extensive human evaluation to make sure that the smaller model indeed works better – this hypothesis was proven with our LOGIPEM/HOPE translation quality metric.

We spent significant time reviewing the results and running further experiments – and determined very conclusively that the cleanliness of data and extent of training are more important for the model’s accuracy than its size. That is, for specialized domains good data is of critical importance over the size of the model.

Unfortunately, the completion of our work coincided with the release of ChatGPT and the new wave of hype. We are once again seeing the talk of this “AI” replacing human intelligence and the labor of translators.

But after some months of discussions the industry has arrived to consensus (and we were saying it right from the start), that despite some remarkable examples on average ChatGPT translates worse than previous generation LLMs, due to the higher “temperature” of the model necessary for variance of the output.

Not to mention that even for well-trained language models the error rate is never zero and their improvement is more costly both in terms of money and effort.

So what is it that we have at the moment?

  1. For specialized domains, a small custom model still does better. The data is the key, and it should be clean for training.
  2. There is currently no reduction in effort required to finalize translation compared to older well-trained custom models.
    (If anything, it’s an increase of such effort, because the bad news is that mistakes made by ChatGPT usually go unnoticed even by professional users due to the increased fluency.)

The realization of the fact that verified accuracy cost money is slowly finding awareness throughout the industry. For our part, we continue our NLP research especially in the area of translation quality management and training of language models to be factual and objective, and verify claims with hard data in the field, and we are happy to share these facts and conclusions with you.