AI Lab

Services

Our services

Our machine translation (MT) solutions for corporate clients scale from the implementation of turnkey MT solution on the customer's infrastructure with customer's specialists training to the integration of the company's translation department with third-party MT services as well as full outsourcing of MT service providers.

Selecting MT suppliers

We help to choose an MT supplier depending on the source and target languages, the type of content to be translated (technical documentation, software, contracts, correspondence, etc.), requirements for the quality of translation and many other parameters the MT quality depends on.

MT training

The difference in quality between publicly available and specially trained MT systems is enormous. However, proper MT training isn’t a trivial task, nor a cheap one. Our training platform lets you customize the training process and evaluate the results with minimal costs before the system launch.

Training Corpora

MT training requires "training corpora" on a given subject: the sets of bilingual texts, where the sentences in the source and target languages are aligned. While corpora of general lexis such as news are freely available, narrowly themed corpora of the required size are a scarce and expensive resource.

Translation memory: Translation departments of enterprises usually have arrays of translation memories that can be used as training corpora after some "refinery". Translation memory, which is used in any modern translation automation system (CAT system), can’t be "fed" to the MT system for training. First, you need to find and delete incorrect translations, untranslated fragments, hypertext markup, etc. Clearing translation memory is an independent, non-trivial task.
Text alignment: If there is no translation memory or it’s small, we use the technology of matching random texts of a given subject in different languages to create a translation memory based on the materials provided by the customer or gathered via the Internet. Unlike the slow, cumbersome technologies for matching (aligning) texts of past generations, our solution based on recent advances in AI technologies works fully automatically, very quickly and with almost 100% accuracy.

MT Training

There are three scenarios for MT training and using it in practice. The best choice requires consulting specialists and studying the specific task of a given customer.

Standard ("stock") MT system: Already trained on the general lexis and can be rented in the data center of an MT service provider. The translation quality corresponds to the average freelance translator and is often obviously unacceptable for certain types of content.
Domain-adapted general-purpose system: This is a stock system that is bought or rented from an MT supplier and deployed on a customer's end. Such system is initially trained on general lexis of a large volume. It’s further trained on a narrow domain corpus, usually provided by the customer. The size of the second training corpus is critically important, and that's why we pursue the cutting edge of scientific research.
Extended narrow domain system: Unlike a domain-adapted general-purpose system, it’s initially trained on a relatively small narrow domain corpus. In this case, the system is good at "remembering" domain-specific terms, but the output text is illiterate. To make the output more fluent, the system is further trained on an extended corpus that contains more general lexis. This is the most complicated, most delicate MT training scenario because it provides high quality but also requires the participation of specialists.

Assessment of Training Results

To assess the effectiveness of training and the daily operation of the MT system, machine translated texts are compared with a reference translations using specific metrics. The reference may be an independent human translation of the same text or a post-edited machine translation of the original text.

MT quality metrics: The most common metrics are BLEU, hLepor and TER. Calculating such metrics requires special tools, and interpretation of the results requires experience in applying these metrics in practice.
What do the metrics show? For example, it’s known that BLEU = 20 is bad, but BLEU = 70 is an almost unattainable ideal. What are the values that matter in real life: 40, 50 or 60? Well, there are many different BLEU metric implementations, because each supplier or researcher can customize the parameters of this metric. As for the hLepor metric, the initially published version of this algorithm contained errors and omissions.
Requirements for proper assessment of training results: Assessing the quality of the MT system after training requires special-purpose test corpus. The general practice is to mechanically split the training corpus into two parts in a 9:1 ratio. However, in the case of mechanical or random selection, the quality assessment tends to become exaggerated. We use a special splitting algorithm based on the latest developments in AI, and it provides the most objective assessment of MT quality.

Advantages

Translating content that can’t be translated manually

MT lets you translate those types of corporate information, where previously manual translation couldn’t be used at all due to time and cost constraints: user-generated content, technical support chats, market research, internal correspondence in a global company, etc.

Reducing the cost of localizing goods and services

MT lets you reduce the costs of common tasks in technical documentation, software and partly marketing materials localization. With the MT of the previous generation (statistical, SMT), it was possible to reduce translation costs by an average of 30%. For example, if the translation cost is 1, MT post-editing costs 0.7 or perhaps 0.6. The transition to neural MT systems drops the cost to the level of 0.5 and even lower. However, cost efficiency requires careful preparation and proper tuning of the MT system and thus the adaptation of translation workflows.

Quicker, better translation

The obvious advantage of MT is a significant acceleration of translation processes. Given the various factors, the translator's productivity increases up to threefold. It all depends on the language pair, the type of content, the way the system is taught, and other circumstances that can be noted only by specialists experienced in working with MT.

Deployment

The deployment of an MT system at an enterprise must include the decision about system hosting and employee training to work with and administrate the system. It also includes the connection of the system to the enterprise IT environment for information exchange, integration with translators' CAT systems and the preparation of data for training MT.

MT system hosting

Typically, the MT server is hosted on the customer enterprise network or the supplier's data center. The second option may require the use of special measures to ensure the confidentiality of information to be processed.

MT system administration

The process of administrating a deployed MT system involves training for the responsible employees. Alternatively, it can be outsourced through remote system administration.

MT and integration with the enterprise IT infrastructure

To reduce the costs associated with daily use of the MT system, it’s advisable to link the MT system with the enterprise's CMS system. Because different MT vendors may provide different API interfaces that are specific to their systems, you may need the skills of a specialist.

Integrating the MT system with the translator's work environment

CAT systems differ in terms of integration with third-party MT systems, so a specialist advice may be required. We offer a specialized Memose solution for optimization of the MT post-editing process. It complements existing CAT systems and implements a new concept for working with MT. It significantly increases the quality of the final translation and enhances the translator's experience of working with MT.

Preparing data for MT training

We provide services for the thorough cleaning of accumulated arrays of translation memory (TM) for MT system training. Additionally, we offer migration of existing TMs to our Memose translation-memory server. The server is based on a new translation memory-storage concept and ultra-fast database. So, it's optimized for translation processes with the use of MT.

Generating data for MT training

MT training requires a corpus of bilingual texts of 50,000 pairs of sentences and more depending on the MT version. If the existing arrays of translation memory do not provide the required amount of data, we offer our solution Paralela for the automatic alignment (matching) of texts in different languages based on the latest AI technologies. To obtain the aligned bilingual arrays, similar to translation memories, it’s possible to use multilingual documents available at the enterprise: the enterprise website content, other available texts and third-party websites. The alignment is automatic and highly accurate.

Maintenance

Subsequent to the training phase and launch, the MT system requires periodic maintenance, including additional training, quality control and post-editing performance assessment. It’s also important to periodically compare the chosen system with others. After all, progress is a constant pursuit. MT systems improve at different rates, and sometimes they may even backslide.

Testing on previous corpora so as not to "overtrain" the MT system

After additional training of the system on a new domain, it’s necessary to perform not one, but two quality checks.

New-domain quality check: To make sure the system became better at translating the new domain, it’s advisable to check it on a new test corpus.
Primary-domain quality check. The system is tested on a test corpus for the main domain: you need to make sure that the system doesn’t become worse at translating the main topic.

Monitoring translators' performance

To assess the efficiency and cost of MT, you can measure the time spent by post-editors to edit the raw MT. The best option is when the CAT system lets you measure this time. An alternative is to selectively measure the performance of post-editors. To obtain a reliable estimate, it’s necessary to meet a number of conditions described in mathematical statistics. The measured average output of the post-editors also helps to draft the translation schedules.

Tips and tricks

The deployment and subsequent operation of MT systems abound in pitfalls. Listed below are typical questions that inevitably arise along the way.

Why are there so many different MT systems on the market? How can we choose the "best" one, and what criteria provide the basis for that choice?
Is it possible to assemble an MT system from freely available components on your own? If so, what sort of knowledge is required?
What's the ideal volume for the training corpus?
What should I do if there aren’t enough texts from the specific domain to fill the training corpus?
What operations does the "cleaning" of a training corpus include, and why is there no universal «one size fits all» data-cleaning method?
What are the risks of machine translation of confidential materials, and why is it so difficult to make data anonymous?
Why is so much manual work involved in matching original and translated documents, even though the corresponding alignment programs have been developed over many years?
Is it possible to endlessly retrain an MT system?
How to understand whether the training of the system is useful or not? What difficulties occur when measuring the quality of the MT?
How can you make the MT follow the corporate glossary?
Why is so much manual work involved in extracting terminology to create glossaries, and what solutions have been introduced?
Why can't we choose an individually tuned MT system once and for all? What do scientists and developers say?
Is it true that MT has reached parity with human translations? Does neural MT really produce smooth, almost human text? What's the catch?
MT is cheap, but the regular training is expensive. Can I avoid the risk of going broke because of maintenance requirements?

We’re ready to help you get answers to these questions and successfully implement MT in your enterprise. Request a free demonstration and consultation on MT implementation!

Let's discuss

Learn more about hotspots in AI research and application in our blog.

Blog