Jost Zetzsche has featured Paralela aligner in recent issue of Toolbox Journal. The article reads:
Paralela |
In the last few months, Logrus Global has sponsored the Tool Box Journal a couple of times by highlighting their Paralela tool. I have to admit I had actually never looked closely at the tool, but having done so now, I’ve discovered that what Paralela — a tool used to align texts — offers really doesn’t have much in common with what you and I probably think of when we think about alignment. We usually have in mind something with parallel translated texts, where we use the align feature in our translation environment tool to match them up sentence by sentence so they can end up in our translation memory. We can influence the alignment by adding or modifying rules about segment delimiters (usually punctuation marks), but overall it’s a rather manual process with a lot of correction necessary. Some tools now use dictionary data to improve their alignment, but the developers behind Paralela decided to go a completely different route.
Paralela deploys a combination of the Google-developed Language-Agnostic BERT Sentence Embedding (LaBSE) technology and the Facebook AI Similarity Search (Faiss) tool to approach text pairs non-consecutively. It sorts out which segment matches what, completely independently of their order. I ran a few tests this week, and it really is very interesting to see, especially because this happens against the current of everything we assume about alignment. And it works!
The tool is essentially broken down into two processes. The first process, called Grabber, does two things: After the user enters a website’s top domain, it searches recursively (i.e., through all available directories). It then strips all non-textual items out of all the HTML-based files so only usable data remains. Alternatively, users can select files (PDF, Word, PowerPoint, text, or OpenDocument files) and clean them in preparation for alignment. Either way, the resulting text is then presented so the user can make adjustments if necessary, or it can also be downloaded as a spreadsheet.
Users can also choose to skip all of this and do it as part of the Align process, but then they won’t be able to find out if there are any potential problems (such as downloaded data from the wrong language, etc.).
After going through the Grabber process, users must select “Select grabber tasks as source [target]” and start the alignment process. Given the complex and resource-heavy process mentioned above, the results appear in a shockingly short amount of time. By default, they will not be sorted according to the order in which they appeared but by a confidentiality score that is displayed to the right. There are now a number of options for filtering the translation units to potentially unselect them from being included in the translation memory, including whether they’re identical or not, by confidentiality score, difference in length, etc. Users can also search the data with the help of regular expressions and modify the text in this interface. Once saved, it can then be downloaded as a TMX file to either make it into or import into a translation memory, or to use it as training data for a machine translation engine.
It’s a cool tool, and while there could still be improvements in the download module of the Grabber process — such as automatic language detection — it’s an example of what an otherwise tedious part of many translation workflows (or prepping data for MT training) can look like if approached differently.