Marcin Miłkowski IFiS PAN / IPI PAN Translation Quality Checking in LanguageTool
Outline of the talk
● ● ● ●
QA in translation jobs Automated QA tools Translation QA features in LanguageTool Future work
QA in translation jobs
●
In real translation world, translators have to follow different style guides
●
●
●
For example, Microsoft wants web page to be translated into “strona sieci web”, and others prefer “strona internetowa” Sometimes the requirements are quite specific and conflicting And that makes QA even more difficult
QA in translation jobs
●
Another problem stems from translating technical text in a tagged form
●
●
Translators have to be careful to preserve the format (and other) tags from the original in the translation For example, formatting tags in localizing software (strings as “%s” etc.) But: unit conversions, currency conversions...
●
Numbers in general should remain the same
●
Automated QA tools
●
So it would be great to automate the chores
● ●
But the tools are usually limited For example, QA Distiller has the following features:
– – – – – –
detect untranslated segments detect inconsistency detect formatting problems (spaces, commas...) detect language-dependent formatting problems (date format etc.) detect terminology errors use regular expressions for customized rules
Automated QA Tools
●
Free QA Tools begin to appear
●
●
Apsic Xbench has almost the same set of features as QA Distiller (it misses only some formatting problems) CheckMate (from Okapi Tools) offers even more for many translation (bilingual) formats
CheckMate
Automated QA Tools
●
But most of these tools do not offer real linguistic technology
● ● ● ●
No POS tagging No stemming Custom rules rely only on regular expressions Words are specified only as linear sequences
LanguageTool
●
LanguageTool: opensource proofreading tool
●
Used e.g. in OpenOffice.org
●
Supports 21 languages to a different degree www.languagetool.org
●
LanguageTool
●
The proofreading tool is based on surface text processing, without deep parsing, not to mention semantic analysis Yet, it manages to get significantly better results (for some languages) than commercially available products (see Miłkowski 2010 for comparison for Polish) Some rules are built semi-automatically from corpora
●
●
LanguageTool in OmegaT
●
In OmegaT (free CAT software), target language is being already checked with LanguageTool as plugin
LanguageTool in CheckMate
●
In CheckMate, you can also use LanguageTool (launched earlier in server-mode) for target text
Bilingual mode in LanguageTool
●
The existing tools do not leverage (yet) the bitext mode available in LanguageTool In bitext mode, LT uses:
●
●
● ● ●
false friend rules: the rules are matched only when both source and target contain the false friend terms (to avoid false alarms) rules for target language generic bitext rules (in Java) XML bilingual rules for target language (if any).
False friend rules
●
There are many rules that detect possible false friends in translation (in different language combinations, mostly for English, German, Polish, Italian, and French) In bilingual mode, they are triggered less frequently (in real cases), so less false alarms
●
Language Belarusian Catalan Danish Dutch English Esperanto French Galician German Icelandic Italian Lithuanian Malayalam Polish Romanian Russian Slovak Slovenian Spanish Swedish Ukrainian
Rules Maintainer 7 Alex Buloichik 213 Ricard Roca 22 Esben Aaberg 311 Ruud Baars 479 Marcin Miłkowski, Daniel Naber 80 Dominique Pellé 1810 Agnes Souque, Hugo Voisard (2006-7) 166 Susana Sotelo Docío 139 Daniel Naber 39 Anton Karl Ingason 86 Paolo Bianchini 4 Mantas Kriaučiūnas 18 Jithesh.V.S 1028 Marcin Miłkowski 418 Ionuț Păduraru 124 Yakov Reztsov 55 Zdenko Podobný 58 Martin Srebotnjak 70 Juan Martorell 26 Niklas Johansson 8 Andriy Rysin
Rules for target language
Generic bitext rules
●
As of now, only two:
●
●
Check if translation length is roughly the same as the original Check if the translation is the same as original
XML Bilingual Rules
●
A single file for many source languages and one target May specify language-aware specific checks and corrections, including, but not limited to:
●
●
●
● ●
Terminology, even in complex phrases (and with rich morphology) Incorrect syntax patterns (copying original grammar structure to a target language with a different syntax) Inconsistency in terminology Dates, currencies, number formats...
Special uses of bitext checking
●
Bitext checking in LanguageTool may be used for automatic post-editing of statistical machine translation
●
●
For example, to make sure that financial data get translated in a consistent manner (currency remains consistently translated) To fix missing negation, common in Google Translate
●
For bilingual dictionary quality checks
Some examples
●
Let's take a real-world mistake: a translator uses Google Translate to make his job quicker and mistranslates “similar in kind” as “podobny w naturze”
●
And, frankly, every proof-reader knows that human beings are able to be more stupid than machine translation...
The rule
●
The pattern in XML rule for Polish would be:
<token>similar</token> <token>in</token> <token>kind</token>
<pattern> <source lang="en">
</source> <target> <token inflected="yes">podobny</token> <token>w</token> <token>naturze</token> </target> </pattern>
Making style guides formalized
●
Many style guides contain numerous checks that can be easily formalized using LanguageTool notation Using LT, the proofreader will be able to focus on important mistakes, not just mechanical and stupid ones
●
The future
●
There are several rules that will be implemented:
●
●
Smart number matching rule that uses textual way of translating figures (“one” gets translated into “1”, or “1” gets translated into “jeden”): standard QA tools cannot do that And partly based on that, currency & date conversion checks
Interoperability
●
Currently, LT can be used in bilingual mode only from the command-line and directly via its API
●
●
●
A new parameter will be added to enable HTTP requests (for CheckMate) This will allow checking multiple translation formats, like Trados RTF, XLIFF, TMX, TTX, PO... More native filters for standards on the commandline
Google Summer of Code 2011
●
LT participates in GSoC 2011, so we hope to get a lot of new things done for many languages
● ●
●
Make UI easier to use Enable writing rules without even touching XML source Convert simple vocabularies (tabbed text and TBX) into terminology checks
●
… and attract more developers!
Conclusion
●
Translation QA should be done in languageaware fashion, not just via pure text search (even with regular expressions) And there are free tools that enable checks that go beyond what most commercial packages can do!
●
Thank you!
●
Miłkowski, M. (2008). Automated Building of Error Corpora of Polish, in: B. Lewandowska-Tomaszczyk (ed.), Corpus Linguistics, Computer Tools, and Applications – State of the Art. PALC 2007, Peter Lang. Internationaler Verlag der Wissenschaften 2008: 631-639. Miłkowski, M. (2010). Developing an open-source, rule-based proofreading tool, Software – Practice and Experience 2010, 40 (7): 543-566. DOI: 10.1002/spe.971 Miłkowski, M. (in press). Automatic Rule Generation for Grammar Checkers, in: S. Goźdź-Roszkowski, Proceedings of PALC 2009. For more information on LanguageTool, see www.languagetool.org and languagetool.wikidot.com
●
●
●