Sunday, September 28, 2008

google

As you probably already know, Google Translate has added 11 more languages, including Slovak, to its already impressive portfolio. While testing the English-Slovak service, I was pleasantly surprised at the MT engine's ability to handle syntax like noun phrases containing adjectives, although I noted a number of problems associated with translating English idiomatic structures, such as those involving verbs "give" and "take" or multiword expressions. Overall, about 60% of translations of work related documents I put in did produce comprehensible and usable texts, so color me impressed. More testing will be required to see if Slovak translators who work for me should start to worry about their jobs (and trust me, I do have a shit list), but I'm pleased to inform you that we already have a candidate for the mistranslation of the year. Consider the headline of this report on the first US presidential debate from CNN.com and then have a look at the translation, especially the items in red:



English: Analysis: A few jabs, but no knockout in first debate
Slovak: Analýza: Za pár popíchnutí, ale žiadna kokot nedved v prvom diskusie

OK, my praise of syntax handling now sounds premature, since in "v prvom diskusie", neither the noun nor the ordinal numeral are declined properly (it should be "v prvej diskusii"), but that's not the interesting bit. That rests with the translation of the word "knockout": "kokot nedved". "Kokot" = "dick, prick" is of course the basic Slovak insult for a man, for more information see here. It is also a very vulgar term, rarely seen in print or heard on the airwaves, so its appearance here will not only ellicit a chuckle for its own sake, but also the question of just what corpus was Google using in training the MT engine. The web, sure, but I can't think of any sufficiently large bilingual corpus where that word would crop up. And that question is even more justified with the second part of the translation: what the hell is a "nedved"? The only word that even comes close is the Czech surname Nedvěd which is a form of "medvěd" = "bear". There are a few people with that name with a significance presence on the web to be included in a web corpus, like the football player Pavel Nedvěd, the hockey player Petr Nedvěd and the folk singers Jan (Honza) and František Nedvěd. But how did their name get into the translation for "knockout"? "Knockout" ("knokaut" in Slovak) is a sports term, but I know of no boxer by the name of Nedvěd. Then again, football and hockey players as fellow athletes could probably fit the bill. That still leaves the question of how did this Czech word get into a translation into Slovak. And it's not the only one - if you look at the screenshot, you will see at least three more words clearly identifiable as Czech (highlighted in green):

- "štípnout" for "tweak" - Slovak: "upraviť, vyladiť"; "štípnout" = "pinch, sting", slang: "steal", although one of my dictionaries gives "štípnout" for tweak" without any further context or explanation.
- "slíbený" for "vowed" - Slovak: "sľúbený". Note that this is a past participle while the original has "vowed" as a past tense verb.
- "poldové" for "cops" - Slovak: "policajti", slang: "fízli". Note the context mismatch: both "poldové" and "fízli" is stylistically marked and not very likely to appear in a newspaper save perhaps for direct quotes.

Once again, I assume that web corpora were to some extent used to train the MT engine. As my own feeble attempts at corpus research have shown, the country code cz or sk in the domain name does in no way guarantee that you will find only Czech or Slovak text there. The actual ratio is hard to determine, but it is definitely nice to see that one of the better aspects of Czechoslovakia - its almost fully bilingual citizens - survives to this day.

And one last interesting bit from this small test: Barack Obama's full name is translated as it should be. But whenever his last name shows up on its own, Google translates it as "osobách" = "person-LOC.PL" (highlighted in light blue). Buggered if I know why...

(h/t: filer)