Two a day keeps stagnation away
University of Helsinki language technology professor Jörg Tiedemann has released a dataset with over 500 million translated sentences in 188 languages

University of Helsinki language technology professor Jörg Tiedemann has released a dataset with over 500 million translated sentences in 188 languages

Automatically translated data sets that can be used for data augmentation Translations have been done with models trained on the Tatoeba MT challenge data. We include translations of Wikipedia, WikiSource, WikiBooks, WikiNews and WikiQuote (if available for the source language we translate from). Translations are done on shuffled, de-duplicated data sets and they come in blocks of at most one million sentences per file.

Read more: https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/Backtranslations.md

Leave a Reply

Your email address will not be published. Required fields are marked *