Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English–Bengali and English–Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.
Original languageEnglish
Title of host publicationProceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation
PublisherAsia-Pacific Association for Machine Translation
Pages1-13
Number of pages13
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation'. Together they form a unique fingerprint.

Cite this