Filtering Matters: Experiments in Filtering Training Sets for Machine Translation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.
Original languageEnglish
Title of host publicationProceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Place of PublicationTórshavn, Faroe Islands
PublisherUniversity of Tartu Library
Pages588-600
Number of pages13
Publication statusPublished - 1 May 2023

Fingerprint

Dive into the research topics of 'Filtering Matters: Experiments in Filtering Training Sets for Machine Translation'. Together they form a unique fingerprint.

Cite this