Sentalign: Accurate and scalable sentence alignment

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.
Original languageEnglish
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Pages256-263
Number of pages8
DOIs
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'Sentalign: Accurate and scalable sentence alignment'. Together they form a unique fingerprint.

Cite this