Automated Sentence Alignment in Ukrainian-German Parallel Texts

Authors

  • M.I. Korotiuk National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”
  • N.A. Rybachok National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” https://orcid.org/0000-0002-8133-1148

DOI:

https://doi.org/10.15407/intechsys.2025.01.050

Keywords:

sentence alignment, parallel texts, machine translation, BLEU metric, dictionaries

Abstract

Introduction. Sentence alignment in German parallel texts is a relevant task. It allows obtaining parallel data sets necessary for many computational linguistics tasks, such as parallel corpus construction and machine translation. The article describes the main tasks of sentence alignment, reviews existing methods and analyzes their ideas. Based on this analysis, a new method is proposed, it is based on the Bleualign approach, which uses machine translation systems and the BLEU metric to assess the similarity of sentences. However, it differs in the use of additional marker dictionaries for industry terms and conjunctions, including their synonyms. This article outlines the main tasks of sentence alignment, reviews existing methods, and discusses their ideas. Based on this analysis, a new method is proposed. This method is based on the Bleualign approach, which uses machine translation systems and BLEU metrics to evaluate sentence similarity, including the alignment of parts of complex sentences. However, it differs in the alignment process steps and introduces additional marker dictionaries for domain-specific words and conjunctions, including their synonyms.

The purpose of the work is to develop a method and software for automated sentence alignment in Ukrainian-German parallel texts.

Methods. The developed method is based on the Bleualign method and the BLEU metric. It is improved by the use of dictionaries of industry terms and conjunctions, and also provides a focus on one language pair — Ukrainian-German. The proposed method consists of 6 stages, allowing to align sentences in Ukrainian-German parallel texts. The proposed method is implemented in software using the Python programming language.

Results. A new method for aligning sentences for Ukrainian-German parallel texts has been developed and its software implementation has been completed. The proposed method is based on statistical approaches and does not require significant computing resources. Unlike the Bleualign method, it uses dictionaries of industry terms and conjunctions for more accurate sentence alignment.

Conclusions. Further research will include experiments and comparison of the alignment results obtained using the proposed method with the results of the Bleualign method.

References

Gale W., Church K. A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 1993, Vol. 19 (1), 75–102. URL: https://www.researchgate.net/publication/220355307_A_Program_for_Aligning_Sentences_in_Bilingual_Corpora [Accessed 27 Nov. 2024]

Halácsy P., Kornai A., Nagy V., Németh L., Trón V. Parallel corpora for medium density languages. Recent Advances in Natural Language Processing IV, 2007, Issue 1, 47–258. URL: https://www.researchgate.net/publication/282780901_Parallel_corpora_for_medium_density_languages [Accessed 27 Nov. 2024].

Sennrich R., Volk M. MT-based sentence alignment for OCR-generated parallel texts. Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, 2010, Issue 11. URL: https://www.researchgate.net/publication/281754851_MT-based_sentence_alignment_for_OCR-generated_parallel_texts [Accessed 27 Nov. 2024]

Thompson B., Koehn P. Vecalign: Improved Sentence Alignment in Linear Time and Space. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, Issue 1, 1342–1348. URL: https://www.researchgate.net/publication/336999037_Vecalign_Improved_Sentence_Alignment_in_Linear_Time_and_Space [Accessed 27 Nov. 2024]. https://doi.org/10.18653/v1/D19-1136

Web Align Toolkit: Online parallel texts aligner and format converter. URL: http://phraseotext.univ-grenoble-alpes.fr/webAlignToolkit [Accessed 27 Nov. 2024]

InterText:parallel text alignment editor. URL: https://wanthalf.saga.cz/intertext [Accessed 27 Nov. 2024]

Liu L., Zhu M. Bertalign: Improved word embedding-based sentence alignment for Chinese-English parallel corpora of literary texts. Digital Scholarship in the Humanities, 2023, Vol. 38 (4), 621–634. URL: https://www.researchgate.net/publication/366682551_Bertalign_Improved_word_embeddingbased_sentence_alignment_for_Chinese-English_parallel_corpora_of_literary_texts [Accessed 27 Nov. 2024]. https://doi.org/10.1093/llc/fqac089

Lingtrain Aligner. URL: https://github.com/averkij/lingtrainaligner-editor/tree/t/master/docs2/docs/source [Accessed 27 Nov. 2024]

Downloads

Published

2025-06-30

How to Cite

Korotiuk, M., & Rybachok, N. (2025). Automated Sentence Alignment in Ukrainian-German Parallel Texts. Information Technologies and Systems, 1(1), 50–58. https://doi.org/10.15407/intechsys.2025.01.050

Issue

Section

Intellectual Information Technologies