XNLIeu: a dataset for cross-lingual NLI in Basque

Maite Heredia; Julen Etxaniz; Muitze Zulaika; Xabier Saralegi; Jeremy Barnes; Aitor Soroa

doi:10.18653/v1/2024.naacl-long.234

XNLIeu: a dataset for cross-lingual NLI in Basque

Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa

Abstract

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

Anthology ID:: 2024.naacl-long.234
Volume:: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4177–4188
Language:
URL:: https://aclanthology.org/2024.naacl-long.234/
DOI:: 10.18653/v1/2024.naacl-long.234
Bibkey:
Cite (ACL):: Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, and Aitor Soroa. 2024. XNLIeu: a dataset for cross-lingual NLI in Basque. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4177–4188, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: XNLIeu: a dataset for cross-lingual NLI in Basque (Heredia et al., NAACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.naacl-long.234.pdf

PDF Cite Search Fix data