Maite Heredia
2025
HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation
Jaione Bengoetxea
|
Mikel Zubillaga
|
Ekhi Azurmendi
|
Maite Heredia
|
Julen Etxaniz
|
Markel Ferro
|
Jeremy Barnes
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
In this paper we present our submission for the NorSID Shared Task as part of the 2025 VarDial Workshop, consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification, evaluated using data in different dialects of the Norwegian language. For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages. In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores within our experiments. Our final results on the test set show that our models do not drop in performance compared to the development set, likely due to the domain-specificity of the dataset and the similar distribution of both subsets. Finally, we also report an in-depth analysis of the provided datasets and their artifacts, as well as other sets of experiments that have been carried out but did not yield the best results. Additionally, we present an analysis on the reasons why some methods have been more successful than others; mainly the impact of the combination of languages and domain-specificity of the training data on the results.
2024
XNLIeu: a dataset for cross-lingual NLI in Basque
Maite Heredia
|
Julen Etxaniz
|
Muitze Zulaika
|
Xabier Saralegi
|
Jeremy Barnes
|
Aitor Soroa
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.
Search
Fix data
Co-authors
- Jeremy Barnes 2
- Julen Etxaniz 2
- Ekhi Azurmendi 1
- Jaione Bengoetxea 1
- Markel Ferro 1
- show all...