Ahmad Shallouf


2025

pdf bib
CompUGE-Bench: Comparative Understanding and Generation Evaluation Benchmark for Comparative Question Answering
Ahmad Shallouf | Irina Nikishina | Chris Biemann
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

This paper presents CompUGE, a comprehensive benchmark designed to evaluate Comparative Question Answering (CompQA) systems. The benchmark is structured around four core tasks: Comparative Question Identification, Object and Aspect Identification, Stance Classification, and Answer Generation. It unifies multiple datasets and provides a robust evaluation platform to compare various models across these sub-tasks. We also create additional all-encompassing CompUGE datasets by filtering and merging the existing ones. The benchmark for comparative question answering sub-tasks is designed as a web application available on HuggingFace Spaces: https://huggingface.co/spaces/uhhlt/CompUGE-Bench

2024

pdf bib
CAM 2.0: End-to-End Open Domain Comparative Question Answering System
Ahmad Shallouf | Hanna Herasimchyk | Mikhail Salnikov | Rudy Alexandro Garrido Veliz | Natia Mestvirishvili | Alexander Panchenko | Chris Biemann | Irina Nikishina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Comparative Question Answering (CompQA) is a Natural Language Processing task that combines Question Answering and Argument Mining approaches to answer subjective comparative questions in an efficient argumentative manner. In this paper, we present an end-to-end (full pipeline) system for answering comparative questions called CAM 2.0 as well as a public leaderboard called CompUGE that unifies the existing datasets under a single easy-to-use evaluation suite. As compared to previous web-form-based CompQA systems, it features question identification, object and aspect labeling, stance classification, and summarization using up-to-date models. We also select the most time- and memory-effective pipeline by comparing separately fine-tuned Transformer Encoder models which show state-of-the-art performance on the subtasks with Generative LLMs in few-shot and LoRA setups. We also conduct a user study for a whole-system evaluation.