Daiki Shiono
2025
Evaluating Model Alignment with Human Perception: A Study on Shitsukan in LLMs and LVLMs
Daiki Shiono
|
Ana Brassard
|
Yukiko Ishizuki
|
Jun Suzuki
Proceedings of the 31st International Conference on Computational Linguistics
We evaluate the alignment of large language models (LLMs) and large vision-language models (LVLMs) with human perception, focusing on the Japanese concept of *shitsukan*, which reflects the sensory experience of perceiving objects. We created a dataset of *shitsukan* terms elicited from individuals in response to object images. With it, we designed benchmark tasks for three dimensions of understanding *shitsukan*: (1) accurate perception in object images, (2) commonsense knowledge of typical *shitsukan* terms for objects, and (3) distinction of valid *shitsukan* terms. Models demonstrated mixed accuracy across benchmark tasks, with limited overlap between model- and human-generated terms. However, manual evaluations revealed that the model-generated terms were still natural to humans. This work identifies gaps in culture-specific understanding and contributes to aligning models with human sensory perception. We publicly release the dataset to encourage further research in this area.
2024
Detecting Response Generation Not Requiring Factual Judgment
Ryohei Kamei
|
Daiki Shiono
|
Reina Akama
|
Jun Suzuki
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge.However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues.This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset.The model with the highest classification accuracy could yield about 88% accurate classification results.