0% found this document useful (0 votes)
21 views12 pages

Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education

Uploaded by

danico07gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education

Uploaded by

danico07gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computers and Education: Artificial Intelligence 6 (2024) 100252

Contents lists available at ScienceDirect

Computers and Education: Artificial Intelligence


journal homepage: www.sciencedirect.com/journal/computers-and-education-artificial-intelligence

Automatic question-answer pairs generation using pre-trained large


language models in higher education
Jintao Ling , Muhammad Afzaal *
Department of Computer and Systems Sciences, Stockholm University, Sweden

A R T I C L E I N F O A B S T R A C T

Keywords: The process of manually generating question and answer (QA) pairs for assessments is known to be a time-
Pre-trained language model consuming and energy-intensive task for teachers, specifically in higher education. Several studies have pro­
Question-answer pairs generation posed various methods utilising pre-trained large language models for the generation of QA pairs. However, it is
Higher education
worth noting that these methods have primarily been evaluated on datasets that are not specifically educational
Automatic evaluation
Real-educational evaluation
in nature. Furthermore, the evaluation metrics and strategies employed in these studies differ significantly from
those typically used in educational contexts. The present discourse fails to present a compelling case regarding
the efficacy and practicality of stated methods within the context of higher education. This study aimed to
examine multiple QA pairs generation approaches in relation to their performance and the efficacy and con­
straints within the context of higher education. The various approaches encompassed in this study comprise
pipeline, joint, multi-task approach. The performance of these approaches under consideration was assessed on
three datasets related to distinct courses. The evaluation integrates three automated methods, teacher assess­
ments, and real-world educational evaluations to provide a comprehensive analysis. The comparison of various
approaches was conducted by directly assessing their performance using the average scores of different auto­
matic metrics on three datasets. The results of the teachers and real educational evaluation indicate that the
assessments generated were beneficial in enhancing the understanding of concepts and overall performance of
students. The implications of the findings from this study hold significant importance in enhancing the efficacy of
QA pair generation tools within the context of higher education.

1. Introduction streamline assessment development.(Alberti et al., 2019, pp.


6168–6173; Chan & Fan, 2019, pp. 154–162). Automated QA pair
In the context of higher education, teachers face the significant generation involves creating relevant questions and their corresponding
challenge of creating assessments that are both effective and efficient in answers from unstructured texts, aiming to streamline assessment
evaluating student knowledge Gholami & Morady Moghaddam, 2013; development (Alberti et al., 2019, pp. 6168–6173; Chan & Fan, 2019,
Jia et al., 2020). Traditional methods of manually crafting quizzes and pp. 154–162). The significance of this research lies in its potential to
exams are not only time-consuming but also detract from the valuable revolutionize how assessments are developed, making the process more
time that could be spent on teaching and engaging with students efficient and opening the door to more personalized and adaptive
(Gholami & Morady Moghaddam, 2013; Klein & Nabi, 2019; Rodri­ learning experiences. By evaluating different QA pair generation
guez-Torrealba et al., 2022b). methodologies—namely the pipeline, joint, and multi-task learning
This situation highlights a pressing need for innovative solutions that approaches—this study aims to identify the most effective strategies for
can streamline the assessment creation process without sacrificing the integrating this technology into higher education. This is not merely an
quality of education. academic exercise; it has the potential to significantly impact educa­
This necessitates exploring the promising field of automated tional practices, enhancing the way teachers design assessments and
question-answer (QA) pair generation. Leveraging advancements in ultimately improving student learning outcomes.
natural language processing and machine learning, particularly with However, the aforementioned methodologies were assessed and
pre-trained large language models (LLMs), offers a promising solution to validated using separate datasets and evaluation metrics. When

* Corresponding author.
E-mail address: [email protected] (M. Afzaal).

https://doi.org/10.1016/j.caeai.2024.100252
Received 29 November 2023; Received in revised form 24 April 2024; Accepted 8 June 2024
Available online 10 June 2024
2666-920X/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

considering the application of these methodologies within the field of performance, as evidenced by empirical evaluations (Rodri­
education, it becomes challenging to offer a definitive recommendation guez-Torrealba et al., 2022b). The various approaches can be broadly
for approach selection due to the difficulty in directly comparing their classified into three distinct categories based on their architectural de­
efficacy. While certain literature sources acknowledge the potential use signs (Qu et al., 2021; Rodriguez-Torrealba et al., 2022b). These cate­
of their approaches in an educational context, there is a lack of empirical gories include the pipeline approach, joint learning approach, and
evaluation of these approaches in real-world educational settings. multi-task learning approach.
Additionally, the datasets (e.g., SQuAD dataset is extensively utilised
and has over 100,000 question-answer pairs derived from a collection of 2.1. Pipeline approach
Wikipedia articles) that used to generate pairs specifically designed for
reading comprehension (Rajpurkar et al., 2016). However, using data­ The pipeline approach is a straightforward methodology in which
sets like SQuAD for educational efficacy assessment is inappropriate, as the processes of question generation and answer generation are
they were not specifically designed for educational purposes (Lelkes executed sequentially. In a recent study (Rodriguez-Torrealba et al.,
et al., 2021). As a response, diverse methodologies have been developed 2022b), the authors introduced a novel processing pipeline that lever­
for QA pair generation, each holding potential for application in higher ages the T5 language model to generate question-answer pairs. This
education. method’s superior performance in generating coherent and contextually
In this context, we pose the following research questions. relevant QA pairs suggests its potential to enhance learning by providing
more engaging and challenging materials in higher education settings
• Question 1: How can we develop a robust framework to evaluate the (Johnson et al., 2024). The model, known as the QAP Model (Ques­
effectiveness of different QA pairs generation approaches in the tion/Answer Pairs Model), leverages the inherent duality between
higher education context, ensuring that these technologies meet the question generation (QG) and QA to achieve its objectives. The T5 model
specific needs of educational stakeholders? is utilised for generating both questions and answers, with a subsequent
• Question 2: What impact do automatically generated QA pairs have model employing distractors to provide incorrect answer options, which
on students’ academic performance and engagement in real educa­ enhances the learning challenge.
tional environments? In another study (Alberti et al., 2019, pp. 6168–6173), authors
introduced a roundtrip consistency mechanism to further enhance the
The first research question underscores the necessity of a tailored coherence of the generated QA pairs. This mechanism involves a
evaluation that transcends mere technical performance, integrating detailed comparison between initial and subsequent responses to ensure
pedagogical considerations to truly assess the value of these technolo­ coherence, excluding pairs that do not meet the criteria. In the context of
gies in educational settings. While the second research question is crit­ education, it is crucial to tailor the generation of student questions to
ical as it seeks to directly link the use of innovative assessment meet educational needs rather than solely assessing comprehension (Yao
technologies with measurable outcomes in student learning and et al., 2022). The FAIRYTABLEQA dataset, annotated by domain ex­
engagement, thus providing a compelling argument for the integration perts, is utilised to refine the rules for answer extraction, although
of such tools in educational practice. limitations exist in covering the broad spectrum of educational subjects.
By proposing a methodology that evaluates the effectiveness of Researchers also propose integrating a text summarization module
different QA pair generation approaches (pipeline, joint, and the multi- within the pipeline approach to streamline the extraction of crucial in­
task approaches) using pretrained large language models (LLMs) across formation and the formulation of questions and answers (Gabajiwala
three benchmark datasets derived from real educational materials, this et al., 2022). This optimized approach utilizes advanced tools like
paper aims to fill the aforementioned gaps. These datasets reflect ConceptNet, WordNet, and Sense2vec to ensure relevance and accuracy
different courses that are taught at University and were generated spe­ in the generated educational content.
cifically for evaluating different methodologies. The evaluation process The straightforward pipeline approach using LLMs like T5 in higher
involves automatic, teachers and real-educational settings within the education allows for ease of implementation, which is beneficial for
context of higher education. Our findings reveal that the multi-task teachers without deep technical expertise. It facilitates the custom­
learning approach, particularly when utilising the T5 LLM, out­ ization and scalability of educational content, making it adaptable to
performs its counterparts, demonstrating a notable positive impact on various class sizes and curriculum needs (Kurdi et al., 2020). Moreover,
student academic performance. Teachers expressed high satisfaction the modular nature of the pipeline permits iterative enhancements based
with the question accuracy and relevance, acknowledging the technical on feedback, contributing to continuous improvement in teaching
competence and educational utility of the QA pairs, while noting the methodologies and student learning outcomes. This methodological
need for improvements in understandability and consistency across simplicity thus not only democratizes access to advanced AI technolo­
different courses. Moreover, our investigation shows that the use of gies in educational settings but also enhances the overall educational
generated QA pairs positively influences students’ academic perfor­ experience by providing reliable and adaptable tools for both teaching
mance. Specifically, it was observed that students engaging more and assessment (H. C. Wang et al., 2023).
frequently with QA pair-based assessments tended to perform better in
their final examinations, indicating the significant benefits of inte­ 2.2. Joint learning approach
grating these technologies into educational assessments.
In the context of a joint model, the process of generating both the
2. Background question and answer occurs iteratively. The interdependence between
question generation and question answering presents a promising op­
The utilisation of pre-trained language models has yielded remark­ portunity for enhancing performance and reducing the reliance on an­
able achievements in the domain of natural language processing (NLP). notated data (Klein & Nabi, 2019). The pipeline approach exhibits
Consequently, numerous methodologies have been put forth to address suboptimal performance in the task of extracting the most suitable QA
the task of generating question-answer pairs, leveraging the capabilities pairs from textual data. This is primarily due to the disregard for the
of these pre-trained language models. These methodologies are not interdependence between question generation and answer extraction,
merely technical achievements; they have practical implications for resulting in the potential generation of incompatible QA pairs (Cui et al.,
educational practices, particularly in crafting personalized and adaptive 2021).
learning experiences (Yi et al., 2021). A novel approach was introduced (Klein & Nabi, 2019) that in­
These approaches have demonstrated substantial enhancements in tegrates the Transformers decoder GPT-2 model with the Transformers

2
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

encoder BERT, with the aim of facilitating collaborative learning in the as input. The performance of the multi-task-based model is observed to
context of question-answering and question generation. The training of be superior when evaluating its results in comparison to the T5 model
the model is conducted through the utilisation of an endto-end schema, that has been fine-tuned under a single-task setting, as reported in
wherein the dual tasks of question-answering and question generation reference (Akyon et al., 2022).
are considered concurrently, rather than being approached as sequential In higher education, the use of QA pairs based assessments using
tasks. The utilisation of a ground-truth answer and accompanying LLMs like BERT and T5, employing approaches such as pipeline, joint
context text is employed as a means to generate a question through the learning, and multi-task learning, significantly enhances both teaching
utilisation of GPT-2. The subsequent step involves providing the and learning experiences. For students, these models facilitate person­
pre-trained BERT model with the SQuAD text, which encompasses the alized learning through adaptive assessments and provide immediate,
question that was formulated, in order to facilitate the extraction of the detailed feedback, which is crucial for effective learning (Zhong et al.,
answer span. In the event that BERT fails to accurately predict the cor­ 2022). They also help in creating fairer and more accurate evaluations
rect answer as indicated by the annotations, the resulting loss of infor­ by minimizing human biases and ensuring consistency across assess­
mation will be propagated backwards to GPT-2. ments.The direct impact on student learning is profound, as these as­
In a similar vein, a novel unified framework for the generation of QA sessments help in reinforcing knowledge through repeated and tailored
pairs was proposed (Qu et al., 2021). Based on the available evidence, it practice, leading to improved understanding and retention (Mazidi &
can be inferred that the process of sequential generation can be effec­ Nielsen, 2014). For teachers, LLMs offer considerable benefits by auto­
tively simplified by transforming it into the task of generating joint mating the generation of test materials, thus saving time and effort that
question-answer pairs. Within the confines of this particular framework, can be redirected towards interactive teaching. Additionally, the
the tasks of question generation and keyphrase extraction have been data-driven insights generated by these models support informed
intricately intertwined, forming a dual task. The optimisation process pedagogical decisions and curriculum development, enhancing educa­
involves leveraging the advantages of each task in a mutually reinforc­ tional outcomes. The superior performance of LLMs such as T5 model in
ing manner, with iterative refinement being a key component. The higher education settings means that educational practices can become
process of generating answers will be facilitated by utilising the more adaptive, personalized, and inclusive, enhancing both the quality
extracted keyphrase as a guiding factor. and accessibility of education (Kurdi et al., 2020).
The joint model approach with BERT or GPT is transformative,
integrating question and answer generation to reflect the interconnected 2.4. Research gap
nature of learning. This method enhances educational practices by
providing immediate, context-aware feedback and generating precisely In conclusion, it is worth noting that the existing methodologies for
aligned question-answer pairs, leading to more accurate student as­ generating QA pairs using pre-trained language models can be classified
sessments (Rodriguez-Torrealba et al., 2022a). It also allows for the into three distinct categories: the pipeline method, the joint model, and
efficient creation of customized educational content, significantly the multi-task learning model. The approaches that have been reviewed
reducing the workload on educators and enabling them to focus on more are organised and presented in a structured manner in table 0.8. The
personalized teaching. Moreover, its ability to automate content gen­ utilisation of multiple datasets for the generation of QA pairs is evident
eration makes it ideal for scalable applications such as online learning in the aforementioned methods. Furthermore, it is important to note that
platforms and MOOCs, supporting large-scale educational initiatives the metrics employed for the purpose of automatically evaluating text
with consistent, high-quality content (Kurdi et al., 2020). generation were found to be inconsistent, lacking a standardised set of
metrics for comprehensive evaluation.
2.3. Multi-task learning approach The current landscape of QA pairs generation methodologies has
witnessed the emergence of various proposed approaches. However, a
The multi-task model utilizes a shared encoder to process inputs for notable gap in the existing literature relates to the absence of compre­
both question and answer generation, enabling mutual learning and hensive performance evaluations conducted on real-world education
synergy between these tasks. This integration blurs the lines between datasets. The performance of these approaches has been assessed
generating questions and answers, highlighting their interdependence through the utilisation of various datasets and evaluation metrics,
and enhancing their effectiveness (Cui et al., 2021). One of the primary thereby posing a challenge in terms of comparing their respective out­
challenges encountered in the pipeline approach is the generation of comes. Furthermore, it should be noted that the datasets provided in
incompatible QA pairs in an independent manner, without considering table 0.8 lack a clear source from real-world educational settings.
the underlying relationships between them. The OneStop model is being Consequently, the extent to which these datasets can be applied to the
proposed as a solution to address these issues. It leverages a field of education remains uncertain. Hence, it is imperative to conduct a
sequence-to-sequence transformer architecture, featuring a bidirec­ comprehensive assessment of these methodologies using a universally
tional encoder and an autoregressive decoder, to generate ques­ accepted dataset and standardised evaluation techniques in order to
tionanswer pairs using a multi-task learning approach (Cui et al., 2021). ascertain their efficacy in the realm of education. Furthermore, it is
Documents are inputted into the encoder, and then the decoder, imperative to incorporate real-world evaluations that involve educators
employing cross-attention with the encoder’s outputs, generates ques­ with substantial teaching experience in the field of higher education.
tions and predicts answer spans. This interplay between the document These evaluations are crucial in order to comprehensively assess the
and the generated content is crucial for producing accurate and influence and efficacy of these approaches.
compatible responses, thereby achieving optimal results in the question Moreover, there is a notable absence of empirical studies directly
and answer generation processes. investigating the impact of automatically generated QA pairs on stu­
In contrast to the sharing of the encoding model on question gen­ dents’ academic performance and engagement. This gap highlights the
eration and answer generation tasks, the multi-task-based T5 model need for targeted research that evaluates how these technologies affect
demonstrates a more direct approach in modelling these tasks (Zhong learning outcomes in actual classroom settings. Such studies are essen­
et al., 2022). In the multi-task setting, the T5 model undergoes fine­ tial to understand whether and how the use of automated QA pairs can
tuning on three distinct tasks: question answering, question generation, enhance educational practices, particularly in terms of improving stu­
and answer extraction. For the question answering task, the model takes dent engagement and academic achievement.
in a context and question pair as input. In the question generation task,
the model utilizes answerhighlighted context as input. Lastly, for the
answer extraction task, the model employs sentence-highlighted context

3
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

3. Methodology in various scenarios.


The Stanford Question Answering Dataset (SQuAD) uniquely com­
This section outlines a methodology for assessing the efficacy of prises passages from Wikipedia articles and corresponding question-
various approaches to generating QA pairs and different LLMs within the answer pairs, highlighting its utility for evaluating reading compre­
context of higher education. Another objective is to evaluate the influ­ hension in LLMs models. Each entry features a passage and a set of
ence of these approaches and LLMs on students’ academic performance. questions formulated from the text, where answers are exact text spans
The methodology consists of three main phases. The initial phase of data within the passage (as shown in Fig. 1). This structure aims to challenge
collection entails gathering fin-tune datasets for the purpose of fine- models across a spectrum of topics, enforcing a rigorous test of their
tuning LLMs. Subsequently, benchmark datasets are generated to facil­ natural language processing capabilities. Documented by Rajpurkar
itate the evaluation process. In the second phase of the experiment, three et al. (Rajpurkar et al., 2016), the dataset emphasizes creating one
distinct approaches for generating QA pairs will be selected. These ap­ high-quality question-answer pair per passage, ensuring precise model
proaches will be combined with LLMs that will be fine-tuned using the evaluation. SQuAD’s design, leveraging a wide subject range, from
fine-tune datasets that have been collected. In the evaluation phase, the history to science, serves as a comprehensive benchmark for advancing
selected approaches and LLMs are assessed on benchmark datasets using question-answering systems and enhancing machine learning
various accuracy measures. The goal is to determine the most suitable techniques.
group, consisting of both the approach and LLM, for real-educational The DG-RACE dataset, pivotal for enhancing distractor generation
evaluation. The selected cohort is subsequently employed to generate (DG) techniques, serves as the foundation for fine-tuning pre-trained
MCQ assessments, which are employed in the context of real-world language models aimed at creating distractors for multiple-choice
educational evaluations. These evaluations seek to evaluate the influ­ questions (MCQs) in reading comprehension exams (Gao et al., 2018;
ence of the assessments on students’ academic achievements. The sub­ Rodriguez-Torrealba et al., 2022b). It is designed not just to offer
sequent sections will provide a comprehensive analysis of the various incorrect alternatives but to ensure these distractors are convincingly
stages involved in the proposed methodology. plausible, thereby enriching the realism and educational value of prac­
tice exams (As showin in Fig. 2). The dataset leverages a wide array of

3.1. Data collection

This section introduces and describes two distinct categories of


datasets that were gathered for the purpose of this research work.
Initially, it is important to highlight the significance of fine-tuning
datasets in the process of refining pre-trained language models. These
datasets are specifically designed to improve the performance of these
models in tasks that are specific to particular domains. Secondly,
benchmark datasets, derived from educational materials through
manual curation, are predominantly utilised for the evaluation of per­
formance in fine-tuned models.

3.1.1. Fine-tuning datasets


The approach for QA pairs and DG generation were fine-tuned using
the SQuAD and DG-RACE datasets, respectively. The utilisation of
multiple datasets was observed in the process of fine-tuning the chosen
methods for generating QA pairs, as indicated in Table 1. The selected
Fig. 1. SQuAD dataset example.
datasets for this study exhibit distinct characteristics and are applicable

Table 1
The summarization of the proposed method.
Model Dataset Question Purpose Distractor Pre-trained Method Automatic Expert Evaluation in a
generation generation Language type Evaluation Evaluation realworld
Answer
model setting
generation

QAPModel DG-RACE ✔ ✔ ✔ T5 Pipeline cosine × ×


(Rodriguez- similarity
Torrealba et al.,
2022)
QACGModel (Alberti SQuAD, natural ✔ ✔ × BERT Pipeline EM, F1 ✔ ×
et al., 2019, pp. Questions (NQ)
6168–6173)
QAG system (Yao FairytaleQA ✔ ✔ × BART Pipeline ROUGE ✔ ×
et al., 2022)
Question Generation SQuAD ✔ × × GPT-2, BERT Joint BLUE, × ×
and Answering ( Learning ROGUE
Klein & Nabi, 2019)
QAG framework (Qu SQuAD, RACE ✔ ✔ × ProphetNet Joint BLEU, ✔ ×
et al., 2021) Learning ROUGE,
METEOR
OneStop Model (Cui SQuAD, ✔ ✔ × BART Multitask BLEU, ✔ ×
et al., 2021) NewsQA, Learning ROUGE
DuReader
Multi-task mT5 model TQuADv2 ✔ ✔ × mT5 Multitask BLEU, × ×
(Akyon et al., 2022) learning METEOR,
ROUGE

4
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

carefully selected two teaching assistants for each course, prioritizing


their expertise and familiarity with the course content. These assistants
were tasked with selecting and refining material directly from academic
textbooks, with a specific focus on ensuring the relevance and challenge
of each question and its set of distractors. To further enhance the quality
and applicability of the dataset, we adopted a consensus-based
approach. Only those questions where both the course instructor and
the teaching assistant reached agreement were included. This rigorous
selection criterion ensured that every question of the dataset not only
aligns closely with academic standards but also effectively tests stu­
dents’ knowledge and critical thinking skills in a manner that is both
challenging and educationally valuable.

3.2. Experiment

Fig. 2. DG-RACE dataset example. Based on our analysis of the existing literature, it can be inferred that
the approaches employed for QA pairs can be categorised into three
texts, drawing from subjects commonly found in academic assessments distinct types: the pipeline, joint, and the multi-task learning approach.
to ensure a broad applicability. By focusing on the generation of However, there are number of approaches that fall in theses types. The
high-quality, contextually relevant distractors, DG-RACE aims to simu­ experimental design involved the selection of approaches for this study
late a genuine exam experience, testing students’ comprehension skills which is based on four conditions. First, the approach must satisfy the
effectively and preparing LLMs based systems to support nuanced specific requirements of the educational context, which includes the
educational needs. generation of questions, corresponding answers. Second, the approach
has the ability to generate the distractors will be considered preferen­
3.1.2. Benchmark datasets tially. Third, the approach has been mentioned the applicability in the
To ensure uniformity and high educational standards, we recognized education context. Lastly, the approach must be designed specifically for
the need for a benchmark dataset tailored explicitly for educational the English language. Based on thses conditions three approaches were
purposes, distinguishing it from datasets like RACE and SQuAD, which selected (Cui et al., 2021; Qu et al., 2021; Rodriguez-Torrealba et al.,
are not primarily designed with educational objectives in mind. This 2022b).
dedicated dataset adheres to strict criteria, emphasizing the creation of However, the selection of the pre-trained language models is a
questions that include a passage/context, one correct answer, and three crucial factor that can significantly impact the performance of these
carefully crafted distractors to simulate real-world examination condi­ approaches. It means that the different pre-trained language models
tions effectively (as shown in Fig. 3). Inspired by the methodologies have been shown to exhibit bias based on the training data used and
proposed by Torres (Torres et al., 2009), our approach to building this model architectures. To demonstrate the performance of the proposed
benchmark datasets involved detailed guidelines, as documented in approaches explicitly, different pre-trained language models were
Appendix A. selected and accessed respectively to exploit the patience of each
We selected three courses for this endeavor: a bachelor-level course approach and minimize the potential bias of different pre-trained lan­
in programming 1 and two master-level courses in Big Data with NoSQL guage models. As a result, a combination of pre-trained language models
and Enterprise Computing and ERP Systems. These courses were chosen with comparable architectures to those employed in the original
to develop benchmark datasets, ensuring a wide range of coverage that research were used in conjunction with the selected approaches to
reflects the critical areas of contemporary computer science education. generate QA pairs.
The annotation process was meticulously carried out with the collabo­ In this study, we meticulously selected pre-trained language models
ration of six annotators, comprising teaching assistants, to ensure a such as T5, ProphetNet, and BART based on specific criteria to ensure
robust evaluation across three distinct courses. For each course, we alignment with our research goals on QA pairs generation (Liu et al.,

Fig. 3. Benchmark dataset example.

5
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

2019; Qi et al., 2020; Raffel et al., 2020; H. Wang et al., 2022). Key assess the efficacy of the generated QA pairs derived from diverse ap­
considerations included each model’s architectural innovation, like T5’s proaches. The objective of this evaluation is to assess the efficacy of the
text-to-text framework, ProphetNet’s n-gram prediction, and BART’s generated QA pairs in the context of educational applications.
hybrid approach, ensuring advanced comprehension and generation
capabilities. Performance on benchmark datasets was crucial, guiding us 3.3.1. Automatic evaluation
to models with demonstrated excellence in NLP tasks. Flexibility for In the context of automatic evaluation, it was necessary to establish
fine-tuning allowed us to adapt models to our unique dataset, enhancing nine distinct groups (as presented in Table 2) for the purpose of gener­
their effectiveness. The selection was further influenced by the level of ating QA pairs since we are using three approaches along with three
community support and the availability of resources, ensuring ease of large language models. For example, in the context of the pipeline
implementation and potential for collaboration. Lastly, engagement approach, three distinct groups were formulated, namely pipeline + T5,
with models at the forefront of NLP research promised that our work pipeline + Bart, and pipeline + ProphetNet.
would remain cutting-edge. This thoughtful selection process guaran­ The evaluation of these combinations involved the utilisation of
teed that the models we chose were not just tools but integral to pushing established evaluation metrics, namely BLEU, METEOR, and ROUGE, to
the boundaries of QA pairs generation research. quantitatively assess their performance and derive corresponding
Thus, these three pre-trained language models can serve as sub­ scores. The computation of the average score is subsequently performed
stitutes for each other in the selected approaches. In summary, Table .2 for every QA pair generated by each group. Based on the findings ob­
shows the combinations of the QA pairs generation approaches and pre- tained from the evaluation results, it was deemed necessary to select the
trained LMs. The experiments were carried out using the SQuAD dataset. three groups (one group for each course) that exhibited the highest level
Additionally, the distractors were an essential part of the MCQ, of performance within all approaches for the purpose of conducting real-
except the correct question and answer. A T5-based distractor genera­ world educational evaluation. This decision was made in light of the fact
tion (DG) model with the question-answer pairs and the context para­ that evaluating all nine groups would entail a significant investment of
graph was implemented to generate distractors for the generated QA time and effort. While automatic evaluation metrics have proven to be
pairs from all selected approaches to form a complete MCQ (Rodri­ valuable in objectively assessing the quality of generated text and
guez-Torrealba et al., 2022b), which refers to the part of the pipeline reference text, there remains a gap in their ability to evaluate the syn­
approach. tactic and semantic aspects of generated QA pairs (Qu et al., 2021).

3.2.1. Fine-tuning 3.3.2. Teachers Evaluation


The fine-tuning stage started with the data preprocessing. Then the To conduct a thorough assessment of the quality and educational
processed data was passed to the QA generation models that have been utility of the generated QA pairs, it is crucial to include feedback from
shown in Table .2. Totally nine fine-tuned models were obtained.The teachers. This approach aligns with the methodology proposed by (Qu
data preprocessing method varies across different QA pairs generation et al., 2021), emphasizing the importance of teacher insights in evalu­
approaches. For each approach, we will adhere to the original data ating QA pairs’ effectiveness in real-world educational contexts. To this
processing strategy used in the source material. After the preparation of end, three teachers, each representing a distinct course of study, were
the data, the text will be given to the tokenizer of each pre-trained carefully selected to participate in the creation of assessments and to
language model. Tokenization means breaking the natural language provide detailed feedback through in-depth interviews. These in­
into chunks of tokens considered discrete elements (Liu et al., 2019). The terviews were designed to critically assess the quality and utility of these
subsequent step involves representing the text as a vector based on the QA pairs.
frequency of tokens in the documents, thereby converting the text into a After that teachers were tasked with organizing the QA pairs and
numerical data structure that can be used as input for the models. associated distractors into a series of Multiple Choice Questions (MCQs).
In the process of fine-tuning our model, we conducted all our ex­ This step was undertaken to facilitate the creation of assessments or
periments using the powerful NVIDIA V100 GPU on the Google Cloud exercises tailored to the educational content. A total of 48 assessments
platform. We started with an initial learning rate of 10e5 and imple­ were crafted, with a detailed allocation of 19 assessments for the pro­
mented a warm-up strategy to make sure the learning process stays gramming 1 course, 16 for the Big Data course, and 13 for the Enterprise
stable. The batch size was aligned with 8. Our main task was predicting Computing and ERP Systems course, reflecting the specific content and
the next token, which is a classic challenge in building models that learning objectives of each course. Subsequent to the assessment crea­
generate text. To optimize the model’s parameters, we used the cross- tion, in-person semi-structured interviews were conducted with each
entropy loss function along with the AdaW optimizer. teacher to gauge the assessments’ quality. These interviews were
meticulously recorded, and significant insights were documented by the
interviewers. The recordings and key observations were subsequently
3.3. Evaluation
compiled and subjected to a thematic analysis, aiming to distill and
interpret the essence of the interview discussions, thereby providing a
During the evaluation stage, the benchmark dataset is utilised as
nuanced understanding of the educational value and potential areas for
input for the fine-tuned models in order to generate QA pairs and dis­
improvement of the QA pairs.
tractors. Subsequently, the performance of each approach will be
assessed through a comprehensive evaluation process involving auto­
3.3.3. Real-world educational evaluation
mated, teachers and real-world evaluations. The evaluation employed to
In the stage of real-world educational evaluation, the objective was
to assess the influence of the generated assessments on the academic
Table 2 performance of students across a range of courses. Therefore, half of the
The Groups of the QA pairs generation methods and pre-trained LMs.
students in each course were randomly selected to deliver the assess­
Pipeline approach Approach + LLM Multi-task learning ments, and the remainder were not allowed to interact with the as­
approach
Joint learning sessments. At the finish of the courses, a comprehensive data collection
approach and analysis process was implemented in order to assess the influence of
T5 Pipeline + T5 Joint + T5 Multi-task + T5 the delivered assessments on the academic performance of the students.
BART Pipeline + BART Joint + BART Multi-task + BART In terms of data collection, data related to the academic performance
ProphetNet Pipeline + Joint + BART Multi-task ProphetNet of students, as well as their efforts in completing the assessments, were
ProphetNet
gathered for the purpose of this study. This encompassed the final

6
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

examination scores of each student, as well as the quantity of assessment enhanced performance in comparison to other LLMs across a range of
attempts made by students throughout the duration of the course. In methodologies, including pipeline, joint, and multi-task approaches.
terms of data analysis, a pair of statistical tests were conducted in order Furthermore, the findings of this study indicate that across all three
to assess the influence of the assessments on the academic performance courses, T5 consistently demonstrated its efficacy. However, it is worth
of the students. In order to determine the difference in academic per­ noting that within the specific context of the Programming 1 course,
formance between students who engaged in the assessments and those BART exhibited superior performance when utilising a multitask
who did not, a statistical analysis employing an independent samples t- approach compared to other LLMs. Nevertheless, it is important to
test was conducted. In order to investigate the potential association highlight that BART’s performance, even in this scenario, was still
between assessment attempts and students’ academic achievement, this inferior to that of T5 with a pipeline approach.
research employed Pearson and Spearman’s correlation analyses. Based on the findings presented in the tabulated data (Tables 3 and 4
and Table .5) and following the established procedures for real educa­
4. Results tional evaluation (section 3.3.2), it is considered appropriate to exclu­
sively utilise the most favourable outcome obtained from each
In this section, we present the results of the experiments conducted respective group for the purpose of generating QA pairs. Henceforth, it
to evaluate the effectiveness of approaches. These approaches were has been determined that the QA pairs generated by the Pipeline + T5
evaluated in terms of its ability to correctly generate QA pairs and and group will be selected for the Programming 1 course. Similarly, for the
how they affected students’ academic success. big data and enterprise modelling courses, QA pairs obtained from the
Multi-task + T5 group will be chosen.

4.1. Results of automatic evaluation


4.2. Results of Teachers Evaluation
The results of the automatic evaluation are presented in Table .3,
Table .4 and Table .5 for the benchmark datasets of each group The thematic analysis of interview data on the evaluation of QA pairs
(Approach + model). identified five themes: correctness, understandability, difficulty level,
These tables show the results of various automatic evaluation met­ knowledge impact, and utility impact (as shown in Fig. 4).
rics, including BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-1, ROUGE-2,
ROUGE-L, ROUGE-Lsum, and METEOR, for three different pipeline 4.2.1. Themes Definitions
groups (Pipeline + T5, Pipeline + BART, Pipeline + ProphetNet), three Correctness assesses the factual accuracy of questions, answers, and
different joint groups (Joint + T5, Joint + BART, Joint + ProphetNet), distractors, ensuring they’re relevant and appropriate. Understandabil­
and three different multi-task learning groups (Multi-task + T5, Multi- ity evaluates the clarity and accessibility of language and presentation,
task + BART, Multi-task + ProphetNet). The values in tables represent making sure the content is easy to grasp. The difficulty level theme ex­
the average score of each metric across all the QA pairs in the benchmark amines the complexity of the QA pairs, ensuring they are suitably
datasets. The higher the score, the better the performance of the group in challenging to effectively test student knowledge. Knowledge impact
generating high-quality answers. For example, the BLEU-1 score for the looks at the educational value of the QA pairs, from basic to advanced
Pipeline + T5 group is 36.15, which means that on average, this group knowledge, assessing their scope in enhancing learning. Lastly, utility
generates answers that match 36.15% of the unigrams in the reference impact considers the overall usefulness of the QA pairs, evaluating how
answers. well they support learning outcomes, teaching effectiveness, and course
In terms of approaches, it can be observed that the multi-task quality. These themes together provide a comprehensive overview of the
approach outperforms both the pipeline and joint approaches across quality and educational significance of the QA pairs.
all benchmark datasets representing three distinct courses. Notably, in
the context of Programming 1, the pipeline method exhibits perfor­ 4.2.2. Results
mance levels that are comparable to those of the multi-task method. The evaluation of the generated QA pairs across three course was
Conversely, the joint approach lags behind both the multi-task and conducted using a precise numerical scale, enabling a nuanced assess­
pipeline approaches. In the case of the enterprise modelling and big data ment of each dimension from low to high performance (as shown in
course, the multi-task approach displayed the highest performance, Fig. 5). The evaluation results revealed high levels of teacher satisfaction
while the joint and pipeline methods demonstrated comparable per­ with the questions’ correctness, underscoring the QA pairs’ accuracy,
formance levels, with the joint model marginally outperforming the relevance, and appropriateness across different subjects. However,
pipeline method. while the answers in the Programming 1 and Big Data courses were
In the context of LLMs, it has been noted that T5 demonstrates highly praised for correctness, the Enterprise Modelling course’s results

Table 3
Results of automatic evaluation metrics for programming 1.
Metric Models

Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet

BLEU-1 36.14964029 33.64153452 23.39791356 30.33136522 27.36548502 26.60121754 29.74910394 30.98478783 22.63743051
BLEU-2 22.35608167 20.90095886 11.69539345 20.07960227 16.95921112 16.65103447 19.2227172 21.30069651 12.57124711
BLEU-3 15.79375148 15.02378866 7.138231206 15.2961294 12.49040462 12.36952088 13.89763955 16.39482792 8.097514051
BLEU-4 12.02456886 11.82414574 5.032866744 12.38900336 10.12981764 10.06914535 10.60066968 13.21698548 5.763605075
ROUGE-1 32.02495387 30.63471727 21.11995874 20.23716852 16.99843471 15.73133455 31.27625675 31.18783253 23.80227394
ROUGE-2 12.99585724 12.75571406 5.401256265 9.135552858 6.925865541 6.635624063 14.38539526 15.79856505 6.857651749
ROUGE-L 26.35652675 23.9642481 15.96005056 18.87403179 15.88092056 14.33875987 25.05205253 26.87081364 18.81232257
ROUGE- 26.36512806 24.0792488 15.97624424 18.87955415 15.91985963 14.35597131 25.02281517 26.85631877 18.74756191
Lsum
METEOR 31.70870176 29.47697555 25.37397826 20.02663572 17.75808498 19.41754084 35.48587518 34.20076926 32.69324558
Average 23.97502333 22.47792573 14.566210336 18.36100481 15.603120425 15.130016541 22.74361392 24.09017744 16.664761388
Score

7
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

Table 4
Results of automatic evaluation metrics for big data.
Metric Models

Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet

BLEU-1 23.28076522 18.2197417 20.79261672 19.90801444 18.93799357 16.43486243 35.13891896 32.38045738 22.77432712
BLEU-2 16.50661274 10.56219257 11.8668281 13.27365872 10.82287578 8.685232733 26.96877127 24.94276192 15.09960323
BLEU-3 13.02351867 6.741328941 8.339360611 9.400936677 7.010160883 5.780093828 22.46610602 21.15422776 11.43974895
BLEU-4 10.90174311 4.455817899 6.36770316 6.913700666 4.897609407 4.25681217 19.43932548 18.85497636 9.322526971
ROUGE-1 27.28057174 21.05013342 16.72537488 24.29114063 19.14611678 14.60227056 29.45955894 28.6514385 17.92825687
ROUGE-2 13.83632432 8.526407772 5.92788386 12.08894383 7.222503895 4.713001085 16.89802293 17.49711636 7.781688575
ROUGE-L 26.93642493 20.30977193 15.19678172 23.79932539 18.39886388 13.92933252 28.2352558 27.96082317 17.18851397
ROUGE- 26.89281126 20.34191829 15.16479933 23.73454746 18.42582878 13.97819757 28.30308393 27.80589989 17.25035657
Lsum
METEOR 29.23718962 23.03642065 22.9554569 27.68654117 22.48453957 23.30335489 35.1346726 35.42909757 32.18946831
Average 20.87732907 14.804859241 13.704089476 17.899645443 14.149610283 11.742573087 26.89374621 26.07519988 16.774943396
Score

Table 5
Results of automatic evaluation metrics for enterprise modelling.
Metric Models

Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet

BLEU-1 24.89688773 25.03339031 20.66971081 24.48834059 23.1617379 21.11226446 25.8684405 25.3164557 16.21037464
BLEU-2 16.27869483 15.89980441 7.966343666 16.46266515 14.18031072 12.3615537 17.63112649 17.63351073 9.614484625
BLEU-3 11.61481621 11.38762279 3.953459766 12.02050845 9.649366618 8.273964464 12.93639526 13.01625815 6.195580785
BLEU-4 8.468833478 8.593874495 2.167852295 8.815473961 6.968471916 6.089207626 9.58311597 9.638056692 4.454534055
ROUGE- 23.41647287 21.27903254 16.05618224 24.7873407 19.01818552 17.4928996 27.8988838 26.64783878 19.22624383
1
ROUGE- 10.30302322 8.612521911 2.615157343 11.81028214 7.597897023 6.520264528 14.84577498 14.19347913 7.091385424
2
ROUGE- 22.07447606 19.82906176 13.81788872 23.70111652 18.04581239 16.49337089 26.27754562 24.9953504 17.44916216
L
ROUGE- 22.03421358 19.73847621 13.8621216 23.68067221 18.07887164 16.50290914 26.2163233 25.05321598 17.46847245
Lsum
METEOR 22.65856622 21.83261886 18.16281993 23.98163752 19.90253358 20.84400664 30.26681373 29.59724752 25.10922375
Average 17.971776022 16.911822587 11.030170708 18.860893027 15.178131923 13.965604561 21.28049107 20.676823676 13.646606858
Score

indicated a medium level of satisfaction. indicated that improvements could be made in enhancing course sup­
In the terms of understandability, the clarity and accessibility of the port, as suggested by the medium level of satisfaction from the Enter­
QA pairs were largely accepted, particularly the questions themselves, prise Modelling teacher. This aspect underscores the potential to further
which were universally deemed highly understandable by all teachers. refine the QA pairs to better align with and support the overarching
Nevertheless, the understandability of answers and distractors did not goals of the courses.
achieve unanimous high ratings, especially from the Enterprise Model­ In conclusion, while the method for generating quiz-style MCQs was
ling and Programming 1 teachers, suggesting a need for improvement in deemed effective for educational purposes, especially in terms of cor­
how answers and distractors are articulated to ensure they are easily rectness and understandability, the detailed feedback from teachers
comprehensible. suggests several avenues for improvement. Enhancing the difficulty
The feedback on the difficulty level presented a mixed picture, level, broadening the coverage of advanced knowledge, and bolstering
indicating significant room for enhancement to better meet educational course support emerged as key areas for future development.
goals. While the Programming 1 course’s teacher found the questions to
be of a high difficulty level, aligning with desired standards, the re­
sponses regarding the difficulty levels of answers and distractors varied 4.3. Results of real-educational evaluation
widely. This diversity in feedback highlights an imperative to recalibrate
the challenge presented by the QA pairs, ensuring they are sufficiently In real-educational evaluation, created QA pairs were organized into
demanding to stimulate student engagement and learning. series of MCQs in order to create assessments and delivered to students
Teachers praised the QA pairs for their positive impact on imparting during the courses (section 3.3.2). In the end of the courses statistical
basic and concept-wise knowledge, with high evaluations across these analyses were carried out to investigate the effects of the assessments on
areas. Yet, the advanced knowledge impact received lower ratings from student academic performing performance.
the Programming 1 and Enterprise Modelling teachers, suggesting the In this evaluation, the participants were divided into two distinct
necessity to delve deeper into complex topics and elevate the academic groups: Group 1, comprising individuals who were granted the oppor­
rigor of the QA pairs to foster a more comprehensive understanding tunity to engage in assessments throughout the duration of the course,
among students. and Group 2, consisting of individuals who were not provided access to
Regarding utility impact, the QA pairs were recognized for their the assessments. Following the completion of the aforementioned tasks,
beneficial support towards students, teachers, and the overall course a series of statistical analyses were conducted. The initial analysis aimed
objectives, particularly in facilitating student learning and aiding to determine any differences in academic performance between students
teachers in assessing student comprehension. However, feedback who engaged in the assessments and those who did not. To achieve this,
an independent samples t-test was employed. In order to investigate the

8
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

Fig. 4. Thematic analysis.

potential association between assessment attempts and students’ aca­ approaches were then evaluated on these benchmarks through a com­
demic achievement, this research employed Pearson and Spearman’s bination of automatic methods, teacher assessments, and real-world
correlation analyses. educational evaluations.
The results obtained from the t-test for the t and p-values (Table 7)
showed a significant statistical difference between the groups in all three
5.1. Findings
courses. The observed difference between groups is more pronounced in
the context of the big data course when compared to the other two
We began by addressing the first research question of how to develop
courses. Table 6 showed statistics of both groups in all three courses. As
a robust framework to evaluate the effectiveness of different QA pairs
presented in Table 6 Group 1’s mean in all three courses was greater
generation approaches in higher education, ensuring that these tech­
than Group 2’s. The greater mean for Group 1 demonstrates that par­
nologies meet the specific needs of educational stakeholders. Our results
ticipants who performed assessments during the courses performed
from automatic evaluations and teacher assessments provide a
better in the final exam than nonparticipants (group 2). The maximum
comprehensive view of the performance of these technologies across
scores for the final exams in the Programming 1, Big Data, and Enter­
various educational contexts.
prise Modelling courses are 300, 50, and 50, respectively.
Our findings highlight that the multi-task approach generally out­
The findings of the correlation analysis, as displayed in Table 8,
performs the pipeline and joint methods in generating QA pairs, as
revealed a favourable correlation between assessment attempts and
evidenced by the high scores across multiple automatic evaluation
student acadamic performance. According to the findings shown in
metrics, particularly for the Programming and big data courses. This
Table 8, the highest correlation coefficient observed for both the Pearson
indicates a robust capability of the multi-task approach to produce
and Spearman techniques was 0.67, specifically in regard to the Pro­
quality QA pairs that align closely with reference answers in these
gramming 1 course. The Enterprise Modelling course exhibited the
courses. Additionally, the T5 model showed superior performance
lowest correlation of 0.59 when analysed using both techniques. How­
across all methods and was consistently effective, further supporting its
ever, in the Big Data course, the correlation shown a modest improve­
use in our recommended framework for generating QA pairs.
ment compared to the Enterprise course, but it declined in comparison to
However, when considering the application of these approaches in a
the Programming 1 course. Based on the findings of this analysis, it can
real educational setting, the feedback from teachers provides essential
be inferred that there exists a positive correlation between the number of
insights that should influence the evaluation framework. Teachers re­
assessments attempted by students throughout the course and their
ported high satisfaction with the accuracy and relevance of the questions
performance on the final exam.
generated, particularly praising the QA pairs for their correctness in the
Programming 1 and Big Data courses. This suggests that the QA pairs
5. Discussion
generated are not only technically competent but also applicable and
useful in an educational context. Despite these positives, the feedback
This study implements three distinct approaches and LLMs for
also pointed out areas needing improvement. The medium satisfaction
generating QA pairs, which were fine-tuned using the SquAD and RICE
levels in the Enterprise Modelling course and the mixed feedback on the
datasets. To assess their performance in educational settings, we
understandability of answers and distractors across courses highlight the
developed benchmark datasets for three separate courses. These
need for adjustments in how these QA pairs are formulated. This

9
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

Fig. 5. Heatmap of thematic analysis of teachers evaluation.

variance underscores the importance of context and subject specificity in


Table 6
the development and application of QA generation tools.
Statistics on groups of students.
Additionally, the teacher evaluations revealed that while the ques­
Course Group N (users/ Mean Std. tions were effective at covering basic and conceptual knowledge, there
students) deviation
was a noted need for deeper coverage of advanced topics, particularly in
Programming 1 Group 1 120 231.51 59.68 the Programming 1 and Enterprise Modelling courses. This feedback is
(Assessments users)
crucial for refining the QA pairs to ensure they not only test but also
Group 2 (Non- 120 205.35 50.06
Assessments users)
enhance student understanding of more complex subject matter.
Big Data Group 1 70 44.66 20.08 These findings lead us to the second research question, in which we
(Assessments users) aimed to examine the effects of the generated QA pairs on students’
Group 2 (Non- 70 32.37 8.27 academic performance. To measure the effects of QA pairs we delivered
Assessments users)
MCQs assessments to students during the courses and at the end we
Enterprise Group 1 40 40.33 16.05
Modelling (Assessments users) performed two statistical analyses on final exam scores and students’
Group 2 (Non- 40 33.53 9.34 assessment attempts. The results in Section 4.2 raised two important
Assessments users) findings. First, the generated QA pairs positively impacted students’
academic performance because students who were allowed to attempt
assessments during the course achieved higher final exam scores than
Table 7 those who were not. Second, a positive correlation was found between
Results of the independent samples t-test. the number of assessment attempts that students made during the course
Course Sig. t df Sig.(Two-Sided p) and their final exam scores. Based on these findings, the answer to the
second question is that the generated assessments positively affected
Progamming 1 < 0.001 3.67 238 < 0.001
Big Data < 0.001 4.73 138 < 0.001
students’ academic performance, and students who attempted more
Enterprise Modelling < 0.001 2.31 78 0.023 assessments had a higher tendency to achieve better scores in the final
exams of the course than those who attempted fewer ones.

10
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

Table 8
Correlations between assessments attempts and Final exam score.
Course Correlation technique N Variable by Variable r Sig.(2-tailed)

Programming 1 Pearson Correlation 120 Assessments Attempts Final exam score 0.67 < 0.001
Spearman Correlation 120 Assessments Attempts Final exam score 0.67 < 0.001
Big Data Pearson Correlation 70 Assessments Attempts Final exam score 0.65 < 0.001
Spearman Correlation 70 Assessments Attempts Final exam score 0.64 < 0.001
Enterprise Modelling Pearson Correlation 40 Assessments Attempts Final exam score 0.59 < 0.001
Spearman Correlation 40 Assessments Attempts Final exam score 0.59 < 0.001

5.2. Implications samples, incorporating multiple stakeholders, employing more varied


and realistic datasets, and exploring a broader array of models and
The findings of this research have significant implications for the methodologies. In addition, it is imperative to generate a diverse range
development of QA generation tools, especially in the higher educa­ of benchmark datasets across various subject areas in order to thor­
tional context. Firstly, it provides valuable insights into the strengths oughly evaluate the efficacy of the chosen methodologies.
and limitations of different approaches to generating QA pairs, which
have proven to be useful in the higher educational context. The inte­ 6. Conclusion
gration of these tools into diverse curricula across disciplines introduces
a transformative potential for teaching and learning practices. Teachers In the context of higher education, it is the prerogative of the teacher
must be supported through professional development programs to to develop assessments that serve as valuable tools in aiding students in
adeptly incorporate these technologies into their strategies. their understanding of various course concepts. In order to conduct
Secondly, automated QA pairs can enhance student learning out­ thorough assessments, teachers require a collection of QA pairs that are
comes and progress in courses. These tools can revolutionize assessment directly relevant to the respective course being taught. In the current
practices by enabling more formative assessments that provide imme­ study, a deliberate choice was made to employ three distinct method­
diate feedback and support personalized learning paths. Moreover, they ologies in order to generate automatic QA pairs. These methodologies
encourage higher student engagement and motivation by promoting involved the utilisation of pre-trained large language models, with the
self-directed learning and critical thinking. Thirdly, the use of auto­ primary objective of enhancing educational practises. In order to assess
matically generated QA pairs can alleviate the burden on teachers of the efficacy of the aforementioned approaches, a comprehensive eval­
manually creating multiplechoice questions (MCQs) for their students. uation was carried out using both automatic evaluation techniques and
This shift can lead to a more efficient allocation of teacher efforts to­ real-world educational evaluation. The objective was to compare and
wards more impactful educational activities such as personalized stu­ evaluate the performance of these approaches within the context of
dent guidance and curriculum development. educational purposes. The evaluation was conducted on benchmark
The ethical and societal implications also demand careful consider­ datasets related to three distinct courses, namely Programming 1, Big
ation. Ensuring equitable access to these tools across different student Data, and Enterprise Modelling. The findings of this study indicate that
populations is critical. This involves addressing not only the digital the Multi-task approach exhibited superior performance compared to
divide but also designing these tools to be inclusive and accessible to the other two approaches across all benchmark datasets during the
students with disabilities. Moreover, the quality of the generated QA automatic evaluation process. The approach utilised for generating QA
pairs is crucial; biases or limitations in the automated generation process pairs demonstrated a favourable utility impact, effectively providing
must be addressed to ensure the quizzes are unbiased, relevant, and support to students. However, there is room for improvement in order to
aligned with educational objectives. Institutional strategies and policies enhance the facilitation of course support. In order to enhance the
will also need to evolve to support the ethical use of these technologies. comprehensiveness and representativeness of the findings, it is recom­
This includes developing guidelines that address data privacy, combat mended that future investigations consider expanding the pool of par­
academic dishonesty, and ensure that these tools align with broader ticipants and incorporating a wider range of courses.
educational goals and standards.
Finally, there is a need for ongoing research into the long-term im­ CRediT authorship contribution statement
pacts of these tools on educational outcomes, scalability in diverse
educational settings, and the user experience of both students and ed­ Jintao Ling: Writing – original draft, Software, Methodology,
ucators. By expanding the scope of these discussions, the adaptation of Formal analysis, Conceptualization. Muhammad Afzaal: Writing – re­
automated QA pair generation can contribute positively to the educa­ view & editing, Validation, Supervision, Resources, Methodology.
tional context, supporting more efficient and effective learning and
assessment practices.
Declaration of competing interest

5.3. Limitations and future research The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
The benchmark datasets utilised in this study were derived exclu­ the work reported in this paper.
sively from course materials related to computer science-related sub­
jects. It is important to acknowledge that this strategy may introduce Appendix A. Standards of creating multiple-choice questions
certain limitations in terms of the generalizability of the findings and
could potentially introduce bias into the evaluation outcomes. The question creation criteria.
Furthermore, it is important to note that our study focused exclusively
on the evaluation of three prominent language models. We did not delve 1. Clearly identify the specific point being tested by the item
into the investigation of alternative models or the potential synergistic 2. Include the main idea and phrasing in the stem
effects of combining different models, which could potentially lead to 3. Avoid including irrelevant clues that may lead to the correct answer
enhanced performance. The aforementioned limitations underscore the 4. Eliminate unnecessary or extraneous information in the question
necessity for additional investigation employing larger and more diverse stem

11
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252

5. Use negative words in the stem sparingly, and if used, underline or Gholami, V., & Morady Moghaddam, M. (2013). The effect of weekly quizzes on
students’ final achievement score. International Journal of Modern Education and
capitalize them to avoid confusion
Computer Science, 5, 36–41. https://doi.org/10.5815/ijmecs.2013.01.05
6. Ensure that the question is clear and students understand what is Jia, X., Zhou, W., Sun, X., & Wu, Y. (2020). EQG-RACE: ExaminationType question
expected of them generation. https://doi.org/10.48550/arXiv.2012.06106. Comment: Accepted by
AAAI-2021.
Klein, T., & Nabi, M. (2019). Learning to answer by learning to ask: Getting the best of GPT-2
The options creation criteria. and BERT worlds. https://doi.org/10.48550/arXiv.1911.02365 [arXiv:1911.02365
[cs]].
1. Make sure each option is mutually exclusive and does not overlap Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of
automatic question generation for educational purposes. International Journal of
with others Artificial Intelligence in Education, 30, 121–204.
2. Ensure that the options are as similar as possible in terms of Lelkes, A. D., Tran, V. Q., & Yu, C. (2021). Quiz-style question generation for news stories.
structure and phrasing https://doi.org/10.48550/arXiv.2102.09094 [arXiv:2102.09094 [cs]].
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
3. Keep the grammar of each option consistent with that of the Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT
question stem pretraining approach [arXiv:1907.11692 [cs]] https://doi.org/10.48550/arXiv.1
4. Ensure that there is only one correct or best answer among the 907.11692.
Mazidi, K., & Nielsen, R. (2014). Linguistic considerations in automatic question
options generation. Proceedings of the 52nd Annual Meeting of the Association for
5. Use plausible and realistic distractors to challenge the students Computational Linguistics, 2, 321–326.
6. Incorporate common errors or misconceptions made by students Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., & Zhou, M. (2020).
ProphetNet: Predicting future N-gram for sequence-to-sequence pre-training. htt
in the distractors
ps://doi.org/10.48550/arXiv.2001.04063. Comment: Accepted to EMNLP 2020
7. Keep the structure of the options parallel and consistent Findings. Project page: github.com/microsoft/ProphetNet.
8. Ensure that the length of the options is similar to avoid giving Qu, F., Jia, X., & Wu, Y. (2021). Asking questions like educational experts: Automatically
clues to the correct answer generating question-answer pairs on real-world examination data. Comment:
Accepted as a long paper in the main conference of EMNLP 2021. https://doi.org/
9. Avoid using the options ”all of the above” and ”none of the 10.48550/arXiv.2109.05179 [arXiv:2109.05179 [cs]].
above” unless necessary Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., &
10. Avoid using absolute determiners such as ”never” and ”always” Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text
transformer. https://doi.org/10.48550/arXiv.1910.10683. Comment: Final version
11. Randomize the position of the correct answer among the options as published in JMLR).
12. Use letters instead of numbers to label the options 13. Replace Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for
any distractors that are not chosen by any examinees machine comprehension of text. https://doi.org/10.48550/arXiv.1606.05250.
Comment: To appear in Proceedings of the 2016 Conference on Empirical Methods
14. Avoid using humor in the question or options. in Natural Language Processing (EMNLP).
Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022a). End-to-end
generation of multiple-choice questions using text-to-text transfer transformer
models. Expert Systems with Applications, 208, Article 118258.
References Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022b). End-to-End
generation of Multiple-Choice questions using Text-to-Text transfer Transformer
Akyon, F. C., Cavusoglu, D., Cengiz, C., Altinuc, S. O., & Temizel, A. (2022). Automated models. Expert Systems with Applications, 208, Article 118258. https://doi.org/
question generation and question answering from Turkish texts. https://doi.org/10. 10.1016/j.eswa.2022.118258
48550/arXiv.2111.06476. Comment: 14 pages, 1 figure, 13 tables. Torres, C., Lopes, A. P., Azevedo, J. M., & Babo, L. (2009). Developing multiple choice
Alberti, C., Andor, D., Pitler, E., Devlin, J., & Collins, M. (2019). Synthetic QA corpora questions in mathematics [Accepted: 2012-07-30T11:05:13Z]. Retrieved February
generation with roundtrip consistency. Proceedings of the 57th annual meeting of the 19, 2023, from https://recipp.ipp.pt/handle/10400.22/586.
association for computational linguistics. https://doi.org/10.18653/v1/P19-1620 Wang, H., Li, J., Wu, H., Hovy, E., & Sun, Y. (2022). pre-trained language models and
Chan, Y.-H., & Fan, Y.-C. (2019). A recurrent BERT-based model for question generation. their applications. Engineering. https://doi.org/10.1016/j.eng.2022.04.024
Proceedings of the 2nd workshop on machine reading for question answering. https://doi. Wang, H. C., Maslim, M., & Kan, C. H. (2023). A question–answer generation system for
org/10.18653/v1/D19-5821 an asynchronous distance learning platform. Education and Information Technologies,
Cui, S., Bao, X., Zu, X., Guo, Y., Zhao, Z., Zhang, J., & Chen, H. (2021). OneStop 28(9), 12059–12088.
QAMaker: Extract question-answer pairs from text in a one-stop approach [arXiv: Yao, B., Wang, D., Wu, T., Zhang, Z., Li, T. J.-J., Yu, M., & Xu, Y. (2022). It is AI’s turn to
2102.12128 [cs]] https://doi.org/10.48550/arXiv.2102.12128. ask humans a question: Question-answer pair generation for children’s story books.
Gabajiwala, E., Mehta, P., Singh, R., & Koshy, R. (2022). Quiz maker: Automatic quiz https://doi.org/10.48550/arXiv.2109.03423. Comment: Accepted to ACL 2022.
generation from text using NLP. In Futuristic trends in networks and computing Yi, C., Zhu, R., & Wang, Q. (2021). Exploring the interplay between questionanswering
technologies (pp. 523–533). Springer Nature. https://doi.org/10.1007/978981-19- systems and communication with instructors in facilitating learning. Internet
5037-737. Research, 32(7), 32–55.
Gao, Y., Bing, L., Li, P., King, I., & Lyu, M. R. (2018). Generating distractors for reading Zhong, W., Gao, Y., Ding, N., Qin, Y., Liu, Z., Zhou, M., Wang, J., Yin, J., & Duan, N.
comprehension questions from real examinations. https://doi.org/10.48550/arXiv.1 (2022). ProQA: Structural prompt-based pre-training for unified question answering
809.02768. Comment: AAAI2019. [arXiv:2205.04040 [cs]] https://doi.org/10.48550/arXiv.

12

You might also like