Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education
Automatic Question-Answer Pairs Generation Using Pre-Trained Large Language Models in Higher Education
A R T I C L E I N F O A B S T R A C T
Keywords: The process of manually generating question and answer (QA) pairs for assessments is known to be a time-
Pre-trained language model consuming and energy-intensive task for teachers, specifically in higher education. Several studies have pro
Question-answer pairs generation posed various methods utilising pre-trained large language models for the generation of QA pairs. However, it is
Higher education
worth noting that these methods have primarily been evaluated on datasets that are not specifically educational
Automatic evaluation
Real-educational evaluation
in nature. Furthermore, the evaluation metrics and strategies employed in these studies differ significantly from
those typically used in educational contexts. The present discourse fails to present a compelling case regarding
the efficacy and practicality of stated methods within the context of higher education. This study aimed to
examine multiple QA pairs generation approaches in relation to their performance and the efficacy and con
straints within the context of higher education. The various approaches encompassed in this study comprise
pipeline, joint, multi-task approach. The performance of these approaches under consideration was assessed on
three datasets related to distinct courses. The evaluation integrates three automated methods, teacher assess
ments, and real-world educational evaluations to provide a comprehensive analysis. The comparison of various
approaches was conducted by directly assessing their performance using the average scores of different auto
matic metrics on three datasets. The results of the teachers and real educational evaluation indicate that the
assessments generated were beneficial in enhancing the understanding of concepts and overall performance of
students. The implications of the findings from this study hold significant importance in enhancing the efficacy of
QA pair generation tools within the context of higher education.
* Corresponding author.
E-mail address: [email protected] (M. Afzaal).
https://doi.org/10.1016/j.caeai.2024.100252
Received 29 November 2023; Received in revised form 24 April 2024; Accepted 8 June 2024
Available online 10 June 2024
2666-920X/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
considering the application of these methodologies within the field of performance, as evidenced by empirical evaluations (Rodri
education, it becomes challenging to offer a definitive recommendation guez-Torrealba et al., 2022b). The various approaches can be broadly
for approach selection due to the difficulty in directly comparing their classified into three distinct categories based on their architectural de
efficacy. While certain literature sources acknowledge the potential use signs (Qu et al., 2021; Rodriguez-Torrealba et al., 2022b). These cate
of their approaches in an educational context, there is a lack of empirical gories include the pipeline approach, joint learning approach, and
evaluation of these approaches in real-world educational settings. multi-task learning approach.
Additionally, the datasets (e.g., SQuAD dataset is extensively utilised
and has over 100,000 question-answer pairs derived from a collection of 2.1. Pipeline approach
Wikipedia articles) that used to generate pairs specifically designed for
reading comprehension (Rajpurkar et al., 2016). However, using data The pipeline approach is a straightforward methodology in which
sets like SQuAD for educational efficacy assessment is inappropriate, as the processes of question generation and answer generation are
they were not specifically designed for educational purposes (Lelkes executed sequentially. In a recent study (Rodriguez-Torrealba et al.,
et al., 2021). As a response, diverse methodologies have been developed 2022b), the authors introduced a novel processing pipeline that lever
for QA pair generation, each holding potential for application in higher ages the T5 language model to generate question-answer pairs. This
education. method’s superior performance in generating coherent and contextually
In this context, we pose the following research questions. relevant QA pairs suggests its potential to enhance learning by providing
more engaging and challenging materials in higher education settings
• Question 1: How can we develop a robust framework to evaluate the (Johnson et al., 2024). The model, known as the QAP Model (Ques
effectiveness of different QA pairs generation approaches in the tion/Answer Pairs Model), leverages the inherent duality between
higher education context, ensuring that these technologies meet the question generation (QG) and QA to achieve its objectives. The T5 model
specific needs of educational stakeholders? is utilised for generating both questions and answers, with a subsequent
• Question 2: What impact do automatically generated QA pairs have model employing distractors to provide incorrect answer options, which
on students’ academic performance and engagement in real educa enhances the learning challenge.
tional environments? In another study (Alberti et al., 2019, pp. 6168–6173), authors
introduced a roundtrip consistency mechanism to further enhance the
The first research question underscores the necessity of a tailored coherence of the generated QA pairs. This mechanism involves a
evaluation that transcends mere technical performance, integrating detailed comparison between initial and subsequent responses to ensure
pedagogical considerations to truly assess the value of these technolo coherence, excluding pairs that do not meet the criteria. In the context of
gies in educational settings. While the second research question is crit education, it is crucial to tailor the generation of student questions to
ical as it seeks to directly link the use of innovative assessment meet educational needs rather than solely assessing comprehension (Yao
technologies with measurable outcomes in student learning and et al., 2022). The FAIRYTABLEQA dataset, annotated by domain ex
engagement, thus providing a compelling argument for the integration perts, is utilised to refine the rules for answer extraction, although
of such tools in educational practice. limitations exist in covering the broad spectrum of educational subjects.
By proposing a methodology that evaluates the effectiveness of Researchers also propose integrating a text summarization module
different QA pair generation approaches (pipeline, joint, and the multi- within the pipeline approach to streamline the extraction of crucial in
task approaches) using pretrained large language models (LLMs) across formation and the formulation of questions and answers (Gabajiwala
three benchmark datasets derived from real educational materials, this et al., 2022). This optimized approach utilizes advanced tools like
paper aims to fill the aforementioned gaps. These datasets reflect ConceptNet, WordNet, and Sense2vec to ensure relevance and accuracy
different courses that are taught at University and were generated spe in the generated educational content.
cifically for evaluating different methodologies. The evaluation process The straightforward pipeline approach using LLMs like T5 in higher
involves automatic, teachers and real-educational settings within the education allows for ease of implementation, which is beneficial for
context of higher education. Our findings reveal that the multi-task teachers without deep technical expertise. It facilitates the custom
learning approach, particularly when utilising the T5 LLM, out ization and scalability of educational content, making it adaptable to
performs its counterparts, demonstrating a notable positive impact on various class sizes and curriculum needs (Kurdi et al., 2020). Moreover,
student academic performance. Teachers expressed high satisfaction the modular nature of the pipeline permits iterative enhancements based
with the question accuracy and relevance, acknowledging the technical on feedback, contributing to continuous improvement in teaching
competence and educational utility of the QA pairs, while noting the methodologies and student learning outcomes. This methodological
need for improvements in understandability and consistency across simplicity thus not only democratizes access to advanced AI technolo
different courses. Moreover, our investigation shows that the use of gies in educational settings but also enhances the overall educational
generated QA pairs positively influences students’ academic perfor experience by providing reliable and adaptable tools for both teaching
mance. Specifically, it was observed that students engaging more and assessment (H. C. Wang et al., 2023).
frequently with QA pair-based assessments tended to perform better in
their final examinations, indicating the significant benefits of inte 2.2. Joint learning approach
grating these technologies into educational assessments.
In the context of a joint model, the process of generating both the
2. Background question and answer occurs iteratively. The interdependence between
question generation and question answering presents a promising op
The utilisation of pre-trained language models has yielded remark portunity for enhancing performance and reducing the reliance on an
able achievements in the domain of natural language processing (NLP). notated data (Klein & Nabi, 2019). The pipeline approach exhibits
Consequently, numerous methodologies have been put forth to address suboptimal performance in the task of extracting the most suitable QA
the task of generating question-answer pairs, leveraging the capabilities pairs from textual data. This is primarily due to the disregard for the
of these pre-trained language models. These methodologies are not interdependence between question generation and answer extraction,
merely technical achievements; they have practical implications for resulting in the potential generation of incompatible QA pairs (Cui et al.,
educational practices, particularly in crafting personalized and adaptive 2021).
learning experiences (Yi et al., 2021). A novel approach was introduced (Klein & Nabi, 2019) that in
These approaches have demonstrated substantial enhancements in tegrates the Transformers decoder GPT-2 model with the Transformers
2
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
encoder BERT, with the aim of facilitating collaborative learning in the as input. The performance of the multi-task-based model is observed to
context of question-answering and question generation. The training of be superior when evaluating its results in comparison to the T5 model
the model is conducted through the utilisation of an endto-end schema, that has been fine-tuned under a single-task setting, as reported in
wherein the dual tasks of question-answering and question generation reference (Akyon et al., 2022).
are considered concurrently, rather than being approached as sequential In higher education, the use of QA pairs based assessments using
tasks. The utilisation of a ground-truth answer and accompanying LLMs like BERT and T5, employing approaches such as pipeline, joint
context text is employed as a means to generate a question through the learning, and multi-task learning, significantly enhances both teaching
utilisation of GPT-2. The subsequent step involves providing the and learning experiences. For students, these models facilitate person
pre-trained BERT model with the SQuAD text, which encompasses the alized learning through adaptive assessments and provide immediate,
question that was formulated, in order to facilitate the extraction of the detailed feedback, which is crucial for effective learning (Zhong et al.,
answer span. In the event that BERT fails to accurately predict the cor 2022). They also help in creating fairer and more accurate evaluations
rect answer as indicated by the annotations, the resulting loss of infor by minimizing human biases and ensuring consistency across assess
mation will be propagated backwards to GPT-2. ments.The direct impact on student learning is profound, as these as
In a similar vein, a novel unified framework for the generation of QA sessments help in reinforcing knowledge through repeated and tailored
pairs was proposed (Qu et al., 2021). Based on the available evidence, it practice, leading to improved understanding and retention (Mazidi &
can be inferred that the process of sequential generation can be effec Nielsen, 2014). For teachers, LLMs offer considerable benefits by auto
tively simplified by transforming it into the task of generating joint mating the generation of test materials, thus saving time and effort that
question-answer pairs. Within the confines of this particular framework, can be redirected towards interactive teaching. Additionally, the
the tasks of question generation and keyphrase extraction have been data-driven insights generated by these models support informed
intricately intertwined, forming a dual task. The optimisation process pedagogical decisions and curriculum development, enhancing educa
involves leveraging the advantages of each task in a mutually reinforc tional outcomes. The superior performance of LLMs such as T5 model in
ing manner, with iterative refinement being a key component. The higher education settings means that educational practices can become
process of generating answers will be facilitated by utilising the more adaptive, personalized, and inclusive, enhancing both the quality
extracted keyphrase as a guiding factor. and accessibility of education (Kurdi et al., 2020).
The joint model approach with BERT or GPT is transformative,
integrating question and answer generation to reflect the interconnected 2.4. Research gap
nature of learning. This method enhances educational practices by
providing immediate, context-aware feedback and generating precisely In conclusion, it is worth noting that the existing methodologies for
aligned question-answer pairs, leading to more accurate student as generating QA pairs using pre-trained language models can be classified
sessments (Rodriguez-Torrealba et al., 2022a). It also allows for the into three distinct categories: the pipeline method, the joint model, and
efficient creation of customized educational content, significantly the multi-task learning model. The approaches that have been reviewed
reducing the workload on educators and enabling them to focus on more are organised and presented in a structured manner in table 0.8. The
personalized teaching. Moreover, its ability to automate content gen utilisation of multiple datasets for the generation of QA pairs is evident
eration makes it ideal for scalable applications such as online learning in the aforementioned methods. Furthermore, it is important to note that
platforms and MOOCs, supporting large-scale educational initiatives the metrics employed for the purpose of automatically evaluating text
with consistent, high-quality content (Kurdi et al., 2020). generation were found to be inconsistent, lacking a standardised set of
metrics for comprehensive evaluation.
2.3. Multi-task learning approach The current landscape of QA pairs generation methodologies has
witnessed the emergence of various proposed approaches. However, a
The multi-task model utilizes a shared encoder to process inputs for notable gap in the existing literature relates to the absence of compre
both question and answer generation, enabling mutual learning and hensive performance evaluations conducted on real-world education
synergy between these tasks. This integration blurs the lines between datasets. The performance of these approaches has been assessed
generating questions and answers, highlighting their interdependence through the utilisation of various datasets and evaluation metrics,
and enhancing their effectiveness (Cui et al., 2021). One of the primary thereby posing a challenge in terms of comparing their respective out
challenges encountered in the pipeline approach is the generation of comes. Furthermore, it should be noted that the datasets provided in
incompatible QA pairs in an independent manner, without considering table 0.8 lack a clear source from real-world educational settings.
the underlying relationships between them. The OneStop model is being Consequently, the extent to which these datasets can be applied to the
proposed as a solution to address these issues. It leverages a field of education remains uncertain. Hence, it is imperative to conduct a
sequence-to-sequence transformer architecture, featuring a bidirec comprehensive assessment of these methodologies using a universally
tional encoder and an autoregressive decoder, to generate ques accepted dataset and standardised evaluation techniques in order to
tionanswer pairs using a multi-task learning approach (Cui et al., 2021). ascertain their efficacy in the realm of education. Furthermore, it is
Documents are inputted into the encoder, and then the decoder, imperative to incorporate real-world evaluations that involve educators
employing cross-attention with the encoder’s outputs, generates ques with substantial teaching experience in the field of higher education.
tions and predicts answer spans. This interplay between the document These evaluations are crucial in order to comprehensively assess the
and the generated content is crucial for producing accurate and influence and efficacy of these approaches.
compatible responses, thereby achieving optimal results in the question Moreover, there is a notable absence of empirical studies directly
and answer generation processes. investigating the impact of automatically generated QA pairs on stu
In contrast to the sharing of the encoding model on question gen dents’ academic performance and engagement. This gap highlights the
eration and answer generation tasks, the multi-task-based T5 model need for targeted research that evaluates how these technologies affect
demonstrates a more direct approach in modelling these tasks (Zhong learning outcomes in actual classroom settings. Such studies are essen
et al., 2022). In the multi-task setting, the T5 model undergoes fine tial to understand whether and how the use of automated QA pairs can
tuning on three distinct tasks: question answering, question generation, enhance educational practices, particularly in terms of improving stu
and answer extraction. For the question answering task, the model takes dent engagement and academic achievement.
in a context and question pair as input. In the question generation task,
the model utilizes answerhighlighted context as input. Lastly, for the
answer extraction task, the model employs sentence-highlighted context
3
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
Table 1
The summarization of the proposed method.
Model Dataset Question Purpose Distractor Pre-trained Method Automatic Expert Evaluation in a
generation generation Language type Evaluation Evaluation realworld
Answer
model setting
generation
4
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
3.2. Experiment
Fig. 2. DG-RACE dataset example. Based on our analysis of the existing literature, it can be inferred that
the approaches employed for QA pairs can be categorised into three
texts, drawing from subjects commonly found in academic assessments distinct types: the pipeline, joint, and the multi-task learning approach.
to ensure a broad applicability. By focusing on the generation of However, there are number of approaches that fall in theses types. The
high-quality, contextually relevant distractors, DG-RACE aims to simu experimental design involved the selection of approaches for this study
late a genuine exam experience, testing students’ comprehension skills which is based on four conditions. First, the approach must satisfy the
effectively and preparing LLMs based systems to support nuanced specific requirements of the educational context, which includes the
educational needs. generation of questions, corresponding answers. Second, the approach
has the ability to generate the distractors will be considered preferen
3.1.2. Benchmark datasets tially. Third, the approach has been mentioned the applicability in the
To ensure uniformity and high educational standards, we recognized education context. Lastly, the approach must be designed specifically for
the need for a benchmark dataset tailored explicitly for educational the English language. Based on thses conditions three approaches were
purposes, distinguishing it from datasets like RACE and SQuAD, which selected (Cui et al., 2021; Qu et al., 2021; Rodriguez-Torrealba et al.,
are not primarily designed with educational objectives in mind. This 2022b).
dedicated dataset adheres to strict criteria, emphasizing the creation of However, the selection of the pre-trained language models is a
questions that include a passage/context, one correct answer, and three crucial factor that can significantly impact the performance of these
carefully crafted distractors to simulate real-world examination condi approaches. It means that the different pre-trained language models
tions effectively (as shown in Fig. 3). Inspired by the methodologies have been shown to exhibit bias based on the training data used and
proposed by Torres (Torres et al., 2009), our approach to building this model architectures. To demonstrate the performance of the proposed
benchmark datasets involved detailed guidelines, as documented in approaches explicitly, different pre-trained language models were
Appendix A. selected and accessed respectively to exploit the patience of each
We selected three courses for this endeavor: a bachelor-level course approach and minimize the potential bias of different pre-trained lan
in programming 1 and two master-level courses in Big Data with NoSQL guage models. As a result, a combination of pre-trained language models
and Enterprise Computing and ERP Systems. These courses were chosen with comparable architectures to those employed in the original
to develop benchmark datasets, ensuring a wide range of coverage that research were used in conjunction with the selected approaches to
reflects the critical areas of contemporary computer science education. generate QA pairs.
The annotation process was meticulously carried out with the collabo In this study, we meticulously selected pre-trained language models
ration of six annotators, comprising teaching assistants, to ensure a such as T5, ProphetNet, and BART based on specific criteria to ensure
robust evaluation across three distinct courses. For each course, we alignment with our research goals on QA pairs generation (Liu et al.,
5
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
2019; Qi et al., 2020; Raffel et al., 2020; H. Wang et al., 2022). Key assess the efficacy of the generated QA pairs derived from diverse ap
considerations included each model’s architectural innovation, like T5’s proaches. The objective of this evaluation is to assess the efficacy of the
text-to-text framework, ProphetNet’s n-gram prediction, and BART’s generated QA pairs in the context of educational applications.
hybrid approach, ensuring advanced comprehension and generation
capabilities. Performance on benchmark datasets was crucial, guiding us 3.3.1. Automatic evaluation
to models with demonstrated excellence in NLP tasks. Flexibility for In the context of automatic evaluation, it was necessary to establish
fine-tuning allowed us to adapt models to our unique dataset, enhancing nine distinct groups (as presented in Table 2) for the purpose of gener
their effectiveness. The selection was further influenced by the level of ating QA pairs since we are using three approaches along with three
community support and the availability of resources, ensuring ease of large language models. For example, in the context of the pipeline
implementation and potential for collaboration. Lastly, engagement approach, three distinct groups were formulated, namely pipeline + T5,
with models at the forefront of NLP research promised that our work pipeline + Bart, and pipeline + ProphetNet.
would remain cutting-edge. This thoughtful selection process guaran The evaluation of these combinations involved the utilisation of
teed that the models we chose were not just tools but integral to pushing established evaluation metrics, namely BLEU, METEOR, and ROUGE, to
the boundaries of QA pairs generation research. quantitatively assess their performance and derive corresponding
Thus, these three pre-trained language models can serve as sub scores. The computation of the average score is subsequently performed
stitutes for each other in the selected approaches. In summary, Table .2 for every QA pair generated by each group. Based on the findings ob
shows the combinations of the QA pairs generation approaches and pre- tained from the evaluation results, it was deemed necessary to select the
trained LMs. The experiments were carried out using the SQuAD dataset. three groups (one group for each course) that exhibited the highest level
Additionally, the distractors were an essential part of the MCQ, of performance within all approaches for the purpose of conducting real-
except the correct question and answer. A T5-based distractor genera world educational evaluation. This decision was made in light of the fact
tion (DG) model with the question-answer pairs and the context para that evaluating all nine groups would entail a significant investment of
graph was implemented to generate distractors for the generated QA time and effort. While automatic evaluation metrics have proven to be
pairs from all selected approaches to form a complete MCQ (Rodri valuable in objectively assessing the quality of generated text and
guez-Torrealba et al., 2022b), which refers to the part of the pipeline reference text, there remains a gap in their ability to evaluate the syn
approach. tactic and semantic aspects of generated QA pairs (Qu et al., 2021).
6
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
examination scores of each student, as well as the quantity of assessment enhanced performance in comparison to other LLMs across a range of
attempts made by students throughout the duration of the course. In methodologies, including pipeline, joint, and multi-task approaches.
terms of data analysis, a pair of statistical tests were conducted in order Furthermore, the findings of this study indicate that across all three
to assess the influence of the assessments on the academic performance courses, T5 consistently demonstrated its efficacy. However, it is worth
of the students. In order to determine the difference in academic per noting that within the specific context of the Programming 1 course,
formance between students who engaged in the assessments and those BART exhibited superior performance when utilising a multitask
who did not, a statistical analysis employing an independent samples t- approach compared to other LLMs. Nevertheless, it is important to
test was conducted. In order to investigate the potential association highlight that BART’s performance, even in this scenario, was still
between assessment attempts and students’ academic achievement, this inferior to that of T5 with a pipeline approach.
research employed Pearson and Spearman’s correlation analyses. Based on the findings presented in the tabulated data (Tables 3 and 4
and Table .5) and following the established procedures for real educa
4. Results tional evaluation (section 3.3.2), it is considered appropriate to exclu
sively utilise the most favourable outcome obtained from each
In this section, we present the results of the experiments conducted respective group for the purpose of generating QA pairs. Henceforth, it
to evaluate the effectiveness of approaches. These approaches were has been determined that the QA pairs generated by the Pipeline + T5
evaluated in terms of its ability to correctly generate QA pairs and and group will be selected for the Programming 1 course. Similarly, for the
how they affected students’ academic success. big data and enterprise modelling courses, QA pairs obtained from the
Multi-task + T5 group will be chosen.
Table 3
Results of automatic evaluation metrics for programming 1.
Metric Models
Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet
BLEU-1 36.14964029 33.64153452 23.39791356 30.33136522 27.36548502 26.60121754 29.74910394 30.98478783 22.63743051
BLEU-2 22.35608167 20.90095886 11.69539345 20.07960227 16.95921112 16.65103447 19.2227172 21.30069651 12.57124711
BLEU-3 15.79375148 15.02378866 7.138231206 15.2961294 12.49040462 12.36952088 13.89763955 16.39482792 8.097514051
BLEU-4 12.02456886 11.82414574 5.032866744 12.38900336 10.12981764 10.06914535 10.60066968 13.21698548 5.763605075
ROUGE-1 32.02495387 30.63471727 21.11995874 20.23716852 16.99843471 15.73133455 31.27625675 31.18783253 23.80227394
ROUGE-2 12.99585724 12.75571406 5.401256265 9.135552858 6.925865541 6.635624063 14.38539526 15.79856505 6.857651749
ROUGE-L 26.35652675 23.9642481 15.96005056 18.87403179 15.88092056 14.33875987 25.05205253 26.87081364 18.81232257
ROUGE- 26.36512806 24.0792488 15.97624424 18.87955415 15.91985963 14.35597131 25.02281517 26.85631877 18.74756191
Lsum
METEOR 31.70870176 29.47697555 25.37397826 20.02663572 17.75808498 19.41754084 35.48587518 34.20076926 32.69324558
Average 23.97502333 22.47792573 14.566210336 18.36100481 15.603120425 15.130016541 22.74361392 24.09017744 16.664761388
Score
7
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
Table 4
Results of automatic evaluation metrics for big data.
Metric Models
Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet
BLEU-1 23.28076522 18.2197417 20.79261672 19.90801444 18.93799357 16.43486243 35.13891896 32.38045738 22.77432712
BLEU-2 16.50661274 10.56219257 11.8668281 13.27365872 10.82287578 8.685232733 26.96877127 24.94276192 15.09960323
BLEU-3 13.02351867 6.741328941 8.339360611 9.400936677 7.010160883 5.780093828 22.46610602 21.15422776 11.43974895
BLEU-4 10.90174311 4.455817899 6.36770316 6.913700666 4.897609407 4.25681217 19.43932548 18.85497636 9.322526971
ROUGE-1 27.28057174 21.05013342 16.72537488 24.29114063 19.14611678 14.60227056 29.45955894 28.6514385 17.92825687
ROUGE-2 13.83632432 8.526407772 5.92788386 12.08894383 7.222503895 4.713001085 16.89802293 17.49711636 7.781688575
ROUGE-L 26.93642493 20.30977193 15.19678172 23.79932539 18.39886388 13.92933252 28.2352558 27.96082317 17.18851397
ROUGE- 26.89281126 20.34191829 15.16479933 23.73454746 18.42582878 13.97819757 28.30308393 27.80589989 17.25035657
Lsum
METEOR 29.23718962 23.03642065 22.9554569 27.68654117 22.48453957 23.30335489 35.1346726 35.42909757 32.18946831
Average 20.87732907 14.804859241 13.704089476 17.899645443 14.149610283 11.742573087 26.89374621 26.07519988 16.774943396
Score
Table 5
Results of automatic evaluation metrics for enterprise modelling.
Metric Models
Pipeline + T5 Pipeline + Pipeline + Joint + T5 Joint + BART Joint + Multitask + Multitask + Multitask +
BART ProphetNet ProphetNet T5 BART ProphetNet
BLEU-1 24.89688773 25.03339031 20.66971081 24.48834059 23.1617379 21.11226446 25.8684405 25.3164557 16.21037464
BLEU-2 16.27869483 15.89980441 7.966343666 16.46266515 14.18031072 12.3615537 17.63112649 17.63351073 9.614484625
BLEU-3 11.61481621 11.38762279 3.953459766 12.02050845 9.649366618 8.273964464 12.93639526 13.01625815 6.195580785
BLEU-4 8.468833478 8.593874495 2.167852295 8.815473961 6.968471916 6.089207626 9.58311597 9.638056692 4.454534055
ROUGE- 23.41647287 21.27903254 16.05618224 24.7873407 19.01818552 17.4928996 27.8988838 26.64783878 19.22624383
1
ROUGE- 10.30302322 8.612521911 2.615157343 11.81028214 7.597897023 6.520264528 14.84577498 14.19347913 7.091385424
2
ROUGE- 22.07447606 19.82906176 13.81788872 23.70111652 18.04581239 16.49337089 26.27754562 24.9953504 17.44916216
L
ROUGE- 22.03421358 19.73847621 13.8621216 23.68067221 18.07887164 16.50290914 26.2163233 25.05321598 17.46847245
Lsum
METEOR 22.65856622 21.83261886 18.16281993 23.98163752 19.90253358 20.84400664 30.26681373 29.59724752 25.10922375
Average 17.971776022 16.911822587 11.030170708 18.860893027 15.178131923 13.965604561 21.28049107 20.676823676 13.646606858
Score
indicated a medium level of satisfaction. indicated that improvements could be made in enhancing course sup
In the terms of understandability, the clarity and accessibility of the port, as suggested by the medium level of satisfaction from the Enter
QA pairs were largely accepted, particularly the questions themselves, prise Modelling teacher. This aspect underscores the potential to further
which were universally deemed highly understandable by all teachers. refine the QA pairs to better align with and support the overarching
Nevertheless, the understandability of answers and distractors did not goals of the courses.
achieve unanimous high ratings, especially from the Enterprise Model In conclusion, while the method for generating quiz-style MCQs was
ling and Programming 1 teachers, suggesting a need for improvement in deemed effective for educational purposes, especially in terms of cor
how answers and distractors are articulated to ensure they are easily rectness and understandability, the detailed feedback from teachers
comprehensible. suggests several avenues for improvement. Enhancing the difficulty
The feedback on the difficulty level presented a mixed picture, level, broadening the coverage of advanced knowledge, and bolstering
indicating significant room for enhancement to better meet educational course support emerged as key areas for future development.
goals. While the Programming 1 course’s teacher found the questions to
be of a high difficulty level, aligning with desired standards, the re
sponses regarding the difficulty levels of answers and distractors varied 4.3. Results of real-educational evaluation
widely. This diversity in feedback highlights an imperative to recalibrate
the challenge presented by the QA pairs, ensuring they are sufficiently In real-educational evaluation, created QA pairs were organized into
demanding to stimulate student engagement and learning. series of MCQs in order to create assessments and delivered to students
Teachers praised the QA pairs for their positive impact on imparting during the courses (section 3.3.2). In the end of the courses statistical
basic and concept-wise knowledge, with high evaluations across these analyses were carried out to investigate the effects of the assessments on
areas. Yet, the advanced knowledge impact received lower ratings from student academic performing performance.
the Programming 1 and Enterprise Modelling teachers, suggesting the In this evaluation, the participants were divided into two distinct
necessity to delve deeper into complex topics and elevate the academic groups: Group 1, comprising individuals who were granted the oppor
rigor of the QA pairs to foster a more comprehensive understanding tunity to engage in assessments throughout the duration of the course,
among students. and Group 2, consisting of individuals who were not provided access to
Regarding utility impact, the QA pairs were recognized for their the assessments. Following the completion of the aforementioned tasks,
beneficial support towards students, teachers, and the overall course a series of statistical analyses were conducted. The initial analysis aimed
objectives, particularly in facilitating student learning and aiding to determine any differences in academic performance between students
teachers in assessing student comprehension. However, feedback who engaged in the assessments and those who did not. To achieve this,
an independent samples t-test was employed. In order to investigate the
8
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
potential association between assessment attempts and students’ aca approaches were then evaluated on these benchmarks through a com
demic achievement, this research employed Pearson and Spearman’s bination of automatic methods, teacher assessments, and real-world
correlation analyses. educational evaluations.
The results obtained from the t-test for the t and p-values (Table 7)
showed a significant statistical difference between the groups in all three
5.1. Findings
courses. The observed difference between groups is more pronounced in
the context of the big data course when compared to the other two
We began by addressing the first research question of how to develop
courses. Table 6 showed statistics of both groups in all three courses. As
a robust framework to evaluate the effectiveness of different QA pairs
presented in Table 6 Group 1’s mean in all three courses was greater
generation approaches in higher education, ensuring that these tech
than Group 2’s. The greater mean for Group 1 demonstrates that par
nologies meet the specific needs of educational stakeholders. Our results
ticipants who performed assessments during the courses performed
from automatic evaluations and teacher assessments provide a
better in the final exam than nonparticipants (group 2). The maximum
comprehensive view of the performance of these technologies across
scores for the final exams in the Programming 1, Big Data, and Enter
various educational contexts.
prise Modelling courses are 300, 50, and 50, respectively.
Our findings highlight that the multi-task approach generally out
The findings of the correlation analysis, as displayed in Table 8,
performs the pipeline and joint methods in generating QA pairs, as
revealed a favourable correlation between assessment attempts and
evidenced by the high scores across multiple automatic evaluation
student acadamic performance. According to the findings shown in
metrics, particularly for the Programming and big data courses. This
Table 8, the highest correlation coefficient observed for both the Pearson
indicates a robust capability of the multi-task approach to produce
and Spearman techniques was 0.67, specifically in regard to the Pro
quality QA pairs that align closely with reference answers in these
gramming 1 course. The Enterprise Modelling course exhibited the
courses. Additionally, the T5 model showed superior performance
lowest correlation of 0.59 when analysed using both techniques. How
across all methods and was consistently effective, further supporting its
ever, in the Big Data course, the correlation shown a modest improve
use in our recommended framework for generating QA pairs.
ment compared to the Enterprise course, but it declined in comparison to
However, when considering the application of these approaches in a
the Programming 1 course. Based on the findings of this analysis, it can
real educational setting, the feedback from teachers provides essential
be inferred that there exists a positive correlation between the number of
insights that should influence the evaluation framework. Teachers re
assessments attempted by students throughout the course and their
ported high satisfaction with the accuracy and relevance of the questions
performance on the final exam.
generated, particularly praising the QA pairs for their correctness in the
Programming 1 and Big Data courses. This suggests that the QA pairs
5. Discussion
generated are not only technically competent but also applicable and
useful in an educational context. Despite these positives, the feedback
This study implements three distinct approaches and LLMs for
also pointed out areas needing improvement. The medium satisfaction
generating QA pairs, which were fine-tuned using the SquAD and RICE
levels in the Enterprise Modelling course and the mixed feedback on the
datasets. To assess their performance in educational settings, we
understandability of answers and distractors across courses highlight the
developed benchmark datasets for three separate courses. These
need for adjustments in how these QA pairs are formulated. This
9
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
10
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
Table 8
Correlations between assessments attempts and Final exam score.
Course Correlation technique N Variable by Variable r Sig.(2-tailed)
Programming 1 Pearson Correlation 120 Assessments Attempts Final exam score 0.67 < 0.001
Spearman Correlation 120 Assessments Attempts Final exam score 0.67 < 0.001
Big Data Pearson Correlation 70 Assessments Attempts Final exam score 0.65 < 0.001
Spearman Correlation 70 Assessments Attempts Final exam score 0.64 < 0.001
Enterprise Modelling Pearson Correlation 40 Assessments Attempts Final exam score 0.59 < 0.001
Spearman Correlation 40 Assessments Attempts Final exam score 0.59 < 0.001
5.3. Limitations and future research The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
The benchmark datasets utilised in this study were derived exclu the work reported in this paper.
sively from course materials related to computer science-related sub
jects. It is important to acknowledge that this strategy may introduce Appendix A. Standards of creating multiple-choice questions
certain limitations in terms of the generalizability of the findings and
could potentially introduce bias into the evaluation outcomes. The question creation criteria.
Furthermore, it is important to note that our study focused exclusively
on the evaluation of three prominent language models. We did not delve 1. Clearly identify the specific point being tested by the item
into the investigation of alternative models or the potential synergistic 2. Include the main idea and phrasing in the stem
effects of combining different models, which could potentially lead to 3. Avoid including irrelevant clues that may lead to the correct answer
enhanced performance. The aforementioned limitations underscore the 4. Eliminate unnecessary or extraneous information in the question
necessity for additional investigation employing larger and more diverse stem
11
J. Ling and M. Afzaal Computers and Education: Artificial Intelligence 6 (2024) 100252
5. Use negative words in the stem sparingly, and if used, underline or Gholami, V., & Morady Moghaddam, M. (2013). The effect of weekly quizzes on
students’ final achievement score. International Journal of Modern Education and
capitalize them to avoid confusion
Computer Science, 5, 36–41. https://doi.org/10.5815/ijmecs.2013.01.05
6. Ensure that the question is clear and students understand what is Jia, X., Zhou, W., Sun, X., & Wu, Y. (2020). EQG-RACE: ExaminationType question
expected of them generation. https://doi.org/10.48550/arXiv.2012.06106. Comment: Accepted by
AAAI-2021.
Klein, T., & Nabi, M. (2019). Learning to answer by learning to ask: Getting the best of GPT-2
The options creation criteria. and BERT worlds. https://doi.org/10.48550/arXiv.1911.02365 [arXiv:1911.02365
[cs]].
1. Make sure each option is mutually exclusive and does not overlap Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of
automatic question generation for educational purposes. International Journal of
with others Artificial Intelligence in Education, 30, 121–204.
2. Ensure that the options are as similar as possible in terms of Lelkes, A. D., Tran, V. Q., & Yu, C. (2021). Quiz-style question generation for news stories.
structure and phrasing https://doi.org/10.48550/arXiv.2102.09094 [arXiv:2102.09094 [cs]].
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
3. Keep the grammar of each option consistent with that of the Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT
question stem pretraining approach [arXiv:1907.11692 [cs]] https://doi.org/10.48550/arXiv.1
4. Ensure that there is only one correct or best answer among the 907.11692.
Mazidi, K., & Nielsen, R. (2014). Linguistic considerations in automatic question
options generation. Proceedings of the 52nd Annual Meeting of the Association for
5. Use plausible and realistic distractors to challenge the students Computational Linguistics, 2, 321–326.
6. Incorporate common errors or misconceptions made by students Qi, W., Yan, Y., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., & Zhou, M. (2020).
ProphetNet: Predicting future N-gram for sequence-to-sequence pre-training. htt
in the distractors
ps://doi.org/10.48550/arXiv.2001.04063. Comment: Accepted to EMNLP 2020
7. Keep the structure of the options parallel and consistent Findings. Project page: github.com/microsoft/ProphetNet.
8. Ensure that the length of the options is similar to avoid giving Qu, F., Jia, X., & Wu, Y. (2021). Asking questions like educational experts: Automatically
clues to the correct answer generating question-answer pairs on real-world examination data. Comment:
Accepted as a long paper in the main conference of EMNLP 2021. https://doi.org/
9. Avoid using the options ”all of the above” and ”none of the 10.48550/arXiv.2109.05179 [arXiv:2109.05179 [cs]].
above” unless necessary Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., &
10. Avoid using absolute determiners such as ”never” and ”always” Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text
transformer. https://doi.org/10.48550/arXiv.1910.10683. Comment: Final version
11. Randomize the position of the correct answer among the options as published in JMLR).
12. Use letters instead of numbers to label the options 13. Replace Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for
any distractors that are not chosen by any examinees machine comprehension of text. https://doi.org/10.48550/arXiv.1606.05250.
Comment: To appear in Proceedings of the 2016 Conference on Empirical Methods
14. Avoid using humor in the question or options. in Natural Language Processing (EMNLP).
Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022a). End-to-end
generation of multiple-choice questions using text-to-text transfer transformer
models. Expert Systems with Applications, 208, Article 118258.
References Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022b). End-to-End
generation of Multiple-Choice questions using Text-to-Text transfer Transformer
Akyon, F. C., Cavusoglu, D., Cengiz, C., Altinuc, S. O., & Temizel, A. (2022). Automated models. Expert Systems with Applications, 208, Article 118258. https://doi.org/
question generation and question answering from Turkish texts. https://doi.org/10. 10.1016/j.eswa.2022.118258
48550/arXiv.2111.06476. Comment: 14 pages, 1 figure, 13 tables. Torres, C., Lopes, A. P., Azevedo, J. M., & Babo, L. (2009). Developing multiple choice
Alberti, C., Andor, D., Pitler, E., Devlin, J., & Collins, M. (2019). Synthetic QA corpora questions in mathematics [Accepted: 2012-07-30T11:05:13Z]. Retrieved February
generation with roundtrip consistency. Proceedings of the 57th annual meeting of the 19, 2023, from https://recipp.ipp.pt/handle/10400.22/586.
association for computational linguistics. https://doi.org/10.18653/v1/P19-1620 Wang, H., Li, J., Wu, H., Hovy, E., & Sun, Y. (2022). pre-trained language models and
Chan, Y.-H., & Fan, Y.-C. (2019). A recurrent BERT-based model for question generation. their applications. Engineering. https://doi.org/10.1016/j.eng.2022.04.024
Proceedings of the 2nd workshop on machine reading for question answering. https://doi. Wang, H. C., Maslim, M., & Kan, C. H. (2023). A question–answer generation system for
org/10.18653/v1/D19-5821 an asynchronous distance learning platform. Education and Information Technologies,
Cui, S., Bao, X., Zu, X., Guo, Y., Zhao, Z., Zhang, J., & Chen, H. (2021). OneStop 28(9), 12059–12088.
QAMaker: Extract question-answer pairs from text in a one-stop approach [arXiv: Yao, B., Wang, D., Wu, T., Zhang, Z., Li, T. J.-J., Yu, M., & Xu, Y. (2022). It is AI’s turn to
2102.12128 [cs]] https://doi.org/10.48550/arXiv.2102.12128. ask humans a question: Question-answer pair generation for children’s story books.
Gabajiwala, E., Mehta, P., Singh, R., & Koshy, R. (2022). Quiz maker: Automatic quiz https://doi.org/10.48550/arXiv.2109.03423. Comment: Accepted to ACL 2022.
generation from text using NLP. In Futuristic trends in networks and computing Yi, C., Zhu, R., & Wang, Q. (2021). Exploring the interplay between questionanswering
technologies (pp. 523–533). Springer Nature. https://doi.org/10.1007/978981-19- systems and communication with instructors in facilitating learning. Internet
5037-737. Research, 32(7), 32–55.
Gao, Y., Bing, L., Li, P., King, I., & Lyu, M. R. (2018). Generating distractors for reading Zhong, W., Gao, Y., Ding, N., Qin, Y., Liu, Z., Zhou, M., Wang, J., Yin, J., & Duan, N.
comprehension questions from real examinations. https://doi.org/10.48550/arXiv.1 (2022). ProQA: Structural prompt-based pre-training for unified question answering
809.02768. Comment: AAAI2019. [arXiv:2205.04040 [cs]] https://doi.org/10.48550/arXiv.
12