Pruthwik Mishra PHD Thesis
Pruthwik Mishra PHD Thesis
Doctor of Philosophy
in
Computer Science and Engineering
by
Pruthwik Mishra
201307577
[email protected]
CERTIFICATE
It is certified that the work contained in this thesis, titled “Word Problem Solving” by Pruthwik
Mishra, has been carried out under my supervision and is not submitted elsewhere for a degree.
This has been a long, tiring, but a fulfilling journey which is coming to an end. I would like
to thank everyone who supported me throughout this journey. It would not have been possible
without your help.
First of all, I would thank my advisor Prof. Dipti Misra Sharma who has been very supportive
and patient with me in all these years. She has been an amazing guide and person who
taught me a lot of things starting from natural language processing concepts to very defining
life lessons. I cannot remember a single time when I wanted to discuss something with her
and she had not spared time. Thank you madam for this. I thank all the LTRC faculty
Dr. Radhika, Dr. Manish, Dr. Parameswari, Prof. Sangal, Dr. Anil, Dr. Chiranjeevi,
Prof. Vasudev Varma, Dr. Rajakrishnan, Dr. Rahul for their support in different research
areas which I have worked on. I thank all our office and admin staff Praveen, Dhanalaxmi,
Laxminarayan Sir, Namratha, Sammaiah Sir, Rambabu Sir, Mahendra sir, Murthy sir, Satish
Gatla Sir, Pushpalatha madam, Prathima madam, Kumaraswamy Sir, Srikanth Sir, Prabhakar
Sir, Krishna Kishore Sir, Nadeem Sir, Srinivas Rao Sir, Saidulu Sir, and all the house-keeping,
mess, and security staff.
I would thank all my friends and research colleagues who have given their invaluable sug-
gestions for my research. This is a going to be a long list. I thank my friends Kunal, Chandan,
Vinitha, Anwesha, Ravi, Sai Krishna, Divya Sai, Yashwanth, Ratish, Sushant Bhai, Kishori,
Shastri, Nirmal, Sai Ganesh, Mounika. I am grateful that I got the opportunity to work with
a group of language experts who have contributed to my progress. I thank all of them: Al-
pana madam, Anita madam, Preeti madam, Nandini madam, Mithu madam, Kaberi madam,
Krithika madam, Sameena madam, Sarita madam, Younus ji, Noman ji, Vaibhavi, Srivani
madam, Sarala madam, Shailaja madam, and Avinash sir. I thank Litton for introducing me to
the world of word problems and being a part of my initial days in researching this field. I would
like to especially thank Arpit, Litton, Maaz, Krishnakant, Vighnesh, Nikhilesh, Ashok, Vandan,
Saumitra, Prathyusha, Pranav, Arafat, Ganesh for sharing a great bond of friendship which
made this journey more memorable. I thank my fellow students Harshita, Ayush, Sankalp,
Anvishka, Kriti, Ketan with whom I got the opportunity to research and mentor them. I want
to express my deep gratitude to Vandan with whom I share a great camaraderie culminating
v
vi
in many research outcomes in terms of tools and research papers. I used his developed tools in
many of the works presented in this thesis.
Lastly, I thank the most crucial part of my life, my family. I dedicate this thesis to my
mother who has been a pillar of strength for me in difficult times. During all these years, she
has always given me courage and taught me perseverance to keep going. It would not have
been possible without you, Bou. The second person who has always kept a constant watch on
my progress is my elder sister. Although she is a hard task master, she always egged me on to
reach my goals. I thank my wife for providing unwavering support to me. I was not present in
many of the important phases of her life, but she never complained. I convey my deep gratitude
to all these three powerful individuals.
I thank my thesis review panel, Prof. Amba Kulkarni, Prof. Sivaji Bandyopadhyay, and Dr.
Manish Shrivastava, for their insightful comments and guidance.
Abstract
Education plays a vital role in shaping up one’s life. Education in early childhood includes
learning from different activities where counting and other concepts of mathematics act as
major building blocks. Mathematics is not just a subject to be taught in schools, but also
it has myriad applications in our daily lives such as counting the total number of stationery
items, calculating their individual prices, and adding them up for a purchase made in a grocery
shop. This kind of real world situations are posited in mathematical word problems which are
an essential part of a child’s learning that requires natural language understanding (NLU) as
well as knowledge of mathematical operations. Mathematical Word Problem Solvers can assist
both students and teachers. It is a challenging field and this thesis attempts to provide NLP
solutions for math word problems in English and Indian languages.
Our approach for developing word problem solvers follows a pipeline. For a word problem,
first, we identify the relevant operands and required operations. In the second step, we form an
equation from these identified components. The first two stages can be combined to generate
equations at once using neural network based approaches. At the last stage, the equation is
solved by a mathematical solver to get the final solution. We focus primarily on the first two
stages of this pipeline. We developed solvers using three kinds of approaches: frame based,
composition of classifiers based, and end-to-end neural based.
We also shed light on the limitations of the current automatic solvers with respect to the
data. We designed different data augmentation techniques to overcome the data scarcity prob-
lem. As a part of resource building, we developed word problem datasets and solvers for Indian
Languages. In addition to this, we compared different models related to our developed ap-
proaches. We empirically show the difference in generation of various equation notation types.
For this study, we present the results of equations in infix, postfix, and prefix notations. We
also show two natural language processing tasks where components of word problem solvers
can be utilized. In the first task, we studied the impact of simple number based pre-processing
on the performance of machine translation systems. In the second task, we analyze speech
transcript texts to extract equation spans. This kind of text is present mainly in transcripts
from mathematical domain. For achieving this, we develop an equation identifier and convertor
involving math notations for transcripts. This can make the transcripts easily readable for
transcripts which have wide usage of mathematical terms.
vii
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Approaches for Solving Word Problems . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
viii
CONTENTS ix
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
List of Figures
Figure Page
xii
List of Tables
Table Page
xiii
xiv LIST OF TABLES
xv
Chapter 1
Introduction
Natural Language Understanding (NLU) is the most challenging aspect of Natural Language
Processing (NLP). NLU systems are essential for performing a gamut of tasks ranging from
simpler tasks, such as designing chatbots based on short and simple instructions, to relatively
complex tasks, such as reading comprehension and question answering involving different forms
of natural language (NL) inputs. Word Problem Solving (WPS) falls in the category of complex
NLU tasks. Solving a word problem involves three subtasks:
• Forming equations
Our research mainly focuses on the first two subtasks. Word problems come in different
flavors that require varying levels of NLU. Some word problems are very easy to solve where
cues for the operands and operation to be performed are explicitly given, such as “Find the
result when 5 is added to 3.” Here, “added” refers to the addition operation and the operands
are directly mentioned. But explicit cues are often missing in word problems where it becomes
very challenging to interpret the meaning of the sentences. In addition to this, it is difficult to
figure out which parts of the word problem text are relevant and which parts are to be ignored.
Let us look at a few examples below and analyze these issues which make the NLU task much
harder.
Example 1:
• Problem Text: “At a function held in Bhubaneswar’s Kalinga Stadium, Naveen Patnaik
gave cheques of Rs 2.5 crore each to the Indian men’s hockey team vice-captain Birendra
Lakra and defender Amit Rohidas and Rs 50 lakh each to Deep Grace Ekka and Namita
Toppo of the women’s hockey team.” 1
1
https://www.hindustantimes.com/cities/others/odisha-cm-gives-cash-awards-to-\
olympic-hockey-players-101628683292019.html
1
• Question: “How much money was awarded to the hockey players in total?”
1. NLU:
– Identification of two sets of hockey players: a. Men b. Women
– Identification of explicit operands 2.5 crore, 50 lakhs
– Identification of implicit operands: total number of players in each set is missing;
only the names of the players are mentioned in the problem text
– Identification of operations: each for multiplication, total for addition
2. Equation Generation:
– Equation 1: T otal M oney Awarded to P layers f rom M en′ s T eam =
T otal P layers f rom M en′ s T eam ∗ M oney Awarded to Each M an
– Equation 2: T otal M oney Awarded to P layers f rom W omen′ s T eam =
T otal P layers f rom W omen′ s T eam ∗ M oney Awarded to Each W oman
– Equation 3: T otal P rize M oney =
T otal M oney Awarded to P layers f rom M en′ s T eam
+ T otal M oney Awarded to P layers f rom W omen′ s T eam
3. Solving the Equations: Instantiation of variables and solving using a mathematical
solver
T otal P layers f rom M en′ s T eam = 2,
M oney Awarded to Each M an = 2.5 crore,
T otal P layers f rom W omen′ s T eam = 2,
M oney Awarded to Each W oman = 50 lakh
Example 2:
• Problem Text: “Each medal winner at the Tokyo Olympics will be awarded a cash prize
of Rs 1 Cr by the Union government.”
• Question: “What is the total prize money given to the Olympic medal winners?”
1. NLU:
– Identifying how many medals Indian players won at the Tokyo Olympics: re-
quires world knowledge which is missing from the problem text
– Identification of explicit operands 1 Cr
– Identification of operations: each for multiplication
2. Equation Generation:
2
– Equation 1: T otal M oney Awarded to P layers =
T otal N umber of Olympic M edal W inners ∗ M oney Awarded
to Each W inner
3. Solving the Equations: Instantiation of variables and solving using a mathematical
solver
T otal N umber of Olympic M edal W inners = 7 (f rom world knowledge),
M oney Awarded to Each P layer = 1 Cr
Example 3:
• Problem Text: “GST collections recover to �1.16 lakh crore in July. The July 2021
collections were 33% higher than a year ago, with GST collected on the import of goods
rising 36% and domestic transactions (including import of services) growing by 32%.” 2
1. NLU:
– Identification of GST collections in 2021: �1.16 lakh crore
– Percentage of growth: 33%
– Identification of irrelevant quantities
∗ 2021 in July 2021
∗ 36 in GST collected on the import of goods rising 36%
∗ 32 in domestic transactions (including import of services) growing by 32%
– Introduction of Implicit Constants: 0.01 is introduced when dealing with per-
centages
– Identification of operations: higher for addition, % for multiplication with 0.01
2. Equation Generation:
– Equation 1: Growth in rupees in 2021 compared to 2020
= GST collections in 2020 ∗ growth_percentage ∗ 0.01
– Equation 2: GST collections in 2021 =
GST collections in 2020 + Growth in rupees in 2021 compared to 2020
3. Solving the Equations: Instantiation of variables, conversion of numbers in words
into numeric equivalents and solving using a mathematical solver
GST collections in 2021 = 1.16 lakh crore = 1160000000000,
growth_percentage = 33
2
https://www.thehindu.com/business/Economy/gst-collections-recover-to-116-lakh-\
crore-in-july/article35668958.ece
3
These examples shed light on various challenges with respect to varying degrees of complex-
itities in NLU and equation generation while solving arithmetic word problems.
1.1 Motivation
Humans employ domain knowledge and quantitative reasoning for understanding word prob-
lems. Imparting this kind of world knowledge is difficult for any automatic system. Along with
all these, scarcity of resources is a major bottleneck while designing robust word problem solvers.
Since word problems involve NLU, they can be used in several NLP applications, for example,
machine translation and equation identification in English transcripts. Word problem solvers
can also act as teaching assistants. Frame based solvers can empower the teachers to explain
arithmetic word problem solving to students with the help of frames and easily understandable
frame slots. A solver should have the capability of solving diverse word problems as well as
robustness. In this work, we attempt to address these challenges by developing techniques using
frame based, machine learning, and neural models.
1.2 Definition
We will first define a word problem. A word problem consists of a real world narrative
consisting of 2 parts: problem text and question text as shown in the example below.
• Word Problem: Ramesh had 10 pencils. He gave 3 to Suresh. How many pencils does
Ramesh have now?
• Equation: x = 10 − 3
• Solution: 7
The problem text consists of 2-3 sentences and describes a real world scenario involving different
entities, quantities, and their associated units. These sentences include different operands
and operations to be carried out on them. The question text queries an unknown quantity.
So, identification of the relevant operands and operations becomes quintessential in solving
word problems. In this thesis, we also develop different representation schemes and show their
efficiency in problem solving.
4
1.3 Approaches for Solving Word Problems
Earlier systems were designed to solve word problems which can only operate on a limited set
of inputs. Most of the techniques were schema based and handled only addition-subtraction type
of problems. Statistical models using verb categorization, expression trees, linear log models
using various features of different word problem types were also explored. In recent years, there
has been a reinvigorated interest in solving word problems of different kinds - arithmetic ques-
tions, probability questions, SAT questions, science questions, questions from other domains as
well as word problems with visual cues using neural networks. Most of the current approaches
are neural and they treat word problem solving as a sequence to sequence learning task. The
whole problem text is considered as the input to an encoder and the corresponding equation to
solve the problem is the desired output to be generated by a decoder. Lately, graph transformer
based encoder and tree based decoder has been reported as the state-of-the-art technique.
In this thesis, our primary focus is on solving arithmetic word problems using neural ap-
proaches.
1. We propose a frame identification based word problem solver for English where each frame
is represented in terms of pre-defined slots.
2. We develop a GRU based operator identifier and similarity based operand detection model
and generate equations that are composed of these identified components for English.
4. We develop an end-to-end equation generation using memory network based encoder and
LSTM based decoder for English.
5. We develop different neural based end-to-end generation models including attention based
sequence-to-sequence models, transformer based models from scratch, and fine tuning
pre-trained transformer models. All these models are designed to solve word problems in
English.
7. We develop word problem solvers for Hindi and Telugu using current state-of-the-art
pre-trained multilingual transformer models.
5
8. We create and release benchmark word problem datasets conisting of more than 1000
word problems for English, Hindi, and Telugu.
10. We also introduce an equation identifier and converter for English transcripts.
11. We develop conversion tools for converting numbers written in words into their numeric
equivalents. These tools are developed for English, Hindi, Telugu, Gujarati, and Odia.
Our approaches for word problem solving can be broadly summarized as shown in figure 1.1.
• Chapter 2 - Related Work: In this chapter, we briefly outline the previous approaches
followed for word problem solving with focus on the recent neural network based ap-
proaches.
6
• Chapter 4 - Word Problem Solving Using Frame Identification: This chapter
details our initial efforts to build a word problem solver by introducing the concept of
frames. This work is inspired from the erstwhile works of schema and verb categories.
7
Chapter 2
Attempts to automatically solve arithmetic word problems started as early as the 1960s.
Over the years, several rule-based and statistical word problem solvers have been the go-to
models for the task with carefully crafted rules or hand curated features. Recently, with the
advancement of deep learning, current systems are moving towards neural models with better
text representations across larger texts achieving improved performance on several benchmark
datasets. In this chapter, we present an overview of different strategies followed for word
problem solving.
8
lem types, it opened up a direction of research for the cognitive modeling of problem solving
procedures using schemata. ARITHPRO [3] was such an attempt to increase the coverage of
problem types and number of rules to include better solving strategies.
Bakman [4] proposed a basic arithmetic word problem solver called ROBUST that could
solve multiplication and division problems additionally as well as operate on variations in nat-
ural language input unlike its predecessors. ROBUST also introduced the notion of relevancy
on quantities appearing in a problem text. ROBUST used 8 different change schemata for dif-
ferentiating between situations involving changes in place, ownership, creation of new objects,
and termination of existing objects. ROBUST introduced the concept of “formula”, which
represented a generalized description of a situation related to a schema. For Transfer-In-Place
schema, the change formula is as follows:
Sundaram et al. (2015) [5] used the schemata proposed by Bakman to solve problems
available in a benchmark dataset AI2 [6] and achieved improved performance. The major
limitation of all these systems was their inability to deal with problems that required world
knowledge and common sense.
9
of the features used for modeling were linguistic such as dependency tree, lemma, phrases,
unigrams, bigrams, and others dealt with type of the expected answer i.e whether the solution
was positive or an integer. Computing the answer required summing over all templates and
all possible alignments which made the search space exponential. So during inference, this was
approximated employing a beam search. The authors showed that the system’s performance
was directly proportional to the increase in system templates’ frequency. The authors also
acknowledged that external knowledge bases may be required to understand the concepts of
decrease, profit, loss, and other semantic concepts. For equation template frequencies higher
than 20, the system was able to solve 87% of the word problems. Zhou et al. [8] proposed an
improved version of KAZB. The model only mapped the numbers in the word problem into the
number slots which significantly reduced the search space and made training easier in terms of
both time and space complexities. The proposed system registered an improvement of 10% from
the Kushman’s system by solving 79.7% of the total problems. A robust log-linear model was
designed which maximized the margin between correct and incorrect assignments of slots inside
the templates. This resulted in a quadratic programming problem. Upadhyay et al. [9] showed
that many systems lacked the reasoning ability on how the equations were constructed. The
authors shared a dataset containing derivations of 2200 arithmetic word problems in addition
to the equations and solutions. Each derivation was composed of an equation template and
its corresponding alignments of slots with the problem text. They showed improvement in
performance when derivations were included in training the model.
ARIS [6] relied on the concept of verb categorization to form equations and solve word prob-
lems. They categorized each verb appearing in a sentence into one of 7 predefined classes. They
represented every sentence as a set of containers, entities, quantities, attributes, and relations.
Each word problem represented a partial state of the world. Quantities got updated or created
within the containers with the progression of state. The question queried about a quantity
in a particular state. ARIS performed the grounding of the required information mentioned
above by using different tools under stanford coreNLP suite [16]. The verbs were categorized
using an SVM classifier where the features included wordnet [17] features, dependency level
relations between a verb, and other words in a sentence, similarity scores between a verb and
a set of seed words. The updates of quantities between states are completed by matching the
containers and entities between two successive states. The irrelevant quantities can be easily
found by looking at the non-matching containers. They released a public benchmark dataset for
arithmetic word problems named AI2. This dataset consisted of only addition and subtraction
problems. Most of the problems had 2 operands and 1 operation. The system could solve 77.7%
of the problems. The errors were attributed to the errors made by the external tools such as
10
dependency parsers, coreference resolvers, irrelevant information, lack of world knowledge, and
missing explicit entities.
11
combining the mentioned RNN model and similarity-based retrieval model to record an overall
improvement in solving word problems. They were the first to explore the effects of seq2seq
models on automatic word problem solving. Their proposed two models outperformed all the
previous statistical models designed in this field.
Most of the word problem solvers perform poorly on datasets consisting of diverse word
problems like the Dolphin18K [19] dataset. Huang et al. [19] showed that a simple similarity
based retrieval model outperformed its sophisticated statistical counterparts on large datasets.
The significant Number Identification (SNI) module is critical in identifying relevant numbers
inside a word problem. The SNI module is an LSTM-based binary classification model. Each
training sample for this model consists of a number and its context. The length of the context
window for each sample is 3. If a number is significant, then it would be replaced with a number
symbol as n1 , n2 , n3 , ... The seq2seq model was 5 layers deep, with a word embedding layer, a
2-layer GRU as the encoder and a 2-layer LSTM as the decoder. The retrieval model found
out the lexical similarity between a test problem and all the training problems where each word
problem was represented as a TF-IDF vector.
The accuracy of the retrieval model was positively correlated with the maximal similarity
score between the target problem and the problems in training data. The authors observed
that the seq2seq and retrieval models complemented each other which is shown in Figure 2.1.
Xie et al. [20] modeled word problem solving in a goal driven manner. The proposed solver
generates an expression tree for a given word problem using a tree structured neural network. It
identifies the goal to achieve and then decomposes it into multiple subgoals recursively based on
the operators involved. For encoding the input problem text, the authors used gated recurrent
12
units on word embeddings and a recursive neural network as a decoder. They also showed
that the inclusion of subtree embeddings during the expression tree generation improved the
performance of the solver. The solver performed better than the sequence to sequence model
based solvers. Using this goal driven approach, they were able to eliminate the generation
of invalid mathematical expressions and spurious numbers that did not appear in the input
problem.
Ling et al. [21] introduced a concept of answer rationale for solving word problems. Each
answer rationale consists of a sequence of steps required to solve a problem. For this, they use
a latent sequence of instructions where the solver first learns to convert an input word problem
into an instruction sequence and then learns to generate a solution based on the predicted
instructions. Similar to Ling et al. [21], Huang et al. (2018) [22] presented an intermediate
representation for generating equations that were motivated by semantic parsing. Another
neural approach [23] used tree LSTM decoders for generating equations required to solve a
problem. Griffith et al. (2019) [24], Griffith et al. (2021) [25], Wang et al. [26] used transformer
[27] based models and experimentally studied the difficulty level in generating equations with
different notations (prefix, infix, postfix). Patel et al. [28] empirically showed that the current
solvers are only learning shallow features to solve word problems in the benchmark datasets. The
authors also pointed out a severe limitation of the solvers where they predicted the equations
without looking at the questions. Other studies Sundaram et al. (2022) [29] suggest that deep
learning based approaches are inadequate for learning the mapping of linguistic features of word
problems and the mathematical concepts.
As a part of this work, we have implemented a frame based approach and neural approaches
for word problem solving. For training our models, we have developed datasets. We also high-
light the requirement of data augmentation techniques and show their effectiveness in building
better and robust models. We tested our systems’ performance on benchmark datasets and
gave a comparative analysis with the current systems. We briefly discuss the deep learning
concepts mainly used in this thesis.
The input for word problem solving is a sequence of words and the output is an equation
conditioned on the input which is again a sequence of symbols. The symbols denote the operands
and the operations. In most word problems, the operands are explicit and present in the problem
text whereas the operations are implicit and are to be inferred from the input problem. We
modeled word problem solving as a sequence transduction task. For this, we focused mainly on
neural models and utilized two kinds of architectures for modeling our experiments.
13
• Sequence-to-Sequence Architecture
• Transformer Architecture
Transformer [27] based models are the current best performing models in several NLP tasks.
These models have superior computation efficiency than recurrent models due to parallelizabil-
ity. This enables transformer models to be trained in significantly less time. The architecture
is shown in Figure 2.2.
Figure 2.2: Transformer Architecture (the figure is from the original paper)
14
The architecture is similar to a Seq2Seq [30] encoder-decoder model. The major difference
lies in the processing of the input vectors. The vanilla encoder-decoder model [30] processes the
inputs in a linear fashion whereas the trasformer can function on the inputs in a parallelized
format. For understanding the order of the input sequence, positional encodings are added
to provide the information about the positions of the tokens. Self-attention helps the model
capture the contextual information in an effective manner by looking at all the words in a
sequence while processing a word. The self-attention is applied multiple times in parallel which
allow transformers to model the relationships between words in multiple perspectives. Each
attention layer in the encoder and decoder is followed by a feed-forward network to learn
complex patterns. Layer normalization is applied after each layer during training with residual
connections.
Transformers use scaled dot product attention mechanism where there are 3 primary matri-
ces associated with the input sequence: queries (Q), keys (K), and values (V). dk denotes the
dimensions of Q and K and dv is the dimension of V.
This attention is not computed once. Istead it is computed multiple times with linear projections
parallely and generating multiple output values. These output values are concatenated and
projected to yield the final values. This is called multi head attention and is useful to learn
better representations for the input sequence.
15
Chapter 3
Chapter 2 gives an overview of the approaches followed for word problem solving. To as-
certain which approach is more accurate, evaluation becomes an essential step. The quality
of benchmark data and the metrics for evaluating models are important in presenting a more
accurate picture of which model is better in terms of solving a wider range of word problems.
In this chapter, we will also shed light on the evaluation strategies used for measuring the
performance of the Word Problem Solving (WPS) approaches.
• Question: John has 2 pens and 4 pencils. Robert has 4 pens. How many pens are there?.
• Solution: 6
• Equation: X = 2 + 4
Solution accuracy is calculated as the fraction of correctly solved problems with the total
number of word problems. Most systems are evaluated in terms of solution accuracy. They do
not take into account how the answer is arrived at. If a solver predicts the answer to the above
question as 6, then it will be counted as a correct answer.
16
3.1.2 Equation Accuracy
Equation accuracy is the fraction of the total number of correctly identified equations with
the total number of word problems. For the above question, a solver just needs to identify the
equation as X = 2 + 4. The question contains two mentions of the number ‘4’. If a solver
is evaluated in terms of equation accuracy, it does not need to align the exact mention of
the number. Most of the systems are evaluated in terms of equation accuracy. In the above
example, the equation accuracy will label an equation incorrect even if the predicted equation
is equivalent to the given equation. A predicted equation X = 4 + 2 for the given example
would be deemed incorrect.
Equation accuracy used in many previous works ignored the equation or expression equiva-
lence property shown below.
This becomes very crucial while evaluating the performance of solvers that can handle mul-
tistep arithmetic word problems. We leveraged this concept and proposed a new evaluation
method [33] for equation accuracy.
17
version of such problems with a very limited vocabulary. [34] showed that many benchmark
datasets were composed of problems with high lexical overlap and equation template overlap
which will be studied in detail in the following sections. [34] proposed a greedy search technique
to eliminate such frailties of datasets and showed 20-30% reduction in the size of all the datasets.
They created a repository called “MAWPS” collecting all word problem datasets released by
different authors. [35] showed that many datasets had vocabulary biases. In AI2 dataset, the
verb ‘give’ was only associated with subtraction operation, it completely ignored the addition
aspect of the verb. ALGES514 [7] contains 514 problems constructed from only 28 equation
templates. Similarly, [35] empirically showed that current solvers perform poorly on an unbiased
dataset and they suffered from non generalization. They released a dataset comprising 1492
problems by eliminating the existing biases. Dolphin18K [19] is one of the biggest datasets in
terms of size and variety of problems. It has 18460 word problems and 5871 equation templates.
They showed that a simple similarity match between a problem and all the training problems
yielded better results than other models built using more sophisticated features on this dataset.
AQuA [21] is the biggest dataset for word problem solving. It contains 101449 triples of algebraic
questions, answers, and rationales. All these datasets are available in English. Math23k [18] and
Ape210k [36] are two large datasets available in Chinese. Both the datasets contain elementary
word problems needing only one unknown variable to solve them. But Ape210k dataset is more
diverse in terms of questions and equation templates. ASDiv [37] is the most diverse English
dataset for single variable word problems covering a large number of text patterns. It also
contains annotations of problem type and grade level.
Lexical Overlap
As mentioned earlier, many benchmark datasets consist of similar types of word problems
with variation of only a few words. The following example demonstrates the nature of lexical
overlap. As the number of equation templates for single variable word problems containing one
or two operators is small, we do not discuss the template overlap here.
1. Joan went to 4 football games this year . She went to 9 games last year . How many
football games did Joan go to in all ?
2. John went to 5 football games this year . He went to 6 games last year . How many
football games did John go to in all ?
3. Joan went to 3 baseball games this year . She went to 8 games last year . How many
baseball games did Joan go to in all ?
18
Questions 1 and 2 differ only in terms of the named entities specifically the person names and
pronouns associated with them. Questions 1 and 3 are different only in terms of the entities
associated with the quantities. The entity or unit in question 1 is ‘football games’ whereas
‘baseball games’ is the identified entity or unit in question 3. Changing the subjects or persons’
names and the associated entities is the most widely adopted technique for creating new word
problems in benchmark datasets.
Ungrammaticality
As many of the benchmark datasets are created using crowdsourcing, there are possibilities
of grammatical errors. This was first shown by [34] in their work. They used ERG parser [38]
to spot ungrammatical questions. We show some of the examples of such questions present in
one of the benchmark datasets.
• Incorrect Number Marker in Nouns - Joan found 70 seashells on the beach . she gave Sam
some of her seashells . She has 27 seashell . How many seashells did she give to Sam ?
In order to tackle the problem of lexical overlap, we devise a strategy to remove highly
overlapping word problems from a dataset which results in maximum diversity inspired by the
earlier works [34, 39]. Let T (p) denote the set of unigrams appearing in a word problem p after
removing all numeric quantities and punctuations. We meausre the lexical overlap between two
word problems in terms of Jaccard Similarity. Let LexSim(p, q) denote the lexical similarity
between the word problems p and q computed as LexSim(p, q) = |T (p) ∩ T (q)|/|T (p) ∪ T (q)|
Let D denote a dataset containing n word problems. As a first step, we set a threshold value
th for lexical similarity. For any word problem pi ∈ D, we calculate lexical similarity of pi with
other problems pj , ∀j, j > i. We remove all the word problems from D satisfying the property
LexSim(pi , pj ) >= th.
We use this technique on the benchmark datasets in English and present the results for
different thresholds in Table 3.1.
We can observe that there is highest overlap between problems in the CC (Common Core)
dataset containing word problems which need multiple arithmetic operations to solve them.
ASDiv dataset is the most diverse dataset of all the datasets containing a single unknown.
MAWPS also follows a similar trend as this is an amalgamation of different datasets including
19
Reduced Size For Different Similarity Thresholds
Dataset Size
0.5 0.6 0.7 0.8 0.9 1.0
AI2 [6] 395 185 228 275 333 380 389
ASDiv [40] 2305 1948 2131 2227 2274 2298 2298
IL [10] 562 269 333 394 444 481 483
Single-Eq [34] 508 353 386 416 449 483 496
MAWPS [34] 2373 894 1035 1179 1316 1450 1802
CC [10] 600 114 116 117 118 121 364
Unbiased [35] 1492 856 1035 1197 1327 1431 1473
Table 3.1: Reduction of Datasizes after removal of similar problems
AI2, IL, Single-Eq, and CC datasets. This method can be used as a preprocessing tool for any
dataset before designing a solver to leverage the property of diversity among problems.
We have used all the above English benchmark datasets except Unbiased for our approaches.
A portion of the Hindi word problems used in our study includes Hindi translations [33] of word
problems in the Unbiased dataset.
In this chapter, we have shed light on the nature of different benchmark datasets. If the
datasets have significantly higher overlap between the constituent problems, any evaluation
done on them will be an overestimation of the proposed systems. In chapters 7 and 8, we will
explore how diversity in a dataset contributes to the efficiency of a word problem solver.
20
Chapter 4
Although current word problem solving approaches are mostly neural, these systems lack
explainability. They directly output equations or solutions without showing the intermediate
steps of the solving process. We initially explored schema based techniques which deduce the
equation and answer through a systematic interaction of schemas. These solvers can act as
teaching aids for school children as these are not only capable of answering quantity related
questions, but also they can answer any query related to different associated entities. They
can help students understand language comprehension as well as mathematical concepts. In
addition, this can be used as a guide to understand sentences containing quantities and their
associations with other terms. Inspired by this, we present here a novel approach for automatic
arithmetic word problem solving where frame identification acts as the main fulcrum for WPS.
4.1 Introduction
Early approaches [1, 3, 2, 4, 5] have relied on the concept of schemas for solving word
problems. Most of these could only solve word problems with a single addition or subtraction
operation. 3 major schema types were proposed: change, part-whole, and compare. These
techniques provided a cognitive framework for better explainability. However, these systems
had severe limitations in their ability to solve a wide range of problems. Bakman [4] extended the
schema representation for better coverage and multi step arithmetic and introduced the notion
of extraneous or irrelevant information. However, the approach only dealt with addition and
subtraction operations. So, there was a need to develop frames for other arithmetic operations
too. In this study, we developed frames encompassing all the operations. The frames are
inspired by the concept of verb semantics.
21
4.2 Definition
A frame is a basic computational unit consisting of relevant information for solving a word
problem. Instead of directly defining frames at a single level, we abstract them at two levels.
The first level decides the role of a frame in a word problem. At this level, the frames are
categorized based on the operations they evoke which is detailed in the next section. At the
first level, a frame is either a State Frame or an Action Frame. State frames and action frames
are identified by the verbs and other words in context. State frames are created based on stative
verbs such as own, possess, contain, or phrases such as “there are”, “there exists” etc. State
frames act as the entity holders or containers for an entity. No operation is associated with
these frames. Action frames correspond to verbs other than the stative verbs. Action frames
act on state frames either causing a change in quantities of the state frames or create state
frames as a result.
Every frame is unique and is identified by its slots. The slots are filled using the dependency
parsed output of a sentence. The slots include entity holder, entity, quantity of the entity,
recipient, and additional information such as place and time. The slots and frames help to
identify the type of question asked and the entities referred. The frames are then used to build
a graph where any change in quantities can be propagated to the neighboring nodes. Most of
the current solvers can only answer questions related to a quantity, while our system can answer
different kinds of questions such as ‘who’, ‘what’ other than the quantity related questions ‘how
many’ due to the presence of different kinds of slots indicative of their roles.
Frame Types
To decide the number of frame types, a thorough study was done using English Framenet
[41]. After analyzing different verbs and their corresponding frames, we came up with a list
of 40 frames. However, many frames from this list had overlapping properties in terms of
participants and conceptual roles. After eliminating the classes that were causing a high degree
of ambiguity, we arrived at a concise list of 22 frame types, as shown in Appendix A. Some of
them are mentioned here. The action frames are given with their associated operation inside
parentheses. Action frames require one or two entity holders to perform actions. Transfer frames
require two entity holders evoking addition operation in one entity holder while subtraction is
performed in the other one. Other action frames only act upon single entity holders.
• State Frame
– Possess
– Contain
22
– State Fact
– Existence
• Action Frame
– Gather (+)
– Transfer Money (+, -)
– Transfer Goods (+, -)
– Use Resource (-)
– Duplication (*)
– Separate Entitites (/)
– Create (+)
– Getting (+)
In this chapter, we will discuss 3 major aspects of word problem solving using frame identifica-
tion.
The questions for annotation are selected from the worksheets available under https://www.
math-aids.com/Word_Problems/.
23
As frames are triggered by verbs, we created a list of frames and a list of words or verbs
corresponding to a frame. 2 annotators were involved in the frame annotation task. The inter-
annotator’s agreement for frame annotation was 0.834 in terms of Fleiss’ Kappa 1 which is
considered perfect agreement.
We created a command line based tool using the Python programming language for facili-
tating annotation. The input for this tool is a question. First, the input question is split into
its constituent sentences. For splitting the sentences, we used the Spacy [42] NLP toolkit. A
verb to frame mapping is created to facilitate the annotation. The tool automatically identifies
the frames if it finds any matching verb related to a frame. If no matching verb is found in any
sentence, an annotator has to annotate the required frame ID or name for the frame. Finally,
the equation required to solve the question is annotated.
For every question, the list of frames, the equation, and question related information are
stored in an xml format which is as follows. All the text is converted into lowercase while saving
the frames.
<question>
<questionid>1</questionid>
<questionstring>Jonathan starts with 36 cards. He gives 35 to Barbara.
How many cards does Jonathan end with ?</questionstring>
<framestring>jonathan starts with 36 cards.</framestring>
<frame>possess</frame>
<framestring>he gives 35 to barbara.</framestring>
1
https://en.wikipedia.org/wiki/Fleiss’s_kappa
24
<frame>transfer_goods</frame>
<framestring>how many cards does jonathan end with ?</framestring>
<frame>possess</frame>
</question>
• Identification of Frames
• Parse Sentences
The design of these classifiers was implemented using sklearn [43] machine learning library.
The classifiers used were: Support Vector Machines [44] and Random Forests [45]. Each input
text was represented as a TF-IDF [46] vector. TF-IDF (TF: Term Frequency, IDF: Inverse Doc-
ument Frequency) assigns weights to words (or n-grams) based on its frequency in a document
and its frequencies across documents to find out how important a word is to a document in a
corpus. In this case, a document is a sentence. TF-IDF was calculated for word unigrams (uni)
and bigrams (uni-bi) appearing in the text. We also experimented with character n-grams in
25
different ranges (2-to-6 and 3-to-6). We did not use any additional lexical or linguistic features
such as parts-of-speech tags, morph features, wordnet [17] features, or dependency labels for
the frame identification. The TF-IDF scores were computed at the sentence level. After trying
out TF-IDF vectors at word and character level separately, we concatenated the vectors and
retrained the models.
Pre-trained language models trained on large corpora are proven to be useful in many NLP
tasks. BERT [47] and its other variants [48, 49] are multi-layered and bidirectional encoder
representations from transformer [27] models which are trained on huge corpora and can be
fine tuned on any target downstream task to achieve performance improvements. For learning
better contextual representation, these models are pre-trained with two tasks of masked token
prediction and next sentence prediction. These models are very easy to fine tune in any super-
vised setting with just an addition of a single output layer using softmax activation. We used
the Huggingface [50] transformer framework to fine tune the pre-trained available models for
frame identification. Two variants of BERT were used for the experiments:
26
Model Features F1-Score
uni 0.87
Linear-SVM uni-bi 0.86
char[2,6] 0.85
char[3,6] 0.85
uni 0.84
Random-Forest uni-bi 0.81
char[2,6] 0.84
char[3,6] 0.85
Linear-SVM uni+char[3-6] 0.88
Random-Forest uni+char[3-6] 0.86
Table 4.2: Frame Classification with TF-IDF Features
Model F1-Score
Distil-RoBERTa-base 0.94
RoBERTa-base 0.95
Table 4.3: Frame Classification with Transformer Based Models
For the arithmetic word problem: “John had 5 books. John gave Robert 2 books. How
many books John have now?” The equation for this question is x = 5 − 2 and the solution
is x = 3. We will derive the equation and the solution through the frames.
Figure 4.2 shows the initial frames created after parsing the first sentence. In many word
problems, questions are asked on information which are not present explicitly in the question.
The answer to this kind of question can only be answered through proper inference. If the
question in the above example is changed to “How many books are there?”, a solver needs to
infer that “Somebody has some books” means “There are some books.” So, for every ‘possess’
27
Dependency Label Frame Slot
Subject Entity Holder
dobj Entity
amod Attribute of Entity
iobj Beneficiary
nummod Quantity
nmod:case Addidional Info
Table 4.4: Dependency labels to Frame Slot Mapping
frame, an existential frame is created and vice versa. So in Figure 4.2, there are two frames
connected to each other instead of one. Both these frames are state frames. The second sentence
gives information about a transfer operation carried out between two entity holders. The order
and type of operation can be found by matching the entity holders and entities. So in this
case, the transfer_goods frame triggers a subtraction operation in one possess frame with an
update in quantity slot 5 − 2 = 3. It also creates another possess frame triggering an
addition operation with quantity coming from 0 + 2 = 2. Once the updation happens in any
‘possess’ frame, it automatically gets updated in its neighboring ‘existence’ frame. Similarly,
this kind of updation is repeated for all frames that are attached to ‘possess’ and ‘existence’
frames. The question sentence is also parsed to find out the frame type and type of question
asked. “who” kind of question seeks an answer from entity holder slots of the frames, similarly
“what” maps to entity slots. “how many” questions interrogates about the quantities involved
in the frames. In the current system, the relation between all these question types and frame
slots are predefined. Each action frame is associated with an operation which is pre-defined.
Table 4 shows the comparison of our system with ARIS. Our system with SVM based
verb categorization was able to solve 115 questions out of 302 questions of the AI2 dataset
28
containing single addition and subtraction operations. This result improved when we integrated
the Roberta-base verb categorization with our solver. The data and models can be found on
System Accuracy
ARIS 77.7%
Our System + SVM Verb Categorization 37.8%
Our System + Robertabase-uncased Verb Categorization 43.2%
Table 4.5: Comparison of System Accuracy with ARIS
https://github.com/Pruthwik/Frame-Identification-Models.
We found 6 major sources of errors by analyzing the errors made by the solver. We present
the categories of errors in table 4.6.
29
The frame based solver expects the state of the world to be linear. If the order is scrambled,
the solver incorrectly predicts the answer. In the example under parsing errors, the dependency
of “now” is “dobj” which confuses the solver to consider it as the entity. Here, the solver fails to
find any matching frame. In another example, the word “rest” refers to a subtraction operation
which comes from world knowledge. Incorrect frame identification is another major source of
error. In many cases, the coreferences are not resolved accurately. We used the CoreNLP suite
[16] for coreference resolution. In the above example, the pronoun “they” was not resolved
correctly which caused the solver to output an incorrect solution.
Between the machine learning (ML) models, Support Vector Machines performed better than
Random Forests. Unigram TF-IDF vectors were the most salient features. After a grid search
to determine the best possible features at word and character levels, we chose unigram and
3 to 6 character grams. When both these TF-IDF vectors are concatenated with each other,
it spiked the performance in both the ML classifiers. But BERT based transformer models
outperform the ML models by a significant margin of 7% in terms of F1-scores. This validates
our hypothesis about the fine tuning of pre-trained language models. RoBERTa, an optimized
BERT model was evaluated to be the best model. Distilled versions of the transformer model
using the technique of knowledge distillation are also comparable to the best performing model.
The major ambiguities for transformer based models lied in frames that evoke similar type of
operations. We show the ambiguity pairs with the help of examples below.
• Create, Getting
Other kind of errors appeared when a sentence has multiple verbs. The model predicted the
frame type for the latter verb whereas the gold annotation refers to the type of the first verb
30
as shown in the 3r d example. This is a case of wrong annotation. Another interesting case
was observed while solving a word problem using frame types. If the frame in a sentence is
predicted incorrectly, but belongs to the same frame type or evokes a similar operation, then
the solver is able to find the correct answer. This scenario is explained below through a worked
out example.
• Question: A village has a population of 30000. 5000 people immigrate to the village last
year. What is the population of the village?
– Gold Frame Type: Contain (we assign possess frame to persons, not locations)
– Predicted Frame Type: Possess
Almost all the arithmetic problem solvers output only the answers or equation, but our
system outputs step wise explanation along with the answer and equation.
Even though the motivation of the design of the frames came from FrameNet [41], the output
of our system is not similar to FrameNet. If a verb had different meanings, we did not create
different frames for different semantics. We focused more on the computational part of the
involved frames.
4.6 Conclusion
In this chapter, we present an easily understandable framework for solving arithmetic word
problems. We hope that this can assist the teacher in explaining arithmetic operations with the
help of frames and slots appearing in them. In our approach, we have predefined action frames
performing arithmetic operations. This task can be learned which action frame does what
operation. As we rely on external tools to solve word problems, this introduces the problem
of cascading errors. So, in the coming chapters, we will attempt to minimize these errors by
designing end-to-end systems for WPS.
31
Chapter 5
Frame based solvers as shown in the previous chapter are highly explainable and can very
well act as teaching assistants. However, they often are not robust and require a large number
of frame types to solve a variety of word problems. As neural based solvers can address these
limitations of the frame based solvers, we move towards neural approaches. In this work, we
developed two kinds of approaches. In the first approach, we decompose word problem solving
into two different tasks: a. Operation Prediction and b. Relevant Operand Identification. Then,
the final equation is composed of these constituents. The second approach attempts to develop
a neural end-to-end arithmetic word problem solver that generates the equation components
at once. The first approach is detailed in this chapter, whereas the next chapter describes the
second approach.
We designed different systems for the first approach where the final equation is composed of
the predictions from different classification tasks.
5.1 DILTON
Instead of taking the word problem as a single string and learning the representation of
the whole string, we split it into problem text and question sentence. In this attempt, we
develop a solver, DILTON 1 . It first predicts the basic arithmetic operation (‘-’,‘+’,‘*’,‘/’)
through a deep neural network based model, extracts the relevant operands, and then uses it
to generate the equation and answer. Its architecture is inspired by memory networks [52] by
learning separate representations for supporting sentences and the question sentence and then
concatenating them. This separation is shown in Figure 5.1.
1
Dilton was considered the smartest teenager in his school. We wanted to design an efficient system, so this
name was chosen https://en.wikipedia.org/wiki/Dilton_Doiley
32
Figure 5.1: World and Query States of a Word Problem
5.1.1 Architecture
Our system is a pipeline consisting of 3 different modules that are detailed below. The
workflow is shown in Figure 5.2.
The input to our system is the problem text P which has 2 relevant quantities num1, num2.
As a first step, P is split into 2 parts.
33
2. World State - the word problem without the final query which has the information required
to answer the query.
We used word2vec [53] to convert each word in the world state and query for their vector rep-
resentation. We then used sequence encoders with Gated Recurrent Unit (GRU) [54] to encode
both the world state and the query separately. We merge these two separate representations
by doing an element-wise sum.
To find the operands in a word problem, we need to first filter out irrelevant quantities.
• Question: John has 3 pens and 2 pencils. Jane have John 5 more pens. How many pens
John have now?
In this question, the quantity 2 is irrelevant which can be easily found out by a similarity match
between the context of the quantity and the question asked. The quantity is associated with
an entity ‘pencil’ whereas the queried entity is ‘pen’. We experimented with different context
window lengths across quantities and reported the results in Table 5.1.
The last layer in our architecture is a softmax layer, which is fully connected to the encoder’s
output, as shown in Figure 5.3. The output layer consists of 4 nodes corresponding to the basic
arithmetic operators. The node with the highest probability is selected as the operator. Then
an equation is composed of the relevant operands and the operator. Finally, we execute the
equations to get the final answer of the word problem.
5.1.1.4 Dataset
We have used the publicly available MAWPS [34] dataset. This dataset consists of word
problems containing a single addition or subtraction. The dataset contains 1751 single step
arithmetic word problems.
We train this whole network end to end using categorical cross entropy loss and Adam [55]
optimizer. The dropout rate [56] is set to 30% for regularization and to prevent overfitting. We
34
used 300 sized word2vec [53] pre-trained embeddings and network learned embeddings of 64
dimensions using a GRU [54] to encode both query and world state. The network was trained
for 40 epochs.
We evaluated our system on two datasets: one being the original dataset 5.1.1.4 and the
other being a subset of AI2 [6]. We evaluated DILTON in a 5-fold cross validation setting on
the original dataset. For the cross validation, we had 3 kinds of configurations. In the first
configuration rel, only the relevant quantities in a word problem are substituted with special
number tokens e.g. {num1, num2, ..}. These relevant quantities are extracted by using the
gold equations. In the second configuration all, all the numeric values in the word problem are
replaced with the special number tokens. In the third configuration all+hidden, an additional
layer of 100 nodes is inserted before the output layer. The results are shown in Table 5.1. The
AI2 dataset consists of 186 questions that require multiple operations to get solved. Currently,
DILTON can not solve these questions. We compared our system against the 209 problems
with single operations in the AI2 dataset. DILTON shows significant improvement over their
accuracies. Table 5.2 shows the equation accuracy for different sizes of context window for any
quantity.
35
5.1.1.7 Error Analysis
We can observe from the results that the operation prediction accuracy in configurations rel
and all in 5-fold cross-validation is very similar. 75% of the incorrect operation predictions oc-
cur in operators with same precedence i.e. the pairs of addition, subtraction and multiplication,
division. We can also see that smaller context windows around numeric quantities perform bet-
ter than larger context windows at finding relevant quantities. But relevant quantity prediction
is significantly impacted if the number of irrelevant quantities increases which is evident from
the superior performance of rel configuration than the all configuration. Including a hidden
layer before the output layer does not improve the operation prediction accuracy, therefore the
overall equation accuracy drops.
RoBERTa
RoBERTa [48] is a robustly optimized version of the original BERT model. The original
BERT was pre-trained using two unsupervised tasks namely masked language model (MLM)
and next sentence prediction (NSP). RoBERTa uses dynamic masking unlike fixed masking in
BERT and is trained on only complete sentences removing the NSP loss in training phase. This
model is trained on 10 times larger corpora than BERT and uses larger batch sizes than the
initial BERT model. Like BERT, RoBERTa and its distilled version replace the CLS and SEP
tokens by < s > and < /s > to mark the start and end of the sentence or sequence. It also
36
utilized byte level byte pair encoding (BPE) [57] instead of widely followed character level BPE
to learn better universal representations for the input sentences. For our experiments, we use
RoBERTa-base with 12 encoder layers and 12 attention heads. The size of the hidden nodes is
768. The number of parameters in this model is 125 millions.
Distil RoBERTa
Distil RoBERTa is a distilled version of the RoBERTa model. We have used the distilled
version of the RoBERTa base model. It also uses byte level BPE for tokenizing the input text
segments. This model has around 35% fewer parameters (82 million) than the RoBERTa base.
This has 6 encoder layers and 12 attention heads. Its major advantage is its speed. It performs
comparable to the original model in most of the NLU tasks.
Previous neural methods utilized this prediction task, but no dataset was publicly available.
Hence, we built a dataset for the task relevant operand identification. We annotated 3718 such
samples from 1751 sentences.
As the pre-trained models are trained on subwords, multi digit numbers are grouped into
multiple subwords. In order to avoid this kind of undesired splitting, we substitute each number
with single letters such as p, q, r, and so on. Earlier arithmetic solvers also followed this
convention to mitigate the limitations of sparse representations for different numbers. For this
task, we created two types of samples:
The size of the context window is 7 for this task i.e. {wi−3 , wi−2 , wi−1 , wi , wi+1 , wi+2 , wi+3 }.
If a token is not present at the mentioned index which usually happens towards the start and
end of a sentence, we leave that index out. This setting was first suggested in Wang et al. [18].
Most of the neural approaches use the same setting for identifying the relevance of quantities or
operands. For creating data samples, we followed some pre-processing steps which are detailed
in the section below.
Tokenization
As an initial step, every word problem is tokenized using white spaces and punctuations.
37
Question Sentence Identification
We utilized Spacy parser [42] to parse a word problem with the identification of sentence
boundaries. Often, the last sentence is presented as a question to solve. We apply an additional
heuristic where the last sentence consists of an ‘if’ clause to find out the exact question span.
• Word Problem: Marvin has p eggs. Jacqueline has q eggs. If Jacqueline gives all of her
eggs to Marvin, how many eggs will Marvin have ?
• Last Sentence: If Jacqueline gives all of her eggs to Marvin, how many eggs will Marvin
have ?
While creating data samples with the context window and question, we concatenate them only
when there is no lexical overlapping between them. Otherwise, we take only the non overlapping
text.
The following example presents the approaches for data creation.
• Question: Joshua has 12 Skittles and 7 eggs. If he shares the Skittles among 4 friends,
how many Skittles does each friend get?
• Tokenization and Replacement of Numbers: Joshua has p Skittles and q eggs . If he shares
the Skittles among r friends , how many Skittles does each friend get ?
• Context Window Around Quantities with Question Sentence and Relevance Score
– Joshua has p Skittles and q how many Skittles does each friend get ? - 1
– p Skittles and q eggs . If how many Skittles does each friend get ? - 0
– the Skittles among r friends , how many Skittles does each friend get ? - 1
38
5.2.1.1 Experiments and Results
For this task, we created a training-free baseline. We computed the cosine similarity scores
between the contexts of quantities and the question sentence. We select the quantities with
highest scores. In this context, the top two are chosen. For representing the texts for the
baseline, we utilized the sentence BERT embeddings [58]. The other models were implemented
using different variants of RoBERTa [48] which is a robustly optimized version of original im-
plemetation of BERT. All these models were finetuned using the Huggingface [50] Transformer
framework.
Both the models were implemented using the Huggingface [50] transformer library. All
the experiments were conducted on a single NVIDIA GeForce GTX 1080 Ti computational
GPU with 11GB of GDDR5X RAM. All the models for both relevant operand prediction
and operation prediction were trained for 20 epochs. Each epoch takes around 2 minutes to
complete.
5.2.1.3 Discussion
The data for the quantity relevance prediction or significant number identification (SNI) task
is mostly unavailable. So, it is difficult to compare the efficacy of our approach with earlier
approaches. Most of the previous systems reported only the accuracy. Benchmark datasets
are highly skewed when irrelevant numbers in word problems are concerned, at most, 10% of
the problems contain numbers that are insignificant. Hence, we added macro F1 and macro
accuracy scores as evaluation metrics for the models as the datasets have high class imbalance.
We also observed a similar pattern in the results shown in Table 5.3 where there is a wide gap
between the F1 and accuracy scores. When the question sentence for a word problem is added
39
to the context window of a quantity, the model predictions become more accurate. This verifies
our hypothesis that bigger contexts for quantity prediction using BERT like models are indeed
highly beneficial. RoBERTa model performed the best, but the distilled versions of the same
model gave comparable results without degrading the former.
For the operation identification, the input to the model is the complete question and the
output is the required operation. We will take the same example as mentioned above.
• Question: Joshua has 12 Skittles and 7 eggs. If he shares the Skittles among 4 friends,
how many Skittles does each friend get?
• Tokenization and Replacement of Numbers: Joshua has p Skittles and q eggs . If he shares
the Skittles among r friends , how many Skittles does each friend get ?
• Operation: / or Division
Model Micro F1
Distil-RoBERTa-base 94.06
roberta-base 95.89
40
5.2.3 Discussion
The equation accuracy was computed by composing the predictions from the two models as
shown in Table 5.5. The configurations denote two different settings for operand predictions.
RoBERTa-base model performed the best in both the configurations. Operand predictions get
a boost when the question sentence is added to the context of the operands, thus improving
the overall equation accuracy. A similar pattern is also observed in operation identification
where the results of the distil RoBERTa were comparable to the original model. In the case of
operation identification, the major ambiguity lies in multiplication and division. Two examples
are shown below:
• Word Problem 1: I have p cents to buy candy . If each gumdrop costs q cents , how many
gumdrops can I buy ?
• Target Operation: /
• Predicted Operation: ∗
• Target Operation: /
• Predicted Operation: ∗
As pointed out by Patel et al. [28], we also observe that deep learning networks focus primarily
on shallow features of word problems where certain words are only attended to while deciding
either operands or operations.
5.3 Conclusion
In this chapter, we designed different classification tasks to predict various components, such
as operands and operations needed to solve a word problem. An equation is formed from these
constituents and solved to get the intended solution. We implemented different techniques for
these classification tasks and showed their efficacies. This kind of modeling will only work when
the number of classification tasks is limited. In this case, the number of operands is 2 with a
single operation. With the increase in the number of operands and operators, the number of
classification tasks increases exponentially and becomes intractable to solve. The next chapter
discusses end-to-end equation generation at once instead of depending on the outputs from the
tasks.
41
Chapter 6
The previous chapter generated the required equation for a word problem by composing the
results of operand and operator prediction modules. The overall performance of such a solver
is impacted due to the error propagation from different modules. To overcome this limitation,
we develop approaches to generate equations for word problems at once.
We experimented with multiple approaches to develop end-to-end equation generators which
are detailed in the following sections.
6.1 EquGener
We introduce a novel method where we first learn a dense representation of the problem de-
scription conditioned on the question. We leverage this representation to generate the operands
and operators in the appropriate order for forming the equation. This approach is unlike several
sequence-to-sequence learners where the complete input word problem is fed to the encoder at
once. This puts an additional constraint on the decoder as it has no clue of what is being asked
in the question.
Our solver is an end-to-end memory network with an equation decoder. Our system handles
problems involving a single arithmetic operation and can be extended to multi arithmetic
operations. We call our system EquGener.
An attention based encoder-decoder [32] has been used as a baseline for our equation gener-
ation system as shown in Figure 6.1. Both the encoder and decoder employ Long Short-Term
Memory (LSTM) to represent the input and target sequence respectively. In this architecture,
the input sequence is encoded as a sequence of word vectors and the decoder has access to all
these vectors instead of a single vector. Each word vector is a concatenated vector represen-
tation of pre-trained Glove [59] embeddings and the embeddings learned by the network from
42
Figure 6.1: Architecture Diagram of Base Model
the training corpus. The equation generation for a word problem requires identifying words
that indicate the presence of operands and operators. The j th hidden state hj of the encoder
is computed as Equation 6.1 using an LSTM which is a function of previous hidden state hj−1
and current input sj .
hj = f(hj−1 , sj ) (6.1)
The decoder is initialized with the hidden and cell states obtained at the last time-step of the
encoder. Each hidden state of the input word sequence and the hidden state of the equation
are compared to arrive at the alignment. The attention at time step t at is computed as per
Equation [?]:
at = align(ht , h~s )
exp(ht T .h~s ) (6.2)
=
Σ ′ exp(h T .h~′ )
s t s
The context vector ct is computed as the weighted combination of the hidden states from the
word sequence:
ct = Σt at × ht (6.3)
The attentional hidden state on the decoder side is obtained by concatenating the context
vector from the input word sequence ct and equation hidden state ht . Wc is a learned weight
matrix.
43
6.1.2 Memory Network Based Encoder
End-to-End memory networks [52] succeeds in representing sentences as well as captures
the salience or the intent of the question in Question Answering systems. We used a variant
of memory network EquGener to solve arithmetic word problems. The word representations
in the supporting sentences act as memories and these are weighted as per the question. The
relevant memories are assigned higher weights than the irrelevant ones. This weighted combi-
nation of memory vectors is then learned by the encoder to obtain a hidden representation of
the word sequence appearing in the supporting sentences conditioned on the question words.
The decoder then generates the equation conditioned on the encoded hidden representation.
Two kinds of memory network settings are generally followed : (1) explicit identification of
supporting sentences where the answer components lie and (2) without any information regard-
ing supporting sentences. We used the later configuration for our system which required less
supervision than the former. Considering our input to be a sequence of words w1 , w2 , w3 , .., wn ,
these words are embedded into a lower dimension space d which can be achieved via an em-
bedding matrix A of size d × V where V is the vocabulary size. These embeddings are learnt
during training. Each word is represented as a vector concatenating its glove embedding [59]
and its transformed inputs. EquGener learned the words that had to be attended to based only
on the question text. This strategy helped us in handling words which were not available in
pre-trained word vectors like numi in the problem text where numi represents the quantities
or operands.
mi = [wp ; we ] (6.5)
where wp and we denotes the pre-trained embeddings and learned word embeddings respectively.
mi denotes the memory at position i. The question sentence in a word problem is also embedded
using another matrix B of a similar dimension to A. A is used to embed the words appearing in
supporting sentences while embedding B is used for the words in the question. This represents
the internal state u for the question. The match between internal state and memory elements
helps in predicting the components involved in an equation. The match is found using a dot
product between u and mi and then applying a softmax function over it. For a single hop
memory network, u is computed by passing all the question words through an embedding layer.
If the embedding dimension is d and maximum question length is denoted as q_max, then u
will be a matrix of dimension q_max ∗ d. For a multihop memory network, the u is updated
iteratively as shown in Equation 6.10.
pi = softmax(uT .mi )
exi (6.6)
softmax(xi ) = ∑n xj
j=1 e
where n refers to the total number of words in the supporting sentences. There is a probability
score for each word appearing in memories. Each memory mi has a corresponding output vector
44
ci which is obtained through another embedding matrix C. The output vector is a weighted
sum of pi and ci s.
We used a slight modification to this formulation used in memory network [52] which is given
below in Equation 6.7. We describe two formulations for single layer and multi-layer memory
networks.
Single Layer
The output memory representation o is a dense representation of the words in the supporting
sentences conditioned on the question.
∑
n
o= p i + ci (6.7)
i=1
This output vector is passed through a fully connected dense network to capture a vector which
is of equal dimension as the word representations. The sequence is fed to an LSTM encoder for
representation of the sequence. This sequence is a sum of embeddings for the query and the
vector after passing the output vector through the dense network.
d = Dense(o) (6.8)
E=u+d (6.9)
Multi-Layer
In the case of a multi-layer memory network, the hidden state gets updated in each hop with
the discovery of new attention points in the memories according to the question. Here, the same
45
input and output embeddings are used across the layers. For a k-hop memory network where
the memory layers are stacked on top of each other, the internal state is updated as follows,
uk+1 = uk + dk (6.10)
The computation of dk is same as the single layer case. The hidden state of the encoder is
computed as per the equations 6.11-6.17 1
6.1.3 Decoder
The decoder takes the encoder representation learned in the memory network and predicts
the sequence of operands and operators. The decoder is initialized with the last hidden state
and cell state from the encoder. The decoder also uses an LSTM to predict the next output.
The decoder predicts the output distribution using teacher forcing [60]. The hidden and cell
states are computed according to the LSTM equations defined in section 6.1.2. In figure 6.2,
the output tokens are referred to as Op1, Op2, and Opr which stand for the operands and the
operator of the equation.
Operation Frequency
+ 472
− 445
∗ 226
/ 171
Total 1314
Table 6.1: Frequency Analysis of Operations in Training Data
1
http://colah.github.io/posts/2015-08-Understanding-LSTMs
46
6.1.4.1 Data
We used 1314 arithmetic problems with a single operation present in the MAWPS [34]
dataset as our training set. The operations include all basic mathematical operations: addition
(+), subtraction (-), multiplication (*), and division (/). The three benchmark datasets for
evaluation are MA1, MA2, and IXL dataset [61] which are subsets of the AI2 dataset [6]
released as a part of project Euclid 2 . We chose problems with only a single operation from
these datasets - 103 from MA1, 118 from MA2, and 81 from IXL dataset.
6.1.4.2 Preprocessing
Every number appearing in a word problem is replaced by numi in a random fashion where
i ε [1, 6]. This is done to minimize the sparsity of different kinds of numbers appearing in the
problem text. This also assisted in learning better representations of numbers in word problems.
NLTK [62] was used to tokenize the sentences in a word problem. The last sentence in the list
of tokenized sentences was considered as the question sentence and the rest as supporting
sentences. The equations were labeled in postfix notation, e.g. num1 num3 +
6.1.4.3 Setting
The development set was fixed to 5% of the training data. The embedding weights for these
were uniformly initialized. Dropout [56] rate of 0.2 was used while learning the embeddings
A, B, and C. The embedding and output dimensions for the LSTM were set to be 64. The
2
http://allenai.org/euclid.html
47
concatenated vector representation for each word is a vector of size 364. Maximum number of
words appearing in the supporting sentences and question sentences were 56 and 34 respectively.
No dropout rates were specified for the LSTMs. The recurrent weights were initialized as
a random orthogonal matrix. The input weights were assigned from a Glorot [63] uniform
distribution with the bias being initialized to zeros. Following this procedure, we were able to
reduce the output vocabulary size. The encoder and decoder hidden and cell states were also
fixed to be of dimension 64. Keras [64] deep learning library was used to build the network.
Adam [55] optimizer was used for the optimization of the parameters. The system was trained
for 50 epochs with the validation set. In the case of multiple hops, the same embeddings, A
and C, were used in different layers.
A1 = A2 = .. = AK (6.18)
C1 = C2 = .. = CK (6.19)
uk+1 = uk + ok (6.20)
We compared our system with other systems on MA1, MA2, and IXL datasets. Table 6.3
shows the accuracy of the systems in solving word problems in terms of the percentage of
problems solved.
EquGener outperforms KAZB [7] significantly. KAZB uses a joint log-linear model distribu-
tion over a full system of predefined equations and alignments between the text and equation
templates. As the alignment space is exponential while aligning with the slots, beam search
is employed to find approximate solution. KAZB uses surface level features for the words and
does not employ any semantic representation of them. So KAZB performs poorly on IXL,
48
where there is an information gap and irrelevant quantities. EquGener makes use of the dense
semantic representation and can identify irrelevant quantities easily. Mitra et al. [61] was the
state-of-the-art system. EquGener performs better than it on IXL dataset which has more
information gaps. Mitra et al. [61] classify each addition or subtraction problem into 3 con-
cepts and each concept is associated with a formula. Different features are defined for each
formula. Multiple formulas can be applied to a word problem and a log linear model scores
each formula based on its features. The biggest limitation of this system is its reliance on
external tools like wordnet, dependency parser, and conceptnet. The system can only solve
addition and subtraction problems. EquGener does not need any computation of extra features
to solve word problems. It can solve problems with any arithmetic operation. But EquGener’s
performance dips for MA2 dataset containing problems with a high percentage of irrelevant
information which even performs worse than attention based encoder decoder. This is evident
from its wrong predictions for operands which is 58% on average while the operator is correctly
identified in 91% of the problems in MA2.
100
75
Accuracy
50
25
0
Baseline OperandsEquGener-Hop1 EquGener-Hop2 Baseline OperatorEquGener-Hop1 EquGener-Hop2
100
75
Accuracy
50
25
0
Baseline OperandsEquGener-Hop1 EquGener-Hop2 Baseline OperatorEquGener-Hop1 EquGener-Hop2
The operand prediction accuracies improve with the number of hops. EquGener requires
several hops to identify the exact operands correctly. The major error in the operand prediction
49
100
75
Accuracy
50
25
0
Baseline OperandsEquGener-Hop1 EquGener-Hop2 Baseline OperatorEquGener-Hop1 EquGener-Hop2
in our system resulted from the predictions that were out-of-order. The accuracy of the relevant
quantity or operand prediction [65] was 89.1%. EquGener improved upon this accuracy by 3%
in 2 of the datasets that are shown in Figures 6.3, 6.4, and 6.5.
Some of the erroneous outputs produced by EquGener are shown in the Table 6.4. In the
first example, the system identified the operators accurately. However, the system could not
identify the direction of transfer for the verb ‘borrow’, which resulted in an erroneous prediction
of the order of equation components. We observe that this is a problem due to data sparsity
and can be overcome by making more examples of this.
In the 2nd example, the system predicted an operand num5, which is not present in the prob-
lem description. Though we expect the numerical values mentioned in the problem description
not to be attached to any context, repeated occurrences may violate this assumption in certain
cases. An architectural improvement to handle the numerical values can better the results.
Similarly, in the 3rd example, the system incorrectly identifies num3 as a relevant operand.
This error can be resolved by adopting an approach similar to Wang et al. [18] which modeled
relevant operand identification as a classification task.
50
TestSet Question Predicted Actual
MA1 Joan picked num3 apples from the orchard . num1 num3 - num3 num1 -
Melanie borrowed num1 apples from her . How
many apples does Joan have now ?
IXL Tom went to num1 hockey games this year , but num1 num5 num1 num3
missed num4 . He went to num3 games last year . + +
How many hockey games did Tom go to in all ?
MA2 In num3 week , Mitch ’s family drank num4 carton num3 num2 num4 num2
of regular milk and num2 carton of soy milk . How + +
much milk did they drink in all ?
Table 6.4: Predicted and Actual Equations from Different Test Sets
6.1.5.4 Discussion
The figure 6.6 below shows the relative attention of words in supporting sentences conditioned
on question words. The words in the question are shown in the Y-axis, and the words in the X-
axis constitute the supporting sentences. EquGener is able to figure out num4 as an irrelevant
quantity as it is associated with the entity crayons whereas the question is asked about the
rulers. The verb “place” appears in the context of the rulers, so it also receives higher weights.
51
6.2 Equation Generation using other Neural Approaches
• Transformers [27]
For the above mentioned architectures, pretrained subword embeddings [67] were used. Each
word problem was tokenized using a subword tokenizer. The special number symbols e.g.
‘num1’, ‘num2’ etc. which were used in the earlier approaches got split into two tokens in the
subword based tokenization. So, we use single alphabets ‘p’, ‘q’, ‘r’ for ‘num1’, ‘num2’, ‘num3’
respectively. One example of subword tokenization is given below.
• Original Word Problem: A ship is filled with 10 tons of cargo . It stops in the Bahamas ,
where sailors load 15 tons of cargo onboard . How many tons of cargo does the ship hold
now ?
• Word Problem after replacing special number token: A ship is filled with p tons of cargo
. It stops in the Bahamas , where sailors load r tons of cargo onboard . How many tons
of cargo does the ship hold now ?
• Subword Tokenization: _a _ship _is _filled _with _p _tons _of _cargo _. _it _stops
_in _the _bahamas _, _where _sailors _load _r _tons _of _cargo _onboard _. _how
_many _tons _of _cargo _does _the _ship _hold _now _?
The configuration used for both the neural networks are given tables 6.5, 6.6.
6.2.3 Results
While training the models were evaluated after 100 validation steps. We tested the systems
on 3 test sets as earlier systems. The results are shown in Table 6.8.
52
parameter value
Word Embedding Size 300
Subword Embedding Size 300
Encoder Layers 2
Decoder Layers 2
Input Sequence Length 200
Output Sequence Length 200
Dropout Rate 0.3
Batch Size 64
Optimizer Adam
Table 6.5: Configuration of BiLSTM model with Global Attention for English
parameter value
Subword Ebedding Size 300
Encoder Layers 4
Decoder Layers 4
Heads 4
Input Sequence Length 200
Output Sequence Length 200
Nodes in Feed forward layer 2048
Dropout Rate 0.1
Attention Dropout Rate 0.1
Batch Size 64
Optimizer Adam
Table 6.6: Configuration of Transformer Model for English
Table 6.9: Comparison of proposed system with other systems for English. Numerical values
represent Equation Accuracy
53
Name #Word_Problems Property
MA1 103 Most contain only relevant info
MA2 118 Most contain irrelevant info
IXL 81 Problems contain information gap
Table 6.7: Test set Statistics
6.2.5 Observation
BiLSTM with the attention based neural network performs better than the transformer
network. The transformer based model reaches peak performance in a few number of training
steps, but its performance degradation is drastic. However, the performance of the BiLSTM
network with global attention is steady and is directly proportional to the increase in training
steps. When compared to other systems, this model outperforms all the previous state-of-the-
art scores. EquGener used word embeddings while the attention based BiLSTM model uses
subword embeddings.
6.3.2 Discussion
We can observe that the T5-large model is superior to the T5-small and T5-base models.
The T5-small model’s performance degrades when there is missing and irrelevant information in
the word problems, which is evident from its equation accuracies in MA2 and IXL datasets. Its
operator identification accuracy is also low which cascades to result in lower eqution accuracies.
T5-base is comparable to the large model in MA1 and IXL. But the T5-large model performs
better in the presence of irrelevant information as well. T5-base and T5-large models outperform
all the models described in this chapter.
54
Model Data Train_Steps Equ_Acc
BiLSTM-Attention MA1 500 17.476
1000 74.757
1500 94.175
2000 94.175
MA2 500 16.949
1000 77.119
1500 87.288
2000 87.288
IXL 500 22.222
1000 83.951
1500 88.889
2000 90.123
Transformer MA1 200 74.748
300 67.961
500 82.524
1000 51.456
MA2 200 53.39
400 55.932
500 47.458
1000 44.915
IXL 200 65.432
300 61.728
500 71.605
1000 44.444
Table 6.8: Test Results for BiLSTM Attention and Transformer Networks for English
We calculated the average Jaccard similarity of the evaluation datasets with the training
dataset to see whether overlapping between them impacts performance. We computed this
similarity by first calculating the maximum similarity of each question in the test set with the
training questions. As the next step, we performed the average of these maximum similarities
to reach a score that denotes the overall similarity of the test set. The operands or numeric
quantities were ignored while calculating the Jaccard similarity. Jaccard similarity can be
defined as
|A ∩ B|
Jac_Sim(A, B) =
|A ∪ B|
where Jac_Sim(A, B) is the Jaccard similarity between two sets A and B. In our case, one set
contains the words in a test question and the other one is a collection of words in a question
in the training dataset. This process is explained mathematically below in Equations 6.21 and
55
Model Enc-Dec Layers Attention Heads Parameters (in millions)
T5-Small 6 8 60
T5-Base 12 12 220
T5-Large 24 16 770
Table 6.10: Details of the used T5 Models
6.22.
max_jac_sim_testj = max Jac_Sim(testj , traini ) (6.21)
traini ∈T rain
∑
avg_jac_sim = (1/|T est|) ∗ max_jac_sim_testj (6.22)
j∈T est
MA1 0.967
MA2 0.923
IXL 0.94
Table 6.12: Average Jaccard Similarity Scores of Test Sets with Training Set
where Train and Test denote the training dataset and test dataset respectively. We computed
the average Jaccard similarities for all three test sets MA1, MA2, and IXL. This is shown in
Table 6.12.
The evaluation accuracies are over estimated as there is very high similarity (1 means all
the test examples are from the training set) between the evaluation and training datasets used
in this chapter. This is also shown in earlier works on the MAWPS dataset, where the lexical
overlap between the word problems is very high.
56
6.5 Conclusion
In this chapter, we showed the efficiency of deep neural networks in solving single variable
word problems. The system has better accuracy compared to the frame based solvers or machine
learning based solvers, but has a trade-off in terms of explainability. We also showed that the
performance of the solver achieves its highest when the word problems in an evaluation dataset
are lexically very similar to the word problems from the training set. In the next chapters, we
show that diversity in word problems impacts the performance of the solvers.
57
Chapter 7
Any deep learning based technique requires reasonable amount of data to be trained on and
perform well. Chapter 3 shows that the datasets available for word problems are resource poor
in terms of number and type of problems. These datasets contain word problems with very
high lexical and template overlapping. This makes the datasets’ capability of generalization
and solving unseen problems very limited. Therefore, one needs to increase the data size. Data
augmentation techniques serve the purpose of ramping up the training samples. In this chapter,
we will explore different data augmentation techniques to tackle this problem.
Easy Data Augmentation (EDA) techniques [69] are simple data manipulation techniques
for boosting the performance of text classification tasks. Table 7.1 shows the application of
different EDA operations on a sentence “John bought 10 apples.”. Although these techniques
operation sentence
Random Insertion John sad bought 10 apples.
Random Swap John bought apples 10.
Random Deletion John apples 10.
Synonym Replacement John purchased 10 berries.
Table 7.1: Example of EDA Operations
can easily be incorporated to generate new word problems, many of the generated word problems
58
are either incomplete or completely ungrammatical. So, we have only made use of synonym
replacement operation for generation in our proposed technique.
1. Adjective
2. Adverb
3. Noun
4. Verb
We used the Spacy toolkit [42] to Part-Of-Speech (POS) tag any input sentence. Only the tokens
belonging to the above classes are selected for replacement. We used a pretrained word2vec
[73] model for finding the embeddings of tokens. This word2vec model is trained on roughly
100 billion words from a Google News dataset and has a vocabulary of 3 million words. For
any content word in the input sentence, the closest matches are searched in the pool of words
corresponding to its POS class from the paraphrase table in terms of cosine similarity.
59
If a sentence has ‘m’ content words and top ‘n’ similar words for each content word are
selected, then there will be a total of mn possible choices for replacement. As any word problem
consists of 2-3 sentences, this method will be able to generate an exponential number of word
problems.
Let us look into some generated word problems.
• Original Word Problem: For Halloween Debby and her sister combined the candy they
received . Debby had 32 pieces of candy while her sister had 42 . If they ate 35 pieces the
first night , how many pieces do they have left ?
1. For christmas Debby and her daughter combined the lollipop they received . Debby
had 32 artifacts of lollipop while her daughter had 42 . If they ate 35 artifacts the
first evening , how many artifacts do they have left ?
2. For birthday Debby and her grandmother combined the lollipop they received . Debby
had 32 artifacts of lollipop while her grandmother had 42 . If they ate 35 artifacts
the first hours , how many artifacts do they have left ?
The first example is generated when top 3 similar words are chosen for each content word.
Similarly, the generation of the second example is done with the top 5 similar words for each
of the content words.
We can see qualitative differences between the generations. An increase in the value of ‘n’ will
create word problems with jarring word substitutions such as ‘hours’ in place of ‘night’. In both
examples, the substitution of ‘artifacts’ for ‘pieces’ is not apt. Although all other substitutions
are acceptable in terms of grammaticality and synonyms, some substitutions degrade the overall
word problem. So, we will add lexical constraints to improve the quality of the generated
problems.
Substituting words based on just distributional similarity generates noisy sentences rather
than grammatical ones. So, we include additional linguistic constraints in the technique de-
scribed in 7.1.2 for generating word problems to ensure quality and grammatical correctness.
Before replacing a word with a similar word determined by their closeness, we evaluate the
synonym similarity [74] between the two words. This synonym similarity is measured as ‘wup
similarity’.
60
‘wup similarity’ calculates how related two synsets are in terms of depth in the wordnet
taxonomies, along with the depth of the Least Common Subsumer (LCS). It is calculated
based on the distance between two synsets relative to each other in the hypernym tree.
If ‘wup similarity’ is less than a threshold, we consider the replacement infeasible. For our
generation task, we fixed the threshold at 0.5.
According to J.R. Firth, “a word is characterized by the company it keeps”. So, a word
cannot be blindly substituted by another word without considering its context and may end
up in creating jarring sentences. We employ a language model to validate whether a word
replacement is compatible with its context. We use a pre-trained KENLM [75, 76] language
model for this which was trained on the Wall Street Journal corpus. The language model assigns
a log probability score to an input text utilizing the saved probabilities upto 5-grams.
Let wi−2 , wi−1 , wi , wi+1 , wi+2 be a 5-gram and si be the word to be substituted in place of
wi . Let ‘score’ denote a function that assigns a score to a n-gram; here, we chose n = 5. Let
‘th’ represent a threshold.
If score(wi−2 , wi−1 , wi , wi+1 , wi+2 ) − score(wi−2 , wi−1 , si , wi+1 , wi+2 ) <= th, then the
wi → si substitution is deemed to be compatible. For our generation task, we fixed th = 2.
Different Gazetteer lists are also prepared for replacing different types of named entities,
such as male names, female names, and currencies. There are other types of content words
that are not present in the paraphrase tables and the similar words retrieved by the word2vec
model are noisy. For the word “green”, the three most similar words by the word2vec model are
“wearin_o”, “greener” and “workers_differently_Corenthal”. In order to mitigate this problem,
we also created a Gazetteer list of color names. This summarizes our proposed augmentation
technique utilizing the paraphrase tables of content words.
61
• Google Translate API
The translations were adjudged ‘acceptable’ if the output generated by an MT system is natural
and fluent in English. The in-house and Opus MT systems generated disfluent and incomplete
outputs for 30% of the word problems. Most of the Chinese to English translations made by
the Google Translate API were acceptable. Hence, we finalized the usage of Google Translate
API for this task. We translated 123430 word problem from the Ape210K dataset. The source
was annotated with word problems and their corresponding equations. On the target side also,
we copied the corresponding equations.
However, word problems generated by using MT systems have noise. We discuss the noise
removal techniques used for our study.
1. Matching of Named Entities: For a Chinese word problem and its English translation,
we first match the number of named entities in the Chinese text with its corresponding
English one. If they match, then we consider it a valid translation. Otherwise we reject
the translation. For this task, we used the Spacy [42] Named Entity Recognizers (NER)
available for Chinese and English. This method of matching reduced the number of
translations by two-thirds from 123430 to 47113.
2. Non ASCII Characters: A few word problems were of ‘fill in the blanks’ and ‘symbol
manipulation’ type of questions using different non ASCII symbols. We filtered them out
from our dataset. 178 of these are available in the translations. Some examples for such
questions are shown below:
3. Erroneous Text: Some translations contained erroneous text consisting parentheses. One
such example is given below:
• A rope is 21 meters long, use (1/5), and leave the full length ((())/(())), if you use
6 meters, use the full length ((() )/(())), the full length ((())/(())) is left, and how
many meters are left.
62
4. Sampled Validation: After removing the noise using the above methods, we sampled
500 samples from the translations. We manually validated them. If a translated word
problem can be fully understood and solved using the given source equation, then we
consider the word problem to be valid and solvable. We found only 3 problems unsolvable.
Hence, we conclude that the dataset created using MT has valid word problems.
5. Matching of Quantities: After using all these noise removal techniques, we match the
numeric quantities between the translation and its corresponding equation required to
solve the Chinese word problem. We consider a problem valid if all the quantities in an
equation is present in the problem text. This approach fails for implicit word problems
which either involve unit conversion or domain knowledge to solve a word problem.
For example: The radius of a circle is 5 cm. What is its area?. The equation to solve
this word problem is x = 3.14 ∗ 5 ∗ 5. The value of 3.14 (pi or π) is not present in the
problem text. This kind of domain knowledge is needed for solving geometry problems.
We do not want to remove the valid translations for non matching numeric quantities.
For our experiments, we split the translated datasets into two parts. The first part deals
with explicit word problems where all the numbers in equations come directly from the
word problems. The other type poses a different challenge for any solver. These are called
implicit word problems where the required equation for solving the problem involves a
numeric quantity that is not present in the problem text.
Out of 44K translated word problems, 10965 are explicit in nature. Rest of the problems
are implicit. As reported in Sharma et al. [33], solvers struggle to map words in a problem
with an implicit number. They replaced implicit numbers with specific number symbols
in an equation. Let us look at an example similar to the ones in Sharma et al. [33]:
• Question: Harsh buys 48 apples from the market. How many dozens of apples does
he have?
• Equation With Numbers: 48/12
• Question With Meta Symbols: Harsh buys p apples from the market. How many
dozens of apples does he have?
• Equation: 48/12
• Equation With Meta Symbols: p/b
Where b refers to the number 12 or dozen. Sharma et al. [33] show that this kind of
implicit mapping is difficult to learn from word problems. For handling these type of
cases, we introduced a method of embedding knowledge from a pre-defined knowledge
base that can act as a cue for solving implicit word problems which will be explained in
the next chapter.
63
#Operations #Word_Problems
1 2955
2 4669
3 1476
4 548
Total 9648
A few word problems in the explicit dataset require more than 10 operations to solve them. We
restrict ourselves to solving word problems that require maximum 4 operators, which is around
88% of the explicit dataset. The details of the explicit dataset are given in the table 7.2.
However, the remaining 30000 implicit word problems need to be verified by conducting
experiments.
We split each dataset in 80:20 ratio after shuffling for creating training and validation sets.
As seen in the previous chapters, solvers achieved their highest performance while predicting
equations written in postfix notation. This solver was created using a biLSTM network with
global attention utilizing the openNMT deep learning framework as explained in the previous
chapter. Pre-trained subword embeddings were used for representing word problems and equa-
tions. We used equations in postfix notation for all the experiments. Table 7.3 shows the details
of the data used for this experiment.
64
We tested the solvers on a completely unseen and different dataset IL (Illinois) [10] to study
the impact of data augmentation. The test dataset consists of 321 word problems.
Table 7.4 shows that the model trained on augmented data improves the performance by
8%. This proves the efficiency of the proposed data augmentation technique.
As shown in chapter 3, ASDiv [40] is the most diverse corpus available for single variable
word problems in English. In this section, we want to investigate how big a factor diversity is
for word problem solving.
ASDiv Dataset
No_of_Operation #Word_Problems
1 985
2 338
Total 1323
Table 7.5: Distribution of Chosen ASDiv Word Problems
ASDiv dataset (Academia Sinica Diverse MWP Dataset) [40] was released as a diverse
dataset containing English word problems of different grade levels and problem types from
elementary mathematics curricula. We selected word problems with one or two operations.
The dataset details are given in Table 7.5. We augment this dataset in 3 different ways and
report the results in the next section.
65
Lexical Augmentation Using Paraphrase Tables
We augmented the ASDiv dataset with our proposed approach using distribution similarity
of the content words from the Paraphrase tables. We also used all the mentioned constraints for
the generation of word problems. We imposed another constraint on a number of valid lexical
substitutions to increase the diversity in the generation. We also limit the number of generated
word problems for a given word problem to 5 in order to avoid over generation of a single type
of question. We did not use sentence level paraphrasing as the available paraphrasing models
often changed the numbers in a word problem and as an effect of that the equation will also
change.
No_of_Operation #Word_Problems
1 1993
2 930
Total 2923
Table 7.6: Distribution of Augmented Word Problems with Original
The details of the dataset created after this augmentation are shown in table 7.6. We notice
that the addition of constraints limits the number of generated word problems where the total
number of generated problems is 1600. The average number of generations per word problem
is 1.21.
No_of_Operation #Word_Problems
1 4952
2 5613
3 1476
4 530
Total 12571
The final type of augmentation combined the generated word problems from both the aug-
mentation techniques with the original dataset. In this experiment, we used the full dataset of
explicit word problems as shown in table 7.2.
66
In the original dataset, the ratio of the number of questions with one operation to the number
of questions with two operations is 2.91: 1. In the full augmented data in table 7.7, there are
word problems with 3 and 4 operations whereas the number of word problems belonging to 1
and 2 operations is high. This dataset has 1098 unique equation templates. If the operands are
replaced by a single meta tag, then the number of unique equation templates is 309.
Experimental Setup
We conducted 5-Fold stratified cross validation on the original and paraphrase table aug-
mented datasets. All the three different equation notations: infix, postfix, and prefix are
experimented with to study the impact of data augmentation on equation types. For the full
augmented data, we report the average of 3 runs. For all the experiments, the operation dis-
tribution was taken into account for splitting the 5 folds in a stratified setting. The equations
in Ape210k dataset are in infix notations. We converted them into postfix and prefix nota-
tions. We used an encoder-decoder model with pre-trained BERT embeddings to fine-tune on
our datasets. RoBERTa-base embeddings were utilized for the first two experiments. For the
full augmented dataset, we used XLM RoBERTa-base. There are 123 million parameters in
the RoBERTa base model having 12 encoder and 12 decoder layers with 12 attention heads.
XLM RoBERTa-base has 270 million parameters with the same number of encoder, decoder,
attention head configurations as RoBERTa base. We froze all the encoder layers so that the
gradients do not get updated for these layers during backpropagation. This reduced the num-
ber of parameters by half and also minimized the risk of overfitting. Most of the models were
trained for 30 epochs. But, some of the models overfitted on training with higher epochs, hence
an early stopping method was adopted. All the models were developed using the Huggingface
deep learning library. We used a single NVIDIA GeForce GTX 1080 Ti GPU having 11GB of
GDDR5X RAM using 6 CPU cores for the experiments.
Preprocessing
Earlier chapters shed light on the conversion of the numeric quantities into generic symbolic
notations to avoid sparse representations. In a similar vein, we converted the numbers into 2
types of notations. This conversion is only applied to the relevant numbers in the problem text.
• Single Letters: p, q, r, ..
• String added with the occurrence count: number0, number1, number2,.., numberi
• Original Question: The cafeteria had 51 apples . For lunch they handed out 41 to students
and decided to use the rest to make pies . If each pie takes 5 apples , how many pies could
they make ?
67
• Converted Text With Single Letters: The cafeteria had p apples . For lunch they handed
out q to students and decided to use the rest to make pies . If each pie takes r apples ,
how many pies could they make ?
1. Infix: ( ( p - q ) / r )
2. Postfix: p q - r /
3. Prefix: / - p q r
• Converted Text With String and Occurrence Count: The cafeteria had number0 apples .
For lunch they handed out number1 to students and decided to use the rest to make pies
. If each pie takes number2 apples , how many pies could they make ?
From table 7.8, it is evident that the conversion of numbers using ‘numberi’ (e.g. number1,
number2, number3 etc.) is superior to single lettered representations ( e.g. p, q, r). This may
have resulted due to high frequent subwords with single alphabets in words during unsupervised
pre-training. Hence, we used this representation in our subsequent experiments. After analyzing
the predicted outputs, we observe that there is a marked difference in terms of equation accuracy
for different kinds of notations when the solver is trained on a small dataset. In this setting,
the generation of equations in infix notations becomes challenging as there is very little data
to learn the balancing of parentheses. 30% of the errors in infix equations are due to this
imbalanced parantheses generation. With the increase in dataset size, as is seen in augmented
68
datasets, this issue gets resolved. Any significant difference between the equation notations
ceases to exist in larger datasets.
The results on augmented datasets using different equation notations are detailed in table
7.9.
Infix 50.3
Lexical Aug+Orig Postfix 50.7
Prefix 50.6
Infix 50.0
Lexical Aug+Translation Postfix 50.3
+Orig Prefix 47.7
Infix 56.7
Lexical Aug+Translation Postfix 58.4
+Orig+1 and 2 operations Prefix 56.0
Table 7.10: Equation Accuracies with Equivalence in the Full Augmentation Dataset
Discussion
We observe from table 7.9 that augmentation improves the performance of the solver, but
only marginally. With the full augmented dataset, we see a drop in the equation accuracy.
69
This is due to the fact that the dataset with full translations contains 3 and 4 operations,
whereas ASDiv and lexically augmented datasets contain 1 and 2 operations. The drop in the
performance of the solver is caused by its inability to generate equations with high number of
operations. The full translation dataset also has less samples for these word problems. When
we just report the equation accuracies for word problems with 1 and 2 operations, there is a
marked improvement by 5%, and the best performance is reported with full augmentation. As
a next step, we experimented with a sample from the full translations to study what quantity
of data can be augmented to achieve optimal performance. In all these experiments, we only
report equation accuracies which relies on direct matching between two strings. We observe
that equation equivalence based evaluation accurately judges the model’s performance which
is shown in table 7.10. Otherwise, the performance is underreported using the exact equation
accuracy. When the number of operations increases, the importance of having this kind of
evaluation also becomes more necessary.
We randomly sampled 1600 translations from all the full translations keeping the same dis-
tribution of operations as given in table 7.6. We kept the size the same for both the augmented
dataset to study their individual effects. The distribution of the new dataset is shown in table
7.11.
No_of_Operation #Word_Problems
1 3001
2 1522
Total 4523
Table 7.11: Distribution of Sampled Augmented Dataset
Experimental Setup
We conducted similar kind of experiments with the same parameters and model in a similar
experimental setup.
Results
The results of the experiments with a sample of Ape210K and other augmentations are
shown in Table 7.12. We do not report accuracies of one and two operations in this table as we
chose a sample from the translations that included only one and two operations.
70
Config Notation Equation_Accuracy
Infix 49.12
Lexical Aug+Orig Postfix 49.88
Prefix 47.6
Infix 45.5
Translation+Orig Postfix 48.29
Prefix 49.96
Infix 55.61
Lexical Aug+Translation Postfix 53.97
+Orig Prefix 54.86
Table 7.12: 5 Fold Results with Sampled Augmented Datasets with Exact Equation Accuracy
Discussion
From Tables 7.9 and 7.12, we can observe that there is no significant difference between
them. Although the sampled dataset is 3 times smaller than the full one, the performance is
very similar. An increase in data size does not always ensure an increase in scores. We can also
see that when the original dataset is augmented with the translation, the equation accuracy
does not improve significantly. This is even less than the results with the lexical augmentation.
We will discuss the role of diversity and relate it to performance in the next section.
71
text in each word problem is normalized where the names of persons and quantities are replaced
by a meta symbols “PERSON” and “NUMBER”. The example given below demonstrates the
normalization process.
• Original Text: Ellen has 6 more balls than Marin. Marin has 9 balls. How many balls
does Ellen have?
• Normalized Text: PERSON has NUMBER more balls than PERSON. PERSON has
NUMBER balls. How many balls does PERSON have?
Dataset CLD
ASDiv 0.71
ASDiv + Lexical Aug 0.32
ASDiv + Translation 0.63
ASDiv + Lexical Aug + Sampled Translation 0.41
ASDiv + Lexical Aug + Full Translation 0.44
Table 7.13: CLD Scores of Different Datasets
As diversity increases in a training dataset, the equation accuracy of the solver trained on
that dataset decreases. MAWPS dataset has a lower CLD score of 0.254. Hence, the equation
accuracy of any solver trained on it is very high, where the state-of-the-art accuracy is 88.7
[80]. On the contrary, ASDiv is richly diverse with a CLD score of 0.71. This impacts the
equation accuracy as shown in table 7.8. The maximum accuracy obtained on this dataset is
49.96. We can also observe that data augmentation positively impacts the performance of a
solver. Lexical augmentation using paraphrase table and additional constraints generate less
diverse word problems as shown in table 7.13 and records a huge drop in CLD scores. The
augmentation of translation adds different types of problems to the original dataset and does
not result in a heavy drop in CLD. Each of the augmentations improves the solver. Training on
a dataset by combining the generated problems using both the augmentation methods results
in creating the best model for word problem solving.
7.5 Conclusion
In this chapter, we discussed the need for augmentation techniques and also proposed two
techniques to augment any word problem dataset. We also experimentally show that the aug-
mentation method helps in the prediction of equations on an unseen dataset. We show empiri-
cally that there is hardly any difference in the generation of equations with different notations.
We created a dataset by combining the outputs from two methods to create an augmented
72
dataset with a significantly higher number of word problems and varied equation templates.
This improved the performance overall. Apart from lexical augmentation, translation can be
used as an effective method for data augmentation in low resource settings. We will discuss
this in detail in the next chapter.
73
Chapter 8
Although the field of word problem solving (WPS) has gained traction recently, all the
attention is centered only around English. Other than English, there has been a surge in
developing solvers for Chinese word problems [18, 36]. Efforts in this direction are far less for
other languages.
There have been very few attempts [33, 81] in ILs for arithmetic word problem solving. As
Indian languages are linguistically divergent from English, how existing technologies handle
word problems in ILs is another motivation for this work. We hope that our efforts towards
developing automatic solvers will spur research in this area.
Development of recent deep neural models [27], [47] have created new benchmarks in ma-
chine translation, eclipsing the previous state-of-the-art (SOTA) performance for many language
pairs. However, these models require a significant amount of parallel data to train. Many In-
dian languages (IL) are resource poor in this regard. In the last decade, multiple initiatives [82],
[83], [84] were undertaken to change this landscape of parallel corpora for ILs. Many SOTA
NMT systems [85], [84], [86] have been developed across English-IL pairs and different IL pairs.
We leverage one such system [87] to translate word problems from English into Hindi. This
NMT system achieves 36.7 BLEU [79] score on the general domain. We then apply already de-
fined WPS approaches to the translated Hindi problems and evaluate the performance. Initial
experiments were done by translating the Common Core dataset in English into Hindi.
74
8.2.1 Issues of Translations
1. English: George was working as a sacker at a grocery store where he made 5 dollars an
hour. On Monday he worked 7 hours and on Tuesday he worked 2 hours. How much
money did George make in those two days?
Hindi: जॉज एक िकराने क दक ु ान म काम करता था जहाँ उसने पाँच डॉलर त घंटा कमाए । सोमवार को
उ ह ने 7 घंटे काम िकया और मंगलवार को उ ह ने 2 घंटे काम िकया । उन दो िदन म जॉज ने िकतना पैसा
कमाया ?
Gloss in Roman: jorj ek kiraane kee dukaan mein kaam karata tha jahaan usane paanch
dolar prati ghanta kamae . somavaar ko unhonne 7 ghante kaam kiya aur mangalavaar
ko unhonne 2 ghante kaam kiya . un do dinon mein jorj ne kitana paisa kamaaya ?
2. English: Sarah baked 38 cupcakes for her school’s bake sale. If her brother, Todd, ate 14
of them how many packages could she make if she put 8 cupcake in each package?
Hindi: सारा ने अपने कूल क बेक सेल के लए 38 केक बेक िकए । यिद उसका भाई , टॉड , उनम से १४
को खा ले , तो वह िकतने पैकेट बना सकती थी यिद वह हर पैकेट म ८ कप केक डाले ?
Gloss in Roman: saara ne apane skoolon kee bek sel ke lie 38 kek bek kie . yadi usaka
bhaee , tod , unamen se 14 ko kha le , to vah kitane paiket bana sakatee thee yadi vah har
paiket mein 8 kap kek daale ?
3. English: For Halloween Emily received 54 pieces of candy. She ate 33 pieces then placed
the rest into piles with 7 in each pile. How many piles could she make?
Hindi: हेलोवीन के लए एिमली को कडी के 54 टु कड़े िमले । उसने 33 टु कड़े खाए और िफर येक ढेर म 7
के साथ ढेर म रखा । वह िकतने ढेर बना सकती थी ?
Gloss in Roman: heloveen ke lie emilee ko kaindee ke 54 tukade mile . usane 33 tukade
khae aur phir pratyek dher mein 7 ke saath dher mein rakha . vah kitane dher bana
sakatee thee ?
The first two examples show inconsistency in the translation of numbers. In the first example,
the first quantity got translated into its word equivalent while the other numeric values of the
other quantities are intact during translation. The numeric value of the first quantity in the
second example was transferred correctly during translation, but the values of other quantities
underwent a change in script when translated into Hindi. To mitigate this inconsistency, we
designed a post-processing tool (presented in chapter 9), which will convert any quantity in
word format or native script into its corresponding numeric value in ASCII.
The third translation is jarring due to the ambiguity arising from the sense of the word ‘pile’.
In the last example, the underlined phrase was not translated into the target. In order to solve
this issue, we need to train the MT systems with sentences that contain words with multiple
senses.
75
For removing noisy translations, we have incorporated various filtering techniques. These
steps include correct named entity transfer, correct number/constant transfer, sampled valida-
tion by humans.
• Transformers [27]
All these models were implemented using the Open NMT toolkit [66].
8.3.1 Preprocessing
For preprocessing, we used an in-house built tokenizer 1 for splitting the sentences in a word
problem and words present in a sentence. We used two kinds of pre-trained embeddings for
Hindi wps -
1. word [88] 2. subword [67]
After the first round of tokenization for words, we used the ‘bpemb’ python library 2 for subword
tokenization [67]. Subword tokens are created using byte pair encoding on Hindi wiki articles.
For subword tokenization for Hindi, the vocabulary size is 200000.
Preprocessing of Numbers
As seen in the earlier chapters, the numbers in word problems are substituted by special
number tokens for better representation learning. In [89] and [90], we substituted any number
appearing in the problem text with one special token from the set {num1,num2,num3}. But
numi is split into num and 0 as subwords where 0 represents any single digit number. In order
to stop the splitting of a word during subword tokenization, the numbers are substituted with
single-letter characters like p, q, r, and the mapping of this replacement is preserved.
76
Type of Split #Word_Problems
Train 400
Validation 100
Test 100
Total 600
Table 8.1: Corpus Details of Hindi Word Problems
Equations can be represented in 3 different notations. Different notations for any binary
operation consisting of two operands a, b, and one operator ‘op’ are shown below.
1. Infix - a op b
2. Prefix - op a b
3. Postfix - a b op
Griffitth et al. (2019) [24], Griffitth et al. (2020) [25] showed that learning prefix and postfix
notations for an equation is easier than learning the infix notation while generating equations
for English word problems. We also designed different settings for three notation types to verify
this claim. One example is shown below:
• Original Question: डेव को कूल से पहले 29 छोटी लीव शट और 11 लंबी लीव शट धोनी थी । कूल
शु होने तक अगर वह उनम से 35 को ही धो देता तो िकतने नह धोते ?
Gloss in Roman: dev ko skool se pahale 29 chhotee sleev shart aur 11 lambee sleev shart
dhonee thee . skool shuroo hone tak agar vah unamen se 35 ko hee dho deta to kitane
nahin dhote ?
• Question After Special Number Token Replacement: डेव को कूल से पहले q छोटी लीव शट
और p लंबी लीव शट धोनी थी । कूल शु होने तक अगर वह उनम से r को ही धो देता तो िकतने नह धोते ?
Gloss in Roman: dev ko skool se pahale q chhotee sleev shart aur p lambee sleev shart
dhonee thee . skool shuroo hone tak agar vah unamen se r ko hee dho deta to kitane nahin
dhote ?
• Prefix Equation: X = − + q p r
• Infix Equation: X = q p + r −
77
parameter value
Word Embedding Size 300
Subword Embedding Size 300
Encoder Layers 2
Decoder Layers 2
Input Sequence Length 200
Output Sequence Length 200
Dropout Rate 0.3
Batch Size 64
Optimizer Adam
Table 8.2: Configuration of BiLSTM model with Global Attention for Hindi
parameter value
Subword Ebedding Size 300
Encoder Layers 4
Decoder Layers 4
Heads 4
Input Sequence Length 200
Output Sequence Length 200
Nodes in Feed forward layer 2048
Dropout Rate 0.1
Attention Dropout Rate 0.1
Batch Size 512
Optimizer Adam
Table 8.3: Configuration of Transformer Model for Hindi
We used the same configuration as was used for English experiements. We implemented
two models which are detailed in Tables 8.2 and 8.3. The word and subword vectors are of 300
dimensions.
8.4 Evaluation
Initial experiments were done using word as well as subword embeddings with the BiLSTM
models. As results shown in Table 8.5 with subword tokenization were superior, we conducted
further experiments using different equation notations with subword embeddings only that are
shown in Table 8.4. The validation steps were set to 100.
78
Model Notation Train Steps Equation Accuracy
BiLSTM Infix 1500 73
2000 77
2500 78
Postfix 1500 69
2000 69
2500 69
Prefix 1500 76
2000 77
2500 79
Transformer Infix 100 46
400 100
500 32
1500 0
2000 24
2500 23
Postfix 200 99
400 100
1500 99
2000 100
2500 100
Prefix 400 99
1000 99
1500 99
2000 99
2500 99
Table 8.4: Hindi Word Problem Solver Accuracies for Different Equation Notations
8.5 Observations
We observed that the transformer model comprehensively outperforms the BiLSTM model,
as shown in Table 8.4. The BiLSTM based solver has similar accuracies for the infix and prefix
equations with the postfix one performing the worst. The transformer model reaches the optimal
accuracies quickly in smaller number of training steps and with increase in training steps the
performance of the solver does not degrade in case of prefix and postfix notations. But in the
case of infix notations, the transformer model overfits very early. With the increase in training
steps, the model starts making errors with respect to parentheses balancing. This phenomenon
was also reported in Griffitth et al. (2019) [24], Griffitth et al. (2020) [25]. As the prefix and
postfix notation is devoid of parentheses, the errors related to them do not arise. The training
data contains word problems with only relevant quantities (3 numbers in each problem). So,
any question containing an irrelevant quantity can not be solved by the solver as it has only
79
Model Emb Type Train Steps Validation Steps Equation Accuracy
BiLSTM word 1000 100 54
subword 1000 100 63
Table 8.5: Hindi Word Problem Solver Accuracies with word and subword embeddings
learnt the representation in the presence 3 quantities. We also observed the dependence of
solvers on shallow heuristics as shown by Patel et al. [28] where the solvers generated the
equation without looking at the question text. These results were also overestimated because
of high overlapping between the training and the evaluation data.
Similar to the augmentation approach followed for English, we augment an existing Hindi
word problems dataset [33] by substituting the content words. For this, we replaced only nouns
and verbs to generate new word problems. We utilized an in-house developed tokenizer 3 for
tokenizing the sentences and words in the Hindi word problems. For POS tagging and other
shallow parsing tasks, we developed a shallow parser [91]. For any noun or verb in a sentence,
we first find three most similar words to it. These words are obtained by using a Hindi word2vec
model [73]. We used pre-trained word2vec embedding [92] that was trained in house on Hindi
news articles consisting of 5 million sentences. We also put an additional constraint that the
similarity between the original word and its similar word should be above or equal to a threshold.
The threshold was set to 0.6. Any word which does not fulfill this criterion was discarded.
In order to maximize the lexical diversity, we imposed a constraint that different morpholog-
ical forms of the same root should be filtered out. If a word and one of its similar words share
the root, then the similar word is deemed to be invalid for substitution. This is a useful feature
3
https://github.com/Pruthwik/Tokenizer_for_Indian_Languages
80
as Indian languages are morphologically richer and a single root can have different morph forms.
Another constraint was added to match the gender with the candidate word. This restriction
is applied to restrain the generation of ungrammatical sentences due to gender mismatch. All
these linguistic properties are extracted using our in-house shallow parser [91].
Instead of replacing a word with its distributionally similar word, a check for compatibility
with its neighborhood in a sentence is verified. We utilize a pre-trained language model for
it. We used a KENLM language model for Hindi 4 . The size of the context window was 5 for
the language model. We use the same threshold as defined in Section 7.2.2.2 for checking the
validity of the 5-gram after substituting a word with a similar word.
Two gazetteer lists are also prepared to replace different types of named entities, such as
male names and female names, in Hindi. As person names share similar contexts irrespective
of gender, the distributional similar contexts of a male person name can include female person
names. To avoid this, these lists were developed.
Instead of generating word problems with single word substitutions, we fixed a threshold on
number of substitutions to count a generated problem to be valid. The threshold is set to 3 for
generating Hindi word problems.
To control over the generation, the maximum number of word problems generated per word
problem is kept at 5 after satisfying all these constraints.
The suggested augmentation method generates many valid word problems. For each ques-
tion, we present two generated examples here.
4
https://github.com/Open-Speech-EkStep/vakyansh-models
81
Valid Generation
• Original: शखा ने 1 दक
ु ान से 65 . का सामान खरीदा । उसने दक
ु ानदार को 100 . का नोट िदया ।
बताओ , उसे िकतने पये वािपस िमले ?
Gloss in Roman: shikha ne 1 dukaan se 65 ru . ka saamaan khareeda . usane dukaanadaar
ko 100 ru . ka not diya . batao , use kitane rupaye vaapis mile ?
Generated 1: सलमा ने 1 दक
ू ान से 65 पए . का बैग ख़रीदा । उसने यापारी को 100 पए . का नोट िदया
। बताओ , उसे िकतने पये वािपस िमले ?
Gloss in Roman: salama ne 1 dookaan se 65 roope . ka baig khareeda . usane vyaapaaree
ko 100 roope . ka not diya . batao , use kitane rupaye vaapis mile ?
Generated 2: वडी ने 1 दक
ू ान से 65 पए . का माल मंगाया । उसने यापारी को 100 पए . का नोट िदया
। बताओ , उसे िकतने पये वािपस िमले ?
Gloss in Roman: vendee ne 1 dookaan se 65 roope . ka maal mangaaya . usane vyaapaaree
ko 100 roope . ka not diya . batao , use kitane rupaye vaapis mile ?
• Original: टॉम के पास पहले से ही 2 गेम थे लेिकन उसने ₹13.6 का बैटमैन गेम और ₹5.06 का सुपरमैन गेम
खरीदा । सुपरमैन गेम क तुलना म बैटमैन गेम िकतना महंगा था ?
Gloss in Roman: tom ke paas pahale se hee 2 gem the lekin usane ₹13.6 ka baitamain gem
aur ₹5.06 ka suparamain gem khareeda . suparamain gem kee tulana mein baitamain gem
kitana mahanga tha ?
Generated 1: ओंकार के पास पहले से ही 2 गेम थे लेिकन उसने ₹13.6 का पैरो गेम और ₹5.06 का सुपरमैन
गेम ख़रीदा । सुपरमैन गेम क तुलना म पैरो गेम िकतना महंगा था ?
Gloss in Roman: onkaar ke paas pahale se hee 2 gem the lekin usane ₹13.6 ka spairo gem aur
₹5.06 ka suparamain gem khareeda . suparamain gem kee tulana mein spairo gem kitana
mahanga tha ?
Generated 2: क लन के पास पहले से ही 2 गे स थे लेिकन उसने ₹13.6 का पाइडरमैन गे स और ₹5.06
का सुपरमैन गे स मंगाया । सुपरमैन गे स क तुलना म पाइडरमैन गे स िकतना महंगा था ?
Gloss in Roman: phrainkalin ke paas pahale se hee 2 gems the lekin usane ₹13.6 ka spaidara-
main gems aur ₹5.06 ka suparamain gems mangaaya . suparamain gems kee tulana mein
spaidaramain gems kitana mahanga tha ?
82
Noisy Generations
Even in the presence of constraints, some substitutions using word embeddings are jarring
and unnatural.
• Original: माथा के पास 76 काड थे । उसने एिमली को 3 िदए । माथा के पास िकतने काड बचे ह ?
Gloss in Roman: maartha ke paas 76 kaard the . usane emilee ko 3 die . maartha ke paas
kitane kaard bache hain ?
Generated: कडा के पास 76 काड थे । उसने लीला को 3 िकए । कडा के पास िकतने काड बचे ह ?
Gloss in Roman: kendra ke paas 76 kaard the . usane leela ko 3 kie . kendra ke paas kitane
kaard bache hain ?
• Original: 1 यि के पास 63 भेड़ है । उसने अपने 3 ब म उनका बँटवारा िकया । पहले बेटे को 22 भेड़
िमल और दस
ू रे बेटे को 21 भेड़ िमल , तो तीसरे बेटे को िकतनी भेड़ िमल ?
Gloss in Roman: 1 vyakti ke paas 63 bheden hai . usane apane 3 bachchon mein unaka
bantavaara kiya . pahale bete ko 22 bheden mileen aur doosare bete ko 21 bheden mileen
, to teesare bete ko kitanee bheden mileen ?
Generated: 1 यि के पास 63 भेड़ है । उसने अपने 4 देखभाल म उनका बटवारा िकया । पहले बेटे को 22
भेड़ िमल और दस
ू रे बेटे को 21 भेड़ िमल , तो तीसरे बेटे को िकतनी भेड़ िमल ?
Gloss in Roman: 1 vyakti ke paas 63 bheden hai . usane apane 4 dekhabhaal mein unaka
batavaara kiya . pahale bete ko 22 bheden mileen aur doosare bete ko 21 bheden mileen ,
to teesare bete ko kitanee bheden mileen ?
The underlined words are jarring substitutions using word embeddings. In case of English,
these jarring substitutions are avoided using the synonym similarities using the wordnet. As
IndoWordnet [93] does not have this facility, we could not include this constraint in our gener-
ation.
83
heads. We froze the encoder layers so that the gradient does not get updated for these lower
layers during backpropagation. This reduced the number of parameters by half and also min-
imized the risk of overfitting. Most of the models were trained for 50 epochs. This model is
trained in 100 languages that include the three languages: English, Hindi, and Telugu for this
study. Some of the models achieved better performance in fewer epochs. We used one NVIDIA
GeForce GTX 1080 Ti GPU having 11GB of RAM using 6 CPUs for the experiments.
Preprocessing
The relevant numbers in a word problems were replaced by meta tokens such as number0,
number1, number2, .., numberi. One example depicts this process.
• Text Replaced by Numeric Meta Symbols: पस बा केटबॉल टीम म number0 खलाड़ी ह। येक
खलाड़ी के पास number1 बा केटबॉल ह। उन सब के पास कुल िकतनी बा केटबॉल ह?
• English Gloss: Spurs basketball team has number0 players. Each player has number1 bas-
ketballs. How many basketballs are there with them?
• Equation Notations:
Dataset
The dataset distribution is shown in table 8.6. The original dataset consists of 2336 word
problems where the number of single operation word problems is thrice the number of word
problems that require two operations to solve. To balance out this unevenness regarding the
number of operations, by using undersampling we created a more balanced dataset to reduce
this ratio to 2: 1. In the non augmented dataset, the ratio of single operation word problems to
double operation word problems is 2: 1 whereas it reduced to 1.57: 1 in the augmented dataset.
84
config No_of_Operation #Word_Problems
1 860
No Augmentation 2 430
Total 1290
1 3318
With Augmentation 2 2108
Total 5426
A similar trend is also seen in Hindi. With the decrease in CLD, equation accuracy goes up.
With the increase in dataset size, there exists no or very little difference in the generation of
different kinds of equation notations.
Most of the current solvers replace the numeric quantities appearing in a word problem with
generic meta symbols to learn better representations for the numbers. This indeed improves the
performance of the solver while generating equations during the decoding phase in an end-to-
end neural network based system. This is a very efficient pre-processing step. However, many
85
word problems require the numeric values of the operands to derive the solution. Let us look
at an example.
• Original Question: There are two kinds of tickets for a circus. One is priced at 48 rupees
and the other is priced at 80 rupees. There are 180 tickets for 48 rupees and 120 tickets
for 80 rupees. A total of 260 tickets were sold. What is the maximum possible revenue
that can be earned?
• Question With Meta Symbol Replacement: There are two kinds of tickets for a circus.
One is priced at number0 rupees and the other is priced at number1 rupees. There are
number2 tickets for number0 rupees and number3 tickets for number1 rupees. A total of
number4 tickets were sold. What is the maximum possible revenue that can be earned?
In this example, the information that number1 > number0 and number4 > number3 is very
crucial to solve the problem. But there are no clues explicitly present in the question about
this fact. This makes solving these kinds of problems challenging.
86
in English, Hindi, and Telugu. This is the first attempt to create a multilingual benchmark
parallel corpora including Indian languages in this domain.
The creation of quality benchmark datasets with diverse as well as adequate samples is
crucial for testing the robustness of developed models. We studied word problem datasets from
different resources such as school books, educational websites, blogs, and forums available for
various languages. Taking inspiration from those, we manually developed benchmark datasets
for English and a few Indian languages such as Hindi and Telugu. These benchmark datasets
contain word problems that will adhere to the properties laid out by Patel et al. [28] and Mio
et al. [40]. In this work, we release a 3-way parallel word problem dataset in English, Hindi,
and Telugu created manually. 4.25% ( 2000 problems) of the English translations did not have
equations; they directly contained the answer without involving the intermediate operations.
We discarded those. As described in chapter 7, this dataset is composed of both explicit and
implicit word problems. For this study, we restrict ourselves to explicit problems.
The corpus is created following a two-step process. First, we translated the validation set
of the Ape210K dataset into English using Google Translate API. The validation set consists
of 5000 word problems. We picked 1127 word problems and manually transcreated them after
applying noise removal techniques as explained in chapter 7. If the equations were erroneous,
they were rectified. In case the word problem is unintelligible, then we newly form a word
problem taking the equation as a basis. In the second stage, the verified English word problems
are translated into Hindi and Telugu manually. For each language, two experts were selected.
All the experts had completed their post graduation level of education. They had domain
knowledge in mathematics as well. The Hindi and Telugu translators have more than five years
of experience in translation and have previously completed translation assignments from diverse
domains. For all these tasks, we used an in-house developed AI enabled post editing tool [95].
A snapshot of the tool is shown in figures 8.1 and 8.2.
The experts were given guidelines to create English word problems adhering to grammat-
ical correctness and ensuring naturalness with solvable equations. The guidelines below were
followed, and different examples were provided for a better understanding of the task.
87
Figure 8.1: English Word Problem Editing
Correction of Determiners
In many translations, there was incorrect usage of determiner a and an. The article the was
heavily used with gross mistakes.
• Original: The fruit store brought 125 boxes of apples, 20 kilograms each, and sold 1,600
kilograms. How many kilograms are left?
• Corrected: A fruit store brought 125 boxes of apples, 20 kilograms each, and sold 1,600
kilograms. How many kilograms are left?
88
Simplify Long Sentences
If there are very long sentences, simplify the text by splitting them into smaller sentences.
• Original: The grain store warehouse has 305 bags of flour, and the number of bags of rice
is 12 times the number of bags of flour, how many bags of rice are there?.
• Simplified: The grain store warehouse has 305 bags of flour. The number of bags of rice
is 12 times the number of bags of flour. How many bags of rice are there?
Naturalness
If the text does not sound natural in English, add phrases or words to make the text more
natural and better fit in the narrative. Similar guidelines were also followed in [33]. When
fractions are used in word problems, the English translations go awry with many instances of
missing words. In this example, the equation also acts as a support for the choice of words or
a chunk of words.
• Original: There are 32 poplar trees in the orchard, less than pear trees (1/9), how many
pear trees are there?
• Equation: 32/(1-(1/9))
• Corrected: There are 32 poplar trees in an orchard. The number of poplar trees is (1/9)
times less than that of the number of pear trees. How many pear trees are there?
Localisation
As the word problems in the benchmark corpora are developed in the Indian context, lo-
calization in terms of currencies, names of persons, locations, organizations, and other named
entities are incorporated. To make the word problems more realistic, equations were also mod-
ified to reflect the changes made in numeric values instead of original numbers.
• Original: In the donation activity for the disaster area , Class 61 donated 245 yuan , and
Class 62 donated 45 yuan less than twice that of Class 1 . How much did the two classes
donate in total ?
• Localised: In a donation activity for the disaster areas in Assam , a school collected some
money from students and teachers . Class 6 donated 2450 rupees and class 5 donated 500
rupees less than twice that of class 6 . How much did the two classes donate in total ?
89
We also followed peer review process between the experts to ensure consistency and quality
in the created word problems and equations.
After the completion of the creation of English word problems, we switched to developing
corresponding translations in Hindi and Telugu. As shown in the previous sections, this manual
task was also carried out using the same post-editing tool [95]. The translators were directed
to focus on both adequacy and fluency while translating into target languages. The following
guidelines were given to them.
• Adequate: The translated word problems should convey the same meaning as the English
equivalent.
• Fluent, and Natural: It should be natural and fluent in the target languages. The target
audience for the word problem dataset is the students in respective vernacular. They
should be capable of understanding and solving the problems.
• Transliteration: As we are dealing with arithmetic and geometry word problems, many
of the English mathematical terms need to be transliterated into the target language. We
took this decision based on the arithmetic books available in Hindi and Telugu. Words
such as cylinder, rhombus, cone etc. are transliterated. If the terms find an equivalent in
the target, then these words take precedence over the transliteration.
A similar peer review process between the experts was followed in translation.
8.11.1 Dataset
We combine the lexically augmented data and translated data with the publicly available
HAWP [33]. As there are no datasets available for Telugu, we use only the Telugu translations
of the English explicit word problems 7.2.
Preprocessing
As we are using the direct outputs of MT systems, we added some noise removal techniques
for English-Hindi and English-Telugu translations. After translation, we match the numeric
quantities both in the source and target. If they do not tally, then we remove the translation.
90
Ape210K contains many problems that require simple calculation of a formulaic expression.
While translating these from English to Hindi or Telugu, either numbers or operations go
missing. This poses an additional challenge to the solver if we do not filter it out because of
missing information. For this task, we are doing a two step translation using MT where the
first step translates a Chinese word problem into English. The next step is to translate this
English problem into Hindi or Telugu. This introduces cascading of errors. Let us look at a
few examples.
• Telugu Example
– English Translation: The number A is 500, and the number B is 780 less than 5
times it. What is the number B?
– Gloss in Roman: E sankhya 500, mariyu bi sankhya 5 retlu takkuva Sa�khya b ante emiti?
• Hindi Example
– English Translation: There are 125 apple trees in an orchard, and the number of
pear trees is 20 less than 4 times that of apple trees. How many trees are there in
this orchard?
– Hindi Translation: एक बाग म 125 सेब के पेड़ ह, और नाशपाती के पेड़ क सं या सेब के पेड़ क
तुलना म 4 गुना कम है।इस बाग म िकतने पेड़ ह ?
– Gloss in Roman: ek baag mein 125 seb ke ped hain, aur naashapaatee ke ped kee
sankhya seb ke pedon kee tulana mein 4 guna kam hai.is baag mein kitane ped hain ?
– Number Missing: 20
After this step, we had 9288 Hindi translated word problems and 9311 translated word problems
in Telugu.
Table 8.8: Dataset Details for Full Augmentation in Hindi and Telugu
The dataset details are present in table 8.8. Similar to the English dataset, 90% of the
problems in both languages require 1 or 2 operations to solve. The CLD scores of the benchmark
91
datasets in English, Hindi, and Telugu are 0.706, 0.764, 0.777 respectively. For calculating
the CLD, we used Indic NER [96]. Out of the 1127 problems from the benchmark datasets,
only 432 are explicit in nature. We report the accuracies on the explicit datasets.
We use multilingual XLM RoBERTa base for our experiments. As the English solver built
in chapter 7 was based on XLM RoBERTa, the initial experiments are designed to test the
zero shot capability of this multilingual model. XLM RoBERTa-base supports 100 languages
including Hindi and Telugu. All the results are reported in two settings: Full that reports
the accuracy on the whole dataset with one, two, greater than two operations and 1+2op only
reports accuracies of questions with 1 and 2 operations. The results are shown in Tables 8.9,
8.11, and 8.10.
Table 8.9: Zero Shot Experiment Results for Hindi and Telugu. Lang stands for language
and Type stands for type of the evaluation data. Full represents the accuracies on datasets
containing 1, 2, and >2 operations while 1+2op represents the accuracies on datasets containing
1 and 2 operations.
The zero-shot learners rather fare poorly on both datasets. But the accuracy of predicting
the equation correctly for a single operation is 40% on average. This shows that the model’s
performance gradually degrades with the increase in the number of operations. As a next step,
we fine tuned the English solver on Hindi and Telugu word problems. The fine tuning was done
for 20 epochs with the same parameters and settings as described above. We also fine-tuned on
Hindi and Telugu data directly, but the results were inferior to fine-tuning an English solver.
92
Lang Type Infix Postfix Prefix
Full 1+2op Full 1+2op Full 1+2op
Table 8.10: Exact Equation Accuracies on English Fine Tuned Model for Hindi and Telugu.
Lang stands for language and Type stands for type of the evaluation data. Full represents the
accuracies on datasets containing 1, 2, and >2 operations while 1+2op represents the accuracies
on datasets containing 1 and 2 operations.
Table 8.11: Equivalent Equation Accuracies on English Fine Tuned Model for Hindi and Telugu.
Lang stands for language and Type stands for type of the evaluation data. Full represents the
accuracies on datasets containing 1, 2, and >2 operations while 1+2op represents the accuracies
on datasets containing 1 and 2 operations.
8.11.3 Discussion
Fine tuning on a pre-trained model leverages transfer learning to achieve optimal perfor-
mance. Our experiments show that multilingual models can prove to be beneficial in case of
scarcity of resources. We also observed that the Hindi solver performs well compared to Telugu
as it is augmented with a human created dataset along with lexical augmentation. The noise
in the translations can hinder learning which is seen in Telugu. The model does not learn to
generate bigger equations with a larger number of operations. This may be attributed to fewer
number of examples for 3 and 4 operation word problems. These results are very similar to the
93
results reported by Patel et al. [28] on the manually created SVAMP dataset. The word prob-
lem solvers, although trained on a large set of word problems, fail to solve single variable word
problems and perform poorly on manually crafted datasets with diversity at both lexical and
template levels. We also see a similar significance of equation equivalence for Hindi and Telugu.
Exact matching penalizes the equivalence of equations and under-report the performance.
For English, out of the 10 samples, ChatGPT was able to solve 8 of them correctly. The
following two questions were answered incorrectly by ChatGPT.
Linguistic Ambiguity
• Question: For the first main dish, they were asked to cook steak. If the third and second
team cooked 240 plates of steak and the first team cooked 75 plates less than what the
second and third team made, how many steaks did they cook altogether?
The model accurately identifies the subtraction operation related to the word less. But, it
incorrectly assumes that the second and third teams individually make 240 plates.
• Answer: 18002
94
(1+2−3)+(1000−1001)+(2009−2007)+9002+(1+2−3)+(1000−1001)+(2009−2007)+9002
While answering this question, chatGPT groups numbers to easily calculate the values as written
below. It incorrectly calculates one of the values of the groups. So, the final solution is
incorrectly evaluated.
• Question: राहुल, लू टीम का क ान मैच से पहले घबरा रहा है। उनक पहली त ं ी लाल टीम ने उ ह हरा
िदया। यिद वे 13 रन से हारे और लाल टीम ने 61 रन बनाए, तो उनक टीम ने िकतने रन बनाए थे?
• Gloss in Roman: raahul, bloo teem ka kaptaan maich se pahale ghabara raha hai. unakee
pahalee pratidvandvee laal teem ne unhen hara diya. yadi ve 13 ran se haare aur laal teem
ne 61 ran banae, to unakee teem ne kitane ran banae the?
• Predicted Equation: x = 61 + 13
• Correct Equation: x = 61 - 13
In this example, the word lost is associated with subtraction, but chatGPT identifies it to evoke
an addition operation. It uses a wrong formula to arrive at the answer:
Runs scored by the Blue team = Runs scored by the Red team + Difference in runs between the
teams
• Question: आकड म खेलते हुए क ने 33 िटकट ' हेक ए मोल' और 9 िटकट ' के बॉल' खेलकर जीत । यिद
वह टाॅिफ़याँ खरीदने क को शश कर रहा था, जनक क मत 6 िटकट थी, तो वह िकतनी खरीद सकता था?
• Gloss in Roman: aarked mein khelate hue phraink ne 33 tikaten vhek e mol aur 9 tikaten
ske bol khelakar jeeteen. yadi vah taaaifiyaan khareedane kee koshish kar raha tha, jinakee
keemat 6 tikaten thee, to vah kitanee khareed sakata tha?
95
• Correct Equation: x = (33 + 9) / 6
This question is a multi-step arithmetic problem. The first operation and its operands are
correctly identified. In the second operation, it misinterprets that the unit of the first operation
is toffees, not tickets. Hence, the second predicted operation is multiplication in case of a
division.
• Question: ఒక జత కర 135 మ 85 న ఒక సం ఖ ం
మ పర 250 న ం . వ ఆ ఎంత ం ?
� Gloss in Roman: Millet oka jata snikarlaku 135 yuvanlu mariyu 85 yuvanlanu oka phutbaal
kosam kharcu cestundi mariyu sels parsan 250 yuvanlanu cellistundi. Sels vyakti amenu enta
tirigi pondali?
It incorrectly identifies that the question is about total money spent and adds up all the
numbers.
Missing Operands
• Question: ట 38 వ స ఆ ,ఒ క వ స 25 ట , ప సగ న 60 ల
ఆ ల ం . ఈ ఆర తం ఎ లఆ ల ందవ ?
� Roman in English: Totalo 38 varusalu apil cetlu, okkokka varusalo 25 cetlato, prati cettuku
sagatuna 60 kilola apil labhistundi. i arcard mottam enni kilola apillanu pondavaccu?
� Predicted Equation: x = 38 * 60
� Correct Equation: x = 25 * 38 * 60
ChatGPT ignores the information about the rows (varusalu) in the question and only predicts
the total number of apples (apil) from the trees (cetlu).
96
8.13 Conclusion
In this chapter, we discussed the development of data and approaches for WPS in ILs. We
developed benchmark datasets for English, Hindi, and Telugu. We designed a solver for Telugu
by augmenting it with the translations of English word problems, which provides a baseline
for resource-poor languages. We also show that the availability of a human created dataset
improves the solver rather than a noisy augmentation technique such as machine translation.
We report the frailties of word problem solvers in solving diverse arithmetic word problems.
We also tested some examples in different languages with ChatGPT and showed some of the
limitations of it. It performs poorly for the Indian languages.
97
Chapter 9
The earlier chapters show how an arithmetic word problem can be solved using frame based
and neural approaches. Arithmetic word problems have a fixed structure and require NLU to
solve them. As WPS involves this step, it can be used in other scenarios, too. We explore two
such scenarios.
Recent advances in the field of machine translation have helped different low-resource lan-
guages to be accessible in other languages. Although the quality of translation outputs across
different language pairs has improved significantly with the advent of subword units [97], the
translation of numbers is still challenging. If numbers are represented as their word equivalents,
identification and translation into the target language becomes a daunting task. To tackle this,
researchers apply different preprocessing techniques for handling numbers either by replacing
them with special token identifiers like UNK or NUM. In the case of subwords, they are repre-
sented as 0+ (regular expression of 1 or more 0s) where 0 denotes any single-digit number, 00
for any two-digit number, and so on. If a sentence is tokenized at the subword level in this way,
it is impossible to get back the original numbers appearing in the text. We propose a conversion
technique for identifying numbers in word form for English and other Indian languages. This
can be plugged into any NMT system as a preprocessing task.
Based on the efficiency of NMT systems, the translation of numbers varies. We used the
“Google Translate” for this study in two different language pairs: English-Hindi and English-
Odia and show the performance of the translation variance in the following sections.
98
9.1.1.1 Number Conversion
The example shown in Figure 9.1 erroneously converts the English number into Hindi and
Odia. The numeric value of the number ‘2 million three thousand two hundred’ mentioned in
English is 2003200, but the translated quantity both in Hindi and Odia has a value of 203200
which is 10 times less than the English equivalent.
We convert the numbers that are expressed in words into their corresponding numeric values
for the source sentence. This, in turn, eliminates the problems arising due to the conversion,
and the efficiency of the NMT systems increases, as shown in Figure 9.2. The developed system
is available at https://github.com/Pruthwik/word2number-convertor.
99
languages. In India, SWAYAM and NPTEL platforms are translating educational video lectures
from English into multiple Indian languages. As a huge number of lecture videos are created on
a regular basis, creating manual translations in multiple languages becomes an impossible task.
So, efforts are underway to develop supporting tools for Automatic Speech Recognition (ASR),
Machine Translation (MT), and Text-to-Speech (TTS) in a human-in-the-loop paradigm. Our
proposed solution is a post processing step after ASR before supplying it to MT systems.
9.2.1 Motivation
MOOC lectures are usually subtitled. A subtitle file consists of plain text and timelines of
the text. This text is created using transcription. Transcription is the process of converting the
speech in audio into written text. In simple terms, a transcript (outcome of transcription) is
the textual form of what is spoken in audio. Many a times, the transcript contains plain text.
Numbers and equations are written as words e.g. A line can be represented by an equation y
equals to m into x plus c. This kind of text is cumbersome to read. It is even more challenging
to understand its meaning for a course that requires mathematical symbols and equations, as
shown in the example.
If such a course gets translated, it will have the same properties of an unintelligible source tran-
script and will be difficult for a reader or listener in the target language. When mathematical
notations are used in the source text, the translations tend to preserve the same notations.
We present an approach to identify equation spans and convert them into their mathematical
notations.
100
2. E-EQ - denotes the continuation of an equation
As any pre-existing dataset for this task was missing, we created an annotated dataset our-
selves. For dataset creation, we choose a probability course offered in NPTEL 1 . Firstly, we
collected the available English transcripts from the NPTEL website. Instead of annotating ev-
erything manually, we followed a semi-automatic approach. We designed an annotator program
using regular expressions to find the common mathematical patterns. After the automatic anno-
tation, an expert validates the annotated equations. For the task, we performed inter-annotator
agreement with two annotators and obtained an agreement or kappa score of 0.847, which shows
perfect agreement. The annotators disagreed on long equations - one annotator chose the whole
as a single equation while the other one split it into two smaller equations.
Regular expressions used to extract the equations are represented in Figure 9.4. Notations
are described below:
1
https://nptel.ac.in/
101
Figure 9.4: Regular Expressions to extract equations
• lef t_str → single operators appearing to the left of an operand in a sentence e.g.
inverse of x, factorial of y, minus xyz, mod of p
• lef t_str → single operators appearing to the left of an operand in a sentence e.g. y
factorial, x square, x cube, z naught
• assign_str → assignment phrases e.g. is like, gives us, is given by the formula, will become
Due to the inconsistencies during transcription, there lies a lot of ambiguities in identification
of mathematical variables. Sometimes multiple alphabets are written without spaces in between
them. This causes chances of misinterpreting the actual semantics of the equations. So, we
inserted spaces in between alphabets when they occurred together. The biggest ambiguity arose
between ‘a’ and ‘i’. Both of these alphabets can either be used as a pronoun or a variable. We
used part-of-speech information of the contextual words for disambiguation.
We annotated 1255 sentences containing 2868 equations from the probability domain.
The initial models were developed using conditional random fields (CRF) [102]. Word based
contextual features are incorporated for different context window sizes. We used the CRF++
toolkit [103] for implementation of CRF.
102
9.2.3.2 Neural Networks Based Models
Different neural network models were developed for this task. The details of the used features
and training parameters are given below.
3. BiLSTM-CRF [106]
103
9.3 Results
The results for CRF are presented in Table 9.1. The increase in F1 scores is marginal with
an increase in context windows, but bigger context windows capture more equation terms. This
was the baseline for the task.
We performed 5-fold cross-validation for all our models. Results in Table 9.2 show that the
neural networks based equation identifiers outperform the CRF based models by a significant
margin. But all the models suffer from low precision. The actual spans of the equations are not
identified properly. In most cases, the starting span is correctly identified whereas the models
mislabel the end of the span. Most times the predicted equation span exceeds the original gold
span.
The grammar used for the conversion into notation is shown in Figure 9.5. S, Bin, Binop,
L, R, X are all non-terminals and var, sinop_l, sinop_r, binop are preterminal sysmbols. The
104
Figure 9.5: Grammar for Conversion of the Equation Terms
terminal symbols are the actual mathematical symbols like ‘+,-,*,/,<’ etc. Binop represents the
strings for the binary operators. sinop_l and sinop_r denote the string appearing to the left and
right of operands respectively. var represents the variable or operand strings. These notations
are similar to those already shown in Section 9.2.2.1.1. Abstract Syntax Tree (AST) imposes
Figure 9.6: Abstract Syntax Tree and Expression Tree for x plus y into z
a precedence order on the nodes which in turn is helpful for the expression tree. We output
all possible parses for any input string. We show one AST and expression tree in Figure 9.6.
Another expression tree possible for ‘x plus y into z’ is x + (y ∗ z).
With the pre-defined ruleset, we correctly convert 43% of the identified equations. We do not
consider the symbol conversions or equations with a single word for calculating the accuracy.
We will explore the idea of modeling this as a sequence labeling task where the input is the
identified equation terms and the output is the corresponding mathematical symbols.
105
Chapter 10
In this chapter, we will summarize the main contributions of this thesis. We discuss some
future directions spawning out of this research as well.
10.1 Conclusion
In this work, we discussed different word problem solving approaches and their strengths and
weaknesses. We started with frame based techniques and refined some existing approaches as
well. We then showed the efficacy of deep learning models and also their limitations. Generation
with different equation notations are thoroughly experimented with. We also discussed the
characteristics of available word problem datasets in depth. Development of effective and robust
data augmentation techniques is the need of the hour. We proposed one generic augmentation
technique to enrich any word problem dataset and show the impact of diversity in word problem
datasets on the performance of automatic solvers. The data augmentation technique was tested
on two languages: English and Hindi. We also show that translations can be used as a data
augmentation technique when the resources are really scarce. This can boost the field of word
problem solving for resource poor languages such as Indian Languages.
Word problem solving techniques can also assist various NLP tasks. Equation span identi-
fication and conversion is one of them for which initial models have been developed. This can
easily be integrated into any Speech-to-speech MT system and can improve the performance
of any MT system in translating mathematical equations and formulaic expressions. With the
development of different models for word problem solving, there is a need to create benchmark
datasets that take into account diversity in terms of both lexicon and templates. To address
this issue, we manually created 3 benchmark datasets in English, Hindi, and Telugu. As one
of the initial teams to work on word problem solving in Indian languages, we developed solvers
in English, Hindi, and Telugu. We also empirically show how transfer learning helps when fine
tuning on a new task in multilingual models.
106
10.2 Future Directions
Although we presented some effective modeling strategies and data augmentation techniques
in the domain of word problem solving, there remains some promising future directions.
• Graph based transformers are state-of-the-art models for English, but are not explored
for Indian languages. This would be an interesting area which requires additional toolkits
such as dependency parser, named entity recognizer.
• In this study, we could explore only two Indian languages. It will be an interesting area of
research to explore the language relatedness between languages to solve word problems.
• Speech transcripts from mathematics and technology domains present a different challenge
where the input is in spoken form. Equation identification and conversion is a starting
step in this direction. Equation or mathematical notation conversion can open up many
possibilities for word problem solving when the text is spoken.
107
Appendix
A List of Frames
108
Related Publications
1. Purvanshi Mehta, Pruthwik Mishra, Vinayak Athavale, Manish Shrivastava, and Dipti
Misra Sharma, “Deep Neural Network based system for solving Arithmetic Word prob-
lems”, Proceedings of the IJCNLP 2017, System Demonstrations, Tapei, Taiwan, pages
65–68
2. Pruthwik Mishra, Litton J Kurisinkel, and Dipti Misra Sharma, “Arithmetic Word Prob-
lem Solver using Frame Identification”, CICLing-2018
3. Pruthwik Mishra, Litton J Kurisinkel, Dipti Misra Sharma, and Vasudeva Varma, “EquGener:
A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations”,
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Compu-
tation, Hong Kong, pages 456-465
4. Vandan Mujadia, Pruthwik Mishra, and Dipti Misra Sharma, “Deep Contextual Punctu-
ator for NLG Text”, SEPP-NLG 2021 Shared Task
5. Harshita Sharma, Pruthwik Mishra, and Dipti Misra Sharma, “HAWP: a Dataset for
Hindi Arithmetic Word Problem Solving”, Proceedings of the Thirteenth Language Re-
sources and Evaluation Conference, Marseille, France, pages 3479–3490
6. Pruthwik Mishra, Litton J Kurisinkel, and Dipti Misra Sharma, “Arithmetic Word Prob-
lem Solver using Frame Identification”, Extended Version, POLIBITS Journal, Accepted,
Not Published
7. Pruthwik Mishra, Vandan Mujadia, and Dipti Misra Sharma, “Multi Task Learning Based
Shallow Parsing for Indian Languages”, submitted to TALLIP Journal, Received 1st Re-
view
8. Harshita Sharma, Pruthwik Mishra, and Dipti Misra Sharma, “Verb Categorisation for
Hindi Word Problem Solving”, Accepted in ICON-2023
109
Bibliography
[1] D. G. Bobrow, “Natural language input for a computer problem solving system,” Ph. D.
Thesis, Department of Mathematics, MIT, 1964.
[2] C. R. Fletcher, “Understanding and solving arithmetic word problems: A computer sim-
ulation,” Behavior Research Methods, Instruments, & Computers, vol. 17, no. 5, pp.
565–571, 1985.
[5] S. S. Sundaram and D. Khemani, “Natural language processing for solving simple word
problems,” in Proceedings of the 12th International Conference on Natural Language
Processing, 2015, pp. 394–402.
[8] L. Zhou, S. Dai, and L. Chen, “Learn to solve algebra word problems using quadratic
programming,” in Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, 2015, pp. 817–822.
[9] S. Upadhyay and M.-W. Chang, “Annotating derivations: A new evaluation strategy and
dataset for algebra word problems,” arXiv preprint arXiv:1609.07197, 2016.
110
[10] S. Roy and D. Roth, “Solving general arithmetic word problems,” in Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1743–
1752.
[11] ——, “Unit dependency graph and its application to arithmetic word problem solving,”
in Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, 2017.
[Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/10959
[12] S. Roy, S. Upadhyay, and D. Roth, “Equation parsing: Mapping sentences to grounded
equations,” arXiv preprint arXiv:1609.08824, 2016.
[14] Y.-C. Lin, C.-C. Liang, K.-Y. Hsu, C.-T. Huang, S.-Y. Miao, W.-Y. Ma, L.-W. Ku,
C.-J. Liau, and K.-Y. Su, “Designing a tag-based statistical math word problem solver
with reasoning and explanation,” in International Journal of Computational Linguistics
& Chinese Language Processing, Volume 20, Number 2, December 2015-Special Issue on
Selected Papers from ROCLING XXVII, 2015.
[15] D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimiza-
tion,” Mathematical programming, vol. 45, no. 1-3, pp. 503–528, 1989.
[17] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM,
vol. 38, no. 11, pp. 39–41, 1995.
[18] Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” in Proceedings
of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp.
845–854.
[19] D. Huang, S. Shi, C.-Y. Lin, J. Yin, and W.-Y. Ma, “How well do computers solve math
word problems? large-scale dataset construction and evaluation,” in Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2016, pp. 887–896.
[20] Z. Xie and S. Sun, “A goal-driven tree-structured neural model for math word problems.”
in IJCAI, 2019, pp. 5299–5305.
111
[21] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program induction by rationale gen-
eration: Learning to solve and explain algebraic word problems,” in Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2017, pp. 158–167.
[22] D. Huang, J.-G. Yao, C.-Y. Lin, Q. Zhou, and J. Yin, “Using intermediate representa-
tions to solve math word problems,” in Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 419–428.
[23] Q. Liu, W. Guan, S. Li, and D. Kawahara, “Tree-structured decoding for solving math
word problems,” in Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), 2019, pp. 2370–2379.
[24] K. Griffith and J. Kalita, “Solving arithmetic word problems automatically using trans-
former and unambiguous representations,” in 2019 International Conference on Compu-
tational Science and Computational Intelligence (CSCI). IEEE, 2019, pp. 526–532.
[25] ——, “Solving arithmetic word problems with transformers and preprocessing of problem
text,” arXiv preprint arXiv:2106.00893, 2021.
[26] L. Wang, Y. Wang, D. Cai, D. Zhang, and X. Liu, “Translating a math word problem to
an expression tree,” arXiv preprint arXiv:1811.05632, 2018.
[28] A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to solve simple
math word problems?” arXiv preprint arXiv:2103.07191, 2021.
[29] S. S. Sundaram, S. Gurajada, M. Fisichella, S. S. Abraham, et al., “Why are nlp models
fumbling at elementary math? a survey of deep learning based word problem solvers,”
arXiv preprint arXiv:2205.15683, 2022.
[30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural net-
works,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
[31] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.
112
[33] H. Sharma, P. Mishra, and D. M. Sharma, “Hawp: a dataset for hindi arithmetic word
problem solving,” in Proceedings of the Thirteenth Language Resources and Evaluation
Conference, 2022, pp. 3479–3490.
[35] S. Roy and D. Roth, “Mapping to declarative knowledge for word problem solving,”
Transactions of the Association for Computational Linguistics, vol. 6, pp. 159–172, 2018.
[36] W. Zhao, M. Shang, Y. Liu, L. Wang, and J. Liu, “Ape210k: A large-scale and template-
rich dataset of math word problems,” arXiv preprint arXiv:2009.11506, 2020.
[37] S.-y. Miao, C.-C. Liang, and K.-Y. Su, “A diverse corpus for evaluating and developing
english math word problem solvers,” in Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020, pp. 975–984.
[38] D. Flickinger, “On building a more effcient grammar by exploiting types,” Natural Lan-
guage Engineering, vol. 6, no. 1, pp. 15–28, 2000.
[39] B. Birnbaum and K. J. Goldman, “An improved analysis for a greedy remote-clique
algorithm using factor-revealing lps,” Algorithmica, vol. 55, no. 1, pp. 42–59, 2009.
[40] S.-Y. Miao, C.-C. Liang, and K.-Y. Su, “A diverse corpus for evaluating and developing
english math word problem solvers,” arXiv preprint arXiv:2106.15772, 2021.
[41] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley framenet project,” in Proceed-
ings of the 36th Annual Meeting of the Association for Computational Linguistics and
17th International Conference on Computational Linguistics-Volume 1. Association for
Computational Linguistics, 1998, pp. 86–90.
[42] M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing,” 2017, to appear.
[44] C. Cortes and V. Vapnik, “Support vector machine,” Machine learning, vol. 20, no. 3, pp.
273–297, 1995.
[45] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
113
[46] K. Sparck Jones, “A statistical interpretation of term specificity and its application in
retrieval,” Journal of documentation, vol. 28, no. 1, pp. 11–21, 1972.
[47] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirec-
tional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[48] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv
preprint arXiv:1907.11692, 2019.
[49] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
[51] D. Chen and C. Manning, “A fast and accurate dependency parser using neural networks,”
in Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP), 2014, pp. 740–750.
[54] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent
neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[55] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[57] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models
are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
114
[59] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word represen-
tation,” in Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), 2014, pp. 1532–1543.
[61] A. Mitra and C. Baral, “Learning to use formulas to solve simple arithmetic problems.”
in ACL (1), 2016.
[62] S. Bird and E. Loper, “Nltk: the natural language toolkit,” in Proceedings of the ACL
2004 on Interactive poster and demonstration sessions. Association for Computational
Linguistics, 2004, p. 31.
[63] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in Proceedings of the thirteenth international conference on artificial
intelligence and statistics, 2010, pp. 249–256.
[65] S. Roy and D. Roth, “Solving general arithmetic word problems,” arXiv preprint
arXiv:1608.01413, 2016.
[66] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “Opennmt: Open-source toolkit
for neural machine translation,” arXiv preprint arXiv:1701.02810, 2017.
[68] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J.
Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”
The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
[69] J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting performance
on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388.
115
[70] M. Maimaiti, Y. Liu, H. Luan, Z. Pan, and M. Sun, “Improving data augmentation for
low-resource nmt guided by pos-tagging and paraphrase embedding,” Transactions on
Asian and Low-Resource Language Information Processing, vol. 20, no. 6, pp. 1–21, 2021.
[71] E. Loper and S. Bird, “Nltk: The natural language toolkit,” arXiv preprint cs/0205028,
2002.
[73] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representa-
tions in vector space,” 2013.
[74] Z. Wu and M. Palmer, “Verb semantics and lexical selection,” arXiv preprint cmp-
lg/9406033, 1994.
[75] K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the
sixth workshop on statistical machine translation, 2011, pp. 187–197.
[78] J. Tiedemann and S. Thottingal, “Opus-mt–building open translation services for the
world,” in Proceedings of the 22nd Annual Conference of the European Association for
Machine Translation. European Association for Machine Translation, 2020.
[79] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evalua-
tion of machine translation,” in Proceedings of the 40th annual meeting of the Association
for Computational Linguistics, 2002, pp. 311–318.
[80] V. Kumar, R. Maheshwary, and V. Pudi, “Practice makes a solver perfect: Data aug-
mentation for math word problem solvers,” arXiv preprint arXiv:2205.00177, 2022.
[82] G. N. Jha, “The tdil program and the indian langauge corpora intitiative (ilci).” in LREC,
2010.
116
[83] G. Ramesh, et al., “Samanantar: The largest publicly available parallel corpora collection
for 11 indic languages,” ArXiv, vol. abs/2104.05596, 2021.
[86] V. Goyal, S. Kumar, and D. M. Sharma, “Efficient neural machine translation for low-
resource languages via exploiting related languages,” in Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics: Student Research Workshop,
2020, pp. 162–168.
[87] V. Mujadia and D. M. Sharma, “Boosting mt performance for indian languages,” 2022,
to appear.
[90] P. Mehta, P. Mishra, V. Athavale, M. Shrivastava, and D. Sharma, “Deep neural network
based system for solving arithmetic word problems,” in Proceedings of the IJCNLP 2017,
System Demonstrations, 2017, pp. 65–68.
[91] P. Mishra, V. Mujadia, and D. M. Sharma, “Multi task learning based shallow parsing
for indian languages,” 2023, to appear.
[92] G. Katrapati, “Developing a word2vec model for hindi from news articles,” 2017, to
appear.
[95] V. Mujadia and D. M. Sharma, “Post edit me: An ai enabled post editing tool for speech
to speech machine translation,” 2021, to appear.
117
[96] A. Mhaske, H. Kedia, S. Doddapaneni, M. M. Khapra, P. Kumar, R. Murthy V, and
A. Kunchukuttan, “Naamapadam: A large-scale named entity annotated data for indic
languages,” arXiv preprint arXiv:2212.10168, 2022.
[97] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with
subword units,” 2016.
[98] J. Chauhan, “An overview of mooc in india,” International Journal of Computer Trends
and Technology, vol. 49, no. 2, pp. 111–120, 2017.
[103] T. Kudo, “Crf++: Yet another crf toolkit (2005),” Available under LGPL from the
following URL: http://crfpp. sourceforge. net, 2015.
[104] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm
and other neural network architectures,” Neural networks, vol. 18, no. 5-6, pp. 602–610,
2005.
[105] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture
6a overview of mini-batch gradient descent,” Cited on, vol. 14, no. 8, p. 2, 2012.
[106] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv
preprint arXiv:1508.01991, 2015.
[107] L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade:
Second Edition. Springer, 2012, pp. 421–436.
[108] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
arXiv:1711.05101, 2017.
118
[109] ——, “Fixing weight decay regularization in adam,” 2018.
119