MasterThesisYH FinalDraft
MasterThesisYH FinalDraft
Supervisor:
Prof. dr. ir. Rommert Dekker
Second assessor:
Prof. dr. Martijn G. de Jong
Final version:
October 30, 2023
The views stated in this thesis are those of the author and not necessarily those of the
supervisor, second assessor, Erasmus School of Economics or Erasmus University
Rotterdam.
Abstract
Comparing spare parts demand forecasting methods is an important part of the spare parts
demand forecasting field. Even more so, when newer methods are introduced. In this paper, new
methods are compared to older, widely-used methods. The methods compared in this paper are
Croston’s method, Syntetos-Boylan approximation (SBA), DLP, LightGBM, Long-short term
memory (LSTM), Multi-Layer-Perceptron (MLP), Random Forest (RF), Willemain’s method
and Quantile regression. Every method is applied to eight different data sets. The data sets
are grouped into simulated data sets or industrial data sets. The performance of the methods is
measured through forecasting accuracy measures and inventory performance measures. In terms
of forecasting accuracy Quantile regression was superior overall followed by MLP. Willemain’s
method was the overall best in terms of inventory performance. However, for lumpy demand,
LSTM outperforms Willemain in terms of inventory performance. For erratic demand MLP
outperforms Willemain. Whereas MLP was the second-best performer in terms of forecasting
accuracy, LSTM did not stand out in terms of forecasting accuracy. We then compared the re-
sults to the reviewed recent literature and found them to be comparable. Through this research,
several findings stand out, the performance measure used and the data set category have an
influence on the results. Data cleaning plays a crucial role and that hyper parameter tuning
takes time and requires prior knowledge.
Contents
1 Introduction 1
2 Literature Review 3
2.1 Time-series forecasting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Non-parametric approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Machine Learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Comparison between parametric, non-parametric and ML methods . . . . 7
2.1.5 Conclusion time series forecasting methods . . . . . . . . . . . . . . . . . 7
2.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Forecasting accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Inventory control performance . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Comparative studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Data 29
4.1 Industrial data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Simulated data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Classification of the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 References 47
7.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1 Introduction
Spare parts producing companies have a responsibility of being able to replace obsolete, defective
parts with new ones when needed. Some spare parts can take quite some time to be manufac-
tured. Furthermore, when the demand for such a spare part is spontaneous and of important
quantity, OEMs 1 might be caught off-guard and the spare parts production takes time to get
going or if a higher amount is needed, to scale up. The waiting time for a spare part can be
costly as it can cause downtime of a production or a service (Haan, 2021). The obvious solution
to face the disruption of production caused by a lack of spare parts would be to have spare parts
at all times and in all places in stock. However, keeping inventory is also costly. Especially, if
the spare parts in the inventory are expensive and take away a considerable amount of space.
Silver (1981), as described in Willemain et al. (2004), states that the demand for spare parts
can be intermittent and variate between no demand at all for multiple periods to a very high
demand. In other words, the demand can be infrequent but also the demand quantity can highly
vary. For this reason, Willemain et al. (2004) and Syntetos et al. (2015) explain that demand
forecasting is not only difficult but also highly important so that the inventory can be managed
correctly. Durlinger and Paul (2015) and Callioni et al. (2005) found that in general, companies’
yearly inventory holding cost can make up between 5 to 45 per cent of the costs of the inventory.
Knowing the spare parts demand in advance is not only beneficial for the company’s stock
management but also for its finances. This is why spare parts forecasting is such an important
topic regarding a firm’s sales. It determines how much you need to order. Supplying spare
parts can be a vital sales advantage as the spare parts business is of high importance with
high margins. Suomala et al. (2002) elaborate that the spare parts business is economically
significant in many industries and can often even be considered the most profitable function of
a corporation. Some industries consider product sales as a positioning opportunity so that the
customers depend on the services and pull-through sales of the product company. For example,
Epson and Hewlett-Packard mainly profit from the sale of toner cartridges and not from the
initial printer sale, (Dennis and Kambil, 2003).
There are numerous methods for forecasting in general. However, the methods for spare parts
forecasting should be able to account for high intermittency and high variability. Furthermore,
as the machine learning domain is rising, ML 2 methods for spare parts demand forecasting
have been developed. As for statistical methods, some ML spare parts forecasting methods
perform better than others depending on the situation (Haan, 2021). This research is important
1
Original equipment manufacturers
2
Machine learning
1
as Pinçe et al. (2021) show that in the last five years, there were nearly no comparative studies
papers produced related to the spare parts forecasting field. Furthermore, the ML domain is
continuously advancing. So, what can new comparative papers add to the field of spare parts
forecasting?
In this comparative paper, different methods of spare parts forecasting will be compared to
each other with the goal of being able to deduce which method performs best for which type of
demand. To measure the accuracy of demand forecasting and inventory performance, different
measures for both will be used. This will allow us to also find out if different measures provide
different results. The data sets used for the comparison are four industrial data sets and four
simulated data sets, representing the demand, which first needs to be classified in one of the
following demand classes: Erratic, Lumpy, Smooth and Intermittent. More details about the
data sets and the classification process will be given in Section 4 Data. This leads us to the
main research question:
”Which methods perform best on what kind of demand respectively for which data set?”.
To answer this question, two other sub-questions come up: ”Is the performance of certain
methods due to the measure used?” and ”Do ML methods perform better in general than statis-
tical methods?”.
In this empirical research, we first review existing spare parts demand forecasting literature
in Section 2. We then present Section 3, where the research design of the paper, the methods
used for the forecasting and the accuracy measures for the forecasts and inventory control
performances, are presented. The methods are applied to eight different data sets, which are
presented in Section 4. After applying the different methods to the data sets, the numerical
results are interpreted and analysed in Section 5. This section answers the research question
and its sub-questions. Last but not least, Section 6 concludes and discusses the findings and
future research opportunities regarding our paper.
2
2 Literature Review
The structure of the literature review is inspired by Pinçe et al. (2021)’s review. In their work,
they explain that the spare parts demand forecasting literature consists of three major groups
with several subcategories. The three major categories are Time-series methods, contextual
methods and comparative studies like this paper.
The first literature category contains time-series forecasting method papers. Time-series fore-
casting method papers contain detailed explanations about time-series methods. Haan (2021)
explains that time-series methods are built on historical data, from which they try to provide
a forecast of future data. Pinçe et al. (2021) subgroups the time-series literature into three
sub-categories: parametric, non-parametric and forecast improvement strategies. The latter is
divided into two branches, demand classification and demand aggregation. The four demand
classifications are erratic, lumpy, smooth and intermittent. The categorization happens based on
multiple demand characteristics, which are explained in Section 4.1 Industrial data sets. In this
paper, the demand is analyzed and the best performing forecasting method is recommended.
Regarding data aggregation methods, they have for goal to reduce the variability of the demand
(Pinçe et al., 2021).
Syntetos et al. (2012) explain that a parametric approach assumes that the lead-time de-
mand follows a certain known distribution. Whereas non-parametric approaches, explained by
Pinçe, Turrini, and Meissner, observe their lead-time demand distribution from the data. Both
categories can be further sub-categorised. The parametric branch can be subdivided into Cros-
ton’s modification, whether demand obsolescence is incorporated, Bootstrapping, if statistical
bootstrapping is used for the parametric category. The non-parametric category is divided into
three sub-categories: Bootstrapping, Neural Network and Empirical method. Those different
categorizations of time-series methods are relevant for this research as they will also be used in
this paper. Furthermore, the grouping allows to give a general conclusion for each category and
their performance when compared.
It is noteworthy, that many forecasting methods exist and more methods are being developed,
as it is a challenging scientific topic. In the next subsections, the existing literature on the
methods used in this paper is reviewed. Some existing methods are briefly mentioned or not
mentioned at all in our literature review, as this would be too extensive for a master’s thesis.
3
2.1.1 Parametric approaches
The first parametric spare parts forecasting method is Croston’s method. Croston’s method was
developed to face the inaccuracies of traditional forecasting methods such as Simple Exponential
Smoothing (SES), which were caused by the predictions of periods of no demand or very low
demand (Croston, 1972). Croston (1972) solves this, by splitting the demand estimate into the
demand size part and inter-demand interval part. The two parts are predicted individually with
SES. Being the first spare parts forecasting method, Croston is used as a benchmark for the
performance comparison with other methods.
Later, Syntetos and Boylan (2005) introduced a new method named Syntetos-Boylan ap-
proximation (SBA), which is also based on SES as Croston’s method. As mentioned in Pinçe
et al. (2021), Syntetos and Boylan (2001) explain that Croston’s method is biased. SBA corrects
for the bias. A formula of SBA and an explanation of how it is different from Croston’s method
can be found in Section 3.1, where the methods used in this paper are elaborated. Pinçe et al.
(2021) concludes that in terms of accuracy measures, SBA outperforms Croston’s method for
industrial spare parts data sets. After SBA, other parametric approaches, like Teunter-Syntetos-
Babai (TSB) have been introduced. However, they are not used in this comparative study. The
reason for this is given in Section 3.1.
Last but not least, a more recent parametric approach named DLP was introduced by Pen-
nings et al. (2017). Pennings et al. (2017) presented a dynamic intermittent demand forecasting
method. DLP anticipates the incoming demand of spare parts by including the positive cross-
correlation between demand sizes and interarrival times (Haan, 2021; Pinçe et al., 2021). In
Pennings et al. (2017), the performance of the method depends on the data set and forecast ac-
curacy measure used. Five different data sets are used (Electro, ElecInd, Raf, Auto and Navy)
and two forecast accuracy measures (MASE and GMAE). Most of the time, SBA performs best.
SBA outperforms the other methods in terms of MASE and GMAE when applied to the ElecInd
and Auto data sets. Furthermore, SBA outperforms the other methods in terms of MASE for
the Raf data set and the Navy data set. DLP is the best performer in terms of GMAE for the
Electro, Raf and Navy data set. TSB achieved the lowest (being the best performing) MASE
for the Electro data set.
4
2.1.2 Non-parametric approaches
Haan (2021) explains that bootstrapping is used to simulate the distribution of the missing data,
by resampling existing data. By doing this, more data is available to model. Bootstrapping was
first introduced by Efron (1979). A often used non-parametric bootstrapping approach, that is
also used in our comparative study, is the one by Willemain et al. (2004), short WSS. As ex-
plained in Pinçe et al. (2021), Willemain et al. (2004) modify the existing bootstrapping method.
The new method takes into account three features of intermittent demand: autocorrelation, fre-
quently repeated values and relatively short time series, which were neglected by the classical
bootstrapping method. Haan (2021) sums up the detailed explanation given by Willemain et al.
(2004). WSS uses a Markov model to first forecast a sequence of zero and non-zero values over
lead time periods based on past demand. After this, all the non-zero forecasts obtain specific
numerical values. The attributed numerical values are obtained from a random sample of past
non-zero values. Lastly, the jittering process starts. The jittering process is explained in detail
in Section 3.1.3. It allows to obtain new demand sizes and to smoothen the demand distribution.
Other non-parametric bootstrapping methods are those by Zhou and Viswanathan (2011),
Porras and Dekker (2008). Zhou and Viswanathan (2011) is seen as an improvement of WSS.
The difference to WSS is, that Zhou and Viswanathan (2011) generate the non-zero lead-time
demand by using a bootstrap of the past distribution of the inter-demand intervals. Pinçe et al.
(2021) and Haan (2021) explain that Porras and Dekker (2008) is referred to as an empirical
method and that it is simpler than bootstrapping.
The last non-parametric method used in this comparative study is quantile regression. Quan-
tile regression as a spare parts demand forecasting method, has not been applied a lot. This is
also why it is difficult to find good research papers on it. Trapero et al. (2019) use a quantile
combination scheme. First, they obtain the quantiles of the lead time forecast density function.
Then, they determine the safety stock. The researchers explain, that based on Boylan, Synte-
tos, et al. (2006), the whole forecast distribution is not needed. Boylan, Syntetos, et al. (2006)
suggest, that only the upper quantiles should be taken into account.
5
to predict the outcome of the other set of the data (test set) (Learned-Miller, 2014). Haan
(2021) also correctly points out, that ML methods, such as neural networks are often difficult
to interpret. In the ML field, researchers speak of a black box, because the input and the
output are known, but the way of getting that result is unknown (Rudin, 2019). However, in
spare parts demand forecasting, the result is more important (high accuracy and good inventory
performance) (Haan, 2021). Another important aspect of ML methods is the hyper parameter
tuning part. Makridakis et al. (2022) refers to Makridakis et al. (2018), which state that there
are many adjustments possible for the hyper-parameters and that finding the best ones takes
time.
One of the first ML methods used in the field of spare parts demand forecasting is MLP 3 by
Gutierrez et al. (2008) (Haan, 2021; Spiliotis et al., 2020). MLP is a form of a neural network
approach. Gutierrez et al. (2008) find that neural network models generally perform better than
the traditional methods (Single exponential smoothing, Croston’s method and Syntetos-Boylan
approximation) in forecasting lumpy demand. This finding is backed by using three different
performance measures (MAPE 4 , PB 5 and RGRMSE 6 ). Essentially, after Gutierrez et al.
(2008)’ neural network method, other neural network approaches are introduced. Kourentzes
(2013) proposes his version of neural network method, that can handle intermittent time series.
In terms of performance accuracy, neural networks perform worse than the best performing
Croston’s method variation. Whereas, in terms of service level, neural networks achieved better
results. One less studied neural network method is LSTM 7 . LSTM was introduced by Hochreiter
and Schmidhuber (1997) and has not been extensively studied in the field of spare parts demand
forecasting. Chandriah and Naraganahalli (2021)’s paper is one of the few, that use the LSTM
method with a modified Adam optimizer for automobile spare parts demand forecasting. Their
method is superior to SES, TSB, SBA, Croston and Modified SBA, in terms of forecasting
accuracy and inventory management.
Other non-neural networks ML methods, that are used in the spare parts demand forecasting
field, are LightGBM and Random Forest 8 . The former was the base of many of the best
performing models in the Makridakis et al. (2022) (Haan, 2021). LightGBM is a gradient
boosting algorithm. Haan (2021) uses LightGBM in his thesis. LightGBM is the worst performer
based on the Percentage Better comparison. However, LightGBM performs well, but not as well
3
Multi-Layer-Perceptron
4
Mean Absolute Percentage Error
5
Percentage Best
6
Relative Geometric Root-Mean-Square Error
7
Long Short Term Memory
8
RF
6
as MLP for extremely high intermittent data, based on inventory control performance. Overall,
SBA seems to be the best performer in Haan (2021)’s paper. The latter is used by Spiliotis
et al. (2020). RF is the best performing method in general, before other ML methods, such
as Gradient Boosting Trees, MLP, Bayesian Neural Network, K-Nearest Neighbor Regression,
Support Vector Regression and Gaussian Processes. It is important to note that the ranking
slightly changes depending on the type of data. However, in general, Gradient Boosting Trees
and RF perform the best in respect to inventory performance and prediction accuracy. Another
paper, that applies RF to forecast spare parts demand is Choi and Suh (2020). They prove that
RF is superior to Support Vector Regression, Linear Regression and Neural Network based on
MAE 9 and RMSE 10 when applied to South Korean aircraft data.
Since the spare parts demand forecasting field is a challenging scientific topic, many new methods
are being introduced. Researchers and field experts compare the methods with each other. Based
on the reviewed literature, one can say that different outcomes have been found. Willemain et al.
(2004) show that, in comparison to SES and Croston’s, their method yields a better forecast
accuracy of the demand distribution for a fixed lead time. So in this paper, non-parametric
methods are superior to parametric approaches. In Spiliotis et al. (2020), ML methods out-
perform statistical methods, except for SBA, which ranks 7 out of 18 methods. The top three
performing methods, based on RMSSE 11 and AMSE 12 for all four types of data, are Support
vector regression, Gradient boosting trees and RF (order changing, depending on data type
and measure). Pinçe et al. (2021) concludes, that in general, the non-parametric methods out-
perform the parametric methods. In Pennings et al. (2017), the parametric methods perform
much better than the non-parametric methods for the same data, where Lolli et al. (2017) con-
cludes the opposite. It is noteworthy, that both papers use different methods for parametric and
non-parametric approaches, which could explain the contradictory results.
In summary, one can say that many different time series forecasting methods exist. We presented
the three major categories (Parametric, non-parametric, ML methods). Parametric methods,
being the first developed methods, are most often used as benchmark methods. Namely, Croston
and SBA. Croston, being the first method developed for spare parts demand forecasting. (Cros-
9
Mean Absolute Error
10
Root Mean Squared Error
11
Root Mean Squared Scaled Error
12
Absolute Mean Scaled Error
7
ton, 1972). Non-parametric and ML methods have been developed more recently. ML methods,
especially, are being studied extensively as they yield promising results in other supply chain
management contexts (Pinçe et al., 2021). Next to delivering high forecasting accuracy, solid
inventory performance, these methods also need to take into account the type of data. In fact,
spare parts data can show extremely high intermittency. Not every method performs well for
every forecasting accuracy measure, for every inventory performance measure and for every data
set. This literature review suggests, that the different strengths of the different methods should
be combined, such that the forecasting accuracy stays high and the improvement in inventory
management cuts the costs of stock keeping.
In the spare parts demand forecasting field, to measure the performance of the methods, two
categories of performance measures are mainly used, forecasting accuracy and inventory perfor-
mance.
Forecast accuracy measures allow to quantify the performance of the prediction made by the
model. It compares a historic value from the training set to the actual value from the prediction.
In the case of spare parts demand forecasting, the demand of a spare part is predicted and then
compared to the actual value from the test set.
Pinçe et al. (2021) and Haan (2021) explain, that there are two types of forecasting accuracy
measures: relative accuracy measures and absolute accuracy measures. The former quantifies
the performance of different forecasting methods relative to each other, while the latter gives an
indication of the forecasting error. Haan (2021) and Syntetos and Boylan (2005). Pinçe et al.
(2021) show that 72.6% of the papers that they reviewed, use an absolute accuracy measure.
Pinçe et al. (2021) provide the table below, which contains commonly used absolute accuracy
measures.
8
Table 1: Common absolute accuracy measures
RM SEt = 1 Pt
e2 Root mean squared error
Ptt s=1 s
|es |
M AP Et = Ps=1
t Mean absolute percentage error
Y
s=1 s
M ASEt = t s=1 1 Pt |es |
1 Pt
Mean absolute scaled error
t−1 i=2
|Yi −Yi−1 |
1
GM AEt = ( s=1 |es |) t Geometric mean absolute error
Qt
1
GRM SEt = ( ts=1 e2s ) t Geometric root mean squared error
Q
A high forecast accuracy alone does not necessarily mean that the inventory is well managed
(Pinçe et al., 2021; Syntetos & Boylan, 2006; Syntetos et al., 2010; Teunter & Duncan, 2009).
Therefore, inventory performance measures are also needed, as the cost implications of stock
holding are high and even higher for not having a stock at all. In comparison to forecasting
accuracy measures, inventory control measures do not compare the difference between the mean
and the forecast. Inventory control measures measure the effectiveness of the stock management
in terms of achieved cycle service level, trade-off curve, total cost, stock volume or shortage
volume. Therefore, a distribution of the demand needs to be assumed. In Section 3.3, the
choice of the assumed distribution for our paper is explained. Pinçe et al. (2021) provide a
visualisation of the inventory performance measures used in spare parts demand forecasting,
where Service level and Trade-off curve are the two most used measures in front of Total cost,
Other (Average total cost, Average on-hand inventory or Stock-out volumes), Stock Volume and
Shortage Volume. Including inventory control measures allows to see the financial implications
of inventory management.
Finally, the third major category of spare parts forecasting is Comparative studies. Different
spare parts forecasting methods will be benchmarked and compared to each other after being
applied to the different data sets. The methods used for this comparison study will be presented
in Section 3.1. The results of each method applied to each data set are then quantified to
be able to compare their performances (Pinçe et al., 2021). The performance measures used
are forecasting accuracy measures or inventory performance measures. Both types of measures
will be used in this paper. Pinçe, Turrini, and Meissner explain that most studies use forecast
accuracy measures, as there seems not to be a general convention on which methods to use as a
9
benchmark. However, inventory performance is described to provide more realistic benchmarks
in Teunter and Duncan (2009). A recent paper that takes both into account is Haan (2021).
This step is important for the field of spare parts demand forecasting as it allows to constantly
challenge the findings of researchers and compare them with each other. Furthermore, in the
last five years, there have not been many new comparative studies (Pinçe et al., 2021).
More recent comparative studies are the Master thesis of Haan (2021) and the paper of
Aktepe et al. (2021). To our knowledge, the newest paper that compares spare parts demand
forecasting methods, are İfraz et al. (2023) and Theodorou et al. (2023).
Haan (2021) compares seven methods with each other. Five conventional methods: Sim-
ple exponential smoothing (SES), Croston’s method, Syntetos-Boylan approximation (SBA),
Teunter-Syntetos-Boylan (TSB) and Willemain and two ML methods: Multi-Layer-Perceptron
(MLP) and LightGBM. The methods are applied to the same eight data sets (four industrial
data sets and four simulated data sets), that are used for our paper. De Haan concludes that
based on the Percentage Better 13 comparison, SBA performs best overall and LightGBM per-
forms worst. This relative measure allows to determine the superior methods. When comparing
the performances based on inventory control performance, Willemain is the best performing
method. This is only true for data that is not categorized as extremely intermittent. For such
demand, Haan (2021) concludes, that MLP and LightGBM are the best performer. Two cri-
tiques of this paper are, that De Haan includes the TSB method, although obsolescence is not
identified and that the LightGBM model is not tuned for the hyper parameters. In fact, for the
latter, De Haan relies on the parameter values of Kailex (2020). Obsolescence is explained as a
spare part no longer being needed, which means, that the demand for that item goes towards
zero (Van Jaarsveld & Dekker, 2011). TSB was introduced in 2011 by Teunter et al. (2011) as
an improvement to Croston’s method, as the latter yields poor performance for obsolescence.
Nonetheless, the obsolescence can be implicit and unidentified.
İfraz et al. (2023) compare four different types of methods for spare parts demand forecast-
ing. The types of methods used are: Regression-based methods (multivariate linear regression14
multivariate nonlinear regression 15 , Gaussian process regression 16 , additive regression 17 , re-
gression by discretion 18 , support vector regression 19 ), Rule-based methods (Decision table,
13
PB
14
MLR
15
MNR
16
GPR
17
AR
18
RbD
19
SVR
10
M5Rule), Tree-based methods (Random Forest 20 , M5P, Random tree, Reduced Error Pruning
Tree 21 ) and Artificial neural networks 22 . This paper’s contribution is important to the spare
parts demand forecasting field, as it uses more ML methods than previous comparison studies.
The researchers apply the methods to a data set of an urban transport bus fleet of a metropolitan
municipality. The inventory type is classified using an Always Better Control method (ABC).
The ABC method follows two rules. The first rule states that items of low value should be amply
kept in stock. The second rule dictates that the quantity of the items of high value should be
sparse, but should be checked more frequently. Although the ABC classifier method is not used
in our paper, İfraz et al. (2023) include multiple ML methods, which provide guidance for our
paper, as we are applying some of the ML methods to our data sets.
In Aktepe et al. (2021), four methods (Linear regression 23 , Nonlinear regression 24 , ANN
and SVR) are used to predict the sales of a construction machinery company. Its business
consists of the sale of spare parts it produces for other companies. The researchers explain in
the Conclusion and Discussions part, that the ML methods are performing better than the linear
and nonlinear regression models in terms of forecasting accuracy. This is also a reason to analyze
the performance of ML methods in our paper, as ML methods look promising in the field of spare
parts demand forecasting. Nonetheless, they do not provide inventory performance measures,
which would allow to observe if the findings stay consistent. This is why, in our paper, next to
the forecast accuracy measures, inventory control measures are used to test for a difference in
the outcome.
Most recently, Theodorou et al. (2023) conducted a study on the connection between fore-
casting accuracy methods and inventory performance methods applied to the M5 competition
data set from Makridakis et al. (2022). The inventory performance measures used are trade-off
curves and monetary cost estimates (lost sales and holding inventory), as the cost variable is
available in the data set. 12 forecasting accuracy methods are used in this paper:
11
• Automated selection of ARIMA models (AutoRegressive Integrated Moving Average)
• Aggregate-Disaggregate Intermittent Demand Approach (ADIDA) & intermittent Multiple
Aggregation Prediction Algorithm (iMAPA)
• LightGBM
To measure the accuracy of the forecasts, Root Mean Squared Scaled Error (RMSSE) is
used. The performance of the models is related to the length of the review period. The ranking
of the performances of the methods is provided in Table 2. The researchers conclude, that the
optimal choice of forecasting method may vary depending on the assumed costs. Furthermore,
the choice of forecasting method should be connected to the target, as more accurate methods
do not necessarily show lower costs. Only one forecasting accuracy measure, namely RMSSE
is used in this paper, which does not allow to compare the performances of the methods based
on the choice of forecasting accuracy measure. This is why our paper uses next to RMSSE
also MSE and MASE. Regarding the inventory performance, the researchers assume a normal
distribution, which differs from our case, as we assume a gamma distribution. Nguyen (2023)
compared the normal and gamma distribution for our data sets and concluded, that the gamma
distribution performs better.
The research question and the sub-questions of our paper are also of great importance in
Pinçe et al. (2021). For the former, there is no simple answer to it. Pinçe et al. (2021) explain
that the performances of the methods vary from one industrial data set to another. For the latter,
the inventory performance measure and accuracy measure used play a big role in the results as
they can yield different outcomes. Furthermore, the hand-in-hand use of inventory performance
measures and accuracy measures is advised as both do not necessarily show the same performance
results. Regarding the sub-question about the performance of ML methods compared to the
performance of statistical methods, Pinçe et al. (2021) explain that, as mentioned in Baryannis
et al. (2019) and later by Kraus et al. (2020), ML methods are of good use in other supply chain
management contexts. This is why, they could work better in spare parts demand forecasting.
From this literature review, we conclude that, as there have not been a lot of comparative
studies in the spare parts forecasting field lately, this paper can contribute to this field. Fur-
thermore, we can conclude that ML methods have a lot of potential in the spare parts demand
forecasting field, as in two out of the four reviewed comparative papers, they perform better
than traditional methods. Nonetheless, this finding should not be trusted blindly. In fact, out of
the four comparative papers, only Haan (2021) uses multiple data sets. The other papers only
apply their methods to one single data set. Furthermore, when comparing the performances of
the methods, a lot of variability regarding the superior method is observed. In fact, the findings
12
of the papers, that use some of the same methods are not consistent with each other. Our
research question seems to be important in other papers too. The literature review provides
guidance on how other researchers approached the research questions. This can be taken into
consideration for this paper, however, also focusing on different forecasting methods.
Thus, the following methods are the most promising methods for our research: Croston,
SBA, MLP, Willemain, RF and LightGBM. Furthermore, as performance measures, we decide
on MSE, MASE, RMSSE and Trade-off curves. We also decided to use other methods and
measures that have not been used in the comparative papers from Table 2. Those methods are
DLP, Quantile regression and LSTM for forecasting and GMAE to measure the accuracy. The
reasons why we use those methods and measures are given in Section 3.1.
Table 2 and 3 on the next two pages, provide an overview of the literature review of recent
comparative studies in the spare parts demand forecasting field and their key findings.
13
Table 2: This table gives an overview of the data, the methods, and the performance of the
comparative papers.
Theodorou et M5 com-
Naive& sNaive RMSSE Trade-off
al. (2023) petition by
MA curves
Makridakis
SES (Normal
et al. (2022)
Croston distribution
SBA assumed for
TSB the target
ES service
ARIMA level)
ADIDA Monetary
iMAPA cost
LightGBM
14
Table 3: This table summarizes the key findings of the reviewed comparative studies.
15
3 Research design and methodology
This paper will be structured in two steps. The first step consists of setting up the experimental
design based on previous comparative studies like Haan (2021). The second step requires the
selection of several spare parts forecasting methods and a description of the technique of each
method. Furthermore, the measures of the results for the forecasting accuracy and the inventory
performance will be elaborated. By using different performance measures on the different applied
methods, we aim to be able to answer the research questions and analyse which methods perform
best for which performance measures.
In the experimental design, four industrial data sets and the four simulated data sets will
be explored and classified in one of the four demand classification categories. This allows us to
investigate whether certain methods perform better for certain types of demands.
The data sets have already been cleaned by de Haan (2021) and improved by Nguyen (2023)
for efficiency. It is also noteworthy, that outliers have already been removed by the two mentioned
authors. Next, the data will be split into a train and test set. Then, the chosen methods will
be applied to the data sets, which will then allow us to compare the results of the forecasting
accuracy, the inventory performance and the differences in the results due to the different data
sets. Further explanations about why those data sets were picked and what their different
characteristics are, will be explained in Section 4.
16
3.1 Methods for the comparison
In the next step, we explain the reason behind using this method in this comparative study
and how the method works.
Let’s start with presenting the methods from category 1, that will be used.
Croston
One method, that is commonly used as a benchmark method is the Croston’s method, which
is also the first method developed for spare parts demand forecasting. Croston’s method is
elaborated in detail in Croston (1972) and is built on the Simple Exponential Smoothing method
(SES). Kourentzes (2013) explains that Croston’s method focuses on two separate components.
zt , which is the non-zero demand size and xt , which stands for the inter-demand interval. SES
uses a smoothing parameter which puts more weight on the recent data (demand). However, for
intermittent demand where zero demand periods can happen, SES would take into account the
zero demand periods, which are extreme values that have an impact on the prediction. This is
why in Croston’s method, zt has to be non-zero as the estimates are only updated when demand
occurs. The prediction of Croston’s method is given by: Ŷt = xˆt .
zˆt
Croston’s method is included
in this paper, as it allows us to compare the newer methods to Croston’s method, a method
which is used as a benchmark in most spare parts demand forecasting papers.
Syntetos-Boylan approximation
Another statistical method, serving as a benchmark, is Syntetos and Boylan (2005)’s method also
abbreviated as SBA. The decision to include SBA is due to the fact that SBA was developed to
prove that Croston’s method is biased. SBA corrects the bias. SBA proposes following estimator:
z′
Yt′ = (1 − α2 ) xt′ , where (1 − α2 ) is the bias correction coefficient. α represents the smoothing
t
constant value, which is utilized to update the inter-demand intervals. Both Croston’s methods
and SBA can be applied through the use of the ”tsintermittent” R-package by Kourentzes (2014).
17
DLP
The last statistical method is presented by Pennings et al. (2017). The DLP method is an
intermittent demand forecasting method and assumes a dependence between interarrival time
(elapsed time) and demand size to anticipate the incoming demand, which is not the case in
methods like Croston’s (Pinçe et al., 2021).
1−p
DL,t = µ[L + (τ0 − )(1 − (1 − p)L )]
p
The left part of the equation represents the expected total demand DL,t at time period t for
an SKU for a lead time of L. The expected total demand is calculated proportionally to the
inter-arrival time (τ0 ) with respect to the probability of non-zero demand (p). In other words,
the DLP method exploits the elapsed time (τ0 ) to anticipate incoming demand. This part of the
equation: (τ0 − 1−p
p )(1 − (1 − p) ) takes into account this elapsed time and adjusts the expected
L
demand by the probability of non-zero demand (p). 1 − (1 − p)L represents the probability of
at least one demand occurring over the lead time period.
The reason for including this method is because Pennings et al. (2017) obtain encouraging
results. In fact, the researchers state that they are able to reduce unnecessary inventory invest-
ment by 14% for SKUs that exhibit cross-correlation, compared to Croston’s method. As no
package exists for the DLP method, some code has been provided by Dr Jan Van Dalen (one
of the three researchers, that introduced the method). The code is run on RStudio and will be
provided on a GitHub page, linked in subsection 6.6, dedicated to the master thesis.
LightGBM
The first method is LightGBM. As described in Haan (2021), LightGBM was the base of many of
the top methods in the M5 competition analyzed by Makridakis et al. (2022). Gradient Boosting
methods are also generally used for many Kaggle competitions, as they perform quite well and
are easy to use. This is also the reason, why it will be used in this paper. For this method, the
code from de Haan (2021) will be used. However, we will tune the hyper parameters differently,
to see if the findings improve. We set the learning rate to 0.01 (previously 0.O75), increased
the number of rounds to 15000 (previously 12000) and got rid of the sub feature and sub row
18
arguments. De Haan adapted the code for the same data sets that are used in this paper from
Kailex (2020) and did not try other values for the best hyper parameters. The functioning of
LightGBM can be found in the open source documentation Microsoft (2021) and the R package
used in this paper in Shi et al. (2022). Furthermore, the hyper parameters and their respective
roles can be found in Table 4. The description of the roles of the hyper parameters have been
obtained on the Microsoft (2021) page.
The next method is Long short-term memory (LSTM), which is a type of Recurrent Neural
Networks (RNN). LSTM has been proposed by Chandriah and Naraganahalli (2021) to forecast
automobile spare parts demand. As explained in Chandriah and Naraganahalli (2021), the dif-
ference between an RNN and a feed-forward Neural Network is that the RNN uses a feedback
connection to remember the prior time steps. This whole process is quite complicated in the
long term, which is why the function of LSTM comes in handy. The latter is able to resolve
the problem of vanishing gradient in RNN. The vanishing gradient problem means that with
every parameter update, the gradient becomes smaller. However, the gradient is carrying the
information. This means that a smaller gradient provides also less information. For long data
sequences, this becomes a problem as the updates of the parameter are not significant anymore.
In other words, there is no learning happening anymore. In this paper, LSTM will be used in
combination with Adam optimizer (Adaptive moment estimation) as in (Chandriah & Nara-
ganahalli, 2021). RNN functions by remembering the output of the previous data point and
re-using it for the next one (memory). The Adam algorithm allows to optimize the weights at
each level.
The technicalities of RNN and LSTM are explained in depth in Sherstinsky (2020). Or in the
seminal paper of Hochreiter and Schmidhuber (1997). I decided to include this method for the
comparison, as in the paper of Chandriah and Naraganahalli (2021), the researchers state, that,
the modified Adam optimizer performs well for their data set. Furthermore, one of the data sets
used in our study is also an automotive data set. However, Chandriah and Naraganahalli (2021)’s
paper states that, ”The Croston method forecasts the demand by separating the time intervals
and demand size. This method is better compared to conventional Simple Exponential Smooth-
ing (SES), Syntetos-Boylan-Approximation (SBA), Croston, Teunter-Syntetos-Babai (TSB) and
Modified SBA. However, the performance of these methods is poor for intermittent demand.”.
This is not in accordance with findings from other papers such as Pinçe et al. (2021), Teunter
et al. (2011) and other renowned papers. In fact, the TSB method was introduced to adjust for
19
the lagging update of the variation of the new demand levels. Thus, TSB should perform bet-
ter than the previous methods (Croston and Croston’s modifications) for intermittent demand
spare parts forecasting in most cases (Pinçe et al., 2021). Furthermore, although Chandriah and
Naraganahalli (2021) categorize this paper in the spare parts demand forecasting field, the data
consists of new cars and not spare parts. This is different from spare parts demand forecasting
and is not helpful for our paper. However, the paper guides us on how to apply the LSTM
method.
To run the model, the ”keras” and ”tensorflow” packages in R are used and the optimizer is
set to ”optimizer adam”. Furthermore, the learning rate, beta 1 and beta 2 can be tuned. keras-
team (2021) provides insights into the implementation of the Adam optimizer. Tunable hyper
parameters, specifically to the Adam optimizer are the learning rate,β1 , β2 and epsilon. The
learning rate controls the step size of the weight updates. β1 and β2 , represent the exponential
decay rate for the 1st moment estimates and 2nd moment estimates, respectively. Simply put,
these hyper parameters control how much the optimizer ”remembers” its previous moments.
epsilon, a small constant for numerical stability. In addition to these hyper parameters, there
are also hyper parameters to the LSTM model itself, such as the number of layers, the number
of units in each layer, the batch size and the number of epochs. Table 4 shows the parameters
that can be tuned in LSTM and what their role is. The description of the roles of the hyper
parameters have been obtained on the dedicated GitHub page of SciKit-Learn (2015). Further
details of the Adam optimizer can be found in the seminal paper of Kingma and Ba (2014).
Furthermore, the LSTM method needs some data pre-processing, i.e. setting: ”lag”, ”delay”
and ”n” (next steps) to obtain the input sequences (X) and output sequences (Y). The ”lag”
allows to set the number of previous time steps to use as input variables per sequence to predict
the next time period. ”delay’ allows to set the step how far the model will predict into the
future. And ”n” allows us to set how many time steps ahead the model will predict. The
downside of the LSTM method is that due to the need to create sequences, the data becomes
scarcer, as the time series data is combined into smaller chunks. This causes an issue in our case,
as when the data is split into a training and test set, the test set has fewer time steps, which
are used to create the sequences. Hence, there are even fewer predictions generated out of those
test data sequences. This is why, when computing the accuracy measures, the test data input
is shortened, such that its length matches the length of the predictions. Regarding the hyper
parameter tuning, we decided to keep the model simple, i.e. two layers with 50 units each, a
dropout layer and 10 epochs to prevent overfitting, as training a model on not much data risks
overfitting.
20
Multi-layer perceptron
Another ML method is the Feed-forward neural network, which is based on the methodology
of Spiliotis et al. (2020) and can also be referred to as a Multi-Layer Perceptron (MLP). This
neural network consists of a single hidden layer. As Haan (2021) mentions from Smyl (2020),
all the ML methods are trained the same way, which is using constant size, rolling input and
output windows. Haan (2021) and Spiliotis et al. (2020) cite Zhang et al. (1998), which states
that because of the use of nonlinear activation functions by ML algorithms, the data should be
scaled in the range of 0 and 1 pre-training. By scaling the data, not only does the learning speed
improve, but also computational problems are avoided. The data should be linearly transformed
yt −ymin
between 0 and 1 following y ′ = ymax −ymin . The transformation is reversed after obtaining the
forecasts, to find out the final prediction and the forecasting accuracy. This method is included
in this comparative study, as it is easy to run, yet performs well in other papers. Furthermore,
it allows us to observe the performance of simple ML methods compared to statistical spare
parts forecasting methods. To run this method, the RSNNS package in R will be used and the
hyper parameters will be tuned until the optimal parameters are found for the training of the
model. Table 4 shows the parameters that can be tuned in MLP and what their role is. The
description of the roles of the hyper parameters have been obtained on the dedicated GitHub
page of SciKit-Learn (2019b).
Random forest
The last ML method is based on the Random forest algorithm proposed by Breiman (2001).
Random Forest combines the predictions of multiple decision trees and averages their predic-
tions (Biau & Scornet, 2016). Spiliotis et al. (2020) used RF in their comparative study and
implemented it by using the R package randomForest by Liaw, Wiener, et al. (2002). We de-
cided to include this method, as Random Forest is easy to apply. Furthermore, Choi and Suh
(2020) compare Random forest in their paper to Support Vector Regression, Linear Regression
and Neural Network. Random Forest yields the best results in their paper. The Random Forest
algorithm allows to tune several hyper parameters, which can be seen in Table 4. The descrip-
tion of the roles of the hyper parameters have been obtained on the dedicated GitHub page of
SciKit-Learn (2019a) and in the paper of Probst et al. (2019). Throughout the implementation
of RF, several problems came to our attention. One problem is, that it is highly computational
intensive. This is why the method is run on Google Colab in an R script. The model itself is
built on the scorecardModelUtils and randomForest packages by Arya Poddar (2019) and Liaw,
Wiener, et al. (2002). We decided to use the scorecardModelUtils and randomForest packages
21
for Random Forest, as the former package allows hyper parameter tuning and the latter package
is used to train the final model and the prediction.
It is important to know, that many hyper parameters can be tuned for ML methods. Not
every single tuneable hyper parameter is shown in Table 4, as this is beyond the scope of
this thesis. Furthermore, the used package also plays an important role for the used hyper
parameters, as some hyper parameters cannot be tuned in some packages.
The ML methods are expected to perform well. However, the hyper parameter tuning
of those methods will be an important part and the most difficult part of implementing ML
methods. By correctly tuning the model, the methods can be reproduced by others, which
allows standardization of the procedure. Another important point regarding ML methods is,
as already mentioned in Section 2.1.3, the lack of interpretability. The so-called Black box
problem can occur for some ML methods, that use complex mathematical operations and data
transformations. In our case, the Black box problem is mainly an issue for the MLP and LSTM
methods as these are Deep learning methods. Deep learning is a subset of ML, which requires
more amount of data and a longer training time. Although it requires less human intervention,
as Deep learning methods learn on their own, they make non-linear, complex correlations, which
are difficult to understand. LightGBM and RF, on the other hand, are easier to interpret, as
they are tree-based methods, that can be visualized.
22
Table 4: This table summarizes the important hyper parameters of the used ML methods and their
roles.
23
3.1.3 Non-parametric methods
Willemain
Willemain et al. (2004) method is different from the statistical and ML forecasting methods
because they forecast a whole distribution of demand over a fixed lead time. It does this, by
using a bootstrap method in 7 steps, they are able to forecast the cumulative distribution of
demand over a fixed lead time. Willemain’s method can be summarised in 7 steps:
1. Step: Estimate transition probabilities for two-state Markov model for historical demand
2. Step: Utilize the Markov model to generate zero and nonzero sequences over the forecast
horizon conditional on the last observed demand.
3. Step: Replace nonzero demand with a random numerical value with replacement, from
the set of observed nonzero demands.
4. Step: Jitter the nonzero demand values. Jittering means to pick a different value, which is
located close to the selected value. This allows more variation and a more natural variation
of the demand size. (Example: Instead of using the randomly chosen non-zero demand of
7, a close-by value such as 6, 8, 9 or 10 is used.) The maximum value is the previous value
plus the jittering value.
5. Step: Summation of the predicted values over the forecast horizon, to get one single
predicted value of lead-time demand (LTD).
6. Step: Repeat steps 2-5 many times to obtain many LTD values.
7. Step: The obtained LTD values in step 6 are sorted, such that a distribution of LTD is
obtained.
The lead time for Willemain is set to 1 instead of 0 because a lead time of zero would mean
that only the current period is being forecasted. This is because Willemain’s bootstrapping
method forecasts a cumulative distribution of the demand over a certain lead time. If the lead
time is 0, there is no delay between the decision to replenish and when the stock is available.
This means that as there is no delay to account for, the method is forecasting the demand for
the current period. As Haan (2021) mentions that Willemain et al. (2004) successfully proves
that his method outperforms SES and the methods based on SES, such as Croston’s method,
Willemain’s bootstrapping method is also included. However, Willemain’s method has a critique
point. In fact, when sampling for one single period ahead, the Markov chain is reduced as it
can only have one of the two states, zero demand or non-zero demand. To run the method, the
code from Nguyen (2023) is used.
24
Quantile regression
The last method, quantile regression, is also categorised as a distribution-focused method as the
quantile function is the inverse of the distribution function (Taylor, 2007). Furthermore, quantile
regression estimates the conditional quantile function as a linear combination of the predictors
and does not make assumptions about the distribution of the target variable. Koenker and
Hallock (2001) explain that quantile regression is suited for cases when the conditions of linear
regression are not met (i.e. linearity, homoscedasticity, independence and normality). In the
spare parts forecasting domain, quantile regression would allow to find a specific quantile that
suggests, for example, taking the 25th quantile, there is a 25% chance that the actual demand
for a spare part is below the forecast and there is a 75% chance that the demand is above. The
quantile regression model is given by: Qτ (yi ) = β0 (τ ) + β1 (τ )xi1 + ... + βp (τ )xip i = 1, ..., n
and τ ∈ (0, 1). Qτ (yi ) represents the τ -th quantile of the dependent variable ’y’ for the i-th
observation. β0 (τ ), β1 (τ ), β2 (τ ), ..., βp (τ ) are the quantile-specific coefficients of the intercept
and independent variables at the τ -th quantile. xip are the independent variables for item i and
period p. τ represents the quantile level of interest. For every wanted quantile, in our case from
50% to 99%, we fit a quantile regression for every period to predict the next demand ’y’ based
on the previous predictions.
As we are focusing more on the upper quantiles (i.e. from 50% to 99%), the values need to
be converted into percentages. For this, no extra package is needed. In fact, after establishing
the quantile regression model with the existing rq function in R, predictions for the desired
quantiles can be made through the predict function, by setting the ’level’ argument to a vector
of desired values. The rq function takes as input the formula, the data and the τ (quantiles)
levels of interest. In our case, general insights into the overall performance of the model across
the pre-determined range of quantiles can be made through this. Quantile regression is included
in this comparative study, as there are not many papers that use this method for spare parts
forecasting (Syntetos et al., 2012).
After training the model, the model needs to be tested. Therefore, accuracy measures are
required, that allow us to compare the predictions with the actual values. For this, the most,
widely used accuracy measures are used (Pinçe et al., 2021). As Haan (2021) describes from
Pinçe et al. (2021), the most commonly used accuracy measures are absolute accuracy measures.
In this comparative study, Mean Absolute Scaled Error (MASE) is one of the absolute accuracy
measures. MASE is quite important as it allows a scale-free measurement across all time series of
25
different items (Pinçe et al., 2021). The other absolute accuracy measure is Mean Squared Error
(MSE), which has been proposed by Hyndman and Koehler (2006). These accuracy measures
are defined respectively as follows:
n n ′
1X 1
t=1 |Yt − Yt |
P
M SE = (Yt′ − Yt )2 , M ASE = n
n t=1 ( n11−1 ) ni=2 |Yi − Yi−1 |
P 1
The third accuracy measure is the Root Mean Squared Scaled Error (RMSSE), which has
been elaborated by Hyndman and Koehler (2006) and used in many papers, such as Haan (2021),
Spiliotis et al. (2020), and Theodorou et al. (2023). Again, as the MASE measure, the RMSSE
also allows a scale-free measurement across all time series of different items. It is defined as
follows: v
t=n+1 (yt − yˆt )
u 1 Pn+h 2
RM SSE = t h1 P
u
Theodorou et al. (2023) describe from Kolassa (2016), that Squared errors measures, such as
RMSSE, are suitable when it comes to estimating the average demand for intermittent data.
Last but not least, the less used Geometric Mean Absolute Error (GMAE) is the fourth
applied absolute accuracy measure in this thesis. GMAE is used in the paper of Pennings et al.
(2017) next to MASE. Pennings et al. (2017) claim that GMAE and MASE are two recently
proposed and widely used metrics. The former does not scale the errors, whereas the latter does.
As we are also using the DLP method, we want to be able to compare the performance of the
method with Pennings et al. (2017)’s results. GMAE is defined as follows:
t
1
GM AE = (
Y
|es |) t
s=1
es is the prediction error, between the actual value (demand in our case) and the predicted value.
To obtain the GMAE, the absolute errors of each observation are multiplied by each other.
After that, the t-th root of this product is taken. t being the total number of observations.
However, the GMAE is not well-suited for data containing a lot of zeros. This goes for the
data set containing the predictions and the test data set. In fact, in the context of GMAE, if
the predicted value is zero and the actual value is also zero, then the absolute error for that
prediction is zero. Since GMAE is the geometric mean of these absolute errors, if any of the
absolute errors are zero, then the GMAE will be zero. The Table 5 below provides a small
example of the sensitivity of GMAE for 0 values.
26
Predicted Actual Absolute error
4 3 |3 − 4| = 1
0 1 |1 − 0| = 1
0 0 |0 − 0| = 0
1
Geometric mean (1 ∗ 1 ∗ 0) 3 = 0
t = the total number of observations
Now, to be able to compare the performance of the accuracy measure of the different methods
on different data sets, a new measure is needed. Pinçe et al. (2021) use the Percentage Better
(PB) and Percentage Best (PBt) and explain that both ”rank the performance of different
methods based on the percentage of time they perform better or best according to an underlying
measure.”. PB and PBt are relative accuracy measures. Given that Haan (2021) uses PB, we
are also going to use PB as this allows us to compare our findings to their findings on the same
data sets.
As already mentioned in Section 2.3 Comparative studies, two types of performance measures
are used. Next to the forecasting accuracy measures, inventory performance measures are also
important, as the former does not necessarily mean that the inventory performance for spare
parts is high. While most forecasting methods estimate the mean, inventory control measures
need an assumption of the demand distribution. In our case, we rely on Nguyen (2023)’s findings,
which show that a gamma distribution performs better than a normal distribution. However,
this is only the case for when the mean is not too small compared to the variance. In fact, a
company prefers to have too much stock rather than too little, as it can then at least minimize
downtime. This means, that the loss function is considered to be asymmetrical. Pinçe et al.
(2021) present a distribution plot (Figure 5 on page 13) that shows the two most commonly used
inventory performance measures are the Service level and the Trade-off curve. In our paper, as
in Haan (2021)’s paper, the trade-off curves show the trade-off between the achieved fill rate
(AFR) and the holding costs.
Before determining the AFR, an inventory policy needs to be set (Haan, 2021). In this
paper, the approach by Van Wingerden et al. (2014), which is also used by Haan (2021) is used.
Herefore, a base stock level R is determined by evaluating previous demand. Each period, the
Inventory Position (IP) is updated. Back ordering is allowed. IP is defined as:
27
Van Wingerden et al. (2014) state, that if IP drops to the stock level (R) or below, new stock
is ordered. Although a minimum order quantity can be specified, we rely on Haan (2021)’s
paper and decide to also not include it, for simplicity reasons. Furthermore, a zero lead time
is assumed. Zero lead time indicates that the replenishment order arrives immediately after an
order is placed.
By picking the same Inventory control measures as Haan (2021), the comparison of the inven-
tory performance of the same data sets used with different methods across different comparative
studies is possible. As in this paper, the trade-off curves visualize the trade-off between the
AFR and the holding cost, a target fill rate (TFR) needs to be set. The fill rate targets used
for this paper are 75%, 80%, 85%, 90%, 95%, 99% and 99,9999%, which are the same as Haan
(2021) used.
Before training a model, the data is split into a training set and a test set. For this, we will
apply the same training procedure as Haan (2021), which is shown in Nguyen (2023). The data
is split into a 70% and 30% split. This means, that 70% of the data are used for training the
model and the other 30% for testing the model, to see how accurate the model is. The training
is done on a single SKU 25 basis.
The industrial data sets need to be classified first into one of the four categories: Erratic,
Lumpy, Smooth and Intermittent. The classification is done by respecting the classification
scheme of Boylan et al. (2008), which is based on Syntetos and Boylan (2005). Boylan et al.
(2008) suggest that the classification is done based on two criteria, the mean inter-demand
interval p and CV 2 , the squared coefficient of variation of the demand sizes.
The mean inter-demand interval ”p” for every item = Total number of time periods
Count of the non zero demands and
28
Figure 2: Demand-based categorization for forecasting by Boylan et al. (2008).
4 Data
The data sets for this paper are divided into four industrial data sets and four simulated data
sets. The reason behind using industrial data sets and simulated data sets is, that the latter
allow to control the environment. This means that by having a complete intermittent demand
data set, this allows to see the impact of the demand class on the method, which allows us to
answer the main research question of this study. Including industrial data, allows to observe
the reality, how the methods perform in practice when used in an industry. All data sets can be
found on the GitHub page of Nguyen (2023). The data sets have been cleaned and the outliers
have been removed by Nguyen (2023). An important aspect of the data sets is whether they
include lead time or not (Haneveld & Teunter, 1997). In case, lead time is not included in the
data set, a lead time of 0 is assumed. This means that the spare parts are immediately ready and
no waiting time is required until the spare part is delivered. The data sets are not continuous
in time. They have discrete timestamps.
A table summarizing the description of the data sets can be found below.
The first data set includes sales of 1392 (3451 before cleaning) 26 items of a dutch manufac-
turing company and will be named ”MAN”. The collection of the data started in the first week
of 2012 until the 16th November of 2014 (150 weeks). It includes variables like prices, inventory
26
Cleaning consists of dropping items which do not have > 1 demand occurrence in the train and test set, which
are required for a forecast.
29
costs, the lead time , demand frequency and demand size, the minimum order quantity, the fixed
order costs and the demand dates (per week).
The second industrial data set is gathered over seven years (1996-2002) and contains in-
formation about the demand of 5000 aircraft spare parts of the British Royal Air Force. The
variables of this data set are nearly identical to the ones for the first data set, except for the
inventory costs, which are not included in the ”BRAF” data.
The third data set contains data from the ”OIL” industry. It contains data about 7644
(14523 before cleaning) spare parts of an oil refinery for a period of 56 months (January 1997
to August 2001). This data set includes the prices and lead times.
Last but not least, a data set from the automotive industry. It contains, for instance, sales of
3000 items during 2 years. Again, the included variables are identical to the previous data sets,
except that it does not contain price or lead time information. This is why the provided prices
for ”AUTO” in Table 6, have been calculated, by examining the relationship between pricing
and monthly order frequency in the other data sets. Haan (2021) provides the formula for the
ratio RPS (Ratio Price Sales), which allows to examine the relationship, but also the way to
calculate the other price statistics of the Auto data set.
The RPS of the AUTO data set is set as the average of the RPS of the other data sets, which
can be found in Table 7. The average product price is obtained by multiplying the RPS by the
monthly sales. Haan (2021) and Nguyen (2023) also calculate the RMS (Ratio Monthly Sales),
which is used to obtain the individual item price. The RMS is calculated as follows:
As Haan (2021) correctly points out, the mean product price for the AUTO data set is very
high compared to the other data sets. This is due to the fact that the average product price is
affected by the higher frequency portions, that can be found in the data set next to the relatively
low average monthly item sales. A small, negative correlation is observed between the monthly
average demand (item sales) and the average product price for the MAN, BRAF and OIL data
30
sets. The AUTO data set shows a higher negative correlation. This is due to the fact that the
prices of the AUTO data set have been calculated by respecting the ratios (RPS & RMS) of the
other data sets. All the correlations are of a significance level of at least 5%.
Inventory Demand Min. order Fixed order
Data set Nr. Sales Duration Prices costs Lead time frequency & size quantity costs
MAN 1392 items 150 weeks Yes Yes Yes Yes Yes Yes
BRAF 5000 items 7 years Yes No Yes Yes Yes Yes
Automotive industry 3000 items 2 years No Yes No Yes Yes Yes
OIL 7644 items 56 months Yes No Yes No No Yes
Table 6: This table summarizes the description of the industrial data sets.
Table 7: Descriptive statistics for the MAN, BRAF, AUTO and OIL data sets
Simulated data sets allow to replicate a certain behaviour. In our case, every simulated data set
replicates one of the four data categories (Erratic, Lumpy, Smooth, Intermittent). By having a
clear dominating class of items, it is easier to conclude which method performs best for which
data set. This facilitates the control of the environment and whether they have an impact on
the performance. The four simulated data sets are generated in R, through the R package
’tsintermittent’ by Kourentzes (2014). Furthermore, we rely on Haan (2021)’s procedure. The
package requires three input arguments. The three input arguments are:
To resemble the industrial data sets, Haan (2021) sets the number of time series to 60 months,
and the number of observations per time series to 6500 items. To be able to replicate every
data category, the squared coefficient of variation, CV 2 , and the mean inter-demand interval
of non-zero demand, p, need to be chosen for every data set. This is done according to the
cut-off values set by Boylan et al. (2008) in Section 3.4. Table 8 shows the settings for the four
simulated data sets. Same as Haan (2021), we observe that average monthly demand decreases
when p increases. The average product price is determined by using the average RPS of 212.633
by following the process in Section 4.1. A negative, significant correlation between monthly
31
average demand and average product price is observed in all simulated data sets. This means,
that the items of high demand are also cheaper items.
Furthermore, it is noticeable, that the negative correlation coefficient is much stronger for the
simulated data sets (See Table 8) than for the industrial data sets (See Table 7. The industrial
data sets include more non-zero demand occurrences than the simulated data sets. This can
also be seen, when comparing the mean inter-demand interval, p in Table 9. In other words,
the intermittency effect is much stronger in the industrial data sets, as there are nearly no zero
demand occurrences in the simulated data sets. This raises the question if the simulated data
sets are really replicating the behaviour of the industrial data sets, as the simulated data sets
do not take into account the price as an input during the simulation process.
As previously mentioned, the industrial and simulated data sets need to be classified. For this,
the classification scheme of Boylan et al. (2008) is used. The scheme provides the important
cut-off values of p = 1.32 and CV 2 = 0.49. The formulas that are used to calculate CV 2 and p
for every individual item can be found in Section 3.4.
The results from the classification of Nguyen (2023) can be seen in Table 9. We observe that
the inter-demand interval p of the industrial data sets is much higher than for the simulated
data sets, except for the AUTO data set. The AUTO data set is the only data set, that cannot
be classified as a single category, as it seems to have items for every demand type. However,
the majority are classified as smooth and intermittent. The low inter-demand interval can be
explained due to the high number of smooth items. Smooth items have frequent demand with
low demand size variability (Boylan et al., 2008). The same is observed for SIM3, where nearly
all the items are classified as being smooth items. SIM3 also shows the lowest inter-demand
interval. Regarding the simulated data sets, we observe that they have been correctly classified.
Nonetheless, we also observe that the simulated data sets do not only consist of purely one
type of demand. SIM1 has mostly erratic items, SIM2 mostly lumpy items and SIM4 mostly
intermittent items. With this classification, we will be able to answer which methods perform
32
best on what kind of demand respectively for which data set.
Data CV 2 p Erratic items Lumpy items Smooth items Intermittent items Total items
MAN 0.92 16.41 23 806 1 562 1392
BRAF 0.63 11.14 0 2095 0 2905 5000
AUTO 0.41 1.32 378 307 1241 1074 3000
OIL 0.18 14.52 0 814 0 6830 7644
SIM1 0.75 1.00 6198 0 302 0 6500
SIM2 0.80 1.50 410 5614 25 451 6500
SIM3 0.30 1.05 36 0 6464 0 6500
SIM4 0.25 1.45 1 7 786 5706 6500
In this section, we first compare the performance of each method for the different data sets and
forecasting accuracy measures. Then, in a second stage, the inventory performance is analysed
and compared to the forecasting accuracy results.
33
Data set
Method Measure # Best
SIM1 SIM2 SIM3 SIM4 MAN BRAF AUTO OIL
Croston MSE 79.202 79.138 34.730 40.519 12940.104 199.690 86.344 138.605 1
MASE 0.673 1.027 0.487 0.780 2.499 2.080 0.788 1.998 0
RMSSE 2.722 3.346 1.878 2.412 5.329 3.300 1.721 1.666 0
GMAE 4.605 4.725 3.138 3.793 6.481 1.669 2.422 0.852 0
SBA MSE 78.623 78.834 34.620 40.435 12921.667 199.807 83.089 132.321 3
MASE 0.664 1.012 0.484 0.778 2.439 2.001 0.777 1.849 0
RMSSE 2.712 3.337 1.874 2.409 5.304* 3.289 1.710 1.635 2.5*
GMAE 4.503 4.592 3.102 3.792 6.236 1.534 2.362 0.781 3
DLP MSE 89.679 85.074 44.229 45.850 13002.323 203.685 112.859 148.810 0
MASE 0.748 1.104 0.559 0.824 2.589 2.218 0.938 2.212 0
RMSSE 2.918 3.488 2.126 2.562 5.372 3.362 2.031 1.740 0
GMAE 5.305 5.227 3.671 3.970 6.903 1.872 3.036 0.987 0
MLP MSE 78.087 77.406 34.842 39.718 13227.733 201.303 82.672 153.535 3
MASE 0.679 1.027 0.493 0.776 3.121 2.301 0.822 2.066 1
RMSSE 2.708 3.316 1.881 2.388 5.494 3.355 1.736 1.674 3
GMAE 4.729 4.763 3.209 3.783 6.214 2.087 2.686 0.988 1
LSTM MSE 87.170 86.052 70.132 53.216 865142.462 929.221 108.998 137.520 0
MASE 0.661 1.121 0.698 0.856 19.087 5.816 0.866 1.730 0
RMSSE 2.791 3.436 2.651 2.720 30.689 6.064 1.820 1.465 1
GMAE 4.174 5.459 4.591 4.330 112.417 9.369 2.648 1.067 1
LightGBM MSE 81.914 81.795 35.957 41.700 13584.090 202.644 95.428 156.013 0
MASE 0.692 1.053 0.498 0.790 2.974 2.336 0.857 2.077 0
RMSSE 2.772 3.410 1.909 2.454 5.602 3.375 1.866 1.686 0
GMAE 4.727 4.756 3.209 3.841 5.958 1.897 2.612 0.926 0
RF MSE 81.042 82.816 35.965 42.761 13385.695 201.351 88.078 154.925 0
MASE 0.699 1.073 0.501 0.800 2.942 2.332 0.838 2.072 0
RMSSE 2.761 3.436 1.911 2.476 5.544 3.372 1.790 1.683 0
GMAE 4.893 4.922 3.247 3.898 6.064 1.950 2.578 0.940 0
Willemain MSE 77.928 78.606 34.880 40.755 13111.759 199.775 84.182 132.939 1
MASE 0.690 1.045 0.497 0.783 2.594 2.344 0.899 2.319 0
RMSSE 2.714 3.348 1.886 2.420 5.304 3.365 1.863 1.726 0.5*
GMAE 4.909 4.904 3.273 3.786 4.392 1.082 2.648 0.613 3
Quantile reg. MSE 83.241 87.917 34.757 40.800 14104.492 202.800 88.210 139.344 0
MASE 0.638 0.956 0.477 0.779 1.586 1.167 0.746 0.956 7
RMSSE 2.779 3.512 1.877 2.421 5.371 3.253 1.724 1.487 0
** GMAE 1.715 1.698 0.954 1.689 0.479 0.000 1.366 0.000 **8
Results rounded to three decimals. The best accuracy is highlighted for each data set and measure.
*0.5, because only 50% is accounted to the method, as the place is shared with another method.
** The GMAE results of the QR need to be analyzed with caution.
Table 10: Forecasting accuracy of all the methods on each data set.
The values of the forecasting accuracy measures of all the methods applied to the data sets can
be found in Table 10 with the best accuracy score highlighted for each method. The column #
Best shows how many times the method is the best performer for the given accuracy measure
and across all the data sets. The row of the Quantile regression method showing the GMAE
accuracy is in italics as these results need to be analyzed with caution.
When blindly comparing all the results with each other, the quantile regression method is
34
the best performer by far. It outperforms every other method in terms of GMAE and MASE,
except for the MASE metric applied to the SIM4 data set. Here, MLP is the superior method.
However, the GMAE results of the Quantile regression method show 0 values twice. This is due
to the sensitivity of GMAE towards 0 absolute errors. In fact, as described in Subsection 3.3 and
visualized in Table 5, GMAE equals zero as soon as one single observation has an absolute error
of 0. As can be seen in Table 10, Quantile regression shows zero error for the BRAF data set and
the OIL data set in terms of GMAE. For the other data sets, the GMAE is also low compared
to the GMAE computed by the other methods. The low GMAE is due to the prediction of
the Quantile regression method. The method predicts a lot of zeros when in fact there is some
demand. Due to these particular results for Quantile regression in terms of GMAE, we decided
to not include the GMAE accuracy performance of the QR method to calculate the Percentage
Better score.
The quantile regression method is the best performer for the given accuracy measures and
across all the data sets. It outperforms every other method in terms of MASE, except for the
MASE metric applied to the SIM4 data set. Here, MLP is the superior method. The worst
performing methods are DLP, LightGBM and RF as, in our experiment, they fail to outperform
all the other methods in one instance.
MLP shows the best performance for the simulated data sets, whereas SBA outperforms the
other methods in most instances of the industrial data sets. In fact, the superiority is due to the
fact that the GMAE accuracy measures are not included for the QR method in the comparison
for the previously given reasons.
The second best performing method is SBA. It performs well on both data sets types, es-
pecially in terms of MSE and GMAE. MLP ranks 3rd. It is superior in terms of MSE and
RMSSE. When focusing on the data sets, MLP outperforms all the other methods for the SIM4
data set in terms of MSE, MASE, RMSSE and GMAE. It outperforms every other method in 8
instances.
Following MLP, in fourth place, Willemain can be found. Willemain outperforms the other
methods in 4.5* instances, i.e. MSE for SIM1 and shares the place with SBA for the RMSSE for
MAN. Furthermore, in terms of GMAE, it is superior for the MAN, BRAF and OIL data sets.
In fifth place is LSTM showing superiority in two instances, followed by Croston (superiority
in one instance). It is noteworthy that LSTM has not been extensively tuned due to fear of
overfitting. In fact, as the data becomes scarce when creating the sequences, there is less data
available for tuning and testing. Hence, the model is kept simple.
35
The only methods, that do not outperform the other methods in one instance, are DLP,
LightGBM and RF.
The Percentage Better score is computed by dividing the number of times a method outper-
forms every other method by the total number of times the method has been used. For example,
Croston is the best performer once. We divide 1 by 32 (as we have 32 instances) *100 and obtain
3.125%. The same calculation is done for every method. Quantile regression is superior to the
other methods in 29.167% of the comparisons (for MSE, MASE and RMSSE), followed by SBA
with 26.563% and MLP with 25%.
Another observation that is made while running the methods is, that the ML methods take
much longer time to run. Especially RF and LightGBM, nevertheless, they do not perform well
in terms of forecasting accuracy. Another computing time consuming method is Willemain’s
method, which is due to the bootstrapping. The run-times of the methods have not been
measured in RStudio, however, the duration has been observed.
Figure 3a and Figure 3b show the trade-off curves between the achieved fill rate (AFR) and
the inventory holding costs for the SIM1 data set. Plot (a) shows the average achieved fill rate,
whereas plot (b) shows the total achieved fill rate. The SIM1 data set consists mostly of erratic
items and a small part of smooth items. When comparing both plots, the trade-off curves show
a similar pattern. In Plot (a), the quantile regression (QR) method shows higher inventory
holding costs for the same average AFR as the other methods. The other methods are bundled
together and behave similarly. When looking at the total AFR vs inventory holding costs trade-
off curves, LSTM and QR stand out from the other methods, as their inventory holding costs
are higher for the same total AFR. In fact, up to 83% total AFR, they behave similarly to the
other methods. From 83% to 97%, they decouple. In the Appendix 7.1, the table with all the
values for the average AFR, total AFR and the inventory holding costs is provided.
36
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
The inventory performance results of the SIM2 data set, which consists mostly of lumpy
items, are shown in Figure 4. Both plots, 4a and 4b show that Quantile regression has higher
costs for the same AFR compared to the other methods. This is the case for the average AFR
and total AFR. One method that stands out for the higher AFR, is LSTM. LSTM performs
slightly worse for 75% AFR (total and average) as the other methods except QR, but decouples
from the bundle (from 90% average AFR onwards and from 86% total AFR onwards), and
outperforms all the other methods in terms of costs and AFR.
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
The results of the SIM3 data set, which is dominated by smooth demand, show that all
the methods display a similar behaviour between AFR and Inventory holding costs, except for
LSTM. LSTM consistently achieves the same fill rates (total and average) for a higher cost
compared to the other methods.
The trade-off curves for the SIM4 data set can be found in the Appendix 7.1, as they do not
provide more insights. Their behaviour is similar to the curves for SIM3.
37
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
The MAN data set is characterized by lumpy and intermittent items. Plot 6a shows that Wille-
main presents the lowest costs for the same AFR as the other methods. In general, Willemain
also achieves higher fill rates than the other methods. LSTM seems to provide the highest costs
for the range of 0.71% AFR to 0.77%. However, this is not the case for the total AFR. In plot
11b, Willemain is outperformed by all the other methods. Especially, the ML methods perform
very well. QR has been removed from this plot, as due to its low performance in terms of AFR,
it was twisting the plot. The plot including QR can be found in the Appendix 7.1 Figure 11.
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
The BRAF data set is characterized lumpy items and intermittent items. The average AFR
plot 7a and the total AFR plot 7b differ. Plot 7a shows the highest costs for LSTM for the
range of 70% to 85% average AFR. LSTM does not achieve higher average fill rates than that.
Another method that stands out, is Willemain. Willemain provides the highest average AFR.
Willemain starts standing out from the other methods from the average AFR of 93%, where the
costs increase exponentially. The other methods do not stand out, except for MLP, which in
38
the range of 83% average AFR to 87% provides the same average fill rate for lower costs. When
comparing the total AFR to the Inventory holding costs, LSTM stands out for having the lowest
costs. However, the total AFR caps at 80% for LSTM. Whereas all the other methods achieve
higher fill rates. Especially Willemain achieves the highest total fill rates (85%) but for much
higher costs.
In both plots, QR has been removed as it does not provide any insights at all. In fact, it does
not show a curve at all, average AFR, total AFR and the inventory holding costs are 0. This is
due to the predictions of the QR method for the BRAF data set and the test BRAF data set
(actual demand). The prediction data frame consists of only zeros, which then has an impact
on the fill rate and the holding costs. The fill rate is calculated as the total supply divided by
the total demand, and the holding costs are proportional to the inventory level, which would
be zero if there is no demand at all. Another argument could be that the test data contains
mostly zero demand. This means that the achieved fill rates would also be zero since there is
no demand to fulfil. The same logic for the inventory holding costs, which would also be zero
since there is no need to hold inventory.
The same is observed for the OIL data set. Both, the OIL and BRAF data sets consist of
lumpy and intermittent items only with lots of 0 demand values. In Appendix 7.1 Table 17
summarizes the behaviour of the QR method for the BRAF data set.
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
The AUTO data set consists of items out of the four categories, with the majority being
smooth items and intermittent items as shown in Table 9. Willemain achieves the highest fill
rates for the average AFR and total AFR. However, for those higher AFR, the costs are also
higher. The other methods fail to achieve the same fill rates. When looking at the figures 8a and
8b separately, it is visible that for the former all the methods, except LSTM are clumped together
39
up to the average AFR of 96%. Above those fill rates, only Willemain succeeds. Furthermore,
LSTM has higher inventory holding costs, but also a lower average AFR (only up to 90%).
When looking at the total achieved fill rate, Willemain is outperformed by every other method
in terms of inventory holding costs. However, in terms of total AFR, Willemain achieves slightly
higher fill rates (1.7% higher) compared to the other methods.
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
Last, but not least, the results of the inventory performance of the OIL data set are shown.
The OIL data set is characterized by lumpy and intermittent items. Two aspects that jump out
when looking at Figure 9, are the performance of LSTM in Figure 9a. LSTM performs badly
in terms of average AFR compared to the other methods for the same inventory holding costs,
except compared to Willemain. The second aspect that jumps out is, that Willemain achieves
the highest average AFR at 60%. However, the costs are also much higher. The other methods
are bundled together and perform similarly. Figure 9b shows the same behaviour as in Figure
9a, except that this time, LSTM does not stand out on its own. This time, LSTM performs
similarly to the other methods. This difference in the behaviour of LSTM for the average AFR
and total AFR can be attributed to the nature of the OIL data set. The OIL data set contains
lots of 0 demand. Furthermore, the average interval of the non-zero demands p is high, meaning
that the intermittency effect is much stronger in the OIL data set. Hence, the different behaviour
on average for the highly intermittent data set, consisting of slow-moving items, which can be
observed in Table 6 (average monthly item sales of 0.63 items).
40
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
In this section, first, the findings of this paper are discussed. Next, the findings are compared
to other existing papers. Followed by linking the findings to the research questions. Finally, a
conclusion and further possible research are brought up.
In this paper, nine different methods from three categories were compared. Namely, Cros-
ton, Syntetos-Boylan approximation (SBA) and DLP grouped into statistical methods. Ma-
chine learning methods consisting of Multi-Layer Perceptron (MLP), Long-Short term memory
(LSTM), LightGBM and Random Forest (RF). The third category, Non-parametric methods is
formed, by Willemain’s bootstrapping method and Quantile regression. The nine methods were
applied to eight different data sets. Four industrial and four simulated data sets, simulate a
certain demand behaviour. To measure the performance of the methods, forecasting accuracy
measures (MSE, MASE, RMSSE and GMAE) and inventory control measures (Achieved fill rate
and Inventory holding costs) were used.
6.1 Findings
Finding 1
Differences in results based on the performance measure used and data set category.
Throughout this paper, we have demonstrated, that the performance of the method depends
on the performance measures used. In fact, the best performing methods in terms of forecasting
accuracy are not necessarily the best performing methods in terms of inventory performance.
The Percentage Better comparison has shown, that Quantile regression outperformed the other
methods followed by SBA. DLP, LightGBM and RF were overall the worst methods. Based
41
on Inventory control measures, Willemain outperforms the other methods. However, for lumpy
demand LSTM outperforms Willemain. MLP stands out as the best performer for erratic
demand. DLP outperforms the other methods for the AUTO data set in terms of total AFR
and inventory holding costs.
Finding 2
The handling of the data is very important. The results can differ because of the different
data cleaning. For example, Pennings et al. (2017) cleans the BRAF data differently compared
to us. In their paper, the BRAF data set consists of only 1131 SKUs, whereas in our paper, it
consists of 5000 SKUs. No explanation was given on how the data was cleaned by Pennings et al.
(2017). This results in different accuracy measures for the same methods. For Croston and SBA,
Pennings et al. (2017) reports better results in terms of MASE compared to us. However, the
opposite is noted for GMAE. Furthermore, the DLP method performs better in both instances
(MASE and GMAE) compared to Pennings et al. (2017) for the BRAF data set.
Finding 3
Machine learning methods are represented in the top performer rankings of both performance
measures, i.e. for forecasting accuracy and inventory performance. One major point of ML
methods is the tuning of the hyper parameters, which can improve the performance of the
model. For example, by adding a hyper parameter tuning grid and trying multiple combinations
of values for the parameters, we obtained better results for LightGBM for all data sets and all
metrics compared to Haan (2021), who does not try multiple values.
However, the problem with hyper parameters tuning is that it is time-consuming and requires
knowledge about the different parameters and how their values have an on the model, which is
an important point for the reproducibility of the methods.
Finding 4
As observed in Table 10, in two instances the GMAE metric equals 0. This is the case for
the BRAF and OIL data set when the QR method is applied. As previously explained, the QR
method predicts many 0 demands for many periods, items and quantiles. In combination with
42
0 actual demand, this causes the absolute errors to equal 0. Due to these 0 absolute errors, the
geometric mean becomes automatically 0 even if other absolute errors are > 0. This is only
observed for the quantile regression method, as it is the only method that predicts 0 demand
when there actually is 0 demand. Therefore, the sensitivity of GMAE for 0 values is a finding
that should be always considered in combination with the used method.
In the paper of Haan (2021) 27 ., seven methods are run across eight data sets. Namely, Croston,
SES, SBA, TSB, Willemain, MLP and LightGBM. Out of those seven methods, we also run
Croston, SBA, Willemain, MLP and LightGBM. In fact, our paper is an extension of Haan
(2021)’s paper as it takes some of their methods plus new methods (LSTM, RF, DLP and
Quantile regression) and runs them on the same eight data sets. The performance measures are
the same. In both papers, MSE, MASE and RMSSE are used as forecasting accuracy measures
and the trade-off curves of service levels are used as inventory performance measures. Our
paper goes one step further and adds a new forecasting accuracy measure, namely GMAE. In
Haan (2021)’s study, SBA is superior overall based on the Percentage Better comparison. SBA
performs second best after Quantile regression based on the Percentage Better score in our paper.
This shows that in both papers SBA proves itself as a reliable method. In terms of inventory
performance, our paper draws the same conclusion as Haan (2021). Willemain’s bootstrapping
method is overall superior. Regarding the ML methods, LightGBM is outperformed by every
method in both papers, although tuned differently, in terms of forecasting accuracy. MLP is the
second best performing method in Haan (2021) for forecasting accuracy and the third best, in
this paper. This proves that Haan (2021)’s findings are reproducible.
In the recent, yet not published paper of Theodorou et al. (2023), they conduct a comparison
study with eleven methods. Three methods are also used in our paper, namely, Croston, SBA
and LightGBM. These methods are applied to one single data set (retail sales from Makridakis
et al. (2022)). The data set consists of mostly intermittent (73%) and lumpy (17 %) items (3%
erratic and 7% smooth). In our paper, the OIL data set, BRAF data set and SIM4 show similar
demand characteristics. LightGBM stands out for both RMSSE and trade-off curves for the
27
A newer version of this paper is the working paper in progress, submitted to the International Journal of
Production Economics; October 2, 2023
43
higher review periods. Croston and SBA do not show significant superiority compared to the
other methods. The performance of LightGBM in this paper complies with that in ours. In
fact, we do not analyze the behaviour for higher review periods. However, the low performance
of LightGBM in general is in accordance with our findings.
6.3 Conclusion
In this section, the findings are linked to the research questions of this paper. The first question
is as follows: ”Which methods perform best on what kind of demand respectively for which data
set?”
Throughout this paper, the performance of 9 different spare parts demand forecasting meth-
ods from 3 different categories have been studied. The methods are categorized as Statistical
methods, Machine learning methods and Non-parametric methods. 8 data sets characterized by
certain demand behaviours are used to run the methods. The demand is classified as either
lumpy, erratic, intermittent or smooth. The findings suggest that there is no consistent supe-
rior method based on the data set. SBA shows superiority in 3 out of 4 accuracy metrics for
the SIM3 data set (Smooth demand), and 2 out of 4 accuracy metrics for the AUTO data set.
However, these findings are not consistent in terms of inventory performance measures. For the
same data sets, different methods are superior. The SIM3 data set shows the best results for
Willemain’s method. Whereas the OIL data set provided the best results for the MLP method
in terms of average achieved fill rate and LSTM in terms of total achieved fill rate. SIM4 on the
other hand shows superiority in all four accuracy metrics when the MLP method is applied to
it. Again, this is not consistent with the findings based on the inventory performance measure.
Willemain is superior here. Therefore, no definitive answer can be given to this research question
as consistency is lacking.
The second question of this paper is as follows: ”Is the performance of certain methods due
to the measure used?”
This question is aimed at the use of forecasting accuracy measures and how they impact
prediction accuracy. Table 10 shows indeed that some accuracy measures perform consistently
well for a method throughout most data sets. The most eye-catching example is the performance
of the quantile regression (QR) method for the MASE accuracy measure. In fact, the MASE
accuracy measure for quantile regression provides the lowest MASE in 7 out of 8 instances.
Another noteworthy example is GMAE, which provides superior performance for Willemain on
3 out of 4 industrial data sets. Or the RMSSE performance for MLP on 3 out of 4 simulated
44
data sets. The reason for this is out of this paper’s scope. However, future research can be
conducted on this.
The third and final research question of this paper is: ”Do machine learning methods perform
better in general than statistical methods?”
Machine learning methods are widely used in other domains as they seem to perform well.
In the spare parts demand forecasting field, machine learning methods are not widely used
as their performance is not yet fully studied. Furthermore, some ML methods are not easily
understandable as they include a black box problem. In this paper, a total of four ML methods
have been used. Namely, Multi-layer perceptron (MLP), Long-short term memory (LSTM),
LightGBM and Random forest (RF). Furthermore, 3 out of the 9 methods are categorized as
statistical methods in Section 3.1. When looking at Table 10, the statistical methods are superior
in 9.5 out of 96 instances, whereas the ML methods are superior in 10 out of 128 instances.
Furthermore, LightGBM and RF, both fail to show superiority in at least one instance. For the
statistical methods, DLP does not perform best for one accuracy metric. This shows that in
terms of forecasting accuracy, the statistical methods perform better on average than the ML
methods.
However, when looking at the inventory performance measure, the opposite is observed. In
fact, only DLP from the statistical methods achieves the highest fill rate for the total AFR.
ML methods on the other hand, are more often the best performing method when it comes
to inventory performance. Consequently, no general conclusion can be drawn from our paper.
Although ML methods provide promising results in terms of inventory performance, they lag
behind statistical methods in terms of forecasting accuracy.
6.4 Discussion
For future research, ML methods in spare parts demand forecasting should further be studied
as there are many aspects of the implementation that can be analysed. One is the training of
the model. Do models, that have been trained through single SKU 28 training perform better
than cross SKU trained models. Furthermore, the hyper parameter tuning of the ML methods
plays a big role. One could pay more attention to the behaviour of the hyper parameters in
the spare parts demand forecasting domain, to be able to determine which methods are easy to
implement, i.e. not complicated to build and tune, do not require much computational power
and work with little data.
28
Stock keeping unit
45
6.5 Acknowledgments
I would like to express my deepest gratitude to Dekker for the invaluable guidance throughout
this thesis. Furthermore, I would like to thank Dekker, Nguyen, De Haan, Syntetos, Boylan,
Teunter, Duncan and Porras for providing the data sets. Lastly, I would like to mention my
family and friends for supporting me through this process.
The methods have been implemented in RStudio and on Google Colab. The code and the
data sets can be found on GitHub. The URL for the GitHub repository is https://github.com/
YllorH/SpareParts MasterThesis.
46
7 References
Aktepe, A., Yanık, E., & Ersöz, S. (2021). Demand forecasting application with regression and
artificial intelligence methods in a construction machinery company. Journal of Intelli-
gent Manufacturing, 32, 1587–1604.
Arya Poddar, K. D., Aiana Goyal. (2019). Scorecardmodelutils: Credit scorecard modelling utils.
https://CRAN.R-project.org/package=scorecardModelUtils
Baryannis, G., Validi, S., Dani, S., & Antoniou, G. (2019). Supply chain risk management
and artificial intelligence: State of the art and future research directions. International
Journal of Production Research, 57 (7), 2179–2202.
Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25, 197–227.
Boylan, J. E., Syntetos, A. A., et al. (2006). Accuracy and accuracy-implication metrics for
intermittent demand. Foresight: The International Journal of Applied Forecasting, 4,
39–42.
Boylan, J. E., Syntetos, A. A., & Karakostas, G. C. (2008). Classification for forecasting and
stock control: A case study. Journal of the operational research society, 59, 473–481.
Breiman, L. (2001). Random forests. Machine learning, 45, 5–32.
Callioni, G., de Montgros, X., Slagmulder, R., Van Wassenhove, L. N., & Wright, L. (2005).
Inventory-driven costs. harvard business review, 83 (3), 135–41.
Chandriah, K. K., & Naraganahalli, R. V. (2021). Rnn/lstm with modified adam optimizer in
deep learning approach for automobile spare parts demand forecasting. Multimedia Tools
and Applications, 80 (17), 26145–26159.
Choi, B., & Suh, J. H. (2020). Forecasting spare parts demand of military aircraft: Compar-
isons of data mining techniques and managerial features from the case of south korea.
Sustainability, 12 (15), 6045.
Croston, J. D. (1972). Forecasting and stock control for intermittent demands. Journal of the
Operational Research Society, 23 (3), 289–303.
de Haan. (2021). GitHub repository for benchmarking spare parts demand forecasting for inter-
mittent demand. https://github.com/danieldehaan96/spdf
Dennis, M. J., & Kambil, A. (2003). Service management: Building profits after the sale: After-
sale services and parts can contribute as much as 50 percent of all profits for the typical
manufacturing company. but to capitalize on that opportunity, you need a supply chain
explicitly designed to support superior service management–and do so in a profitable
way. building that ”service-to-profit” supply chain calls for a sage strategy and flawless
execution. [42]. Supply Chain Management Review, 7 (1), 42(7).
47
Durlinger, P., & Paul, I. (2015). Inventory and holding costs. Durlinger Consultancy, Posterholt,
The Netherlands, Tech. Rep. WP, 4.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics,
7 (1), 1–26. Retrieved June 14, 2023, from http://www.jstor.org/stable/2958830
Gutierrez, R. S., Solis, A. O., & Mukhopadhyay, S. (2008). Lumpy demand forecasting using
neural networks. International journal of production economics, 111 (2), 409–420.
Haan, d. (2021). Benchmarking spare parts demand forecasting methods (Master’s thesis). http:
//hdl.handle.net/2105/60771
Haneveld, W. K., & Teunter, R. (1997). Optimal provisioning strategies for slow moving spare
parts with small lead times. Journal of the Operational Research Society, 48 (2), 184–194.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8),
1735–1780.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
International journal of forecasting, 22 (4), 679–688.
İfraz, M., Aktepe, A., Ersöz, S., & Çetinyokuş, T. (2023). Demand forecasting of spare parts
with regression and machine learning methods: Application in a bus fleet. Journal of
Engineering Research, 11 (2), 100057.
Kailex. (2020). M5 forecasting competition. https://www.kaggle.com/kailex/m5-forecaster-v2
keras-team. (2021). GitHub repository of the keras package/optimizers/adam.py. https://github.
com/keras-team/keras/blob/v2.13.1/keras/optimizers/adam.py
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Koenker, R., & Hallock, K. F. (2001). Quantile regression. Journal of economic perspectives,
15 (4), 143–156.
Kolassa, S. (2016). Evaluating predictive count data distributions in retail sales forecasting.
International Journal of Forecasting, 32 (3), 788–803.
Kourentzes, N. (2013). Intermittent demand forecasts with neural networks. International Jour-
nal of Production Economics, 143 (1), 198–206.
Kourentzes, N. (2014). On intermittent demand model optimisation and selection. International
Journal of Production Economics, 156, 180–190. https://doi.org/https://doi.org/10.
1016/j.ijpe.2014.06.007
Kraus, M., Feuerriegel, S., & Oztekin, A. (2020). Deep learning in business analytics and oper-
ations research: Models, applications and managerial implications. European Journal of
Operational Research, 281 (3), 628–641.
48
Learned-Miller, E. G. (2014). Introduction to supervised learning. I: Department of Computer
Science, University of Massachusetts, 3.
Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R news, 2 (3),
18–22.
Lolli, F., Gamberini, R., Regattieri, A., Balugani, E., Gatos, T., & Gucci, S. (2017). Single-
hidden layer neural networks for forecasting intermittent demand. International Journal
of Production Economics, 183, 116–128.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and machine learning
forecasting methods: Concerns and ways forward. PloS one, 13 (3), e0194889.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2022). The m5 competition: Background,
organization, and implementation. International Journal of Forecasting, 38 (4), 1325–
1336.
Microsoft. (2021). Welcome to lightgbm’s documentation! - lightgbm 3.2.1.99 documentation.
https://lightgbm.readthedocs.io/en/latest/
Nguyen. (2023). GitHub repository for spare parts demand forecasting. https://github.com/
KhueNguyenTK/Spare-Part-Demand-Forecasting
Pennings, C. L., Van Dalen, J., & van der Laan, E. A. (2017). Exploiting elapsed time for man-
aging intermittent demand for spare parts. European Journal of Operational Research,
258 (3), 958–969.
Pinçe, Ç., Turrini, L., & Meissner, J. (2021). Intermittent demand forecasting for spare parts:
A critical review. Omega, 105, 102513.
Porras, E., & Dekker, R. (2008). An inventory control system for spare parts at a refinery: An
empirical comparison of different re-order point methods. European Journal of Opera-
tional Research, 184 (1), 101–132.
Probst, P., Wright, M. N., & Boulesteix, A.-L. (2019). Hyperparameters and tuning strategies for
random forest. Wiley Interdisciplinary Reviews: data mining and knowledge discovery,
9 (3), e1301.
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions
and use interpretable models instead. Nature machine intelligence, 1 (5), 206–215.
SciKit-Learn. (2015). Lstm model. https : / / github . com / scikit - learn / scikit - learn / blob /
7f9bad99d6e0a3e8ddf92a7e5561245224dab102 / sklearn / model selection / search . py #
L1056
49
SciKit-Learn. (2019a). Forest of trees-based ensemble methods. https://github.com/scikit-learn/
scikit - learn / blob / 7f9bad99d6e0a3e8ddf92a7e5561245224dab102 / sklearn / ensemble /
forest.py#L1090
SciKit-Learn. (2019b). Multilayer perceptron. https://github.com/scikit- learn/scikit- learn/
blob/7f9bad99d/sklearn/neural network/ multilayer perceptron.py#L765
Sherstinsky, A. (2020). Fundamentals of recurrent neural network (rnn) and long short-term
memory (lstm) network. Physica D: Nonlinear Phenomena, 404, 132306.
Shi, Y., Ke, G., Soukhavong, D., Lamb, J., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.,
Ye, Q., et al. (2022). Lightgbm: Light gradient boosting machine. r package version 3.3.
4.
Silver, E. A. (1981). Operations research in inventory management: A review and critique.
Operations Research, 29 (4), 628–645.
Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for
time series forecasting. International Journal of Forecasting, 36 (1), 75–85.
Spiliotis, E., Makridakis, S., Semenoglou, A.-A., & Assimakopoulos, V. (2020). Comparison of
statistical and machine learning methods for daily sku demand forecasting. Operational
Research, 1–25.
Suomala, P., Sievänen, M., & Paranko, J. (2002). The effects of customization on spare part busi-
ness: A case study in the metal industry. International Journal of Production Economics,
79 (1), 57–66.
Syntetos, A. A., Babai, M. Z., & Gardner Jr, E. S. (2015). Forecasting intermittent inventory
demands: Simple parametric methods vs. bootstrapping. Journal of Business Research,
68 (8), 1746–1752.
Syntetos, A. A., Babai, M. Z., & Altay, N. (2012). On the demand distributions of spare parts.
International Journal of Production Research, 50 (8), 2101–2117.
Syntetos, A. A., & Boylan, J. E. (2001). On the bias of intermittent demand estimates. Inter-
national journal of production economics, 71 (1-3), 457–466.
Syntetos, A. A., & Boylan, J. E. (2005). The accuracy of intermittent demand estimates. Inter-
national Journal of forecasting, 21 (2), 303–314.
Syntetos, A. A., & Boylan, J. E. (2006). On the stock control performance of intermittent
demand estimators. International Journal of Production Economics, 103 (1), 36–47.
Syntetos, A. A., Nikolopoulos, K., & Boylan, J. E. (2010). Judging the judges through accuracy-
implication metrics: The case of inventory forecasting. International Journal of Forecast-
ing, 26 (1), 134–143.
50
Taylor, J. W. (2007). Forecasting daily supermarket sales using exponentially weighted quantile
regression. European Journal of Operational Research, 178 (1), 154–167. https://doi.org/
https://doi.org/10.1016/j.ejor.2006.02.006
Teunter, R. H., & Duncan, L. (2009). Forecasting intermittent demand: A comparative study.
Journal of the Operational Research Society, 60 (3), 321–329.
Teunter, R. H., Syntetos, A. A., & Babai, M. Z. (2011). Intermittent demand: Linking forecasting
to inventory obsolescence. European Journal of Operational Research, 214 (3), 606–615.
Theodorou, E., Spiliotis, E., & Assimakopoulos, V. (2023). Forecasting accuracy and inventory
performance: Fallacies and facts. https://doi.org/10.13140/RG.2.2.26143.33446
Trapero, J. R., Cardós, M., & Kourentzes, N. (2019). Quantile forecast optimal combination to
enhance safety stock estimation. International Journal of Forecasting, 35 (1), 239–250.
Van Jaarsveld, W., & Dekker, R. (2011). Estimating obsolescence risk from demand data to
enhance inventory control—a case study. International Journal of Production Economics,
133 (1), 423–431.
Van Wingerden, E., Basten, R., Dekker, R., & Rustenburg, W. (2014). More grip on inven-
tory control through improved forecasting: A comparative study at three companies.
International journal of production economics, 157, 220–237.
Willemain, T. R., Smart, C. N., & Schwarz, H. F. (2004). A new approach to forecasting inter-
mittent demand for service parts inventories. International Journal of forecasting, 20 (3),
375–387.
Zhang, G., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with artificial neural networks:: The
state of the art. International journal of forecasting, 14 (1), 35–62.
Zhou, C., & Viswanathan, S. (2011). Comparison of a new bootstrapping method with para-
metric approaches for safety stock determination in service parts inventory systems.
International Journal of Production Economics, 133 (1), 481–485.
51
7.1 Appendix
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
Figure 10: Trade-off curves for the inventory control measures on SIM4
(a) Average achieved fill rate vs Inventory holding (b) Total achieved fill rate vs Inventory holding
costs costs
Figure 11: Trade-off curves for the inventory control measures on MAN
52
0.85 0.8601456 0.8499658 58717018 0.85 Croston
53
0.931 0.9316229 0.9247355 80233527 0.93 SBA
54
0.763 0.8017884 0.7894467 48078462 0.76 Willemain
55
0.844 0.8543435 0.8439422 58253504 0.84 QR
56
0.925 0.9214632 0.9141403 75580178 0.92 MLP
57
0.757 0.7762886 0.7641197 44990067 0.75 LightGBM
58
0.838 0.8467798 0.8363524 56195527 0.83 RF
59
0.87 0.8891434 0.8727577 45172927 0.87 Croston
60
0.951 0.9614604 0.9534332 68728671 0.95 SBA
61
0.783 0.8233149 0.8034206 35045434 0.78 Willemain
62
0.864 0.9322037 0.9220616 63563293 0.86 QR
63
0.945 0.9480125 0.9381538 61461310 0.94 MLP
64
0.777 0.8122043 0.7929472 35498297 0.77 LightGBM
65
0.858 0.8794923 0.8620824 43840349 0.85 RF
66
0.89 0.9067094 0.9030621 45605771 0.89 Croston
67
0.971 0.9759374 0.9743608 65207888 0.97 SBA
68
0.814 0.8656280 0.8605740 39791866 0.80 Willemain
69
0.884 0.8900763 0.8855655 43021923 0.88 QR
70
0.965 0.9680745 0.9662363 61133569 0.96 MLP
71
0.797 0.8157910 0.8102306 35150209 0.79 LightGBM
72
0.878 0.8907300 0.8866581 43361488 0.87 RF
73
0.91 0.9489046 0.9451871 37163950 0.91 Croston
74
0.991 0.9911768 0.9903671 55132431 0.99 SBA
75
0.823 0.8891759 0.8834416 29674466 0.82 Willemain
76
0.916 0.9405366 0.9367177 36220954 0.90 QR
77
0.985 0.9904740 0.9895807 53124631 0.98 MLP
78
0.8111 0.8523301 0.8459967 27044064 0.81 LightGBM
79
0.898 0.9365942 0.9321330 35336800 0.89 RF
80
0.93 0.8029400 0.9133017 171792.17 0.93 Croston
81
0.762 0.6630815 0.7021365 87695.83 0.76 DLP
82
0.843 0.7780662 0.8226621 137273.41 0.84 Willemain
83
0.924 0.0762825 0.3567581 28973.04 0.92 QR
84
0.756 0.7122287 0.9330413 136490.63 0.75 LSTM
85
0.837 0.7581147 0.8936285 142445.86 0.83 LightGBM
86
0.9112 0.7936245 0.9233114 165487.44 0.91 RF
87
0.95 0.9298184 0.8272580 795923.5 0.95 Croston
88
0.782 0.8399698 0.6519860 460692.0 0.78 DLP
89
0.863 0.9315929 0.7630812 795717.8 0.86 Willemain
90
0.944 0.0000000 0.0000000 0.0 0.94 QR
91
0.776 0.7108215 0.5395830 282375.9 0.77 LSTM
92
0.857 0.8919683 0.7443191 671623.8 0.85 LightGBM
93
0.938 0.9170721 0.7915577 758275.8 0.93 RF
94
0.97 0.9678386 0.9450713 9834601 0.97 Croston
95
0.812 0.8916353 0.8362527 6089653 0.80 DLP
96
0.883 0.9554712 0.8964208 8888428 0.88 Willemain
97
0.964 0.9612713 0.9397761 9694301 0.96 QR
98
0.796 0.8788513 0.8835679 8080615 0.79 LSTM
99
0.877 0.9216126 0.8730506 7312541 0.87 LightGBM
100
0.958 0.9622083 0.9333974 9158445 0.95 RF
101
0.99 0.5744956 0.8497466 1494814 0.99 Croston
102
0.822 0.5363684 0.6831087 1185096 0.82 DLP
103
0.914 0.5889627 0.7947046 2719862 0.90 Willemain
104
0.984 0.0000000 0.0000000 0 0.98 QR
105
0.8110 0.4175925 0.7129549 1210009 0.81 LSTM
106
0.897 0.5623221 0.7746919 1380133 0.89 LightGBM
107
0.978 0.5730789 0.8364288 1486709 0.97 RF
108