Data Processing

Lecture 󰯺Data Processing and Analysis󰯻
Mikhailova Elena Georgievna

Grafeeva Natalia Genrihovna
Egorova Olga Borisovna
St. Petersburg
2019
Descriptive statistics
Descriptive statistics is the primary systematization of data from different sources. Let
us consider the main types of descriptive statistics and their practical application.
Descriptive statistics is actively used at the stage of exploratory data analysis, and in
some cases, it turns out to be completely self-sufficient for a complete data analysis.
Let’s consider the main types of descriptive statistics and their practical application.
The measurement of central tendency is the process of selecting a single number that
best describes all the values of a selected variable from a data set. On the one hand, it
provides information on the distribution of variable values in a compressed form, and on
the other hand, leads to a loss of information compared to the frequency distribution of
variable values. There are the following characteristics of the Central trend:
Mean
Mode
Median
The arithmetic mean of a variable is defined as the sum of all values of a variable
divided by the number of values.
The mean is calculated only in numeric scales and in dichotomous data taking the values
0 and 1. There is only one mean for each data set.
Example (calculation of mean for student grades)
The student in the process of his studies at the University received the following grades:
5, 4, 2, 5, 4, 3, 3, 4, 5, 3, 5, 5, 5, 2, 5
Mean = 58/14 ≈ 4.14

Example (calculation of mean for dichotomous data)
The mean can also be used for dichotomous data. If two values of the variable are 0 and
1, the average for such data indicates the share of units in the sample. For example, for
the following data:
1, 0, 0, 0, 1, 1, 1, 0, 0, 0
Mean = 4/10 = 0.4
That is, 40% of the sample values take a value equal to one.
Mode
The mode is the value of a variable that occurs more often than others. It is denoted by
Mo. The mode can be defined on the data of any scale. There can be several values for
the mode. In this case we speak about multimodal distribution of variable values. If none
of the values of the variable in the data set is repeated, it is said that there is no mode.
On the screen you can see an example of calculating the mode for student grades. The
most common grade is 5.
Another example is the calculation of mode for the weather. Sunny weather occurs most
often.
Another characteristic of the Central tendency is the median. Median is based on the
concept of variation series.
Variation series is an ordered data, arranged in ascending or descending order of variable

values. A series is called variation because it contains variations of the values of the
variable.
Let's consider the example of construction of variation series of student’s grades. The
first row is ordered in ascending order, the second – in descending order.
Now we can define the concept of median (Me). The median is the value that is mapped
to the middle element of the variation series. The concept of "middle element" is
different for an even and odd number of the variable values. For a data set of n values, if
n is odd, the middle element’s number is (N + 1)/ 2, and for an even value n, the median
is as the arithmetic mean of two adjacent middle elements with numbers N/2 and N/2 +1.
The median can be defined for numeric and ordinal data. There is only one median for
each dataset. Let's consider an example of calculating the median for students’ grades.
It has 14 elements. Therefore, the median is calculated as the average of 7 and 8 elements
and is 4.5.
Another example is the calculation of the median for sea wind strength on the Beaufort
scale.
There are the following observations about the strength of the sea wind on the Beaufort
scale. We will form on their basis a series of variations in ascending values of the variable.
The number of elements is 13. Therefore, the seventh element is the median.
Let's now summarize. We have considered 3 characteristics of the central tendency.

The following table shows what characteristics can be applied to different scales.
Which of these characteristics is better? Which one to apply if there is a choice? At first
glance, it might seem that the mean is the most capacious and widely used characteristic.
In terms of popularity, this is of course true, but in terms of applicability and usefulness,
it is not always the best one. We will give a well-known example on this subject.
Some village has 50 inhabitants. Among them, 49 people are rural residents with a
monthly income of 1 thousand rubles, and one resident is a prosperous farmer with an
income of 451 thousand rubles. We calculate the average income of the villagers. It is
equal to 10 thousand rubles. It is clear that this number does not adequately reflect the
income of the villagers. In this case, it would be much more rational to use mode or
median as a measure of the central tendency (both equal 1 thousand rubles). In addition,
in this case, one number is clearly not enough to describe the income in this village.
The measure of the central tendency is just one number that is used to describe the
typical value from the sample under study. It does not represent how diverse the data are
in the sample. That's why the concept of Range and Quartile Range (or Inter Quartile
range) was invented.
Range is the difference between the highest and the lowest values of the data set.
For a set of data representing the student's grades that you see on the screen the Range
is : R = 5 – 2 = 3
The second quartile Q2 coincides with the median. Q1 is the median for values less than
Q2. Q3 is the median for values greater than Q2. There are several options for accurately
determining the quartiles values, which may differ slightly. For examples, when
calculating Q1 and Q3 there are 2 options – to include or to exclude the median (that is
less/more means strictly less/strictly more or not strictly less/not strictly more).
That is why most modern tools include two versions of quartile calculation i.e. excluding
or including the median. These functions are called QUARTILE.EXC and QUARTILE.INC
correspondingly. On the screen you can see an example of Q1 calculation including the
median.
Inter Quartile Range is the difference between the third and the first quartile which is
calculated by the formula:
What is the fundamental difference between the Range and the Inter Quartile range?
The Range is a very simple and "rough" measure of variation, because only the smallest
and largest values of the variable are used in the calculation of the Range. When
calculating the inter quartile range, only the extreme values outside the first and the
third quartiles are ignored. 50% of all data fall between the third and first quartiles there
are.
When conducting a preliminary analysis, the so-called box plot is very useful.
It has the form shown on the slide, and can be drawn both horizontally and vertically. It
displays the minimum, maximum and three quartiles. This allows you to very intensively
and expressively display the basic values of the data.
Quartiles can be used to define outliers, i.e. values that are too different from the others.
There are two types of outliers are usually classified: moderate and extreme.
Moderate ones are those outliers that are located below the first quartile or above the
third quartile for the distance of more than1.5 IQR, but less than 3 IQR. Extreme outliers
are located below the first quartile or above the third one by the distance of more than 3
IQR. The scheme of outliers determination is shown on the screen. The determination
of outliers is very important in the data preparation phase and eliminates values whose
validity is questionable.
In addition, on the basis of these data, a so-called box plot with an extension is drawn,
on which outliers are displayed. It is calculated in two stages: on the first stage quartiles
are determined, they determine outliers (which are displayed on the plot as dots); and
then outliers are excluded from the data and the minimum, maximum and quartiles are
recalculated and displayed in the form of a normal box plot. You can see an example of
such a box plot with outliers on the screen.
Variance of a data set or a data sample is an arithmetic mean of squared deviations of the
values from their mean value. The variance value is calculated by the formula mentioned
above or with the help of any software allowing the calculation of such a function.
Google spreadsheets have such a function named VAR. In order to calculate the variance
we should define the data sample, in our case that are the incomes of the inhabitants of
the village mentioned above. The result is shown on the screen.
One more descriptive statistics is tightly connected with the variance, that is the
standard deviation. Standard deviation is the square root of the sample variance. It is
calculated by the formula shown on the screen. This formula is normally included in any
software. In Google spreadsheets this function is denoted as STDEV.
The calculation of the standard deviation of the incomes of village inhabitants is shown
on the screen.
So, in this section we have considered the main descriptive statistics that are very
beneficial at the stage of the exploratory analysis of the data.
Data transformation is one of the most common pre-processing procedures that can
demonstrate characteristics hidden in the data that are not visible in their original form.
We will try to argue the need for changes on a specific example. We have aggregated data
on the exam results for 2018 at our disposal. There is no doubt that they are reliable. This
data is from official releases of federal service for supervision in education and science.
Too many numbers in the table make it difficult draw general conclusions. Let's try to
display the numbers of top-rated RNE students in various subjects in the form of the
simplest graphs. The data correspond to aggregated values from different categories.
Therefore, they will suit visualization in the form of a bar or pie chart. Here's what our
data looks like on a bar graph.
Once again, we emphasize that the data are reliable. This is aggregated data, which is
exactly as is. However, it is obvious that they cannot be visualized in this way, because
some values (for example, corresponding to the categories of English, Geography and
French) are displayed inadequately. Let's try to display the same data as a pie chart.
Maybe this picture is a little better. Geography now looks more convincing. However,
English and French are still almost invisible in the diagram. The reason that the specified
values are not visible is the large spread of values among the aggregated data. What
should we do with such data and how to visualize it? A possible solution is to transform
the data in such a way that the spread of values will be reduced and the data itself
becomes at least commensurable. There are many different conversion methods. Let's
consider the most traditional ones and discuss how this transformation affects the
visualization.
The table on the screen shows the most common methods of conversion and aspects of
their use. For example, a natural or decimal logarithm is suitable for data conversion,
keeps order among values, but is not appropriate if there are zero values in the source
data. The square root transformation also maintains the order between values and is
appropriate for zero values, but not for negative values. The transformation of the
fraction’s reciprocal formula changes the order of the values on the reverse but is not
able to work with null values.
Let's consider how the transformations are applied on the example of a logarithmic
transformation. To perform a logarithmic transformation, you must calculate the
logarithm of each value in the dataset and use that transformed data instead of the
original ones.
Logarithmic transformations have a significant effect on the form of distribution. The
screen shows a bar chart of top-rated RNE students after the application of natural
logarithmic transformation.
The transformation of the square root has a more moderate effect on the shape of the
distribution.
The following graph shows the bar chart oftop-rated RNE students after applying the
square root transformation.
There are many possible transformations. How to choose the appropriate one? The
answer to this question is not obvious, although there are formal statistical methods to
choose the transformation. If you do not delve into these theories, then the general
strategy for choosing a transformation is to clearly define the purpose of the
transformation (for example, visualization of a certain type),apply the most commonly
used transformations such as logarithms, square root, square, fraction’s reciprocal and
then choose the best method based on the purpose and results.
Since the transformation methods include the application of mathematical functions to
the source data, it is necessary to pay attention to the change of measurement units. For
example, if you apply a logarithmic function to a variables howing the number of top-
rated RNE students, the unit of measure becomes the logarithm of the number.
Therefore, when presenting data on charts and diagrams, it is necessary to indicate what
transformations were performed and in which units the data is displayed. If the
transformed data was used to calculate statistics, remember to perform the inverse
transformation to represent the result in the initial units. For example, if square root
transformation was applied, you must make the inverse transformation and square the
result.
Data normalization
Data normalization is another procedure for possible pre-processing of data. The

purpose of normalization is to enable comparison, aggregation and further visualization
of the values of several variables from different scales. For some machine learning
algorithms (and not only), normalization of variables is a necessary condition.
We will try to argue the advantages of normalization on a specific example.
Suppose there is one voucher to Artek* and the following information about the students
are at the disposal of the School’s pedagogical Council. You can see it presented on the
slide.
*Artek is an International children's center located on the southern coast of Crimea in the village of Gurzuf. It used
to be very high rated in the Soviet time and still is very popular.
Let's assume that all the certificates of about the same level and there are no incredibly
outstanding ones among them. The task of the pedagogical Council is to find (and later
publish) a formal criterion which is to be use to select an applicant for a single voucher
to Artek. What is the best way to find this criterion?
The task is to match to each student a value, which will represent all his achievements,
and the non the basis of the see values to make a rating of students and to identify who
is the first applicant.
If all the variables and grades were measured in the same scales and units of
measurement, it could be possible to summarize all the values, but this is a very rough
approach, since sports certificates are issued much more often than others, so sports
achievements immediately overlap all the others. The solution is in the normalization of
the variables values, and then in the calculation of final criterion based on them.
Why is data normalization necessary?
Usually the expression of some quality is described by a number. The variable x varies
from a minimum value of xmin (reflecting the lack of quality) to a maximum value of xmax
(the highest degree of quality). The presence of a quality criterion allows to solve the
problem of comparing two objects, but only in accordance with this parameter. However,
one should remember to what extent the parameter can vary. And the ranges of values
spread and units of measurement for different variables are very diverse…
In addition, it is sometimes necessary to estimate how close the value is to the edges of
the range or to the middle of the range. When it comes to comparing or aggregating data
on different parameters, they are impossible to be compared. However, it is the quality
parameter that is interpreted as the degree of quality intensity. And the degree of
intensity can and should be compared and aggregated!!!
In this case the parameters should be brought to one scale so that the minimum and
maximum values for different variables were the same. Exactly this transformation is
called normalization. After this transformation, you can compare and aggregate a variety
of parameters obtained by different techniques.
2 classes of numerical parameters
With all the variety of numerical characteristics of objects, two broad classes can be
distinguished from them:
* unipolar, expressing only the degree of availability of some quality (for example,
intense color or very good grade);
*bipolar, reflecting not only the degree of availability of quality, but also its "direction".
The normalization methods differ for these classes. Let us consider some of them.
Most often, unipolar values are normalized in the range from 0 to 1. Any continuous
increasing function y = f (x) with a minimum value of 0 and a maximum value of 1 can
act as the normalization function:
y(xmin)=0; y(xmax)=1; dy/dx>0
Let us consider two possible variants of this function, which undoubtedly

have the above-mentioned characteristics.
Let’s consider possible types of such a function, having the properties mentioned above.
On the slide you can see two types of such a function - exponential and linear. It should
be noted that the last transformation is used more often than others because of its
simplicity. The advantage of the exponential normalization is that it distributes the
initial values evenly over the range from 0 to 1. Moreover, small modifications of this
formula make it easy to increase this evenness of distribution in specific cases.
Bipolar values are usually normalized in the range from -1 to 1. Any continuous
increasing function y = f (x) with a minimum value of minus 1 and a maximum value of
plus 1 can act as the normalization function.
y(xmin)=-1; y(xmax)=1; dy/dx>0
A possible example of such a linear function based on minimum and maximum values
is given on the screen.
Clearly, there are other possible normalization options, and some of them are not linear
and are sometimes tied to the specifics of the subject area in which the parametersare
normalized, however, in most cases, such a transformation is quite sufficient for
subsequent analysis.
Now, armed with the knowledge gained, let’s return to our example with students. What
is the students’ performance?
Let's consider all the parameters of our example with students. They're all unipolar.
Indeed, the average mark is a unipolar parameter that reflects the unidirectional quality
of educational success, as a rule, it is measured from 1 to 5. The number of certificates
for participation in art competitions is a unipolar parameter (presented as a positive
integer). The number of certificates for sports achievements is a unipolar parameter
(presented as a positiveinteger). The number of certificates for participation in
intellectual competitions is a unipolar parameter (presented as a positive integer).
This means that when normalizing the data, we can use any of the normalizations for
unipolar parameters. For simplicity, we take linear normalization.
Let’s consider, for example, how the normalization of the parameter<Number of
certificates for participation in art competitions> will be performed. First, we define the
minimum and maximum values for this parameter. They are equal respectively - 0 and
5. We placethese values in the formula of the value’slinear normalization of the value.
You can see the results on the screen.
We use the linear unipolar transformation formula for all parameters and normalize all
the initial data (moreover, each parameter must be normalized separately). So, we have
normalized values for each parameter. What's next?
It is necessary to set the so-called objective function based on normalized values
(corresponding to students). What is the objective function? This is a mathematical
expression of some quality criterion for an object (process, solution).The objective
function is set in order to obtain one parameter instead of a large number of qualitative
parameters for each studied object, and then, based on the maximum (or minimum)
value of the function, determine the object at which the corresponding extremum is
reached. So what value of the function to use? Maximum? Minimum? It depends on the
specifics of the task and the type of function itself. For example, if this function reflects
the total positive qualities of the student, then this is probably the maximum. And if we
talk about the total cost to perform a task, then it is more logical to use a minimum. In
our case, the sum of normalized values can be used as the objective function, since each
of the values reflects some positive qualitative characteristics of the student, and the
maximum value of such a function is considered to be the best value. So, the student
corresponding to the maximum value of the function will be considered to be the best.
At our disposal there are at least two possible options to see the result of calculating such
a function: we can add another column to the table in which the sum of normalized
parameters will be calculated and then find a student with the maximum value of the
objective function; another option is touse a wonderful tool for visualization - stacked
bar chart. This type of chart can be found in most popular software. The peculiarity of
this chart is that it itself sums up the parameters and our task is to set them correctly
and find the column with the maximum accumulated value. We will demonstrate both
options.
СЛАЙД 13
So who is to be chosen for a trip to Artek?
So, the slide shows a table in which the value of the objective function is added as the
last column (option A is the sum of all normalized parameters). It is easy to see that the
maximum value of the objective function (2.2) corresponds to Mathew. So Mathewis the
winner! Let’s consider the second possible option for determining the winner –stacked
bar chart.
We will build a stacked bar chartbased on normalized values and we will see an absolutely
similar result. It’s Mathew who corresponds to the column with the accumulated value
of the maximum height. In this case all you need to take care of is to correctly set the
type of chartand the values of the source data.
We could stop at this, but it turned out that the pedagogical council insists on doubling
the significance of the standardized score for school performance, i.e. if all other
normalized values are placed inthe objective function with a coefficient 1, then the
normalized average score will be multiplied a significance coefficient 2. The final
formula for the objective function in this case is displayed on the screen. Let's try to
determine the winner with a new objective function.
The slide shows a table in which option B is used as the target function. It is easy to see
that the maximum value of the target function corresponds to Anna. So, in this case, the
winner is Anna! We can see this option shown on stacked bar chart.
We cannot change the manner of accumulation in the histogram, but we can double the
values of the parameter for performance in the initial data for the diagram and insert a
new doubled parameter to build a stacked bar chart. As expected, the winner is Anna.
Time Series Analysis
A time series is a series of observations in time order:
y1, y2, ..., yn
Where yi are the values of a variable y at n equally spaced time points:
t1, t2, ..., tn

Examples of time series are consistently measured data (collected every day, hour,
minute, and so on) on power generation, production volume, sales, consuming, transport
services, and more.
The slide shows an example of the process described by a time series.
Time series analysis assumes that time-series data is dependent on its past values.
The problems of time series analysis are generally divided into two classes:
● Analyzing time-series components to extract meaningful statistics and other

characteristics of the data.
● Forecasting to predict future values.
Let’s consider these problems in turn.
There are four time-series components:
A trend reflects smooth, long-term changes in the level of the series.
Seasonality describes systematic changes in the level of the series in a fixed period of
time (for example, monthly power consumption corresponds to a period of 12 months).
A cycle reflects the repeating period in the series (for example, economic cycles, high
solar activity periods, and so on).
Noise is the random variation in the series.

The slide shows several examples of the series and the respective components.
Time-series forecasting assumes the construction of such a function f that gives the
series forecast value for the points (t + h) based on the values of the time series y1, y2,
…, yt and the additional parameter h.
f(y1, y2, …, yt , h) = ŷt+h
Here, h lies within the interval from 1 to H, where H is the forecasting horizon.
Based on the horizon, the forecasting problem is generally divided into short-term and
long-term forecasting. However, the time scale is conditional, as it often depends on the
characteristics of the time series.
There are some things to consider while solving these problems. Let’s elaborate on them.
● It is not always easy to reveal underlying patterns in the history of a series (for
example, to measure the duration of the periods of time, choose the appropriate
analytical function for a trend, and so on).
● Patterns (if there are any) can be distorted by noise in the data. These seem
especially characteristic of data obtained from different sensors. That is why the
analysis includes data preprocessing to remove noise by way of different
smoothing techniques.
● The dynamics of the series in the past do not guarantee similar behavior of the
series in the future, because the behavior can significantly change due to
external factors (for example, oil price dynamics have the tendency to change
dramatically in response to oil output quotas, political events in the petroleum
exporting countries, and so on).
Nevertheless, despite these problems, time-series models can and must be built, because
their benefits to business development outweigh the known risks. However, statistics
can provide mathematical methods that more accurately estimate the so-called
prediction interval. The prediction interval gives an interval within which we expect the
predicted value to lie with a probability that is not below a specified one.
Time-series models are usually classified as additive or multiplicative. An additive model

predicts a series by recursively adding or subtracting some increments with respect to
the known values of the time series. A multiplicative model suggests that the known
elements are multiplied by some parameters to forecast a series. For example, we want
to build an additive model and know that the average monthly increase in demand for a
certain product is 100 pieces. Then, the forecast value for the demand in the next month
is defined as the previous series value plus 100 pieces. The multiplicative model shows
this as the demand increase by 10 percent. Hence, the forecast value for the demand in
the next month can be calculated as the previous series value multiplied by 1.1.
Seasonality, which is the repeating the short-term cycle in the series, can also be
analyzed by using the additive or multiplicative model.
This lecture will cover simple examples of time-series modeling. But first, let’s discuss
how to assess the quality of obtained time-series models and how to compare them. Such
comparisons require so-called quality metrics. The metric is a function that defines the
distance between any two members of the set. There are many different forms of such
metrics, but for now, we will stick with some of them.
Before deciding on the metric to use, we need to find out more about the data. To assess
the forecast quality, we need predicted values and actual time-series data. Moreover, we
need to build a model based on one part of the data and verify against another. The data
used to construct a model is known as a training sample, while the data for the model
assessment is called a test sample. How to split a time series into training and test
samples? This can be solved in many different ways. However, in the case of a time series,
the simplest and most logical approach is to split the series into three parts and use the
first two as a training sample while building the model. Next, the third part can serve as
a test sample to assess the quality of the constructed model.
Now we can move on to the metrics. Most of the quality metrics are based on the concept
of forecast error (that we will designate by et). Forecast error at time t is the difference
between the predicted and observed values of the variable at the time t. Therefore,
et = ŷt – yt, where ŷt is the predicted value, and yt is the actual value of the variable.
All the metrics that we will use are based on the assumption that the model that had
minor errors in the past will have minor errors in the future.
The first metric МАЕ that is mean absolute error is obtained by dividing the sum of the
absolute values of the forecast errors by the number of points in the test sample. The
second metric MSE (mean squared error) is calculated as the sum of squared forecast
errors divided by the number of points in the test sample. The third metric MAPE (mean
absolute percentage error) is defined as a percentage of the sum of forecast errors divided
by actual values of the time series over the number of points in the test sample. In the
formulas, et is the forecast error, yt is the actual value of the variable, n is the number of
points in the test sample.
We will use these metrics to evaluate time-series models. To choose the model, we will
consider models with the minimum value of the metrics.
Time Series Smoothing
As we have already noted, some time-series values are noisy (in the sense that they
reflect random error). Noisiness often found in the series that are based on the data
obtained from different sensors. An example of the noisy time series is shown on the
screen. When analyzing time series, you need to identify their structure and estimate
all major components, which is not easy because of noise. The good news is that there
are many ways to remove it. Let’s consider some of the well-known methods widely
used for denoising.
The first method is a moving average. The moving average requires setting a window
width for each value of the series variable. The window is placed over neighboring
values of the series (ideally, k values before and k values after the smoothed value).
Next, these neighbors and the initial value of the series are used to calculate the
arithmetic mean. The smoothing formula is shown on the screen.
Here, Yi is the value of the initial series, Si is the value of the smoothed series, and (2
* k + 1) is the window width. The degree of smoothing depends on the window width.
A large window size results in rough smoothing and a possible loss of information
about the series dynamics. A small window size of 5-7 points can result in insufficient
denoising. There is no universal window size. A window size heavily depends on the
domain and the goals of averaging in each particular case.
Smoothing is widely used in technical analysis of market quotes and is inherent in all
the tools for stock analysis. Let’s look at the example of the initial and smoothed RTS
index chart after applying the moving average when k equals 3. Please note that the
calculations of the moving average do not allow finding the smoothed value for the
first k and last k points of the series. The moving average formula that is shown on
the screen always requires access to future measurements, therefore, it can be
calculated only offline. In practice, the moving average is often calculated online
(which means we know only past values and the point on the smoothed time
series).This smoothing is often used in technical analysis of financial data, like stock
prices. Can we apply the moving average to online calculations?
Yes, we can. However, the calculation formula will include only the neighbors
preceding the smoothed value. The formula is shown on the screen. Here, Yi is the
value of the initial series, Si is the value of the smoothed series, and (n + 1) is the
series width.
This smoothing results in a delay of the smoothed data stream, but still, it can be
successfully used to perform specific tasks.
There’s one more technique often used in online smoothing. It is called exponential
smoothing. The simplest form of exponential smoothing is given by the recursive
formula shown on the screen. The initial smoothed value is defined as the first point
on the time series, and all the subsequent smoothed values are defined as yt*α +(1-
α)*St-1. Here, Yt is the value of the initial series at the time t, St-1 is the value of the
smoothed series at the point t-1, and α is the smoothing parameter. Choosing an
appropriate value of the smoothing parameter α is crucial. The current smoothed
value is obtained from the previous one and from the error of the previous
smoothing.The error value that is used for adjustment is defined by the smoothing
parameter α.The closer the value α is to 1, the more of the discrepancy between the
smoothed and real value is considered acceptable for calculating the next value.The
closer the value α is to0, the more of the discrepancy between the smoothed and real
value is considered random, therefore, less of it is used for calculating the next value.
We can rewrite the exponential smoothing formula in the non-recursive form. The
formula shows that the smoothed value is the weighted sum of all the previous values
of the time series, and the coefficients decrease when the distance between the series
value and the present moment of time increases. For example, if α = 0.1, the formula
takes the form shown on the screen.
How to choose the smoothing parameter? There is no formally correct procedure for
choosing α.In practice (at least in exchange activities), α lies between 0.1 and 0.3.We
can say that the value of the smoothing parameter is based on the statistician’s
judgment on the sustainability of change in the parameter at hand.
The screen shows the examples of exponential smoothing of the RTS index with
various values of the smoothing parameter. The smoothing parameter on the first
chart equals 0.1.
On the second chart, the smoothing parameter is 10 times smaller and equals 0.01.
Note how the choice of the smoothing parameter affects the result.
Decomposing Time Series Data into Trends
Earlier, we discussed the major components of a typical time series such as trend,
seasonality, and noise. We also noted that it is possible to denoise data using special
techniques at the stage of preliminary analysis of time series. Separation of the noise
makes the pattern more salient. How to identify a trend and seasonality? Can they be
found analytically (by way of a time-dependent mathematical function)? What will it
give? If we learn to characterize the behavior of a time series, we will be able to predict
its behavior. For business, it means that we can predict car sales or the number of
passengers, consumers, guests, and much more. So, this is actually very interesting.
Well, how to identify a trend in the time series? In practice, we use such analytical
functions as linear, polynomial, exponential, and logarithmic.
Of course, we can also use other functions, but these four are more common, and they
are embedded in many tools. How to choose the best among them? There are formal
methods, but the simplest way is to recall how the graphs of the respective functions
look like and use the graph of a time series to choose the best function. Let’s consider
the examples.
It is easy to guess that the time series in Example 1 corresponds to a linear function.
Example 2 shows a polynomial trend.

The time series in Example 3 corresponds to a logarithmic trend.
However, to model the behavior of the series, it is not enough to define the type of the
function, which corresponds to the trend line. We need to find the exact function
parameters. So, how to do this? Each of the mentioned types has corresponding
mathematical methods that can be used to find these parameters.
For example, to find the parameters of a linear trend, we can use a least-squares fit. It
allows finding the parameters of the function by using the formulas shown on the screen.
To find the parameters of the other trends, we can apply specific mathematical
techniques. This time, we will not describe them in detail. However, note that the trends
and the corresponding analytical functions are well identified by the simplest tools.
Before we move on to identifying trends, we would like to make one more note.
Sometimes it is hard to choose the trend line among several options. Are there formal
criteria for choosing the type of the trend? In fact, such criteria exist! And it’s called the
coefficient of determination. It is denoted R2.
A coefficient of determination can be used to assess the quality of the selected trend
equation. It takes the values from 0 to 1. For the acceptable trend models, it is assumed
that the coefficient of determination should not be less than 0.5. Models with the
coefficient of determination greater than 0.8 are considered good. When the coefficient
of determination R2 = 1, there’s a functional relationship between the variables (that is
between the initial time series and the trend). The accurate formula for the coefficient
of determination is shown on the screen.
We will use the coefficient of determination that is implemented in many tools to assess
the quality of the trend lines.
Data processing tools can not only display the trend equation on a chart but also offer
options that allow calculating the parameters of the linear trend without creating a
graph. They are called SLOPE and INTERCEPT. For the trend equation y(x) = a*x + b,
SLOPE calculates the parameter a, and INTERSECT calculates the parameter b. In some
cases, these options are preferable, because they give a more accurate result compared
to the equation with the rounded parameters (it is shown on the trend chart).
Let’s look at the SLOPE example.
Here, you can also see INTERSECT and the result. Compare the accuracy of the
representation for the parameter a in the equation on the chart and after applying the
function.
Well, now you know how to identify trends and the corresponding analytical functions.
How to apply this knowledge in practice? You can apply this to forecast the behavior of
simple time series with noise and trends but without seasonality. In other words, it is
possible to model a time series based on the analytical function that corresponds to the
trend. Here we have a suitable time series.
It is the time series that we discussed before. The time series refers to the cars for 1.000
persons in the Central Federal District. Data reflects the period from 2000 to 2018.
Let’s create a linear graph corresponding to this time series and verify that there’s no
seasonality. Therefore, we can model the series based on the analytical function ax +
b.
Let’s see how effectively the points are modeled by this function. To do this, we will split
the series into training and test samples, find the parameters of the trend on the training
sample, perform forecasting on the test sample, and then assess the forecast quality by
examining the MAPE metric.
The training sample is highlighted in yellow, and test sample is in blue.

To find the parameters of the linear trend, we will use SLOPE and INTERCEPT.
We can calculate the forecast values on the test sample (for the years from 2013 to 2018)
using the formula AX+B. The result is shown in the Forecast column. To obtain this
result, you can also use the built-in option FORECAST.LINEAR. It outputs the forecast
value based on the model of the linear trend for the specified time series.
Next, we can assess the forecast quality by examining the MAPE metric. To do so, we
add a new column named Error and calculate the terms in the numerator for the MAPE
metric.
All we’ve got left is to sum up all the values in the Error column and divide it by the
number of test sample elements. The result of 2.9% shows an average deviation from the
actual values and seems reasonable. A little bit later, we will discuss how to model the
series with seasonal components.
Modeling of Seasonal Time Series
How to identify a seasonal component in time series data and how to use it to model time
series? Let’s find out. Economic time series often contain a seasonal component, because
such scenarios frequently occur in a business context. For example, an overall picture of
power consumption has a pattern that repeats and depends on the month. The number
of restaurant guests also has a pattern and depends on the day of the week. The number
of people in public transport depends on the time of the day. These processes allow us
to define the duration of each recurring period and, as a result, to model seasonal time
series. In these examples, the period is 12 months for power consumption, 7 days for
restaurants, and 24 hours for public transport. But what to do with the series that have
less distinct seasonal components?
How to determine the recurring period and its seasonal components? There are many
complex mathematical theories for determining a time period, but we will simplify our
task by assuming the period is known, and the series at hand is denoised. Let’s find out
how to determine seasonal components for periodic series and how to build
corresponding forecasting models.
Let the initial data be a time series f(t).
• First, we need to get the analytic expression for the trend component. We will
denote the obtained series by d(t).
• Second, we need to determine a seasonal component. To do so, we subtract the
trend component from the initial time series. That is, we create the series r(t) =
f(t) – d(t).
• Now we can build a model to forecast the series based on the trend and seasonal
components.
This scheme of model building is quite typical of the models that have seasonal
components. However, these models can be drastically different depending on a
particular seasonal component. Let’s take a look at the peculiarities of two example
models.
Now we are looking at the initial time series with the seasonal component of constant
amplitude. Aside from the initial data, you can see a graph based on this data, trend line,
and parameters for the linear equation of the trend. The time range of the time series
equals 6.
Since we have the linear trend parameters, we can obtain the seasonal component of the
time series by subtracting the trend component from the initial time series. The results
are shown in the table and on the graph.
Look at the graph of the time series with the seasonal component of constant amplitude,
and you will see why it deserves its name. Local minima and maxima of the seasonal
component lie almost on the same line that is parallel to the x-axis. The characteristic
of this series is that the seasonal component at time t is the same as at time t - n, where
n is the period length.
Therefore, we can wrap it in a simple formula for short-term forecasting of the time
series (for one period ahead). Assume that the last series value is defined at time t. Then,
the short-term forecasting formula at time (t + k) consists of two terms: the value of the
trend component at time (t + k) and the value of the seasonal component at time (t + k -
n).
What is different in the long-term forecasting model? The difficulty is to consider the
seasonal component only based on the actual values of the series. That is, until time t.
But we can overcome this difficulty. We can use the last known period in the formula,
even if we forecast several periods. In the short-term forecasting formula, it is sufficient
to use not the value k in the seasonal component, which indicates how far we are from
the known values of the series, but its absolute value n, where n is the period length.
The next slide shows the long-term forecast of the time series. The forecast is obtained
using the formula, and it covers five periods. The initial time series is shown by the blue
line, and the predicted values are shown in red. The result is quite convincing. However,
you should know that long-term forecasting is always worse than short-term forecasting
since the closer we are to the initial data, the more well-founded our assumptions about
the behavior of the series. Of course, we can and should assess the forecast quality using
the metrics that we’ve already discussed in this lecture.
As you can see, here’s another time series. Its distinctive quality is the presence of a
seasonal component of growing amplitude. The primary data analysis has been carried
out. The graph corresponding to the initial data and the trend line have been drawn. The
parameters for the linear trend equation have been explicitly calculated.
Let’s subtract the trend component from the initial time series. Now we can visualize the
result. The visualization clearly shows the growing amplitude of seasonal components
of the time series.
The amplitude changes over time in a steady manner. In this case, we can also apply the
short-term forecasting model that we used to analyze the time series of constant
amplitude because it will still take into account the dynamics of changes in the
amplitude despite lagging behind (a lag will be equal to one period). Therefore, it would
be better to modify the formula for short-term forecasting of the second time series and
rewrite it to include a seasonal component increment for one period. We are the most
interested in studying the behavior of the seasonal components because the analytical
formula is enough for the trend component. As we have already noted, time-series
models are usually classified as additive or multiplicative. In the additive case, we should
specify the increment, while in the multiplicative, we need a factor. We will choose the
latter option. Now we need to find the so-called seasonal parameter. It is the value that
we will use to multiply by the values of the seasonal components in the last known
period. There are many ways to calculate this parameter, but the easiest is to compare
the amplitude of the last period and the period before last. The ratio of these amplitudes
is the desired seasonal parameter.
But how to determine the amplitude in the last period and the period before last? The
amplitude can be defined as the difference between the maximum and minimum values
of the period points. The illustrated example of calculating the seasonal parameter
assumes that the period length equals 6 and that the last known point of the trend
component is in the cell C62.
After the seasonal parameter is determined, we can easily create the short-term
forecasting model. The formula is displayed on the screen. It is practically the same as
for the seasonal series of constant amplitude, but the seasonal component from the last
known period is multiplied by the seasonal parameter.
What differs in the case of the long-term forecasting? We need to consider the increment
that increases the amplitude of seasonal components, allowing for several periods ahead.
If we can assume that the seasonal parameter is at least approximately the same for all
periods, then we can use the formula that is shown on the screen to create the model.
Please note the expression (k div n + 1). It reflects the relative number of the period being
forecast. Such periods are enumerated starting at time t.
Please look at the long-term forecast of the time series of growing amplitude. The
forecast is obtained using the discussed formula.
The initial time series is shown on the graph by the blue line, and the predicted values
are shown in red.
Of course, when dealing with real-life time series, we need to assess the forecast quality
and make sure that we’ve chosen the best model. However, we will skip this step because
we have thoroughly discussed the forecast quality assessment earlier.
Well, this time, we covered several simple ways to model seasonal time series. Of course,
forecast accuracy decreases as the forecasting horizon increases. Nevertheless, even
these models can be useful for forecasting the behavior of real-world economic series.

Data Processing

Uploaded by

Data Processing

Uploaded by

Lecture 󰯺Data Processing and Analysis󰯻

Mikhailova Elena Georgievna

Example (calculation of mean for student grades)

Mean = 58/14 ≈ 4.14

Mean = 4/10 = 0.4

Variation series is an ordered data, arranged in ascending or descending order of variable

Let's now summarize. We have considered 3 characteristics of the central tendency.

Data normalization is another procedure for possible pre-processing of data. The

We will try to argue the advantages of normalization on a specific example.

Why is data normalization necessary?

2 classes of numerical parameters

y(xmin)=0; y(xmax)=1; dy/dx>0

Let us consider two possible variants of this function, which undoubtedly

y(xmin)=-1; y(xmax)=1; dy/dx>0

So who is to be chosen for a trip to Artek?

Time Series Analysis

A time series is a series of observations in time order:

y1, y2, ..., yn

Where yi are the values of a variable y at n equally spaced time points:

t1, t2, ..., tn

The slide shows an example of the process described by a time series.

● Analyzing time-series components to extract meaningful statistics and other

Let’s consider these problems in turn.

There are four time-series components:

A trend reflects smooth, long-term changes in the level of the series.

Noise is the random variation in the series.

f(y1, y2, …, yt , h) = ŷt+h

Time-series models are usually classified as additive or multiplicative. An additive model

Example 2 shows a polynomial trend.

The training sample is highlighted in yellow, and test sample is in blue.

Let the initial data be a time series f(t).

You might also like