Module 2.3 EDA Part 3 Time Series Data in Python and R
Module 2.3 EDA Part 3 Time Series Data in Python and R
For Module 2.3, we will be putting into practice the theories discussed in Module 2.1 EDA Part 1 Time Series Data using
quarterly sales data of a French retail company from Prof. Rob J. Hyndman and Prof. George Athanasopoulos’s book
Forecasting: Principles and Practice (3rd ed).
These are some of the questions I ask at various stages of model building, more so during EDA.
1. Are there any null values? How many? How do we impute null data?
▪ If NaNs/#NAs are present, first identify why these data points are missing and if they mean anything. Missing
values can be filled by interpolation, forward-fill, or backward-fill depending on the data and context. Also
make sure null doesn’t mean zero, which is acceptable but has modeling implications.
▪ It is important to understand how the data was generated (manual entry, ERP system), any transformations ,
or assumptions were made before imputing the missing data.
4. If seasonality is present, how does the data change from season to season for each period? How does the
seasonality look like if trend is also present?
▪ Does it increase/decrease with the trend? Changes slowly, rapidly, or remains constant? These are important
observations to be made, especially for regression and gradient boosted models. This is also key if any data
preprocessing will be needed.
▪ Decompose the series into level, trend, seasonality, and residual error. Observe the patterns in the
decomposed series.
▪ Is the trend constant, growing/slowing linearly, exponentially, or some other non-linear function?
▪ Is the seasonal pattern repetitive?
▪ How is the seasonal pattern changing relative to level? If it is constant relative to level, it shows additive
seasonality; whereas if it is growing relative to level, it's multiplicative. See screenshot from
https://anomaly.io/seasonal-trend-decomposition-in-r/index.html for a more visual explanation.
Figure 1. Visual comparison of additive and multiplicative seasonality
7. What is the distribution of the data? Will we need to perform any transformations?
▪ While normally distributed data is not a requirement for forecasting and does not necessarily improve point
forecast accuracy, it can help stabilize the variance and narrow the prediction interval.
▪ Plot the histogram of the entire dataset and for each time period (i.e., each year) to gauge kurtosis/peakedness
and skewness of the data. It can also help compare different periods and track trends over time.
▪ If the data is severely skewed, consider normalizing the data before training the model.
▪ Common transformations for positively-skewed data include square root, cube root, and logarithm.
▪ Common transformations for negatively-skewed data include power transformation (e.g., squaring the values)
and logarithm also.
▪ A more powerful data transformation is the Box-Cox transformation, which includes both log and power
transformations that depends on the parameter λ and are defined as follows:
log 𝑦𝑡 , 𝑖𝑓 λ = 0;
𝑤𝑡 = {𝑦𝑡λ − 1
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
λ
o When trends are present in a time series, shorter lags typically have large positive correlations because
observations closer in time tend to have similar values. The correlations taper off slowly as the lags
increase. In this ACF plot, the autocorrelations decline slowly. The first five lags are significant. In
practice we usually set k = 1 for a time series that exhibits a strong trend like this one. A sinusoidal
(wave) pattern that converges to 0, possibly alternating negative and positive signs also signify a strong
trend.
Figure 3. ACF plot of time series that exhibits strong trend at 5% level of significance
o When seasonal patterns are present, the autocorrelations are larger for lags at multiples of the
seasonal frequency than for other lags. When a time series has both a trend and seasonality, the ACF
plot displays a mixture of both effects. Notice how you can see the wavy correlations for the seasonal
pattern and the slowly diminishing lags of a trend.
Figure 4. ACF plot of time series that exhibits BOTH trend and seasonality at 5% level of significance
▪ A PACF plot is only appropriate if you will develop a class autoregressive model like ARIMA. Typically, you will
use the ACF to determine whether an autoregressive model is appropriate. If it is, you then use the PACF to
help you choose the model terms. We will discuss this in more detail in Module 3.
We will need the following libraries for this exercise. These are the basic ones and most used packages and libraries for a
data analytics project in Python.
a. Pandas
b. Numpy
c. Matplotlib
d. Seaborn
e. Altair
f. Statsmodels
g. Scipy
import pandas as pd
import numpy as np
#Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
plt.style.use('seaborn-white')
%matplotlib inline
#Statistics libraries
import statsmodels.api as sm
import scipy as stats
from scipy.stats import anderson
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, plot_acf, plot_pacf, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.stats.diagnostic import acorr_ljungbox as ljung
path = 'https://raw.githubusercontent.com/pawarbi/datasets/master/timeseries/ts_frenchretail.csv'
#Sales numbers are in thousands, so I am dividing by 1000 to make it easier to work with numbers, especially
squared errors
data = pd.read_csv(path, parse_dates=True, index_col="Date").div(1_000)
data.index.freq='Q'
data.head()
The data.head() line in the code is a sanity check to ensure that the you imported the correct file. It shows the first five
rows of your dataset.
In Python, it is important to set the frequency of the time series data. You can set data.index.freq to any of the following:
- Quarterly = ‘Q’
- Monthly = ‘M’
- Weekly = ‘W’
#forecast horizon
h = 6
train_length = len(train)
Before analyzing the data, first split it into train and test (hold-out) for model evaluation. All EDA and model fitting/selection
should be done first using train data. Never look at the test set until later to avoid any bias. Typically, we want at least 3-4
full seasonal cycles for training, and test set length should be no less than the forecast horizon.
In this example, we have 24 observations of quarterly data, which means 6 full cycles (24/4). Our forecast horizon is 4
quarters. So, the train set should be more than 16 and less than 20. We will use the first 18 observations for training and
keep the last 6 for validation. Unlike typical train-test split, we cannot shuffle the data before splitting to retain the temporal
structure.
CODE BLOCK #4: Data integrity check
Observations:
a. No null values
b. Length of the train set is 18 and we have 12 unique dates/quarters so no duplicate dates
c. Each quarter has 1 observation, so no duplicates and time series is continuous
#Create line chart for Training data. Index is reset to use Date column.
train_chart=alt.Chart(train.reset_index()).mark_line(point=True).encode(
x='Date',
y='Sales',
tooltip=['Date', 'Sales'])
#Add zoom-in/out
scales = alt.selection_interval(bind='scales')
#Combine everything
(train_chart + rolling_mean +text).properties(
width=600,
title="French Retail Sales & 4Q Rolling mean ( in '000)").add_selection(
scales
)
Matplotlib and Seaborn create static charts, whereas plots created with Altair are interactive. You can hover over the data
points to read tooltips. The most useful feature is the ability to zoom-in and out. Time series data can be dense and it’s
important to check each time period to get insights. With zoom-in/out, it can be done interactively without slicing the time
series.
NOTE: You can choose to perform your time series plotting in Excel instead if you find writing the codes too confusing.
Observations:
a. Sales has gone up each year from 2012-2016. Positive trend is present.
b. Typically, sales go up from Q1 to Q3, peaks in Q3, then drops in Q4. This is a seasonal pattern. Model should capture
seasonality and trend.
c. Series is not stationary as observed with upward sloping rolling mean line.
CODE BLOCK #7: Density plot of time series and each year
Observations:
a. Density plot shows data looks normally distributed. Bimodal distribution in quarters is because of small sample
size. Peaks shift right from 2012 to 2015 indicating increase in average. No structural breaks found.
b. Distribution becomes fatter as the years progress, indicating higher spread/variation (as seen in boxplot too).
Always use a semicolon (;) after plotting any results from statsmodels. For some reason if you don’t, it will print the plots
twice. Also, by default the statsmodels plots are small and do not have a figsize() argument. Use rcParams() to define
the plot size.
Observations:
a. Trend is more than linear, notice a small upward take off after 2013-07. Also notice that trend is projecting upward.
b. Seasonal pattern is consistent
c. Residuals are whatever is left after fitting the trend and seasonal components to the observed data. It's the
component we cannot explain. We want the residuals to be i.i.d (i.e uncorrelated). If the residuals have a pattern,
it means there is still some structural information left to be captured. For example, residuals are showing some
wavy pattern, which is not good. We need to perform Ljung-Box test to confirm if they are i.i.d as a group.
d. We do not want to see any recognizable patterns in the residuals (e.g., waves, upward/downward slope, funnel
pattern, etc.)
CODE BLOCK #9: Performing an initial Ljung-Box Test on residuals of decomposed time series
The Ljung-Box test is a statistical test that checks if autocorrelation exists in a time series. It uses the following hypotheses:
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater
than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we
make when creating a model.
The first argument in this code is the dataframe that we want to run the test on. The dataframe decompose.resid.dropn a()
refers to the residuals of the decompose dataframe, which we declared in Code Block #8.
The initial test ran above shows that residuals are uncorrelated. If the residuals are correlated, we can perform
transformations to see if it stabilizes the variance.
CODE BLOCK #10: Test for stationarity and difference the time series if needed
#Differencing
de_trended = train.diff(1).dropna()
adf2 = adfuller(de_trended)[1]
print(f"p value:{adf2}", ", Series is Stationary" if adf2 <0.05 else ", Series is Non-Stationary")
As suspected through our time series plots, the series is not stationary. The first p-value below refers to the first ADF test
in the above code block. The second refers to the ADF test after we performed a first order differencing.
CODE BLOCK #11: ACF and PACF plots of the de-trended time series
As explained earlier, for EDA purposes we’ll care more for the ACF plot rather than the PACF plot. The PACF plot allows us
to tune our parameters for a regression-based model, but will be unnecessary for supervised machine learning models.
Note that for the PACF line in the code, we set the lags equal to 5 because of the small sample size. The PACF plot is more
sensitive to sample size and will return an error if we do not specify a lag size. What this means is that the PACF will only
test partial autocorrelation of up to 5 lags only. For sufficiently large sample size, there’s no need to specify the lag as the
PACF will optimize it already.
Observations:
a. ACF plot shows autocorrelation coefficient is insignificant at all lag values (within the blue 95% CI band), except lag
1.
b. The ACF plot is sinusoidal which means the time series exhibits a strong trend.
As mentioned above, a time series does not have to be Gaussian for accurate forecasting but if the data is highly skewed
it can affect the model selection and forecast uncertainty. In general, if the series is non-Gaussian, it should be normalized
before through transformations. Normalization will also help us to decide on whether we will use regression-based models,
tree-based models, or neural network models later.
To visually check for normality, plotting a histogram against a density plot is ideal. You can also create a Q-Q plot to check
for normality. Points on the Normal Q-Q plot provide an indication of univariate normality of the dataset. If the data is
normally distributed, the points will fall on the 45-degree reference line. If the data is not normally distributed, the points
will deviate from the reference line.
#Distribution Plot
sns.distplot(train["Sales"]);
#Q-Q Plot
sm.qqplot(train["Sales"], fit=True, line='45');
SignificanceResult(statistic=1.0509757294798883, pvalue=0.5912668357500077)
Observations:
a. Q-Q plot shows the data follows the 45-degree line very closely, deviates slightly in the left tail.
b. Density plot shows a distribution close to a perfectly normal curve.
c. Jarque-Bera Test shows the data is from a normal distribution. The Jarque-Bera test is a goodness-of-fit test that
determines whether sample data have skewness and kurtosis that matches a normal distribution. The null
hypothesis is a joint hypothesis of the skewness being zero and the excess kurtosis being zero. Since the p-value
above is not less than .05, we fail to reject the null hypothesis.
d. If the p-value is “small” – that is, if there is a low probability of sampling data from a normally distributed
population that produces such an extreme value of the statistic - this may be taken as evidence against the null
hypothesis in favor of the alternative: the weights were not drawn from a normal distribution.
e. Note that the inverse is not true; that is, the test is not used to provide evidence for the null hypothesis.
While Python is more ideal to use when doing EDA, R has its pros, especially when it comes to statistical analysis. For most
of time series analysis projects, we will need the following packages:
1. tidyverse
2. forecast
3. FinTS
4. tseries
Similar with Python, we will only install packages once in our personal devices. We need to load the packages, however,
every time we perform EDA.
When installing packages for the first time, you will be asked to select a CRAN mirror. You can select any of the mirror sites
listed. I normally choose Taiwan or Philippines.
#install packages
install.packages("tidyverse")
install.packages("forecast")
install.packages("FinTS")
install.packages("tseries")
install.packages(“urca”)
#load libraries
library(tidyverse)
library(forecast)
library(FinTS)
library(tseries)
library(zoo)
library(lubridate)
library(urca)
NOTE: No need to install the zoo and lubridate packages as they are already built-in in R.
path = 'https://raw.githubusercontent.com/pawarbi/datasets/master/timeseries/ts_frenchretail.csv'
data = read.csv(path)
head(data)
In practice, train-test split is usually done at 70% train, 30% test split. For the sake of recreating the Python codes, we used
75-25 split in R.
The first portion of the code converts our Sales data truncated in ‘000s and then converts our dataframe into a time series.
Here, you’ll notice that unlike in Python where it’s easy to properly scale the time series, in R it is much harder to visualize
the original time series plot. Still, we can see in the R plot that the time series is trending upward and has seasonality.
plot(as.ts(decompose_train$seasonal))
plot(as.ts(decompose_train$trend))
plot(as.ts(decompose_train$random))
plot(decompose_train)
Like the Python visuals of decomposition, we can capture both trend and seasonality of the retail sales time series using
R. Here, however, we need specify whether the seasonality is additive or multiplicative. Go back to the above in the
overview on how to identify additive and multiplicative seasonality.
CODE BLOCK #7: Performing an initial Ljung-Box Test on residuals of decomposed time series
While we got a different p-value compared to when the test was done in Python, the results are the same in that we will
reject the null hypothesis and that the residuals of the original time series are not autocorrelated.
CODE BLOCK #8: Test for stationarity and difference the time series if needed
The second part of the code performs the differencing of the time series to transform into a stationary time series. Similar
in Python, we set the order of differencing equal to one. After the re-run of the ADF test, we can already say that the
differenced time series is stationary as per below.
The selectlags argument refers to how test will select the optimal number of lags that will minimize the information
criterion. AIC stands for Aikake Information Criterion. We will discuss this further in Module 3.
CODE BLOCK #9: ACF and PACF plots of the de-trended time series
We got the same results in our ACF plot in R with what we got in Python. There is a difference with our PACF plots, but for
the purposes of EDA, let us focus first on our ACF plots.
Similar in Python, using R visuals we can see that the French retail sales time series is normal. We also ran the Jarque-Bera
test and since the p-value is greater than .05, we can reject the null hypothesis that the sample is not from a normal
distribution.