AST4
AST4
Laurent Oudre
[Link]@[Link]
Master MVA
2023-2024
Contents
1. Problem statement
2. Denoising
3. Detrending
5. Outlier removal
▶ Typical usecase: noisy time series with outliers and missing values
▶ In order to apply ML algorithms, the data scientist needs to clean and
consolidate the data
▶ Time-consuming and tedious task: fortunately, ML also provides tools to that
aim!
▶ Careful! All these preprocessing have a strong impact on the expected results
and on the future learned rules!
Introductory example
Introductory example
Introductory example
Introductory example
Introductory example
Introductory example
Introductory example
-1
ECG (µ V)
-2
-3
-4
-5
-6
30 31 32 33 34 35
Time (s)
When all preprocessings have been performed, it becomes possible to retrieve the
heartbeats and thus to perform ML
Contents
1. Problem statement
2. Denoising
2.1 Filtering
2.2 Sparse approximations
2.3 Low-rank approximations
2.4 Other techniques
3. Detrending
5. Outlier removal
Denoising
Notion of AWGN
b[n] ∼ N (0, σ 2 )
Example
1.5 1.5
1 1
0.5 0.5
0 0
Usage
Usage
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400
Minutes Minutes
Filtering
▶ The first solution consists in using results from signal processing and statistics
▶ Knowing that γx [m] = E [x[n]x[n + m]] and using the fact that x[n] and b[n]
are uncorrelated, we get that
|Y [k]|2 = |X [k]|2 + Nσ 2
▶ Adding AGWN is equivalent to adding a constant on the DFT of the signal (in
linear scale)
Example
40
Original signal
2
Noisy signal = 0.01
20
0
10 log10 (|X[k]|2/N) (dB)
-20
-40
-60
-80
-100
-8 -6 -4 -2 0 2 4 6 8
Frequency (Hz) -3
10
2
In the frequency band where only AGWN is present (here with σ = 0.01), the log-spectrum
is equal to
|Y [k]|2 |X [k]|2
10 log10 = 10 log10 + σ 2 = 10 log10 (0.01) = −20dB
N N
Laurent Oudre Machine Learning for Time Series 2023-2024 17 / 86
Denoising Filtering
Example
30
Noisy signal
2
20 10 log10( )
10
10 log10 (|X[k]|2/N) (dB)
-10
-20
-30
-40
-50
0 1 2 3 4 5 6 7 8
Frequency (Hz) -3
10
By plotting the log-spectrum of the noisy signal and knowing the noise variance σ 2 ,
one can guess that all frequencies greater that e.g. 0.001 Hz are likely to only
contain noise.
Filter design
▶ By observing the log-spectrum of the noisy signal and using either prior
knowledge on the original signal bandwidth or on the noise level, we can
determine the type of filter and associated cut-off frequencies that can be used
for denoising
▶ From that, it is only digital filter design (out of scope for this course !). Two
popular solutions
▶ Moving average filter of length L:
L−1
1X
x̂[n] = y[n − k]
L
k=1
0.442947×F
√
Low-pass filter with cut-off frequency fc ≈ s
L2 −1
▶ Butterworth filters: can be low-pass, bandpass, etc…
Example
2 2
1.5 1.5
1
1
0.5
0.5
0
Usage
Usage
0
-0.5
-0.5
-1
-1
-1.5
Noisy signal
Original signal
Denoised signal with low-pass filtering
-2 -1.5 Denoised signal with low-pass filtering
-2.5 -2
0 500 1000 1500 0 500 1000 1500
Minutes Minutes
X
x̂ = z k dk , with |K| < N
k∈K
Dictionaries
▶ Wavelet dictionary with wavelet function ψ(t) and scaling function ϕ(t)
[Percival et al., 2000 ; Mallat, 1999]
Sparse coding
Given an input dictionary D, the denoising task is equivalent to a sparse coding
task, and all previously seen algorithms can be used to that aim (see Lecture 3)
▶ ℓ0-based algorithms with hard thresholding
2
z∗ = argmin ∥x − Dz∥2
z
∥z∥0 =K0
2
z∗ = argmin ∥x − Dz∥2 + λ ∥z∥1
z
Set to zero the coefficients that are lower than a given threshold (and shrink
the other ones)
Sλ (z) = sign(z) × max (|z| − λ, 0)
Laurent Oudre Machine Learning for Time Series 2023-2024 23 / 86
Denoising Sparse approximations
Example
2 2
1.5 1.5
1
1
0.5
0.5
0
Usage
Usage
0
-0.5
-0.5
-1
-1
-1.5
Example
30
25
20
15
10
0
200 400 600 800 1000 1200 1400
Number of kept coefficients K0
How to set K0 or λ
|⟨x, d⟩|
λD (x) = max
d∈D ∥x∥2
∥r(ℓ) ∥22
= 1 − λ2D (r(ℓ−1) )
∥r(ℓ−1) ∥22
▶ The decreasing of the L2 norm of the residual is therefore linked to the
normalized coherence of the residual with the dictionary
▶ If λD (r(ℓ−1) ) is large, it is worth continuing
▶ If λD (r(ℓ−1) ) becomes too small, the algorithm can stop
▶ When can we say that the coherence becomes too low?
▶ One interesting question is therefore: what is the value of λD (r) when the
residual r is pure noise ?
▶ If r is pure random with a known distribution p(r) (e.g. AGWN), we can be
interested in the quantity
Trajectory matrix
Dictionary learning
X̂ = DZ ≈ X
Example
2 2
1.5 1.5
1
1
0.5
0.5
0
Usage
Usage
0
-0.5
-0.5
-1
-1
-1.5
Noisy signal
Denoised signal with dictionary learning -1.5 Original signal
-2 Denoised signal with dictionary learning
-2.5 -2
0 500 1000 1500 0 500 1000 1500
Minutes Minutes
Trajectory matrix
where
▶ U and V are orthogonal matrices
▶ Λ is a diagonal matrix containing on its first diagonal at most Nw singular
values λ1 ≥ . . . ≥ λNw
Nw
X
X= λk uk vtk
k=1
▶ For a zero-mean stationary signal, the lag-covariance matrix for lag Nw can be
estimated as:
1
CX = XXt
Nf
▶ Definite positive matrix with eigen decomposition
CX = ṼΛ̃Ṽt
This principle is the core of the Singular Spectrum Analysis (SSA) algorithm
[Vautard et al., 1992]:
1. Compute the SVD of the trajectory matrix
Nw
X
X= λk uk vtk
k=1
Example
250 2
1
0
-2
0 500 1000 1500
200 0.2
0.1
2
0
Singular value k
3
-0.1
-0.2
100 0 500 1000 1500
0.05
4
0
-0.05
50 0 500 1000 1500
0.05
5
0
0 -0.05
0 500 1000 1500
0 5 10 15 20 25 30 35
k
Singular values and reconstructed components with Nw = 32. From the graphs it
appears that the two first singular values are likely to be signal
Example
2 2
Noisy signal Original signal
1.5 Denoised signal with SSA Denoised signal with SSA
1.5
1
1
0.5
0.5
0
Usage
Usage
0
-0.5
-0.5
-1
-1
-1.5
-2 -1.5
-2.5 -2
0 500 1000 1500 0 500 1000 1500
Minutes Minutes
Denoising with SSA with Nw = 32 and using only the two first components.
Other techniques
Contents
1. Problem statement
2. Denoising
3. Detrending
3.1 Least Square regression
3.2 Other approaches
5. Outlier removal
Trend+Seasonality model
Detrending
Given a signal x[n], estimate and remove the trend component x trend [n]
Standard models
Least-square regression
▶ Least-square estimator: minimization of
∥x − βα∥2
where
β0 (0) ··· βK (0)
β0 (Ts ) ··· βK (Ts )
β=
.. .. ..
. . .
β0 ((N − 1)Ts ) · · · βK ((N − 1)Ts )
▶ Closed form solution
−1
α̂ = β T β βT x
xtrend = β α̂
Example
0.5 0.6
0.4 0.4
0.3
0.2
0.2
0
0.1
-0.2
0
-0.4
-0.1
-0.6
-0.2
-0.8
-0.3
-0.4 -1
-0.5 -1.2
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
0.6 0.1
0.4 0
-0.1
0.2
-0.2
0
-0.3
-0.2
-0.4
-0.4
-0.5
-0.6
-0.6
-0.8
-0.7
-1 -0.8
-1.2 -0.9
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
0.1 0.6
Signal original
0 Signal détrendé
0.4
-0.1
-0.2
0.2
-0.3
-0.4 0
-0.5
-0.2
-0.6
-0.7
-0.4
-0.8
-0.9 -0.6
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Other approaches
Contents
1. Problem statement
2. Denoising
3. Detrending
5. Outlier removal
1.5 1.5
1 1
0.5 0.5
Usage
Usage
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400
Minutes Minutes
Polynomial interpolation
Given a time series x that we want to interpolate on the integer set T = Jnstart , nend K, the
easiest interpolation strategy consists in using polynomial models for the reconstruction
▶ Constant value
x[nstart − 1] + x[nend + 1]
∀n ∈ T , x̂[n] =
2
▶ Linear interpolation
∀n ∈ T , x̂[n] = β1 n + β0
∀n ∈ T , x̂[n] = β3 n3 + β2 n2 + β1 n + β0
where βk are determined by solving a system of equations based on x[nstart − 2], x[nstart − 1],
x[nend + 1] and x[nend + 2]
Example
1.5 1.5
Original signal Original signal
Linear interpolation Cubic spline interpolation
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
0 500 1000 1500 0 500 1000 1500
Minutes Minutes
Low-rank interpolation
▶ The low-rank assumption on the trajectory matrix can also be used for
reconstructing missing samples
x[0] · · · x[N − Nw − 1]
x[1] ··· x[N − Nw ]
X=
.. .. ..
. . .
x[Nw − 1] ··· x[N − 1]
▶ In this case, we will use the Singular Value Decomposition adapted to data
with missing values [Srebro et al., 2003]
▶ These techniques are efficient for medium-size missing patches, as the
low-rank assumption is usually only valid for relatively small windows
Principle
K
X
X̂ = λk uk vtk
k=1
Low-rank interpolation
end
Example
1.5
Original signal
Low-rank approximation
1
0.5
-0.5
-1
-1.5
-2
0 500 1000 1500
Minutes
Nw = 300, K = 10
Model-based interpolation
Model-based interpolation
▶ Problem: how do we estimate the parameters from a time series with missing
data?
▶ Iterative solution
1. Initialization of the missing samples with simple rough estimates (set to zero,
constant or linear interpolation…)
2. Parameter inference from all samples
3. Reconstruction of the missing samples from the learned model
4. Repeat steps 2 and 3 until convergence
AR-based interpolation
▶ For an AR(p) model, given estimates of parameters â, the signal can be
reconstructed by assuming that
p
X
x[n] ≈ − âi x[n − i]
i=1
N−1 p 2
X X
E(x) = x[n] + âi x[n − i]
n=p i=1
AR-based interpolation
x∗ = argmin E(x̃)
∀n∈T
/ ,x̃[n]=x[n]
▶ This optimization problem has a closed form solution (least-square estimates) that is
obtained by rewritting E(x) as the sum of terms depending on the missing samples
n ∈ T and other depending only on the known samples.
▶ By denoting xT the set of missing samples, the equation rewrites
E(x) = xT T BxT + 2xT d + C
where ′
X |
p−|t−t
âl âl+|t−t ′ | if 0 ≤ |t − t ′ | ≤ p
′
▶ ∀(t, t ) ∈ T , bt,t ′ =
l=0
0 else
X
▶ ∀(t, t ′ ) ∈ T , dt = b|k| x[t − k]
−p≤k≤p
t−k ∈T
/
▶ C is a constant only depending on the known samples
▶ The final problem is simply a linear system and thus easy to solve
BxT = −d
Laurent Oudre Machine Learning for Time Series 2023-2024 65 / 86
Interpolation of missing samples Model-based interpolation
AR-based interpolation
Example
1.5 0.18
Original signal
AR(90) interpolation
1 0.16
0.5 0.14
0 0.12
RMSE
-0.5 0.1
-1 0.08
-1.5 0.06
-2 0.04
0 500 1000 1500 1 2 3 4 5 6 7 8 9 10
Minutes Iteration
Contents
1. Problem statement
2. Denoising
3. Detrending
5. Outlier removal
5.1 Isolated samples
5.2 Contiguous samples
Outlier removal
-0.02 2
-0.04
1.5
-0.06
1
-0.08
-0.1 0.5
-0.12
0
-0.14
-0.5
-0.16
-0.18 -1
70.14 70.16 70.18 70.2 70.22 70.24 70.26 70.28 70.3 70.32 8.3 8.35 8.4 8.45 8.5 8.55 8.6 8.65
Time (s) Time (s)
Outliers, also called impulsive noise (as opposed to AWGN) correspond to spurious
samples (isolated or continuous) that take unlikely values
Outlier removal
Outlier removal
Isolated samples
0.5 -0.02
0.4
-0.04
0.3
-0.06
0.2
-0.08
0.1
0 -0.1
-0.1
-0.12
-0.2
-0.14
-0.3
-0.16
-0.4
-0.5 -0.18
0 20 40 60 80 100 70.14 70.16 70.18 70.2 70.22 70.24 70.26 70.28 70.3 70.32
Time (s) Time (s)
Histogram
0.07
0.06
0.05
Count (in %)
0.04
0.03
0.02
0.01
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Value
If the values taken by the impulsive noise are particularly large with respect to the
signal, they can be detected by looking at the histogram of the values taken by the
samples: similar to outlier detection in statistical data
Median filtering
▶ Outliers can be detected AND removed by using a sliding median filtering that
replaces each value by the median of the samples in a window of length 2w + 1:
▶ Median filtering allows to smooth the time series while preserving the
discontinuities
▶ Example : original signal [0.3 0.4 0.45] and noisy signal [0.3 0.9 0.45]
▶ Moving average filter: 0.9 → 0.55
▶ Median filter: 0.9 → 0.375
Median filtering
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.1 -0.1
-0.2 -0.2
-0.3 -0.3
-0.4 -0.4
-0.5 -0.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Contiguous samples
0.5 2
0.4
1.5
0.3
0.2
1
0.1
0 0.5
-0.1
0
-0.2
-0.3
-0.5
-0.4
-0.5 -1
0 20 40 60 80 100 8.3 8.35 8.4 8.45 8.5 8.55 8.6 8.65
Time (s) Time (s)
Contiguous samples
▶ When the impulsive noise corrupts groups of contiguous samples, studying the
values is not sufficient
▶ In order to retrieve the set of outliers T , using a model may be necessary
▶ Outliers: samples that are far from their predicted values according to a model
▶ Same principle that model-based interpolation: parameter estimation,
detection, interpolation and reiterate
▶ Note: this task is close to the Anomaly Detection task (see Lecture 5)
p
X
x[n] = − ai x[n − i] + b[n]
i=1
▶ If adapted model, good parameter estimation and low noise variance, this
quantity must be rather small for samples that are not outliers [Oudre, 2015]
▶ Detection method with threshold λ :
0.5 2
0.4
1.5
0.3
0.2 1
0.1
0.5
0
0
-0.1
-0.2 -0.5
-0.3
-1
-0.4
-0.5 -1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
Example
2 2
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 20 40 60 80 100 0 20 40 60 80 100
Time (s) Time (s)
References
▶ Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010). Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6), 1045-1057.
▶ Percival, D. B., & Walden, A. T. (2000). Wavelet methods for time series analysis (Vol. 4). Cambridge university press.
▶ Mallat, S. (1999). A wavelet tour of signal processing. Elsevier.
▶ Vautard, R., Yiou, P., & Ghil, M. (1992). Singular-spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D: Nonlinear Phenomena, 58(1-4), 95-126.
▶ Flandrin, P., Goncalves, P., & Rilling, G. (2004, September). Detrending and denoising with empirical mode decompositions. In 2004 12th European Signal
Processing Conference (pp. 1581-1584). IEEE.
▶ Boudraa, A. O., & Cexus, J. C. (2006). Denoising via empirical mode decomposition. Proc. IEEE ISCCSP, 4(2006).
▶ Comon, P. (1994). Independent component analysis, a new concept?. Signal processing, 36(3), 287-314.
▶ Lepot, M., Aubin, J. B., & Clemens, F. H. (2017). Interpolation in time series: An introductive overview of existing methods, their performance criteria and
uncertainty assessment. Water, 9(10), 796.
▶ McKinley, S., & Levine, M. (1998). Cubic spline interpolation. College of the Redwoods, 45(1), 1049-1060.
▶ Srebro, N., & Jaakkola, T. (2003). Weighted low-rank approximations. In Proceedings of the 20th International Conference on Machine Learning (ICML-03)
(pp. 720-727).
▶ Janssen, A. J. E. M., Veldhuis, R., & Vries, L. (1986). Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 34(2), 317-330.
▶ Oudre, L. (2015). Automatic detection and removal of impulsive noise in audio signals. Image Processing On Line, 5, 267-281.