0% found this document useful (0 votes)
6 views29 pages

Time Series

The document provides a comprehensive overview of time series analysis and anomaly detection, focusing on the foundational principles such as stationarity, component decomposition, and modeling techniques. It outlines the importance of understanding the underlying structures of time series data for accurate anomaly detection and discusses various methodologies including additive and multiplicative models, as well as advanced approaches like deep learning. Additionally, it emphasizes the necessity of achieving stationarity for effective modeling and anomaly identification.

Uploaded by

pgj42161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views29 pages

Time Series

The document provides a comprehensive overview of time series analysis and anomaly detection, focusing on the foundational principles such as stationarity, component decomposition, and modeling techniques. It outlines the importance of understanding the underlying structures of time series data for accurate anomaly detection and discusses various methodologies including additive and multiplicative models, as well as advanced approaches like deep learning. Additionally, it emphasizes the necessity of achieving stationarity for effective modeling and anomaly identification.

Uploaded by

pgj42161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

‭ ime Series‬


‭Analysis and‬

c om
‭Anomaly‬
s.
‭Detection‬
ic
lyt
na

‭of‬
‭Industrial IoT Data‬
-a
usd
i‭n
‭Part I: Foundations of Time Series Analysis‬ ‭3‬
‭Section 1: Deconstructing Time Series Data‬ ‭4‬
‭1.1 The Four Core Components of a Time Series‬ ‭4‬
‭1.2 Modeling Component Interactions: Additive vs. Multiplicative Decomposition‬ ‭6‬
‭Section 2: The Principle of Stationarity‬ ‭7‬
‭2.1 Defining Stationarity: A Time-Invariant Process‬ ‭7‬
‭2.2 The Importance of Stationarity in Modeling‬ ‭8‬
‭2.3 Validating Stationarity: Visual and Statistical Tests‬ ‭9‬


om
‭Table 2.1: Comparison of Stationarity Tests (ADF vs. KPSS)‬ ‭10‬
‭2.4 Achieving Stationarity: Common Transformations‬ ‭10‬
‭Section 3: Analyzing Temporal Dependencies with Correlation Functions‬ ‭11‬
‭3.1 The Autocorrelation Function (ACF): Measuring Total Correlation‬ ‭11‬

c
‭3.2 The Partial Autocorrelation Function (PACF): Isolating Direct Correlation‬ ‭12‬
‭3.3 Application in Model Identification: The ARIMA Framework‬ ‭13‬

s.
‭Table 3.1: ACF/PACF Signature Patterns for Model Identification‬ ‭14‬
‭Part II: A Comprehensive Guide to Time Series Anomaly Detection‬ ‭15‬
‭Section 4: A Taxonomy of Anomalies‬
ic
‭Section 5: Distance-Based Detection: The k-Nearest Neighbors (k-NN) Approach‬
‭16‬
‭18‬
yt
‭5.1 Core Principle: Anomalies as Isolated Points‬ ‭18‬
‭5.2 Algorithmic Breakdown for Anomaly Detection‬ ‭18‬
l
‭5.3 Advantages and Critical Limitations‬ ‭19‬
na

‭Section 6: Density-Based Detection: The Local Outlier Factor (LOF) Algorithm‬ ‭20‬
‭6.1 Core Principle: Relative Density as an Anomaly Indicator‬ ‭20‬
‭6.2 Algorithmic Breakdown: From Distance to a Factor Score‬ ‭21‬
-a

‭6.3 Interpretation and Application‬ ‭22‬


‭6.4 Strengths and Weaknesses‬ ‭22‬
‭Section 7: Deep Learning Approaches to Anomaly Detection‬ ‭23‬
us

‭7.1 A Modern Taxonomy of Deep Learning Models‬ ‭23‬


‭7.2 The Reconstruction Paradigm: An In-Depth Analysis of Autoencoders‬ ‭24‬
‭7.3 Architecture Deep Dive: LSTM Autoencoders for Sequential Data‬ ‭25‬
d

‭7.4 Training, Inference, and Thresholding‬ ‭26‬


i‭n

‭7.5 Survey of Advanced Architectures‬ ‭27‬


‭Section 8: Synthesis and Recommendations‬ ‭28‬
‭8.1 Comparative Analysis of Detection Methodologies‬ ‭28‬
‭Table 8.1: Comparative Matrix of Anomaly Detection Algorithms‬ ‭28‬
‭8.2 A Decision Framework for Selecting the Right Model‬ ‭29‬
‭8.3 Future Directions and Open Research Challenges‬ ‭30‬

c om
s.
ic
‭Part I: Foundations of Time Series Analysis‬
l yt
na
-a
usd
i‭n
‭ he analysis of time series data represents a distinct and challenging subfield of data‬
T
‭science and statistics. Unlike cross-sectional data where observations are‬
‭independent, time series data is defined by its‬‭temporal ordering‬‭, where each data‬
‭point is recorded at a consistent interval over a period. This inherent sequence‬
‭introduces dependencies between observations, violating the assumptions of many‬
‭standard statistical methods and necessitating a specialized set of tools and‬
‭principles for analysis and modeling. The primary objectives of time series analysis are‬
‭twofold:‬‭to understand the underlying structures‬‭and patterns‬‭within the historical‬


‭data and to leverage this understanding to‬‭model the time series‬‭or‬‭detect‬

om
‭deviations from normal behavior‬‭. This foundational part of the report establishes‬
‭the conceptual and statistical groundwork required for any rigorous application, with‬
‭a particular focus on preparing data for the ultimate goal of‬‭anomaly detection‬‭.‬

c
s.
ic
l yt
na
-a
us

‭Section 1: Deconstructing Time Series Data‬

‭ he first step in any time series analysis is to decompose the data into its constituent‬
T
‭components. This process allows an analyst to isolate and understand the different‬
d

‭forces that combine to produce the observed data sequence. By breaking down the‬
i‭n

‭series, one can identify long-term movements, predictable cycles, and random noise,‬
‭which is essential for accurate modeling and the identification of unusual events.‬

‭1.1 The Core Components of a Time Series‬

‭ time series can be conceptually broken down into four fundamental components.‬
A
‭The systematic identification of these patterns is the first step toward building a‬
‭successful detection model.‬
‭●‬ ‭Trend (T)‬‭: The trend represents the long-term, secular movement of the series,‬
i‭ndicating a general direction of increase, decrease, or stability over the entire‬
‭observed period. It reflects the varying mean of the time series data. A trend does‬
‭not need to be linear; it can be quadratic, exponential, or change direction over‬
‭time. For instance, the trend of overall vibration level from a machine developing‬
‭fault is mostly positive.‬

‭●‬ ‭Seasonality (S)‬‭: Seasonality refers to predictable, repeating patterns or‬


‭ uctuations that occur at a‬‭fixed and known frequency‬‭. These patterns are tied to‬
fl

om
‭calendar-based intervals, such as the time of day, day of the week, month, or‬
‭quarter. Examples are ubiquitous in business, nature and industrial data; for‬
‭example, surface temperature (esp of outdoor machines) contains diurnal and‬
‭yearly seasonal patterns. The key characteristic of seasonality is its constant and‬

c
‭predictable period.‬

s.
‭●‬ ‭Irregularity / Noise / Residual (R)‬‭: This component, also referred to as noise or‬
t‭ he error term, represents the random, unpredictable fluctuations that remain‬

ic
‭after the trend, seasonal, and cyclical components have been removed from the‬
‭series. These variations are caused by short-term, uncontrollable events such as‬
yt
‭sensor error, sudden change in operating condition, etc. In the context of‬
‭modeling, this residual is what is left over after accounting for the predictable‬
l
‭patterns.‬
na
-a
d us
i‭n
‭1.2 Modeling Component Interactions: Additive vs. Multiplicative Decomposition‬

‭ he relationship between these components can be formalized through a‬


T
‭decomposition model. The choice of model depends on how the components interact‬
‭with each other, particularly how seasonality relates to the trend.‬

‭●‬ ‭Additive Model: An additive model is expressed as:‬


‭ t​=Tt​+St​+Rt​‬
Y
‭This model is appropriate when the magnitude of the seasonal variation is‬


‭relatively constant over time and does not depend on the level of the trend.‬

om
‭●‬ ‭Multiplicative Model: A multiplicative model is expressed as:‬
‭ t​=Tt​×St​×Rt​‬
Y
‭This model is necessary when the seasonal variation increases or decreases in‬

c
‭magnitude in proportion to the level of the trend.‬

s.
‭ key property of the multiplicative model is that it can be converted into an‬
A
‭additive model by applying a logarithmic transformation‬

ic
‭log(Yt​)=log(Tt​)+log(St​)+log(Rt​)‬
‭This transformation often stabilizes the variance and makes the series easier to‬
yt
‭model.‬
l
‭ he choice between an additive and multiplicative model is a critical first step in‬
T
na

‭building a robust analysis, as it forms a fundamental assumption about the‬


‭data-generating process. This decision directly impacts‬‭how "normal" behavior is‬
‭defined‬‭, which is a prerequisite for identifying deviations. The core task of anomaly‬
-a

‭detection is to identify data points that depart from an expected pattern. This‬
‭expected pattern is defined by the series' components.‬
us

‭ o correctly identify a‬‭contextual anomaly‬‭—one whose abnormality depends on its‬


T
‭context—the model must first understand the true relationship between trend and‬
‭seasonality. If a practitioner mis-specifies this relationship, for instance by using an‬
d

‭additive model for data where seasonality grows with the trend, the baseline for‬
‭"normal" will be incorrect. During high-trend periods, the model will expect a smaller‬
i‭n

‭seasonal swing than is actually normal, leading to a high number of false positives.‬
‭Conversely, during low-trend periods, it will expect a larger swing, potentially missing‬
‭true anomalies and generating false negatives. Therefore, correct decomposition is‬
‭not a statistical formality but a mandatory precursor to accurate contextual anomaly‬
‭detection.‬
‭Section 2: The Principle of Stationarity‬

‭ he concept of stationarity is one of the most important principles in time series‬


T
‭analysis. It provides a theoretical foundation for many classical forecasting models‬
‭and serves as a crucial data processing objective. A stationary process is one whose‬
‭statistical properties do not change over time, making it far easier to analyze and‬
‭predict than a non-stationary one.‬

‭2.1 Defining Stationarity: A Time-Invariant Process‬


om
‭ time series is considered‬‭stationary‬‭if its underlying statistical properties are‬
A
‭independent of the point in time at which they are observed. This concept is‬
‭formalized in two main ways:‬

c
‭●‬ ‭Strict Stationarity‬‭: A process is strictly stationary if the joint probability‬
‭ istribution of any set of observations (Xt1​​,Xt2​​,...,Xtk​​) is identical to the joint‬
d

s.
‭probability distribution of a time-shifted set (Xt1​+h​,Xt2​+h​,...,Xtk​+h​) for any time‬
‭points and any time shift h. This is a very strong condition that implies all‬

ic
‭statistical moments (mean, variance, skewness, etc.) are constant over time. In‬
‭practice, it is a difficult condition to verify and is‬‭rarely met by real-world data‬‭.‬
yt
‭●‬ ‭Weak (or Covariance) Stationarity‬‭: This is a more practical and commonly used‬
‭definition. A process is weakly stationary if it satisfies three conditions:‬
l
na

‭1.‬ ‭The mean is constant and finite for all time: E[Xt​]=μ.‬

‭2.‬ ‭The variance is constant and finite for all time: Var(Xt​)=σ2.‬
-a

‭3.‬ ‭The autocovariance between any two observations depends only on the lag‬
‭(the time difference) between them, not on their absolute position in time.‬
us

‭ rom this definition, it follows directly that‬‭any time series exhibiting a clear trend‬
F
‭(a non-constant mean) or seasonality‬‭(a predictable, time-dependent pattern in‬
d

‭the mean) is, by definition,‬‭non-stationary‬‭.‬


i‭n

‭2.2 The Importance of Stationarity in Modeling‬

‭ tationarity is a critical assumption for many classical time series models, most‬
S
‭notably the Autoregressive Integrated Moving Average (ARIMA) family of models.‬‭A‬
‭stationary process is fundamentally easier to analyze because its statistical‬
‭properties are consistent over time.‬‭This consistency allows models to learn the‬
‭underlying structure of the data and make more reliable forecasts. When a series is‬
‭stationary, we can assume that the patterns observed in the past will continue into the‬
f‭ uture. By transforming a non-stationary series into a stationary one, we can‬
‭effectively apply standard regression-based techniques that would otherwise be‬
‭invalid for time-dependent variables.‬

‭ he process of achieving stationarity should be viewed as more than just a data‬


T
‭preparation step for specific models. It is, at its core, a powerful‬‭signal isolation‬
‭technique. A time series is a composite of predictable elements (Trend, Seasonality)‬
‭and unpredictable ones (Irregularity/Noise). Anomalies are, by definition, unexpected,‬


‭rare, and irregular events that deviate from the norm; they are conceptually part of‬

om
‭the "Irregularity" component. The techniques used to induce stationarity, such as‬
‭differencing and detrending, are explicitly designed to remove the trend and seasonal‬
‭components. The result of this process is a stationary residual series whose‬
‭fluctuations represent the "noise" around a constant mean. Therefore, the act of‬

c
‭making a series stationary is the first and most fundamental step in isolating the very‬

s.
‭signal that contains the anomalies.‬‭Any anomaly detection method that operates‬
‭on the statistical properties of the data, such as thresholding based on‬
‭standard deviations, will perform more reliably and accurately on these‬

ic
‭stationary residuals than on the raw, non-stationary data.‬
yt
‭2.3 Validating Stationarity: Visual and Statistical Tests‬

‭ efore applying transformations, one must first determine if a series is non-stationary.‬


B
l
na

‭This can be done through both visual inspection and formal statistical tests.‬

‭●‬ ‭Visual Inspection‬‭: A simple time plot of the data is often the first and most‬
i‭ntuitive check. Obvious upward or downward trends, or clear changes in the‬
-a

‭variance (e.g., the fluctuations becoming wider or narrower over time), are strong‬
‭visual indicators of non-stationarity. Another simple method is to split the series‬
us

‭into two or more contiguous parts and compare their summary statistics (mean,‬
‭variance); significant differences suggest non-stationarity.‬

‭●‬ ‭Statistical Tests‬‭: For a more objective and rigorous assessment, unit root tests‬
d

‭ re employed. A "‬‭unit root‬‭" is a feature of some stochastic processes that can‬


a
i‭n

‭cause problems in statistical inference involving time series models. The presence‬
‭of a unit root is a mathematical confirmation of non-stationarity. The two most‬
‭common tests for this are the‬‭Augmented Dickey-Fuller‬‭(ADF) test and the‬
‭Kwiatkowski-Phillips-Schmidt-Shin‬‭(KPSS) test. It is crucial to understand that‬
‭these two tests operate with opposing null hypotheses.‬

‭○‬ ‭Augmented Dickey-Fuller (ADF) Test‬‭: The null hypothesis (H0​) of the ADF‬
‭test is that the time series is‬‭non-stationary‬‭(i.e., it possesses a unit root).‬
‭ he alternative hypothesis is that the series is stationary. Therefore, a low‬
T
‭p-value (typically < 0.05) provides evidence to reject the null hypothesis and‬
‭conclude that the series is stationary.‬

‭○‬ ‭Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test‬‭: The null hypothesis (H0​)‬


‭ f the KPSS test is that the time series is‬‭stationary‬‭around a deterministic‬
o
‭trend. The alternative is non-stationarity. In this case, a low p-value (< 0.05)‬
‭leads to the rejection of the null hypothesis, suggesting that the series is‬


‭non-stationary and requires differencing.‬

om
‭ he opposing nature of these hypotheses can be a source of confusion, but using‬
T
‭them in tandem can provide a more robust conclusion. For example, if the ADF test‬
‭fails to reject non-stationarity and the KPSS test rejects stationarity, one can be very‬

c
‭confident that the series is non-stationary.‬

s.
‭Table 2.1: Comparison of Stationarity Tests (ADF vs. KPSS)‬

‭ o prevent common misinterpretations of test results, the following table provides a‬


T

ic
‭clear, at-a-glance reference for these two fundamental tests.‬
yt
‭Feature‬ ‭ ugmented Dickey-Fuller‬
A ‭ wiatkowski-Phillips-Schmidt-‬
K
‭(ADF)‬ ‭Shin (KPSS)‬
l
na

‭Null Hypothesis (H0​)‬ ‭ he series is‬‭non-stationary‬


T ‭ he series is‬‭stationary‬
T
‭(has a unit root).‬ ‭(around a mean or trend).‬
-a

‭p-value < 0.05‬ ‭ eject H0​: The series is‬


R ‭ eject H0​: The series is‬
R
‭stationary‬‭.‬ ‭non-stationary‬‭.‬
us

‭p-value > 0.05‬ ‭ ail to Reject H0​: The series is‬


F ‭ ail to Reject H0​: The series is‬
F
‭non-stationary‬‭.‬ ‭stationary‬‭.‬
d

‭Primary Use‬ ‭ o test if differencing is‬


T ‭ o confirm if a series is‬
T
i‭n

‭required.‬ ‭already stationary.‬

‭2.4 Achieving Stationarity: Common Transformations‬

I‭f a time series is found to be non-stationary, it must be transformed before it can be‬
‭used with many classical models. Several techniques exist to achieve this.‬
‭●‬ ‭Differencing‬‭: This is the most common method for removing a trend. First-order‬
‭ ifferencing involves creating a new series by subtracting the previous‬
d
‭observation from the current observation: Yt′​=Yt​−Yt−1​. This transformation‬
‭effectively removes a linear trend. If the trend is quadratic, second-order‬
‭differencing (Yt′′​=Yt′​−Yt−1′​) may be necessary, though it is rare to require more‬
‭than two levels of differencing. If seasonality is present, it can be removed with‬
‭seasonal differencing, where the observation from the previous season is‬
‭subtracted:‬


‭Yt′​=Yt​−Yt−m​, where m is the seasonal period (e.g., 12 for monthly data).‬

c om
s.
ic
l yt
na
-a

‭●‬ ‭Power Transformations‬‭: When the variance of a series is not constant (a‬
‭ ondition known as heteroscedasticity), a power transformation can be applied to‬
c
us

‭stabilize it. This should typically‬‭be done‬‭before‬‭differencing‬‭. The most common‬


‭transformations are the‬‭logarithm‬‭, square root, or cube root. A log transform is‬
‭particularly effective for data exhibiting exponential growth, as it can convert the‬
d

‭exponential trend into a linear one, which can then be removed by differencing.‬
i‭n

‭●‬ ‭Detrending‬‭: An alternative to differencing is to explicitly model and remove the‬


t‭ rend. This can be done by fitting a regression model (e.g., linear or quadratic)‬
‭with time as the predictor variable and then subtracting the fitted trend line from‬
‭the original series. The remaining residuals should form a stationary series.‬

‭Section 3: Analyzing Temporal Dependencies with Correlation Functions‬

‭Once a time series is stationary, the next step is to investigate the structure of its‬
t‭ emporal dependencies. This is accomplished by analyzing the correlation between an‬
‭observation and its past values. The primary tools for this analysis are the‬
‭Autocorrelation Function (ACF)‬‭and the‬‭Partial Autocorrelation Function (PACF)‬‭.‬
‭These functions are indispensable for understanding the memory of a process and for‬
‭identifying the appropriate structure of ARIMA models‬‭.‬

‭3.1 The Autocorrelation Function (ACF): Measuring Total Correlation‬

‭ he Autocorrelation Function (ACF) measures the linear relationship between a time‬


T


‭series and its lagged values. Specifically, the ACF at lag k calculates the correlation‬

om
‭coefficient between observations that are k time steps apart, i.e., the correlation‬
‭between Xt​and Xt−k​.‬

‭ n important characteristic of the ACF is that it measures the‬‭total‬‭correlation. This‬


A

c
‭includes both the direct correlation between Xt​and Xt−k​and any indirect correlation‬

s.
‭that is mediated through the intervening lags (Xt−1​,Xt−2​,...,Xt−k+1​). For example, a‬
‭strong correlation at lag 2 could be due to Xt−2​directly influencing Xt​, or it could be‬

ic
‭an artifact of Xt−2​strongly influencing Xt−1​, which in turn strongly influences Xt​. The‬
‭ACF captures both of these effects.‬
yt
‭ hen plotted, the‬‭ACF provides strong visual cues‬‭about the nature of the time‬
W
‭series.‬‭A plot of a non-stationary series will typically show an ACF that decays‬
l
‭very slowly to zero‬‭, as each observation is highly correlated with its recent‬
na

‭predecessors due to the trend.‬‭For a stationary series, significant spikes at‬


‭regular intervals (e.g., at lags 12, 24, 36 for monthly data) are a clear indicator of‬
‭seasonality.‬
-a

‭3.2 The Partial Autocorrelation Function (PACF): Isolating Direct Correlation‬


us

‭ he Partial Autocorrelation Function (PACF) provides a more refined view of temporal‬


T
‭dependence. The PACF at lag k measures the correlation between Xt​and Xt−k​‬‭after‬
‭removing the linear influence of the intermediate lags (Xt−1​,Xt−2​,...,Xt−k+1​). In‬
d

‭essence, it isolates the direct relationship between two observations at a specific lag,‬
‭controlling for the effects of the shorter lags.‬
i‭n

‭ his ability to measure direct correlation makes the‬‭PACF the primary tool for‬
T
‭identifying the order of an Autoregressive (AR) process‬‭. An AR(p) process is one‬
‭where the current value is a linear combination of the p previous values. The PACF of‬
‭such a process will show a significant spike at lag p and then abruptly cut off to zero‬
‭(or within the confidence interval) for all subsequent lags.‬

om
‭3.3 Application in Model Identification: The ARIMA Framework‬

c
‭ he combined analysis of ACF and PACF plots is the cornerstone of the‬‭Box-Jenkins‬
T

s.
‭methodology‬‭for identifying the parameters of ARIMA models.‬

ic
‭An‬‭ARIMA(p, d, q)‬‭model has three components:‬

‭●‬ ‭AR(p) - Autoregressive‬‭: This component specifies that the current value of the‬
yt
s‭ eries is regressed on its own p previous values. The order p is determined by‬
‭examining the PACF plot, which should exhibit a sharp cutoff after lag p.‬
l
na

‭●‬ ‭I(d) - Integrated‬‭: This component specifies the number of times d that the raw‬
‭ ata has been differenced to achieve stationarity. This is determined prior to‬
d
‭ACF/PACF analysis using the methods described in Section 2.‬
-a

‭●‬ ‭MA(q) - Moving Average‬‭: This component specifies that the current value is a‬
f‭ unction of the q previous forecast errors (or random shocks). The order q is‬
us

‭determined by examining the ACF plot, which should exhibit a sharp cutoff after‬
‭lag q.‬
d

‭ or data with strong seasonality, the‬‭SARIMA(p, d, q)(P, D, Q)m‬‭model is used. This is‬
F
‭an extension of ARIMA that adds seasonal components‬‭. The parameters (P, D, Q)‬
i‭n

‭are the seasonal counterparts to (p, d, q), and m represents the length of the‬
‭seasonal period (e.g., m=12 for monthly data with a yearly pattern). This model is‬
‭necessary when seasonality is a significant factor that a standard ARIMA model‬
‭cannot adequately capture.‬

‭ eyond their classical use for identifying a single, static model for an entire series,‬
B
‭correlation functions can be adapted into dynamic feature engineering tools for‬
‭detecting more complex anomalies. The expected correlation structure of a process is‬
‭ key part of its "normal" behavior. A sudden change or break in this correlation‬
a
‭structure is, itself, a type of anomaly. An anomaly is a deviation from the norm, and‬
‭this "norm" encompasses not just the values of the data but also their‬
‭interrelationships.‬‭A "pattern change" or "shapelet" anomaly might not involve‬
‭extreme point values but could manifest as a shift in how current values relate‬
‭to past values‬‭.‬‭Instead of computing a single ACF and PACF for the entire series,‬
‭one could compute these functions over a rolling window‬‭. This process would‬
‭generate a new time series for each significant lag, where the values are the ACF or‬


‭PACF coefficients at that lag. A significant and abrupt change in this new time series‬

om
‭of correlation coefficients would signal a structural break in the underlying process.‬
‭This represents a sophisticated, collective anomaly that would be entirely invisible to‬
‭simple point-based detectors. This technique elevates ACF and PACF from static‬
‭analysis tools to dynamic feature generators for advanced, state-aware anomaly‬

c
‭detection systems.‬

s.
‭Table 3.1: ACF/PACF Signature Patterns for Model Identification‬

T
ic
‭ he following table provides a classic reference for interpreting‬‭correlograms‬‭, a‬
‭fundamental skill for translating the visual patterns of the plots into concrete model‬
yt
‭specifications for stationary data.‬
l
‭Process‬ ‭ACF Behavior‬ ‭PACF Behavior‬
na

‭AR(p)‬ ‭ ails off (decays exponentially‬


T ‭Cuts off sharply after lag p‬
‭or with a damped sine wave)‬
-a

‭MA(q)‬ ‭Cuts off sharply after lag q‬ ‭ ails off (decays exponentially‬
T
‭or with a damped sine wave)‬
us

‭ARMA(p,q)‬ ‭Tails off (begins after lag q)‬ ‭Tails off (begins after lag p)‬
d
i‭n

c om
s.
ic
‭Part II: A Guide to Time Series Anomaly Detection‬
l yt
na
-a
usd
i‭n
‭ aving established the foundational principles of time series analysis, this part of the‬
H
‭report transitions to the primary focus: the detection of anomalies. Anomaly‬
‭detection, also known as outlier or novelty detection, is the‬‭process of identifying‬
‭data points, events, or observations that deviate significantly from the‬
‭expected pattern of a dataset‬‭. In the context of time series, these deviations can‬
‭signal critical events such as operating condition change, sensor issue, faults, etc.‬
‭This section provides a systematic exploration of anomaly detection, beginning with a‬
‭formal classification of anomaly types and then proceeding to a deep, comparative‬


‭analysis of three major methodological families:‬‭distance-based, density-based,‬

om
‭and deep learning approaches.‬

‭Section 4: A Taxonomy of Anomalies‬

c
‭ he effectiveness of any anomaly detection system is critically dependent on a clear‬
T
‭understanding of the type of anomaly it is designed to find. Not all anomalies are‬

s.
‭created equal, and an algorithm optimized for one type may be completely blind to‬
‭another. The literature broadly classifies anomalies into three primary categories,‬

ic
‭which serve as a foundational taxonomy for the field.‬

‭●‬ ‭Point Anomalies‬‭: A point anomaly is an individual data point that deviates‬
yt
s‭ harply and significantly from the rest of the data. Also known as a global outlier,‬
‭this is the simplest and most common form of anomaly. These anomalies typically‬
l
na

‭represent‬‭one-off events, measurement errors, or system glitches‬‭that cause‬


‭a value to fall far outside the normal range. For example, in a dataset of daily‬
‭temperature readings, a single reading of an impossibly high temperature would‬
‭be a point anomaly. Their distinct and isolated nature makes them relatively‬
-a

‭straightforward to detect with statistical methods like Z-scores or simple‬


‭distance measures.‬
us

‭●‬ ‭Contextual (or Conditional) Anomalies‬‭: A contextual anomaly is a data point‬


t‭ hat is considered anomalous only within a specific context. The value of the data‬
d

‭point itself may not be extreme or unusual in a broader sense, but its occurrence‬
‭at a particular time or under specific circumstances makes it abnormal. The‬
i‭n

‭context provides the baseline for expected behavior. Detecting these anomalies‬
‭requires the model to understand the context, such as the running speed of the‬
‭machine, seasonality, time of day, or other recurring patterns.‬

‭●‬ ‭Collective Anomalies‬‭: A collective anomaly occurs when a sequence or‬


‭ ollection of related data points is anomalous as a group, even if each individual‬
c
‭point within the sequence appears normal in isolation. The anomaly lies in the‬
‭combined behavior or pattern of the group. This often indicates a sustained issue,‬
‭ systemic shift, or a coordinated event. For example, a slight but persistent daily‬
a
‭drop in the volume of data processed by a pipeline might not trigger any alarms‬
‭on a single day. However, the collective downward trend over a week is an‬
‭anomalous pattern that signals a developing problem.‬

‭ his anomaly taxonomy is not merely a descriptive classification; it serves as a‬


T
‭prescriptive framework‬‭that directly dictates the necessary capabilities of the‬
‭detection algorithm. There is a direct mapping from the type of anomaly being‬


‭targeted to the required model architecture and its level of awareness.‬

om
‭1.‬ ‭Point anomalies‬‭are defined by their value in isolation from others. This implies‬
t‭ hat a‬‭stateless‬‭algorithm, which evaluates each point individually against a global‬
‭or local threshold (such as a Z-score or a simple distance-based score), is‬

c
‭sufficient for their detection.‬

s.
‭2.‬ ‭Contextual anomalies‬‭are defined by their value relative to their temporal‬
‭ ontext. This implies that the‬‭algorithm must be‬‭context-aware‬‭. It needs to‬
c

ic
‭model or be explicitly provided with information about recurring patterns like‬
‭seasonality or time of day to establish a context-specific baseline for what‬
yt
‭constitutes "normal" behavior.‬

‭3.‬ ‭Collective anomalies‬‭are defined by the behavior of a sequence of points as a‬


l
‭ hole. This implies that the‬‭algorithm must be‬‭stateful‬‭or‬‭sequence-aware‬‭. It‬
w
na

‭cannot evaluate points individually but must process a window or sequence of‬
‭data to identify anomalous patterns. This requirement points directly toward‬
‭models like Recurrent Neural Networks (e.g., LSTMs) or Temporal Convolutional‬
-a

‭Networks (TCNs), which are explicitly designed to process sequences and‬


‭maintain an internal state or memory.‬
us

‭ practitioner who fails to match the algorithm's capability to the target‬


A
‭anomaly type is destined to fail‬‭. Attempting to find a collective anomaly with a‬
‭point-based Z-score method is conceptually flawed and will not work. Likewise, trying‬
d

‭to find a contextual anomaly without providing seasonal or temporal context to the‬
i‭n

‭model is equally bound to fail. Therefore, a clear characterization of the target‬


‭anomaly type is the first and most crucial step in designing an effective detection‬
‭solution.‬

‭Section 5: Distance-Based Detection: The k-Nearest Neighbors (k-NN) Approach‬

‭ istance-based methods are among the most intuitive approaches to anomaly‬


D
‭detection. They operate on a simple yet powerful premise: normal data points tend to‬
‭ xist in dense neighborhoods, while anomalous points are isolated and lie far from‬
e
‭their peers in the feature space. The k-Nearest Neighbors (k-NN) algorithm is a‬
‭classic and widely used example of this approach.‬

‭5.1 Core Principle: Anomalies as Isolated Points‬

‭ he k-NN algorithm is a non-parametric, instance-based, or "lazy" learning method. It‬


T
‭is considered "lazy" because it does not build an explicit model during a training‬
‭phase; instead, it stores the entire training dataset and performs computations only at‬


‭the time of prediction or inference. For anomaly detection, the core assumption is that‬

om
‭an anomalous data point will have a much larger distance to its nearest‬
‭neighbors compared to a normal data point.‬

‭5.2 Algorithmic Breakdown for Anomaly Detection‬

c
‭ he application of k-NN for anomaly detection is a straightforward, multi-step‬
T

s.
‭process:‬

ic
‭1.‬ ‭Select the Hyperparameter k‬‭: The analyst must first choose the number of‬
‭ eighbors, k, to consider for each point. This is a critical hyperparameter that‬
n
yt
‭significantly influences the algorithm's performance.‬

‭2.‬ ‭Calculate Distances‬‭: For a given data point (either from the training set or a‬
l
‭ ew, unseen point), its distance to every other point in the dataset is calculated.‬
n
na

‭Several distance metrics can be used, with the choice depending on the nature of‬
‭the data. The most common is the‬‭Euclidean distance‬‭for continuous, numerical‬
‭data. Other options include the‬‭Manhattan distance‬‭(also for continuous data)‬
-a

‭and the‬‭Hamming distance‬‭for categorical data. The Euclidean distance‬


‭between two vectors‬‭p and q in an n-dimensional space is given by:‬
us

‭𝑛‬
‭2‬
‬ 𝑖‬) ‭​‬
‭𝑑‬(‭𝑝,‬ ‭𝑞‬)‭‬ = ‭‬ ∑ (‭𝑞𝑖‬‭‬ − ‭‭𝑝
‭𝑖‬=‭1‬

‭3.‬ ‭Identify Nearest Neighbors‬‭: After calculating all distances, the k points with the‬
d

‭smallest distances to the target point are identified as its nearest neighbors.‬
i‭n

‭4.‬ ‭Calculate Anomaly Score‬‭: The anomaly score for the target point is then‬
‭ alculated based on these neighbors. A common and effective method is to‬
c
‭define the anomaly score as the‬‭distance to the k-th nearest neighbor‬‭. A point‬
‭with a significantly larger score than most other points in the dataset is flagged‬
‭as an anomaly. Another approach is to use the average distance to all‬
‭k nearest neighbors.‬

c om
s.
‭5.3 Advantages and Critical Limitations‬

ic
‭ he k-NN algorithm offers several advantages that make it an attractive baseline‬
T
yt
‭method for anomaly detection.‬

‭●‬ ‭Advantages‬‭: Its primary strengths are its simplicity and intuitiveness. The logic is‬
l
‭ asy to understand and implement from scratch. As a non-parametric method, it‬
e
na

‭makes no assumptions about the underlying distribution of the data.‬


‭Furthermore, because it has no explicit training phase, it can easily adapt to new‬
‭data points as they become available.‬
-a

‭●‬ ‭Limitations‬‭: Despite its simplicity, k-NN suffers from several significant‬
‭drawbacks that limit its applicability in many modern scenarios.‬
us

‭○‬ ‭Computational Complexity‬‭: The need to calculate the distance from a target‬
‭ oint to every other point in the dataset makes the algorithm computationally‬
p
d

‭expensive. The complexity is typically O(N2) for a dataset of size N, which‬


‭becomes‬‭prohibitive for large datasets.‬
i‭n

‭○‬ ‭Parameter Sensitivity‬‭: The performance of k-NN is highly sensitive to the‬


‭ hoice of the hyperparameter k.‬‭A small value of k can make the model‬
c
‭susceptible to noise‬‭, where small, insignificant clusters might be incorrectly‬
‭flagged as anomalies. Conversely,‬‭a large value of k can cause the‬
‭algorithm to overlook smaller, more localized anomalies.‬

‭○‬ ‭The Curse of Dimensionality‬‭: This is arguably the most critical limitation of‬
k‭ -NN and other distance-based methods.‬‭In high-dimensional feature‬
‭spaces, the concept of distance becomes less meaningful‬‭. As the number‬
‭of dimensions increases, the distance between any two points in the space‬
‭tends to become almost equal. This phenomenon severely degrades the‬
‭performance of k-NN, as it becomes difficult to distinguish between "near"‬
‭and "far" neighbors.‬

‭ he "curse of dimensionality" is not just a theoretical concern but a practical barrier‬


T


‭that defines the utility of k-NN in modern time series analysis. While k-NN can be an‬

om
‭effective and interpretable baseline model for finding point anomalies in univariate or‬
‭low-dimensional multivariate series,‬‭it is theoretically and practically ill-suited for‬
‭the high-dimensional data commonly generated by IoT sensors‬‭, financial‬
‭systems, and industrial monitoring equipment. The algorithm's core mechanism—the‬

c
‭distance metric—becomes unreliable in these settings. This establishes a clear,‬

s.
‭practical guideline for practitioners: if the time series has a low number of dimensions‬
‭(e.g., fewer than 10), k-NN is a reasonable starting point. However, if the‬
‭dimensionality is high, its use is discouraged. In such cases, methods with implicit or‬

ic
‭explicit dimensionality reduction capabilities, such as Autoencoders or Isolation‬
‭Forests, should be considered immediately, as they are designed to overcome this‬
yt
‭fundamental challenge.‬
l
‭Section 6: Density-Based Detection: The Local Outlier Factor (LOF) Algorithm‬
na

‭ ensity-based methods offer a more nuanced approach to anomaly detection than‬


D
‭simple distance-based techniques. Instead of just measuring isolation, they consider‬
-a

‭the density of a point's local neighborhood. The Local Outlier Factor (LOF) algorithm‬
‭is a seminal and powerful example of this approach, designed to identify outliers by‬
‭measuring their degree of isolation relative to their surrounding neighborhood.‬
us

‭6.1 Core Principle: Relative Density as an Anomaly Indicator‬

‭ OF is an unsupervised, density-based algorithm that assigns an anomaly score to‬


L
d

‭each data point by measuring its‬‭local density deviation‬‭with respect to its neighbors.‬
i‭n

‭The fundamental idea behind LOF is that‬‭an anomalous point will have a‬
‭substantially lower local density than its neighbors‬‭, making it a "local" outlier. This‬
‭focus on local, relative density allows LOF to successfully identify anomalies in‬
‭datasets where different regions have different densities, a scenario where global‬
‭distance-based methods might fail.‬

‭6.2 Algorithmic Breakdown: From Distance to a Factor Score‬

‭The LOF algorithm builds upon concepts from k-NN but computes a more‬
‭sophisticated score through a series of steps:‬

‭1.‬ ‭k-distance of a point‬‭: For any given point A, its k-distance is defined as the‬
‭ istance to its k-th nearest neighbor. This establishes the radius of the local‬
d
‭neighborhood.‬

‭2.‬ ‭Reachability Distance (RD): The reachability distance of a point A from a neighbor‬
‭ is defined as the maximum of either the true distance between A and B or the‬
B
‭k-distance of B.‬


‭RDk​(A,B)=max(k-distance(B),d(A,B))‬

om
‭ his has a smoothing effect: for points A that are very close to B (i.e., within B's‬
T
‭dense neighborhood), their reachability distance from B is capped at B's‬
‭k-distance. This prevents points in a dense cluster from having artificially low‬

c
‭reachability distances.‬

s.
‭3.‬ ‭Local Reachability Density (LRD): The LRD of a point A is the inverse of the‬
‭average reachability distance from A to all of its k nearest neighbors.‬

ic
yt
‭4.‬ ‭Local Outlier Factor (LOF): Finally, the LOF score of point A is calculated as the‬
‭ratio of the average LRD of its k neighbors to its own LRD.‬
l
na

‭ his score is a measure of how isolated a point is relative to its surrounding‬


T
‭neighborhood.‬
-a
d us
i‭n
‭6.3 Interpretation and Application‬

‭The resulting LOF score for each point is interpreted as follows:‬

‭●‬ ‭LOF ≈ 1‬‭: The point has a density similar to its neighbors and is considered an‬
‭inlier.‬

‭●‬ ‭LOF < 1‬‭: The point is in a region that is denser than its neighbors, making it a‬
‭strong inlier.‬


om
‭●‬ ‭LOF > 1‬‭: The point is in a region that is less dense (more sparse) than its‬
‭neighbors, indicating it is a potential outlier or anomaly.‬

I‭n practice, a threshold is set on the LOF score (e.g., 1.5 or 2.0) to formally classify‬
‭points as anomalies. As with k-NN, the choice of the neighborhood size k (often‬

c
‭referred to as n_neighbors or minPts in software libraries) is a critical hyperparameter‬

s.
‭that must be tuned for the specific dataset.‬

‭6.4 Strengths and Weaknesses‬

ic
‭●‬ ‭Strengths‬‭: The primary advantage of LOF is its ability to identify local outliers in‬
yt
‭ atasets with clusters of varying densities. A point that is part of a sparse cluster‬
d
‭might have a large distance to its neighbors, but if its neighbors are also part of‬
l
‭that same sparse cluster, its LOF score will be close to 1. A global method might‬
na

‭incorrectly flag all points in that sparse cluster as anomalies.‬

‭●‬ ‭Weaknesses‬‭: LOF shares some of the same limitations as k-NN. It can be‬
-a

‭ omputationally expensive due to the numerous distance calculations. It is also‬


c
‭sensitive to the choice of k and can produce a high rate of false positives if not‬
‭carefully tuned. Furthermore, while more robust than k-NN, it can still suffer from‬
us

‭the curse of dimensionality in very high-dimensional spaces.‬

‭ he true innovation of LOF lies in its use of a‬‭relative‬‭density measure. This relativity‬
T
d

‭makes it uniquely suited for analyzing time series that exhibit natural shifts in volatility‬
‭or behavior, often referred to as regime changes. Such shifts are common in financial‬
i‭n

‭markets and industrial operational data. Consider a stock price time series that has‬
‭periods of stable, low volatility (forming a dense cluster of data points) and periods of‬
‭turbulent, high volatility (forming a sparser cluster). A simple, global density algorithm‬
‭might flag all points in the high-volatility period as anomalous simply because their‬
‭absolute density is low. LOF, in contrast, would evaluate a point within the volatile‬
‭period and observe that its neighbors are also in a sparse region. Their LRDs would be‬
‭similar, resulting in an LOF score close to 1, correctly identifying the point as part of a‬
"‭ normal" (albeit volatile) regime. An anomaly for LOF would be a point that represents‬
‭a‬‭transition between regimes‬‭or a point that is isolated even from its local‬
‭neighborhood. For example, a single "flash crash" data point would be in a very‬
‭sparse region, but its immediate neighbors (from just before the crash) would be in a‬
‭dense region. This large discrepancy in local densities would yield a very high LOF‬
‭score, correctly flagging the event as anomalous. Thus, LOF's relative nature provides‬
‭robustness against the inherent non-stationarity of variance (heteroscedasticity)‬
‭found in many real-world time series.‬


om
‭Section 7: Deep Learning Approaches to Anomaly Detection‬

I‭n recent years, deep learning has emerged as the state-of-the-art paradigm for a‬
‭wide range of machine learning tasks, and time series analysis is no exception. Deep‬

c
‭neural networks have proven exceptionally capable of modeling the complex,‬
‭high-dimensional, and non-linear patterns that characterize modern time series data‬

s.
‭from sources like financial markets, IoT sensors, and healthcare monitoring systems.‬

ic
‭7.1 A Modern Taxonomy of Deep Learning Models‬

‭ eep learning models for time series anomaly detection (TSAD) can be broadly‬
D
yt
‭categorized based on their core strategy. This taxonomy helps to structure the vast‬
‭landscape of available architectures and understand their fundamental approaches to‬
l
‭identifying deviations from normalcy.‬
na

‭●‬ ‭Forecasting-Based Models‬‭: These models are trained to predict the next point‬
‭ r a future sequence of points based on a window of recent historical data. The‬
o
-a

‭underlying assumption is that normal, predictable data can be forecasted with‬


‭low error.‬‭An anomaly is then declared when there is a large discrepancy (a‬
‭high prediction error)‬‭between the model's forecast and the actual observed‬
us

‭value. Architectures like‬‭LSTMs, GRUs, and Transformers‬‭are commonly used in‬


‭this approach.‬
d

‭●‬ ‭Reconstruction-Based Models‬‭: This is the most prevalent unsupervised‬


‭ pproach.‬‭These models are trained to learn a compressed, low-dimensional‬
a
i‭n

‭representation of normal data and then reconstruct the original input from‬
‭this representation‬‭. The principle is that the model will become an expert at‬
‭reconstructing normal patterns. When an anomalous input is provided, the model‬
‭will struggle to reconstruct it accurately, resulting in a high reconstruction error.‬
‭This error serves as the anomaly score. Autoencoders (AEs) and their variants‬
‭(VAEs, GANs) are the cornerstone of this approach.‬

‭●‬ ‭Representation-Based Models‬‭: These models focus on learning rich,‬


i‭nformative embeddings (representations) of the time series data in a latent‬
‭space. The goal is to learn a mapping where normal data points cluster together‬
‭and anomalies are mapped to sparse regions of the space. Anomaly detection is‬
‭then performed in this learned latent space using a secondary technique like‬
‭clustering or density estimation. Contrastive learning methods are a key example‬
‭of this strategy.‬

‭7.2 The Reconstruction Paradigm: An In-Depth Analysis of Autoencoders‬


‭ he reconstruction-based approach, particularly using Autoencoders (AEs), has‬
T

om
‭become a dominant strategy for unsupervised anomaly detection in complex time‬
‭series.‬

‭●‬ ‭Core Principle‬‭: An Autoencoder is a type of unsupervised neural network that is‬

c
t‭ rained to reconstruct its own input. It is composed of two main parts: an‬

s.
‭encoder‬‭, which compresses the high-dimensional input data into a‬
‭lower-dimensional latent space representation (also called a bottleneck), and a‬

ic
‭decoder‬‭, which takes this compressed representation and attempts to‬
‭reconstruct the original input.‬
yt
‭●‬ ‭Application to Anomaly Detection‬‭: The power of AEs for anomaly detection‬
‭ omes from a specific training strategy: the model is trained‬‭exclusively on data‬
c
l
‭that is known to be normal‬‭. Through this process,‬‭the network becomes an‬
na

‭expert at learning the intricate patterns and correlations inherent in normal‬


‭data, enabling it to reconstruct normal inputs with very low error‬‭. When the‬
‭trained model is subsequently presented with an anomalous input—one that does‬
-a

‭not conform to the learned patterns—it will be unable to reconstruct it accurately.‬


‭This failure results in a high reconstruction error, which serves as a powerful and‬
us

‭reliable anomaly score.‬

‭ he power of the autoencoder approach can be understood as a sophisticated,‬


T
‭non-linear‬‭manifold learning‬‭technique. The reconstruction error is not an arbitrary‬
d

‭metric but a geometrically meaningful measure of a data point's distance to a learned‬


i‭n

‭"manifold of normality." In geometric terms, the set of all possible "normal" data‬
‭points can be conceptualized as lying on or near a complex, lower-dimensional‬
‭surface (a manifold) that is embedded within the high-dimensional input space. By‬
‭training the AE to minimize reconstruction error exclusively on normal data, the‬
‭process effectively forces the encoder-decoder pair to learn the shape of this normal‬
‭manifold. Anomalies, by definition, do not follow these normal patterns and therefore‬
‭lie "off-manifold".‬‭When an anomalous point is passed to the encoder, it is‬
‭projected onto the learned latent space, but this projection is inherently flawed‬
‭ ecause the point was not part of the space the AE was trained on.‬‭The decoder‬
b
‭then attempts to reconstruct the original point from this flawed projection,‬
‭inevitably resulting in a high error‬‭. The reconstruction error, therefore, serves as a‬
‭proxy for the distance of a data point to the learned manifold. This provides a far more‬
‭robust and nuanced definition of normalcy than linear methods like PCA or direct‬
‭distance metrics like k-NN, explaining why AEs are so powerful for complex,‬
‭high-dimensional data.‬


‭7.3 Architecture Deep Dive: LSTM Autoencoders for Sequential Data‬

om
‭ or time series data, standard AEs with fully connected (dense) layers are insufficient‬
F
‭because they process each input independently and fail to capture temporal‬
‭dependencies. To address this,‬‭Long Short-Term Memory (LSTM)‬‭networks are‬

c
‭integrated into the autoencoder architecture.‬‭LSTMs are a special type of‬
‭Recurrent Neural Network (RNN) explicitly designed to learn from sequential‬

s.
‭data‬‭by maintaining an internal memory or cell state, making them ideal for this task.‬

T
‭the decoder:‬
ic
‭ he architecture of an LSTM Autoencoder uses LSTM layers in both the encoder and‬
yt
‭1.‬ ‭The‬‭Encoder‬‭consists of one or more LSTM layers that process an input‬
s‭ equence (a window of time series data). It reads the sequence step-by-step and‬
l
‭compresses the information into a single fixed-size vector, which represents the‬
na

‭final hidden state of the LSTM. This vector is the latent space representation of‬
‭the entire input sequence.‬
-a

‭2.‬ ‭A‬‭RepeatVector‬‭layer is then used to duplicate this latent vector, creating a‬


s‭ equence of identical vectors, one for each time step of the desired output‬
‭sequence. This provides the initial input for the decoder at every time step.‬
us

‭3.‬ ‭The‬‭Decoder‬‭consists of one or more LSTM layers that take the repeated latent‬
v‭ ector sequence as input and work to reconstruct the original input sequence,‬
d

‭one time step at a time. The final output is a sequence of the same length as the‬
‭input.‬
i‭n

‭ ython implementations using libraries like Keras and TensorFlow demonstrate how‬
P
‭these layers are stacked to create the full model.‬

‭7.4 Training, Inference, and Thresholding‬

‭ he process of using an LSTM Autoencoder for anomaly detection involves three key‬
T
‭phases:‬
‭●‬ ‭Training‬‭: The model is trained in a purely unsupervised manner, where the input‬
‭ ata also serves as the target output. The objective is for the model to learn an‬
d
‭identity function for normal data. The training call is typically model.fit(X_train,‬
‭X_train). The loss function used to guide the training is almost always a measure‬
‭of reconstruction error, such as Mean Squared Error (MSE) or Mean Absolute‬
‭Error (MAE), calculated between the original input and the reconstructed output.‬

‭●‬ ‭Inference‬‭: Once the model is trained, it can be used to detect anomalies in new,‬


‭ nseen data. A new sequence is passed through the trained model to generate its‬
u

om
‭reconstruction. The reconstruction error for this new sequence is then calculated.‬

‭●‬ ‭Thresholding‬‭: This is a critical final step that translates the continuous‬
r‭ econstruction error into a binary anomaly/normal classification. A threshold must‬

c
‭be set on the error score. A common and effective method is to first calculate the‬
‭reconstruction errors for all the sequences in the (normal) training dataset. The‬

s.
‭distribution of these errors represents the range of "normal" error. The anomaly‬
‭threshold is then set at a high percentile of this distribution, such as the 95th or‬

ic
‭99th percentile. Any new sequence whose reconstruction error exceeds this‬
‭threshold is flagged as an anomaly.‬
l yt
na
-a
d us
i‭n

‭7.5 Survey of Advanced Architectures‬

‭ hile LSTM Autoencoders serve as a powerful and widely used baseline, the field of‬
W
‭deep learning for TSAD is rapidly advancing. Other key architectures that offer distinct‬
‭advantages include:‬

‭●‬ ‭Variational Autoencoders (VAEs)‬‭: A probabilistic extension of the AE that learns‬


‭ probability distribution for the latent space rather than a single point. This‬
a
‭allows it to model uncertainty more effectively and can lead to more robust‬
‭anomaly detection.‬

‭●‬ ‭Generative Adversarial Networks (GANs)‬‭: These models use a two-player‬


‭ ame between a‬‭generator‬‭(which creates fake data) and a‬‭discriminator‬‭(which‬
g


‭tries to distinguish fake from real data). For anomaly detection, the trained‬

om
‭discriminator can be used to identify inputs that do not conform to the learned‬
‭distribution of normal data.‬

‭●‬ ‭Transformers‬‭: Originally developed for natural language processing, Transformer‬

c
‭ rchitectures have proven highly effective for time series. Their self-attention‬
a

s.
‭mechanism allows them to weigh the importance of different time steps and‬
‭process entire sequences in parallel,‬‭enabling them to efficiently capture very‬

ic
‭long-range dependencies that can challenge LSTMs‬‭.‬
yt
‭Section 8: Synthesis and Recommendations‬
l
‭ he selection of an appropriate anomaly detection algorithm is not a one-size-fits-all‬
T
na

‭decision. It requires a careful consideration of the data's characteristics, the nature of‬
‭the expected anomalies, and the computational constraints of the application. This‬
‭final section synthesizes the analyses of the distance-based, density-based, and‬
-a

‭deep learning methods into a comparative framework and provides a practical‬


‭decision guide for practitioners.‬
us

‭8.1 Comparative Analysis of Detection Methodologies‬

‭ he three families of algorithms—k-Nearest Neighbors, Local Outlier Factor, and‬


T
d

‭LSTM Autoencoders—operate on fundamentally different principles and exhibit‬


‭distinct trade-offs in performance, complexity, and applicability.‬
i‭n

‭Table 8.1: Comparative Matrix of Anomaly Detection Algorithms‬

‭ he following table distills the detailed analysis into a single, actionable‬


T
‭decision-making tool, comparing the methods across the key factors a data scientist‬
‭would consider when selecting a model.‬
‭Feature‬ k‭ -Nearest Neighbors‬ ‭ ocal Outlier Factor‬
L ‭LSTM Autoencoder‬
‭(k-NN)‬ ‭(LOF)‬

‭Core Principle‬ ‭ istance-based:‬


D ‭ ensity-based:‬
D ‭ econstruction-base‬
R
‭Anomalies are‬ ‭Anomalies are‬ ‭d: Anomalies are‬
‭isolated points that‬ ‭located in regions of‬ ‭patterns that the‬
‭are far from their‬ ‭lower‬‭relative‬‭density‬ ‭model, trained on‬
‭neighbors in feature‬ ‭compared to their‬ ‭normal data, cannot‬


‭space.‬ ‭local neighborhood.‬ ‭accurately‬

om
‭reconstruct.‬

‭ rimary Anomaly‬
P ‭Point anomalies.‬ ‭ oint and simple‬
P ‭ oint, Contextual,‬
P
‭Type‬ ‭Contextual‬ ‭and Collective‬

c
‭anomalies.‬ ‭anomalies.‬

s.
‭Data Suitability‬ ‭ est for‬
B ‭ etter than k-NN for‬
B ‭ xcellent for‬
E
‭low-dimensional‬ ‭data with varying‬ ‭high-dimensional,‬
‭(univariate or‬
‭few-variable‬
‭multivariate) data.‬ic ‭cluster densities, but‬
‭still struggles with‬
‭very high dimensions.‬
‭sequential, and‬
‭non-linear data‬
‭where temporal‬
yt
‭Performance‬ ‭patterns are critical.‬
‭degrades severely‬
l
‭with high‬
na

‭dimensionality.‬

‭ omputational‬
C ‭ omputationally‬
C ‭ igh inference cost‬
H ‭ omputationally‬
C
-a

‭Complexity‬ ‭expensive inference‬ ‭(O(N2)). No training‬ ‭expensive training‬


‭(O(N2)). No training‬ ‭phase*.‬ ‭phase, but very fast‬
‭phase ("lazy‬ ‭inference (O(N)) once‬
us

‭learner").‬ *‭ traditionally, yes, but‬ ‭the model is trained.‬


‭lof model can be‬
‭trained on normal‬
‭data and inference‬
d

‭data can be‬


‭compared with‬
i‭n

‭normal/baseline data.‬

‭Key Advantage‬ ‭ imple, intuitive, and‬


S ‭ ffectively detects‬
E ‭ earn complex‬
L
‭easy to implement. A‬ ‭local outliers that‬ ‭temporal‬
‭good, interpretable‬ ‭global methods miss.‬ ‭dependencies and‬
‭baseline.‬ ‭Robust to datasets‬ ‭non-linear patterns‬
‭with clusters of‬ ‭automatically.‬
‭State-of-the-art‬
‭varying densities.‬ ‭ erformance on‬
p
‭complex sequential‬
‭data.‬

‭Key Disadvantage‬ ‭ ails in‬


F ‭ erformance‬
P ‭ equires a large‬
R
‭high-dimensional‬ ‭degrades in very high‬ ‭amount of purely‬
‭spaces due to the‬ ‭dimensions.‬ ‭normal data for‬
‭"curse of‬ ‭training. Can be a‬


‭dimensionality."‬ ‭"black box," making‬

om
‭Highly sensitive to‬ ‭results difficult to‬
‭the choice of k.‬ ‭interpret without‬
‭additional‬
‭techniques.‬

c
s.
‭8.2 A Decision Framework for Selecting the Right Model‬

B
‭the selection process:‬
ic
‭ ased on the comparative analysis, a pragmatic, step-by-step framework can guide‬
yt
‭1.‬ ‭Define the Anomaly First‬‭: This is the most critical step. Characterize the target‬
‭ nomaly based on the taxonomy in Section 4. Is the goal to find sudden spikes‬
a
l
‭(Point), values that are unusual for a specific time (Contextual), or subtle,‬
na

‭developing patterns (Collective)? The answer to this question will immediately‬


‭narrow the field of appropriate algorithms.‬
-a

‭2.‬ ‭Assess Data Characteristics‬‭: Analyze the properties of the time series data.‬
‭ hat is its dimensionality? Is it a single sensor reading or hundreds? How large is‬
W
‭the dataset? Is the data stationary or does it exhibit clear trends and seasonality?‬
us

‭3.‬ ‭Start Simple for Simple Problems‬‭: For low-dimensional data (e.g., univariate)‬
‭ here the primary target is detecting point anomalies, begin with a simple and‬
w
d

‭interpretable baseline like k-NN or a statistical method like Isolation Forest. Their‬
‭performance will provide a valuable benchmark.‬
i‭n

‭4.‬ ‭Handle Localized Complexity with Density‬‭: If the data is known to have regions‬
‭ f varying density or volatility (e.g., financial data with high- and low-volatility‬
o
‭regimes), and the goal is to find local outliers, LOF is a superior choice to global‬
‭distance methods like k-NN.‬

‭5.‬ ‭Scale Up with Deep Learning for Complex, Sequential Data‬‭: When faced with‬
‭high-dimensional, complex, and sequential data, and especially if the target‬
i‭ncludes subtle contextual or collective anomalies, a deep learning approach like‬
‭an LSTM Autoencoder is the most powerful and appropriate choice. Its ability to‬
‭learn temporal dependencies from raw data without manual feature engineering‬
‭is a significant advantage.‬

‭6.‬ ‭Iterate and Evaluate‬‭: No single model is a panacea. The best practice is to‬
‭ eploy a candidate model, rigorously evaluate its performance (paying close‬
d
‭attention to the trade-off between false positives and false negatives), and use‬


‭the results to inform further iterations, such as hyperparameter tuning or‬

om
‭selecting a more advanced architecture.‬

c
s.
‭Thank You‬
ic
l yt
‭You can contact us @‬
na

‭https://indus-analytics.com/contact/‬
-a
d us
i‭n

You might also like