0% found this document useful (0 votes)
32 views236 pages

DM2 Notes Latest

The document outlines various modules related to data mining, including imbalanced learning, advanced classification methods, time series analysis, and ethical principles. It explains the CRISP-DM framework for data mining processes, emphasizing the importance of business understanding, data preparation, modeling, evaluation, and deployment. Additionally, it discusses performance evaluation metrics such as confusion matrices, accuracy limitations, cost-sensitive measures, and data partitioning techniques for model validation.

Uploaded by

Arafath Jazeeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views236 pages

DM2 Notes Latest

The document outlines various modules related to data mining, including imbalanced learning, advanced classification methods, time series analysis, and ethical principles. It explains the CRISP-DM framework for data mining processes, emphasizing the importance of business understanding, data preparation, modeling, evaluation, and deployment. Additionally, it discusses performance evaluation metrics such as confusion matrices, accuracy limitations, cost-sensitive measures, and data partitioning techniques for model validation.

Uploaded by

Arafath Jazeeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Translated by Google

DATA MINING 2

Module 1: Imbalances learning and anomaly detection (crisp, evaluation, imbalanced learning, anomaly detection)

Module 2: Advanced classification methods (naïve bayes classifier, rule-based classifiers, logistic regression, support
vector machines, ensemble, neural networks).
Module 3: time series: similarity, approximation, motif, shapelets, classification, clustering).
Module 4: sequential patterns and advanced clustering (sequential pattern mining, x-means, OPTICS,
transactional clustering).
Module 5: ethical principles.

KNN extension

It's an instance-based classifier that isn't memory -based, so there's no training part.
Nearest-neighbor classification is part of a more general technique known as instance-based learning, which
uses specific training instances to make predictions without having to maintain an abstraction.
Instance-based learning algorithms require a proximity measure to determine the similarity or distance between
instances and a classification function that returns the predicted class of a test instance based on its proximity to
other instances.
Advantage: it is able to adapt models to data previously not taken into account, storing a new
instance or eliminating a previous one Disadvantage: it is a lazy learner
therefore it does not build a model explicitly, furthermore classifying unknown records is relatively expensive: in the
worst case, data in training items, the complexity of classifying a single instance is O(n) you have to scan all n
instances

Main elements: 1.
Training set of stored records 2.
Distance metric to complete the distance between the records
3. The value of k, the number of nearest neighbors to find

Given a set of training records and a test record:


- Make distance from records in training to test
- Identify the closest records
- Use the nearest neighbors class label to determine the class label of unknown records
- K-nearest neighbors of a record x are the data points that have the k smallest distance from x

• If k is too small, it is sensitive to noise points and can lead to an overfitting case due to noise in the training
set. • If k is too large,
the neighborhood may include points of other classes.
In general we use k= sqrt(N) where N is the number of samples in the training dataset

Euclidean distance:
The problem with Euclidean distance is that high dimensional data can cause problems with dimensionality .
Solution normalize vector to unit length Attributes can also be scaled to prevent
distance measurements from being dominated by one attribute (e.g. z-score)
Machine Translated by Google

PEBLS (Parallel Exemplar-Based Learning System): It allows to work also with categorical attributes PEBLS is a nearest-
neighbor learning system (k=1) designed for applications where the instances have a symbolic value of the features, in fact it
works with both continuous and name them.
For nominal attributes, the distance between two nominal values is accomplished using the Modified Value Difference
Metric (MVDM):

Distance between records:

CRISP (Cross-Industry Standard Process for DM)


Why should there be a standard process?
-
The data mining process must be trusted and repeatable by people who have little background in DM

- Framework for recording experiences (allows to replicate projects)


- Enables project planning and management

CRISP-DM is non-proprietary, application/industry neutral, tool neutral,


focus on business issues, framework for guidance, experience
base (templates for analysis)
In a company there are two additional phases than usual: business
understanding and deployment

Stages:

• Business understanding: understand the objective of a project, the requirements and the definition of
DM problem
Machine Translated by Google

• Data understanding: initial data collection and familiarization, data quality problems
identification
• Data preparation: prepare data: tables, records, attribute selection, date
transformation and cleaning (90% of the time)
• Modeling: model selection and application techniques, parameters calibration • Evaluation:
business objectives (kpis indicators) and issues achievement evaluation • Deployment
(distribution): result model deployment, repeatable DM process implementation

Business Understanding:
- Determine business objectives:
• Comprehensive understanding, from a business perspective, of what a customer really wants
realize
• Discovering important factors (at the outset) that can influence the outcome of the project •
Neglecting this step leads to spending a great deal of effort to produce the right answer to the
wrong question
- Access situations:
• more detailed fact finding on all resources, constraints (e.g. privacy), assumptions and other
factors that need to be considered • examine the details -
Determine DM goals:

• A business goal includes objectives in business terms: “increase catalog sales to existing
customers” • A DM
goal includes objectives in technical terms: “predict how many widgets a customer will buy, given
their purchases over the past three years, demographic information and the price of the item.”

- Produce project plan: •


Describes the intended plan to achieve DM goals and business goals • The
plan shall specify the intended series of beacons to be performed during the rest of the
project including an initial selection of tools and techniques. Planning is essential!!

Data understanding: explore the data, verify the quality, find outliers
- Collect initial date:
• Acquire within the project the data listed in the project resources •
Includes data loading if necessary for data understanding •
Possibly leads to initial data preparation steps
Machine Translated by Google

• If acquiring multiple data sources, integration is an additional issue, either here or in the
later data preparation phase
- Describe data
• Examine the “gross” or “surface” properties of the acquired data • Report on
the results
- Explore data: •
Tackles the data mining questions, which can be addressed using querying, visualization and reporting
including: distribution of key attributes, through aggregations; relations between pairs of attributes;
properties of significant subpopulations
• May address directly the data mining goals • May
contribute to data description and quality reports • May feed into the
transformation and other data preparation needed
- Verify data quality
• Examine the quality of the data, addressing questions such as: “Is the data complete?”, Are
there missing values in the data?”

Data preparation (90% of the time): collection, assessment, consolidation and cleaning, data selection, transformation:

- Select date
• Decide on the data to be used for analysis • Criteria
include relevance to the data mining goals, quality and technical constraints such as
limits on data volume or data types
• Covers selection of attributes as well as selection of records in a table
- Clean data
• Raise the data quality to the level required by the selected analysis techniques • May involve
selection of clean subsets of the data, the insertion of suitable defaults or
more ambitious techniques such as the estimation of missing data by modeling
- Construct data: constructive data preparation operations such as the production of derived attributes, entire
new records or transformed values for existing attributes
- Integrate data: methods whereby information is combined from multiple tables or records to
create new records or values
- Format data: formatting transformations refer to primarily syntactic modifications made to the data that do not
change its meaning, but might be required by the modeling tool

Modelling:
- Select the modeling technique (based upon DM objectives)
- Build model (parameter settings)
- Access model (rank the models)
- Select modeling technique • Select
the actual modeling technique that is to be used ex) decision tree, neural network • If multiple techniques
are applied, perform this task for each technique separately
- Generate test designs:
• Before actually building a model, generate a procedure or mechanism to test the model's
quality and validity ex) In classification, it is common to use error rates as quality measures for data mining
models. Therefore, typically separate the dataset into train and test set, build the model on the train
set and estimate its quality on the separate test set - Build model: • Run the modeling tool on
the prepared
dataset to create one or more models
- Assessment model:
Machine Translated by Google

• interprets the models according to his domain knowledge, the data mining success criteria and
the desired test design
• judges the success of the application of modeling and discovery techniques more
technically
• Contacts business analysts and domain experts later in order to discuss the data mining
results in the business context
• Only consider models whereas the evaluation phase also takes into account all other
results that were produced in the course of the project

Evaluation: to understand client KPIs


- Evaluation of model (how well it performed on test data)
- Methods and criteria (depend on model type)
- Interpretation of model (importance and hardness depend on the algorithm)
- Evaluate results
- Review process
- Determine next step

Deployment: determining how the results are to be used, or who is to use them, i.e. who will be the end
user. How often should deploy DM results be used.
The knowledge gained will need to be organized and presented in a way that the customer can use it.
However, depending on the requirements, the deployment phase can be as simple as generating a report or
as complex as implementing a repeatable data mining process across the enterprise.
- Plan deployment:
• In order to deploy the data mining result(s) into the business, takes the evaluation results and
concludes a strategy for deployment •
Document the procedure for later deployment
- Plan monitoring and maintenance: over time, the attitudes of
shopping for example, and needs to be updated
• Important if the data mining results become part of the day-to-day business and it
environment
• Helps to avoid unnecessarily long periods of incorrect usage of data mining results •
Needs a detailed on monitoring process •
Takes into account the specific type of deployment
- Produces final reports
• The project leader and his team write up a final report •
May be only a summary of the project and its experiences • May
be a final and comprehensive presentation of the data mining result(s)
- Review project •
Assess what went right and what went wrong, what was done well and what needs to be
improved

So CRISP is useful in DM because it provides a uniform framework for: guidelines and experience
documentation. It is also flexible to handle different business/agency issues and different data

Websites • http://www.crisp-dm.org/ • http://www.spss.com/ • http://www.kdnuggets.com/

Performance evaluations
Machine Translated by Google

The confusion matrix is a measure of the performance evaluation of our model, which is based on the
predictive ability of the model.

• True positive (TP=a) or f++, which corresponds to the number of positive examples correctly predicted by
the classification model. •
False negative (FN= b) or f+ÿ, which corresponds to the number of positive examples wrongly predicted as
negative by the classification model. •
False positive (FP= c) or fÿ+, which corresponds to the number of negative examples wrongly predicted as
positive by the classification model. •
True negative (TN= d) or fÿÿ, which corresponds to the number of negative examples correctly predicted by
the classification model.

• Thanks to the confusion matrix it is possible to calculate the accuracy of our model
• How is the accuracy calculated? We add up the number of correct predictions we have in the table/
overall number of records

Limitation of Accuracy
• Since the accuracy measure treats every class as equally important, it may not be suitable for analyzing
imbalanced data sets, where the rare class is considered more interesting than the majority class.

Ex: if we take a 2-class problem with a predominant class: – Number


of Class 0 examples = 9990 – Number
of Class 1 examples = 10 • If the
Model wants to predict the accuracy with these data we will have class 0, with accuracy 9990/10000 = 99.9
%
– Accuracy is misleading because model does not detect any class 1 example. In this way we lose class 1
completely, without considering it
Machine Translated by Google

cost matrix

The cost matrix allows me to evaluate a model using accuracy, without changing anything in the training or the
test. We're just considering the rating.

A cost matrix encodes the penalty of classifying records from one class as another.
Let C(i, j) denote the cost of predicting a record from class i as class j.
With this notation, C(+, ÿ) is the cost of committing a false negative error, while C(ÿ, +) is the cost of generating a false alarm.
A negative entry in the cost matrix represents the reward for making correct classification. Given a collection of N test records,
the overall cost of a model M is:

Example of computing Cost of Classification

• Consider the cost matrix: The cost of committing a false negative error is a hundred times larger than the cost of committing a false
alarm. In other words, failure to detect any positive example is just as bad as committing a hundred false alarms. • Cost is
something we would like to reduce as
much as possible • Final result:

Despite improving both of its true positive and false positive counts, model M2 is still inferior since the improvement comes at
the expense of increasing the more costly false negative errors. A standard accuracy measure would have preferred the model M2
over M1.

Cost vs Accuracy
Machine Translated by Google

If we compare the two we see that the accuracy is proportional to the cost, if the cost given to false positive and false negative is
the same (=q) and if the cost given to true positive and true negated is the same (=p).
Finally we calculate the final cost with the final formula.

Using the cost matrix we can change the way we evaluate a performance and this helps us evaluate a classifier working with a
large dataset, thanks to the accuracy.

Cost-Sensitive Measures
Measures created from the cost matrix:

• Precision determines the fraction of records that actually turns out to be positive in the group the classifier has
declared as a positive class. The higher the precision is, the lower the number of false positive errors committed by the classifier.

Precision is affected towards C(Yes|Yes) & C(Yes|No) .

• Recall measures the fraction of positive examples correctly predicted by the classifier.
The fraction of positive records that the model correctly recognized as positive. The measure says how much correct
prediction can cover the actual set of positive cases.

• Precision and recall can be summarized into another metric known as the F1 measure. In principle, F1 represents a harmonic
mean between recall and precision, ie:

•weighted accuracy: we use weights as in the example of the cost matrix


Machine Translated by Google

Data partitioning

1. The model is usually divided into 2 parts (called the Holdout technique): test (30%) and test (70%)
The train set is used to train the mode and the test to test the model, to see if we get good results. If the
result is not good we go back and apply another model.

2. Or we can divide the model into three parts: 1.


train = 70%-> for training the model and for the parameter selection
2nd test= 30%
3. validation= 30% of the trainÿ is not used for training, but is used to test the
combination of various parameters. is used the same number of times as the train, until we find
the best parameters. After finding the best combination we can test the model.

In this model we are always applying a holdout validation: i.e. we fix the train and the validation, i.e. the same
operation applied as for 1. it is repeated here, however, directly inside the train. The train, as in model 1, can
be biased due to the random data selection we make and as a consequence also the lal validation.

3.cross validation:
When we do the cross validation, we are applying a sort of 2. Because the validation set is considered between
the data partition. In this model you always try to find the best parameters like 2., and then you test the model.

In this case, unlike 1 and 2, we repeat the operation of 2. k times, with different dimensions depending on how
many times we want to repeat it. There is a validation part here too, but it moves every time we analyze a k part,
to find the biases.
Tip: first do a holdout separation: play with the data and feature engineering, and then test the model at the end.
it is not recommended to do a cross validation directly on the whole dataset, because otherwise we would
have parts as big as the train.
Cross Validation, considering TIME
Machine Translated by Google

What we do is train in the "past" and try to predict the "future".


The black line represents time.
We could "look ahead" and then test the model later, or do a cross validation that considers time. When we do the
cross-validation in time series classification, events, classification it is crucial to organize the evaluation like
this, because it could happen that we split the dataset into train and test and only test things that happened in the past (not
in the project, yay!).

ROC (Receiver Operating Characteristic Curve) •


Developed in 1950s for signal detection theory to analyze noisy signals •
Characterize the trade-off between positive hits and false alarms • ROC
curve plots TP (on the y-axis) against FP (on the x- axis) •
Performance of each classifier represented as a point on the ROC curve
• observe how the performance of the algorithm changes, if an object of positive or negative class is
moved • sample
distribution or cost matrix changes the location of the point

• It illustrates the ability of a binary classifier as its discrimination threshold THR is varied. • The ROC
curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at
various THRs.
• The TPR = TP / (TP + FN) is also known as sensitivity, recall or probability of detection. • The
FPR = FP / (TN + FP) is also known as probability of false alarm and can be calculated as (1 ÿ specificity).
Machine Translated by Google

If we have:
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
So, given a certain threshold, we can classify all the instances we have in a test set and calculate the
FNR and TPR

(TP,FP):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (0,1): ideal

• Diagonal line:
• is Random guessing •
Below diagonal line:
• prediction is opposite of the true class

Using ROC for Model Comparison


Machine Translated by Google

• No model consistently outperforms the other


• M1 is better for small FPR
• M2 is better for large FPR
• Under the ROC curve area
• Ideal:
• Area = 1
• Random guesses:
• Area = 0.5

How to Construct an ROC curve

Lift chart
• The lift curve is a popular technique in direct marketing. • The
input is a dataset that has been “scored'' by appending to each case the estimated probability that it will belong to a given
class. therefore the inputs are the same as for the ROC • The cumulative lift
chart (also called gains chart) is constructed with the cumulative number of cases (descending order of probability) on
the x-axis and the cumulative number of true positives on the y-axis . • The dashed line is a reference line. For any given
number of cases (the x-axis value), it represents the expected number of positives we would predict if we did not have a
model but simply selected cases at random. It provides a benchmark against which we can see performance of the model.

Notice: “Lift chart” is a rather general term, often used to identify also other kinds of plots. Don't get confused!
Machine Translated by Google

1. sort the probabilities 2.


generate the actual classes
3. then count the TPs (cumulative actual classes)

• From Lift chart we can easily derive an “economical value” plot, eg in target marketing.
Given our predictive model lift , how many customers should we target to maximize income? we could use the
to understand how many consumers to call to maximize profit ÿ Profit =
UnitB*MaxR*Lift(X) - UnitCost*N*X/100 ÿ UnitB = unit
benefit, UnitCost = unit postal cost ÿ N = total customers

ÿ MaxR = expected potential respondents in all population (N) ÿ Lift(X) =


lift chart value for X, in [0,..,1]

ROC example:
Machine Translated by Google

IMBALANCED LEARNING
Imbalanced classes

Most classification methods assume that classes are reasonably balanced, while in reality classes are often unbalanced (one class
is really popular and the other is rare). If a class is rare it doesn't mean it isn't interesting: in the medical field, it is
interesting finding out how to cure the population affected by HIV, which is the 0.4% of the total USA population. former. about
2% of credit card accounts are defrauded per year1
(most fraud detection domains are heavily imbalanced)

Evaluating Classifiers on Imbalanced Data When


we evaluate the performance of imbalanced data the accuracy is not a representative measure, since we need to use a
confusion matrix or a cost matrix. Because it always predicts the most common class and the accuracy reaches maybe
99.9% Example assuming the test set contains 100
records: - Positive cases = 75, Negative cases = 25

-
Is a classifier with 70% accuracy good? No, the trivial classifier (always positive) reaches 75%.

- Positive cases = 50, Negative cases = 50


- Is a classifier with 70% accuracy good? At least much better than the trivial classifier.

To say if our classifier accuracy is interesting we need to compare it with the accuracy of the trivial classifier that
always returns the majority class (case of perfect balancing).
Machine Translated by Google

Handling Imbalanced Data


There are several things we can do to treat imbalanced data:
- We need to balance the training set, by Undersampling the majority class or Oversampling the
minority class. These approaches are applied on the training set and the resulting classifier is
applied on the original test set (without any changes) because we want to test the classifier in a
real scenario where our data is imbalanced.
- At the algorithm level we might adjust the class weight by making the algorithm more sensitive to rare
classes, adjust the decision threshold (which is 0.5 by default) or design a new algorithm to
perform well on imbalanced data.
- Switch to anomaly detection
- Do nothing and hope to be lucky

Undersampling the majority class


This approach requires that the majority class is scaled down to the level of the minority class, which can be
achieved by:
- Random Undersampling: Under-sample the majority class(es) by randomly picking samples with or
without replacement.
- Neighbor-based approaches, eg, Condensed Nearest Neighbor, Tomek Links, etc.

Random Undersampling
Under-sample the majority class(es) by randomly picking samples with or without replacment

As we can see, in the right figure, we can distinguish the decision boundaries. So the random
undersampling is efficient because it doesn't require any strategy (to be sure of the results it should be
repeated several times), and it avoids biases in choosing the sample to keep. It still remains a way of
random selection, so this should be considered weakness, because it could lead to wrong choices.
Machine Translated by Google

Condensed Nearest Neighbor (CNN)

This approach is more resistant than the Random Selection, but it is computationally expensive as it
performs a smart undersampling by removing majority points having as k-NN a minority point, and it is
sensitive to noise. So, if a minority point is in the nearest neighborhood of a majority point and this makes a
misclassification, then this majority point is removed.

Algorithm:

a) pass initialized at 1
b) We randomly select an instance x among all instances in the training set D, so we have two
different sets which are D(1) (D at the first iteration) and E.
c) We initialize D at the second iteration D(pass+1) as an empty set, and we initialize a counter to 0. d)
Another x' is chosen randomly from D (the previous x is not in D anymore), and now x is classified
through a Nearest Neighbor classification using E as test-set.
e) If the classifier we built at step d) can find the right class with a simple approach as KNN, the new x is
added to the set D(pass+1) (D at iteration pass+1). Otherwise, if x is misclassified, it is added to
the set E, and the counter is increased by one.
f) The x is removed from this iteration. The instances that are discarded are the closest (because are the
ones in the neighborhood) to every pre-selected instance. g) If
the set D(pass) is not empty, it goes back to step d) and repeats the process. h) If
the counter is 0 it means that no instances have been misclassified and the algorithm ends, because
I reached a situation where a simple classifier as KNN is a good classifier.
Machine Translated by Google

If the counter is not equal to 0, pass is increased and the algorithm goes back to step b) and repeats the
process.
So if a minority point is in the nearest-neighborood of a majority point, and this leads to a
misclassification, then this majority point is removed.

Original paper description of the algorithm steps:

Oversampling the majority class


This approach is preferred on the undersampling, because we are not removing data from our analysis
(even though we are in the big data field).

This can be done with:


• Random Oversampling
Over-sample the minority class(es) by choosing random samples with substitution. This repetition of points
can change the behavior of an algorithm, because the weight of instances can affect the way (for example) a
decision tree makes a split since we are augmenting the population that belongs to a certain portion of space.
An advantage of this approach is that it is fast, but we could end up with a lot of repetitions if we don't
perform several tries.
Machine Translated by Google

• Synthetic Minority Oversampling Technique (SMOTE)


Oversample the minority class(es) by adding points through interpolation. So we forget the majority class,
we select two points belonging to the minority class, and a (synthetic) point is generated in a random
position between the two points.
This algorithm operates in the “feature space” rather than in the “data space”, and effectively forces
the decision region of the minority class to become more general. This means that we have no guarantee
that the new point is labeled in the same way as the two points.

The minority class is over-sampled by taking each minority class sample and introducing synthetic examples
along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the
amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen (by
default k=5). Eg, if the amount of over-sampling needed is 200%, only two neighbors from the five are
chosen and one sample is generated in the direction of each.
[it must be considered that this method can influence our work, keep it in mind!]

SMOTE – Samples Generation


• Take the difference between the feature vector (sample) under consideration and its nearest neighbor.
• Multiply this difference by a random number between 0 and 1, and add it to the feature vector under
consideration.
• This causes the selection of a random point along the line segment between two specific features.
Machine Translated by Google

Alternatives: SMOTENC, BorderlineSMOTE, SVMSMOTE, ADASYN

Acting on the classifier: Adjust the class weight


The classifier can be trained considering different costs to be paid for misclassification errors on minority classes
or giving different importances to classes in the training process of the algorithm. This is generally done using the
“class weight”.

The full line is the decision boundary of the classifier without weights, which cuts in half the red points. By giving
weights to classes, the decision boundary includes all the red points in one area, so it is less possible to have
misclassifications. In other terms: we have a high recall in respect to the minority class.

The objective of the classifier is to find a model which minimizes the total cost (we consider the cost matrix at
training time and not to evaluate the performance of the classifier):

With this approach we can consider the classification as an optimization problem (minimize the cost of
misclassification), as we can call it a Meta-Cost Sensitive Classifier.
The algorithm needs to compute the expected risk of classifying x with class i:

where P(j|x) is the probability of classifying the instance x as class j, and


C(i,j) is the cost of misclassifying class i as class j. Now, the input data for the classifier is also the cost matrix
besides the training set.
In this way we re-label the train data with the class i having lower risk and learn a model on the cost
sensitive train data

Adjust the Decision Threshold


Machine Translated by Google

Several classification methods compute scores in terms of probability of belonging to a class, and then assign class.

Generally we have: -
Score p > 50% ÿ class = Y (p is the probability of being classified as Y)
- Otherwise ÿ class = N Eg:
decision trees have p = #positive/#negative cases over each leaf

What if we generalize the schema into:


- Score p > CertainThreshold(THR)% ÿ class = Y
- Otherwise ÿ class = No

For each THR (in [0-100]) we get a different set of predictions. If we increase the threshold, we are saying that the
confidence must be higher and we are going to increase the number of True Positives by reducing the False Positives.

Instead, by reducing the THR we increase the recall because we find more class Y.
Changing the THR changes also the confusion matrix, which changes all the indicators derived from it: Accuracy, True
Positive Rate (TPR), False Positive Rate (FPR) etc.

Maximum Likelihood Estimation (MLE)


Method for determining the values for the parameters of a model.
The parameter values are found such that they maximize the likelihood that the process described by the model produced
the data that were actually observed.
The basic question is: Which model fits the data best? It starts from a base of Normal Gaussian distribution with mean
and standard deviation parameters.

The smaller the standard deviation, the smaller the bell (pink bell chart).

Ex. we have 3 data points 9, 9.5, 11, we want to calculate the total probability of observing all the data, i.e. the joint probability
distribution of all the observed data points taking into account that each data point is generated independently of the others.

If all events are independent, the total probability of observing this distribution is the product of observing each data point
individually (ie the product of the marginal probabilities).
Machine Translated by Google

The Log Likelihood


The maximum value is found by differentiation, ie, find the derivative of the function wrt a variable, set it to zero and find the
required value.
Since the previous expression is not easy to differentiate, we simplify the calculus considering the natural logarithm of the
expression.

This expression can be easily differentiated to find the maximum.

put first part=0 to find


The same thing can be done with the standard deviation.

Anomaly & Outliers Detection An


anomaly may be a result of the causes given above or of other causes that we did not consider. Indeed, the anomalies in a data
set may have several sources, and the underlying cause of any particular anomaly is often unknown. In practice, anomaly detection
techniques focus on finding objects that differ substantially from most other objects, and the techniques themselves are not
affected by the source of an anomaly.
Thus, the underlying cause of the anomaly is only important with respect to the intended application.

Outlier = observation that is completely different from all others in the dataset and is as if it were generated by a different
mechanism.
There is a statistical intuition that says that normal objects in the real world are generated following a "generating mechanism",
like statistical processes. Abnormal objects deviate from this generating mechanism = outliers.

Anomalies/outliers = set of data points that are considerably different than the remainder of the data.
The natural implication is that anomalies are relatively rare.
One in a thousand occurs often if you have lots of data, and context is important ex. freezing temps in July.

Applications of outlier detection can be: fraud detection (recognizing abuses in credit cards), medicines (eg some unusual
symptoms), public health.
Importance of anomaly detection: Ozone
depletion history: In 1985 three researchers were puzzled by data gathered by the British Antarctic Survey showing that ozone
levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for
recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite
were so low they were being treated as outliers by a computer program and discarded!
Machine Translated by Google

Causes of anomalies can be: • data from


different classes (measuring the weight of oranges, but a few grapefruit are mixed in) • natural variation (unusually tall
people) • data errors (200 pounds 2 years old)

Distinction between noise and outliers

Noise is erroneous, perhaps random, values or contaminating objects. Take into consideration for example the weight recorded
incorrectly.
Noise is not that interesting compared to outliers, it doesn't necessarily produce unusual values or objects.

Anomalies can be interesting if they are not the result of noise. Noise and anomalies are related but still distinct concepts.

Number of attributes can create problems. Many outliers are defined in terms of single attributes: height, shape, color.

It can be difficult to find an anomaly using all attributes (because you have many dimensions) due to noisy or irrelevant
attributes; an object is anomalous with respect to other attributes.

Anomaly scoring, we look for a value that has a link with an object and provides a level of outlierization. Many anomaly
techniques provide only a binary categorization. An object is an anomaly ot it isn't and this is especially true of classification-based
approaches.
Other approaches assign a score to all points. This score measures the degree to which an object is an anomaly, this allows
objects to be ranked.
In the end, you often need a binary decision, ex. should this credit card transaction be flagged?

Other issues for Anomaly Detection • find


all anomalies one by one or all at once if we have too many attributes:
swamping and masking
• Evaluation: it is difficult to measure the correctness of an anomaly

• Efficiency: we want to find outliers as fast as possible • Context: how


anomalies are treated depends on the context • Degree with which “anomalous point”
is defined: The evaluation of whether an object is an anomaly is signaled
by some techniques in a binary way: the object is either an anomaly or it isn't. Often, this doesn't reflect the
underlying reality that some objects are more extreme anomalies than others.

Variants of Anomaly detection problems Given


a dataset D, find all data points x belonging to D with anomaly scores greater than a certain threshold.

or:
Given a dataset D, find all data points x belonging to D that have top-n greater than the anomaly
scores.

or: Given
a dataset D, containing mostly normal (but unlabeled) data points, and a test point x, perform the anomaly score of x with reference
to D.

Model-based anomaly detection


Machine Translated by Google

Many anomaly detection techniques first build a model of the data.


There are three basic approaches to anomaly detection: unsupervised, supervised, and semi-supervised.
• unsupervised: Anomalies are those points that don't fit well or those points that distort the model
(statistical distribution, clusters, regression, geometric, graph)
• supervised: anomalies are regarded as a rare class and we need to have training data
• semi-supervised: sometimes training data contains labeled normal data, but has no information
about the anomalous objects. In the semi-supervised setting, the objective is to find an anomaly
label or score for a set of given objects by using the information from labeled normal objects. In
many practical situations, it can be difficult to find a small set of representative normal objects.

Machine learning for outlier detection


If the ground truth of anomalies is available we can prepare a classification problem to unveil outliers. As
classifiers we can use all the available machine learning approaches: Ensembles, SVM, DNN. The problem
is that the dataset would be very unbalanced. Thus, ad-hoc formulations/implementation should be adopted.
In some cases, it is difficult to build a model; eg, because the statistical distribution of the data is
unknown or no training data is available.

Additional anomaly detection techniques


• Proximity-based: anomalies are points far away from other points. Can detect this graphically in
some cases
(KNN) • Density-based: low density points outliers
(DBSCAN) • Pattern matching: create profiles or templates of atypical

Outliers detection approaches classification


• Global vs local outlier detection: the adjectives global and local refer to the portion of data taken
into consideration global if using the whole dataset to decide if the instance is an outlier, local if using
only a small portion or reference set. DBSCAN and KNN can be considered local.
Considers the resolution of the reference set wrt which the “outlierness” of a particular data
object is determined.
ÿ Global approaches
• The reference set contains all other data objects
• Basic assumption: there is only one normal mechanism
• Basic problem: other outliers are also in the reference set and may falsify the results
ÿ Local approaches
• The reference contains a (small) subset of data objects
• No assumption on the number of normal mechanisms assumed in the data
• Basic problem: how to choose a proper reference set
NB: Some approaches are somewhat in between. The resolution of the reference set [= a portion of
the training set used to judge if a point is an outlier or not or to assign a point an outlierness score]
is varied eg from only a single object (local) to the entire database ( global) automatically or by a
user-defined input parameter

• Labeling vs scoring outliers: consider the outputs of an algorithm.


Labeling approaches: binary output, data objects are labeled either as normal or outlier
Scoring approaches: continuous output, for each object score is computed, data objects can
be sorted according to their score.
NB: Many scoring approaches focus on determining the top-n outliers (parameter n is usually given
by the user). Scoring approaches can usually also produce binary output if necessary (eg by
defining a suitable threshold on the scoring values)
Machine Translated by Google

• Modeling properties: consider concepts based on what “outlierness” is modeled.


1. Approaches classified by the properties of the underlying modeling:
• Rational: apply a model to represent normal data points. Outliers are points that
do not
• Sample approaches: Probabilistic tests based on statistical models, depth-based
approaches, deviation-based approaches, some subspace outlier detection
approaches
2. Proximity-based approaches: notion of proximity
• Rational: Examine the spatial proximity of each object in the data space, if the
proximity of an object considerably deviates from the proximity of other objects it is
considered an outlier
• Sample approaches: distance-based approaches, density-based approaches, some
subspace outlier detection approaches
3. Angle-based approaches, notion of angle
Rational: Examine the spectrum of pairwise angles between a given point and all other
points. Outliers are points that have a spectrum featuring high fluctuation 4.
Visual approaches such as boxplots or scatter plots. Limits: they are not automatic and subjective

From visual box-plot to automatic approach


• The IQR of a set of values is calculated as the difference between the
upper and lower quartiles, Q3 and Q1. IQR = Q3 - Q1
• x is an outlier if x < Q1 ÿ k IQR or x > Q3 + k IQR (generally k=1.5)
• In a boxplot, the highest and lowest occurring value within this limit are
indicated by whiskers of the box and any outliers as individual points.

Limit: Can only be used by one dimension at a time

Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that has a low probability compared to the
probability distribution of the model in the data.
It usually assumes a parametric model describing the data distribution (ex. normal distribution).
Apply a statistical test that depends on: data distribution, distribution parameters (mean, variance) and
the expected number of outliers.
Issues: identify the distribution of the data set (heavy tail distribution), the number of attributes and wonder if
the data is a mix of distributions.
Machine Translated by Google

Assuming we are in one dimension with

a Gaussian distribution, the boundaries can be seen and all points outside are outliers.
For example on x: age, y: number of exams passed, in the center there will be those students who have passed a certain number of
exams with an average age compared to the present ages.

Grubbs' Test

Allows you to find outliers in univariate data. It can only be used with one dimension at a time and assuming that there is
a normal distribution.

• Detects one outlier at a time, removes the outlier, and repeats. There are two hypotheses: H0: no outlier
in the data (null-hypothesis) or HA: there is at least one outlier.
Grubbs' test statistic:

• one-sided test with alpha/N • two-


sided test with alpha/2N

Reject null hypothesis H0 of no outliers if:

A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test
taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated
value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis.

A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but
not both. An example can be whether a machine produces more than one percent defective products. In this situation, if the
estimated value exists in one of the one-sided critical areas, depending on the direction of interest (greater than or less than), the
alternative hypothesis is accepted over the null hypothesis.
Machine Translated by Google

Likelihood Approach
Works with 2 sets: suppose dataset D contains samples from a mix of two probability distributions: M (majority distribution) and A
(anomalous distributions)
General approach: •
initially, assume that all data points belong to M • Then compute Lt(D): the log
likelihood of D at time t • For each point xt belonging to M, move it
to A: ÿ Let Lt+ 1 (D) be the new log likelihood. ÿ Compute the
difference, ÿ = Lt(D) – Lt+1 (D) ÿ If ÿ > c (some
threshold), then xt is declared as an anomaly and moved
permanently from
M to A

How is likelihood modeled?

Data distribution: - ÿ
is a parameter that controls the importance of one distribution over another
- M is the probability distribution estimated from the data, it can be based on any method (naïve Bayes,
maximum entropy, etc.)
- A is initially assumed to be uniform distribution
Likelihood at time t:

Strengths/Weakness of statistical approach


Pros:
• firm mathematical foundation

• can be very efficient • good


results if distribution is known
Machine Translated by Google

Cons:
• in many cases, data distribution may not be known • For
high dimensional data, it may be difficult to estimate the true distribution • Anomalies
can distort the parameters of the distribution (Mean and standard deviation are very sensitive to outliers)

Depth-based Approaches
The general idea is to look for outliers at the data space boundaries but
independently of the statistical distributions. The complicated
thing is to create a convex hull layer to organize the data objects,
i.e. a grouping containing all the data.
Outliers are objects in outer layers
Assumption: outliers are located at the boundaries of the data space

Model [Tukey 1977]


• Points on the convex hull of the full data
space have depth = 1
• Points on the convex hull of the data set
after removing all points with depth = 1
have depth = 2
•– …

• Points having a depth ÿ k are reported as


outliers

is a global model, returns a label once


you specify the layer where you have the outliers

• Similar idea like classical statistical approaches (k = 1 distributions) but independent from the chosen
kind of distribution • Convex
hull computation is usually only efficient in 2D / 3D spaces • Originally
outputs a label but can be extended for scoring easily (take depth as scoring value) • Uses a global
reference set for outlier detection • Sample algorithms:
ISODEPTH,FDC

Exercise:
Machine Translated by Google

If k=1 then A and B are not outliers


k = 2 A outlier B no
k = 3 A and B are outliers
When this approach fails? in high dimensions but also when we have a ring shape of our data.

Anomaly & Outliers Detection pt2


Deviation-based Approaches:
• General idea: given a set of data points (local group or global set), the outliers are points that do not fit to the
general characteristics of that set, ie, the variance of the set is minimized when removing the outliers. • Basic
assumption: Outliers are the outermost points of the data set • The
model was created by Arning et al. in 1996 We
consider a smoothing factor SF(I) that computes for each I ÿ DB (database) how much the variance of
DB is decreased when I is removed from DB.
If we have an equal decrease in variance, a smaller exception set E is better.
The outliers are the elements of E ÿ DB for which the following holds: SF(E) ÿ SF(I) for all I ÿ DB ÿ the
smoothing factor of E is >= smoothing factor of I, for any other subset I that can be selected from the database.

The idea is to find which value of E that maximizes the smoothing factor over all possible subsets in the
database, then find those points that reduce the variance the most in the dataset.
We have to select some sort of max and min number of outliers that we want to find, and then E is set from that.

Discussion:
• Similar idea like classical statistical approaches (k = 1 distributions) but independent from the chosen
kind of distribution, we do not assume any particular kind of distribution for the data •
Naïve solution is in O(2n) for n data objects •
Heuristics like random sampling or best first search are applied to speed up the process •
Applicable to any data type (depends on the definition of Smoothing Function) •
Originally designed as a global method •
Outputs a labeling

Distance-based Approaches
• General Idea: the outliers of a point are judged based on the distance(s) to its neighbors. There are many
proposed variants • Basic
Assumption: Normal data objects have a dense neighborhood. Outliers are far apart from their neighbors, ie,
have a less dense neighborhood

There are several different techniques:


Machine Translated by Google

• Approach 1: An object is an outlier if a specified fraction of the objects in the dataset is more than a specified distance away
(Knorr, Ng 1998). Some statistical definitions are special cases of this. We have a labeling definition if the point is an outlier or
not. • Approach 2: The outlier score of an object is the
distance to its k-th nearest neighbor. Here we have a score on whether or not the point is an outlier.

Analysis Approach 2:

Graphically it means that: the bluer the point, the more the point is IN-lier , if instead it approaches red it is a
OUTLIER. The distance is calculated with respect to the first nearest neighbor.

One of the possible weaknesses could be represented with the graph below.
The red dot appears to be more outlier than point D, because the distance between the two teal dots is
minor.
For this reason, the second approach is called LOCAL and strongly depends on the reference set (i.e. the only closest
neighbor).

If we use 6-nearest neighbor as in this case:


Machine Translated by Google

We may want to identify two clusters and not a set of separate points recognized as outliers.

In this case we have two clusters and a point D. In this case the point D is recognized as an outlier, while
the blue point above cluster 2 is not because the distance from the closest cluster is much less than that of point
D. Having also different densities, the algorithm has problems, because it identifies the blue point as inline
and the point d as outlier.

Formal definition of the approach1:


DB(ÿ,ÿ)-Outliers
• Basic model [Knorr and Ng 1997]
• The parameters are: a radius ÿ and a percentage ÿ

• A point p is considered an outlier if at most ÿ percent of all other points have a distance to p less than ÿ, ie, it
is close to few points:

• The approach is local, it labels the point and has two parameters, ie epsilon and pi

Exercise (A):
Machine Translated by Google

We need to take points A and B and calculate the points that are inside the epsilon radius. First let's try with A.
Then with B. Let's see what are the points in the radius of A, in this case they are 5,6,B,4. We do "divided by 10"
because the points are 10 in total.
In the final unreadable line it says = since 0.4 is greater than 0.15 and not less, then A is not an outlier.

If we specify k=2, the closest neighbors according to the Manathhan distance will be:
- for A=2 -
for B=2 - for
1=4ÿ because the outlier score is the distance of the second nearest neighbor. This means that 1 is
more outliers than 4.
Machine Translated by Google

Algorithms of the Distance-based Approach


• Index-based [Knorr and Ng 1998]
• Compute distance range join using spatial index structure •
Exclude point from further consideration if its ÿ-neighborhood contains more than
Card(DB) ÿ points •
Nested-loop based [Knorr and Ng 1998] •
Divide buffer in two parts •
Use second part to scan/compare all points with the points from the first part • Grid-
based [Knorr and Ng 1998] • Build
grid such that any two points from the same grid cell have a distance of at most ÿ
to each other
• Points need only compared with points from neighboring cells

Outlier scoring based on kNN distances


General models
• Take the kNN distance of a point as its outlier score [proposed in 2000] •
Some modification= aggregate the distances of a point to all its 1NN (one nearest neighbor), 2NN, …, kNN as an
outlier score. We specify some value k, and then, instead of using k distance, we add the distances or find the
mean

Algorithms - General approaches


• Nested-Loop
• Naïve approach: For each object compute kNNs with a sequential scan •
Enhancement: use index structures for kNN queries •
Partition-based
• Partition data into micro clusters •
Aggregate information for each partition (eg minimum bounding rectangles) • Allows
to prune micro clusters that cannot qualify when searching for the kNNs of a particular point

Outlier Detection using In-degree Number •


Idea: Construct the kNN graph for a data set
• Vertices: data points
• Edge: if qÿ kNN(p) (=if point q belongs to the KNN of p) then there is a directed edge from p to

q • A vertex that has an indegree less than equal to T (user threshold) is an outlier
Former:
Machine Translated by Google

We can specify a limit T=2 if it has less than 2 incoming links. In our case E is an outlier. D= 4, C=3, A=4,
B=4.

Discussion
• The indegree of a vertex in the kNN graph equals to the number of reverse kNNs (RkNN= reverse KNN) of the
corresponding point
Former:

• The RkNNs of a point p are those data objects having p among their kNNs •
Intuition of the model: outliers are (are equal)= •
points that are among the kNNs of less than T other points •
have less than T RkNNs
• Outputs an outlier label •
Is a local approach (depending on parameter k)

Strengths/Weaknesses of Distance-Based Approaches


Next
• Simple to implement, use, understand, justify

Cons
• is generally expensive – O(n2) •
Sensitive to parameters •
Sensitive to variations in density •
Distance becomes less meaningful in high-dimensional space •
Difficult with categorical variables

Density-based Approaches
• General idea
• Compare the density around a point with the density around its local neighbors • The
relative density of a point compared to its neighbors is computed as an outlier score •
Approaches differ in how to estimate density

• Basic assumptions
• The density around a normal data object is similar to the density around its neighbors • The
density around an outlier is considerably different to the density around its neighbors

Density can be defined in various ways, the simplest way is to define it as an outlier score. • Density-
based Outlier : The outlier score of an object is the inverse of the density around the object.
• Can be defined in terms of the k nearest neighbors
Machine Translated by Google

• One definition: Inverse of distance to kth neighbor •


Another definition: Inverse of the average distance to k neighbors •
DBSCAN definition, we run the dbscan and then identify the outliers • If
there are regions of different density, this approach can have problems:

Relative Density
• it is better to consider only the relative density, with respect to a given subset of points. • the
density of a point relative to that of its k nearest neighbors is considered:

the numerator density can be = 1/ distance of x from z. Where z is the kNN.

ALGORITHM: We calculate the outlier score for each object for a specified number of neighbors (k) by first
computing the density of an object density(x, k) based on its nearest neighbors. The average density of the
neighbors of a point is then calculated and used to compute the average relative density of the point. This
quantity provides an indication of whether x is in a denser or sparser region of the neighborhood than its
neighbors and is taken as the outlier score of x.

Local Outlier Factor (LOF)


This is a relative approach, using a different definition of density, which works better in practice.
Motivation for which it was proposed:
• Distance-based outlier detection models have problems with different densities •
How to compare the neighborhood of points from areas of different densities?
Example
Machine Translated by Google

• Using the distance based outlier model= DB(ÿ,ÿ)-outlier model:


• it is impossible to find the parameters ÿ and ÿ. such that o2 is an outlier but none of the points
in cluster C1 (eg q) is an
outlier • Using kNN-distance to find the outliers:
• it is difficult to find a k such that O1 and O2 are well classified. Indeed, kNN-distances of objects in
C1 (eg q) are larger than the kNN-distance of o2

Solution: consider relative density


• For each point, compute the density of its local neighborhood
• Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p
and the density of its nearest neighbors
• Outliers are points with largest LOF value
• It is a local approach, it returns a score and we can decide to consider the score or the top n

Some features of the LOF:


1.Adopt the reachability distance: first difference with the trivial density based approach
• Introduces a smoothing factor

• In this case the order of the operators matters.


•O= parameter in the center, k= defines the K-nearest neighbor, p= point whose distance I want to calculate
Machine Translated by Google

•the reachability distance is the maximum distance between the distance from pao, and the k-th distance of the nearest
neighborhood of o.
It means that: if I measure the distance from O to A, I replace this distance with the k-th distance, i.e. the distance
in the circumference in our case. This is because point A is inside the circle, i.e. it is closer.

While for the point p2, which is outside the circle, I use the distance dist(p,o).

2.Local reachability distance (lrd) of point p is calculated as the inverse of the average reach-dists of the kNN s
of p. This formula implements the algorithm formula, but using the local reachability distance instead

• To the denominator = cardinality of the knn of


p • To the numerator = iterated sum of some point that belongs to the knn of p. Let's calculate the rach. distance
between the points in the neighborhood p, replacing them as the center of the analysis and calculating the distance with p.

Former:

The reachability distance between P,A = the greater of the distance (PA) and the distance
(AB). this means that P is a nearest neighbor of A, but not of C. Because P is further away than the knn of C, which is
c-third.

3. Local outlier factor (LOF) of point p


Machine Translated by Google

The Lof is calculated as the average ratio of lrds of neighbors of p and lrd of p/ cardinality of knn of p.

Ex B(k=2, calculate the LOF of A and B, the prof is using the manathan distance not the reachability distance) :

1. First step= calculate the KNN of points A and B:


KNN(A)={B,4}
knn(B)={A,4}

2. CONSIDER THE KNN of the k-th NNs of A and B:


knn(6)={A,3}
KNN(4)={B,2}

3. We calculate the local reachability distance


LRD(A)1/((1+6)/2) = 0.66
LRD(B)1/((1+2)/2) = 0.66
LRD(6)1/((2+2)/2) = 0.5
LRD(4)1/((2+1)/2) = 0.66
Machine Translated by Google

4. We calculate the LOF of A as the sum of the local reachability distances of the points in the neighborhood,
divided by the local reachability distance of the points calculated above
LOF(A)= 0.66/0.66 + 0.5/0.66 / 2 (CARDINALITY OF A) = 0.91 LOF(B)=
0.66/0.66 + 0.66/0.66 / 2 = 1

Properties
When is a point an outlier? • LOF ÿ
1: point is in a cluster (region with homogeneous density around the point and its neighbors) • LOF >> 1: point is an
outlier

Discussion
• I can choice of k (MinPts in the original paper) specifies the reference set • I can decide
to implements a local approach (resolution depends on the user's choice for k)= i.e. I can decide to consider the outlier
score or define a top number of outliers I want and label them as outliers

• It is however a local approach, initially defined for scoring (assigns an LOF value to each point)

As you can see, in the 2D and 3D representation, if the cluster is denser then it will be lower.

Mining Top n Local Outliers


Idea:
• Usually, a user is only interested in the top-n outliers • Do not
compute the LOF for all data objects => save runtime

method
• Compress data points into micro clusters using the CFs of BIRCH [Zhang et al. 1996]. You focus on the microclasses
and compute the LOF only for points within the smallest microclasses. In this way the calculation is optimized, because we
refer only to a small subset. • Derive upper and lower bounds
of the reachability distances, lrd-values, and LOF-values for points within a micro clusters • Compute upper
and lower bounds of LOF values for
micro clusters and sort results wrt ascending lower bound
Machine Translated by Google

• You can also decide to prune micro clusters that cannot accommodate points among the top-n outliers (n highest
LOF values) •
Iteratively refine remaining micro clusters and prune points accordingly

Connectivity-based outlier factor (COF)


Weakness of the LOF= if the dataset has a parameter k not set appropriately, each point has the same LOF and
also the point in the center (in the example) is not classified as an outlier. Thanks to the COF, however, this LOF
problem is overcome.

• Motivation to use the COF= In regions of low density, it may be hard to detect outliers, for this choose a low
value for k is often not appropriate • Solution given
by the COF= Treat “low density” and “isolation” differently • Example

Influenced Outliers (INFLO)


Motivation for using INFLO •
If clusters of different densities are not clearly separated, LOF will have problems

Idea
• Take symmetric neighborhood relationship into account, instead of using simple direct neighborhood.
Instead of using the knn, we consider the k influenced space, defined as the union between the k nearest neighbor
hood(kNN(p)) and the reverse nearest neighbor (RkNN(p)).

model
• Density is simply measured by the inverse of the kNN distance, ie,
den(p) = 1/k-distance(p) •
Influenced outlierness of a point p
Machine Translated by Google

is the sum of the quantity at the previous point (but iterated for all points of the k influenced set)/ the
cardinality of the k influenced set and normalized with respect to the quantity
den(p) • INFLO takes the ratio of the average density of objects in the neighborhood of a point p (ie, in kNN(p) ÿ
RkNN(p)) to p's density

Proposed algorithms for optimizing top-n outliers •


Index-based
• Two-way approach
• Micro cluster based approach

Properties
Similar to LOF:
• INFLO ÿ 1: point is in a cluster
• INFLO >> 1: point is an outlier

Discussion
• Outputs an outlier score •
Originally proposed as a local approach (resolution of the reference set kIS can be adjusted by the user setting
parameter k)

Strengths/Weaknesses of Density-Based Approaches


Next
• Simple
Cons
• Expensive – O(n^2)
• Sensitive to parameters •
Density becomes less meaningful in high-dimensional space

The density based approaches, compared to the distance based approach, solve the problem of the
different densities, but always remain expensive and suffer in high-dimensional space.

Clustering and Anomaly Detection


(LESSON WEDNESDAY 3.3.21)
The question now is: Are outliers
just a side product of clustering algorithms?
Yes, as clustering algorithms are NOT designed to find outliers, but to find similar points, group them and
separate them from other points. If these are then outliers or points belonging to other clusters, it is not considered
by the clustering algorithm. • Some approaches
like DBSCAN do not assign points to clusters, so they label them as noise points (DBSCAN, OPTICS) and it is
easy to identify outliers using a clustering algorithm. • So the idea is to run the clustering
algorithm and label all the points labeled as noise as outliers .
The problem here is that, as said before,
• cluster algorithms are optimized to find clusters (obviously)ÿ the goodness in the process of finding outliers
strictly depends on how good the clustering algorithm captures the structure of clusters
Machine Translated by Google

• A set of many abnormal data objects that are similar to each other would be recognized as a cluster rather
than as noise/outliers These
problems arise especially with k- means, as we assume that clusters with less than a certain threshold of points
are outliers, but in reality these strange points, due to the large dataset that makes them end up in the same
cluster, we have very different points which however are grouped in the same cluster, therefore they are NOT
recognized as outliers.
In DBSCAN, however, we manage to label them as noise.

Clustering-Based Approaches
• Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster.
So for approaches like DBSCAN it's simple (n points are outliers)
ÿ • For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center.
We have to decide on a strategy, which we choose.
ÿ • For density-based clusters, an object is an outlier if its density is too low ÿ •
For graph-based clusters, an object is an outlier if it is not well connected. In a sense
we can imagine them as the result of a hierarchical clustering, considering the dendrogram; the object
is outlier if in the dendrogram it is merged in one of the latest steps and it is alone: therefore the point is
certainly very different from the others, otherwise it would have been merged
earlier. • Other issues include the impact of outliers on the clusters and the number of clusters: the presence
of outliers can affect the computation of the clustering algorithm. We do not have this problem if we know that
there are no outliers in our dataset: when we identify a point, after the algorithm, we know whether or not it is an
outlier, but an approach like DBSCAN, for example, with an external point struggles to label it such as
noise, border or center occurs WHILE the algorithm is being run. Possible solution: We can consider
frozen the labeling of the points in the dataset and then, when adding another point, we simply estimate if it
is core, border, or noise. In any case, we can't use the clustering algorithm as it is, but we need to make
sure that the result can be used.

For example, in the following case we see that it is easy (for example by choosing k= 9) to end up in the
situation highlighted: we have 4 well-defined clusters, while 5 are clusters which, in reality, should be points.
After that, we can compute the distance between a certain point and all clusters' centers (for example we
calculate the distance from the point indicated by the pointer, to the center, and all other cluster centers) with at
least a certain number of points and then we average and we decide on a threshold. This is distance-based way
reasoning.

Below, let's imagine we have the two clusters we see: for point C the distance is computed with respect to the
cluster on the left, while for point D with respect to the one on the right. The {distance from point A, with
respect to the left cluster} >{ distance D with respect to the right cluster}. Again, this is a density issue, of course.
We have a biased evaluation.
Machine Translated by Google

If instead we consider the relative distance instead of the absolute distance, we can solve the problem:

Strengths/Weaknesses of Clustering-Based Approaches


Next
• Simple
• Many clustering techniques can be used
Cons
• Can be difficult to decide on a clustering technique •
Can be difficult to decide on number of clusters
• Outliers can distort the clusters (and therefore the evaluation of the point as an outlier or not is distorted). The
risk is to cause bias to the future model.

Since all of these techniques are unsupervised, we can apply more than one technique for outlier detection
and then decide on a policy for which points are outliers (for example, if ÿ of the methods we used tell us that the
point is an outlier ).

High-dimensional Approaches Challenges Curse of dimensionality


So far we have seen methods that have major problems with dimensionality: •
Relative contrast between distances decreases with increasing dimensionality •
Data is very sparse, almost all points are outliers •
Concept of neighborhood becomes meaningless as the number of dimensions increases

Solutions
• Use more robust distance functions and find full-dimensional outliers: we did this even before with clustering,
statistical, density, deviation-based approaches. Here we mean that all sizes of the
Machine Translated by Google

feature spaces simultaneously contribute to the identification of outliers, so as not to have the
dimensionality problem. •
Find outliers in projections (subspaces) of the original feature space: for example we reduce the
dimensionality and then run the algorithm we want with more confidence.

ABOD – Angle-based Outlier Degree


• Angles are more stable than distances in high dimensional spaces (eg the popularity of cosine-based
similarity measures for text data). For example, when we consider the cosine similarity we now see:

• Object o is an outlier if most of the other objects are located in similar directions •
Object o is no outlier if many other objects are located in varying directions In the
image below we consider directions based on angles. The notion of outlier is similar to that of the depth based
approach, as we assume that the outlier is located at the boundaries, instead of at the
center.

ASKS THE EXAMINATION: we cannot use depth-based approach in high-dimensional space, as it requires
the computation of a convex hull, which means to find the perimeter of the dataset which is simple with
some dimension, but becomes computationally expensive in high- dimensional datasets.

model
• Consider for a given point p the angle between any two instances x and y : I calculate the angle which is
formed as follows, thus comparing the vectors going from p to x and the one going from pay

• Consider the spectrum of all these angles: we calculate the angle between p and any other
possible combination of points in the dataset and then look at the spectrum: if it is small, we have a low
outlier score ÿ The broadness of this spectrum is a score for the outlierness of a point. The spectrum therefore
looks at the variance which is always greater for the inliers: an outlier is a point that has a small variance between
the angles between all points, while an inlier has a large variance between the angles of the other points. This
is a global method, as it considers all points in the dataset.
Machine Translated by Google

Spectrum:

How do we measure this variance? •


Measure the variance of the angle spectrum: we measure the cosine distance between {xep} and {yep}
in order to obtain the vector and then compare the vectors. The denominator is needed in the case of
low-dimensional data (where angles are less reliable ÿ since the assumption of the formula is that in few
dimensional data sets the vector xp is not reliable, since the points could have a globular shape and
there would be many angles.

Properties
• Small ABOD => outlier
• High ABOD => no outlier

ABOD Algorithms
• Naïve algorithm is in O(n^3): computationally expensive •
Approximate algorithm based on random sampling for mining top-n outliers: for example if we
select a sample of m points, where m is much smaller than n, for calculate the ABOD of the n points, the
complexity becomes n*m^2, while in the previous case it was n*n^2 = n^3
- Do not consider all pairs of other points x, y in the database to compute the angles
- Compute ABOD based on samples => lower bound of the real ABOD
Machine Translated by Google

- K-means with high k and then we use the centroids as sample points to calculate the various angles so that the solution is
slightly more stable, but again if the choice of k is not reliable, we can use a point as a centroid in the middle of the
space, which represents nothing ÿ bias!

-
Filter out points that have a high lower bound Refine
-
(compute the exact ABOD value) only for a small number of points: instead of using n = |10 k| we choose q= 10 points,
applying these q and run the sample of m data, the final complexity becomes q*m^2 which is much less than the original,
but the result is not as accurate.

Discussion
• Global approach to outlier detection • Outputs
an outlier score • ABOD: use
for high-dimensionality.

Grid-based Subspace Outlier Detection Model



Partition data space by an equi-depth grid (ÿ = number of cells in each dimension): in the image in question we have 3
dimensions (we can have more); we generate a grid that has cells of the same size. phi represents number of cells in each
dimension and k is the number of dimensions. It's like binning in all possible dimensions using equal width binning.

• Sparsity coefficient S(C) for a k-dimensional grid cell C : we count how many points we have in a certain cell and subtract the
amount seen in the slide. If count <0: the point belonging to that cell is outliers, as it means that there are fewer points in the
cell than expected. And viceversa.

• where count(C) is the number of data objects in C • S(C) < 0


=> count(C) is lower than expected • Outliers are those
objects that are located in lower dimensional cells with negative sparsity coefficient
Machine Translated by Google

ÿ it is a global and labeling method, as it labels all points inside a cell as outliers or inliers. We can do this as a
score method by assigning the same score to points that belong to the same cell.

Algorithm
• Find the m grid cells (projections) with the lowest sparsity coefficients; finding the right parameters is
crucial ÿ dimensionality is given, but phi must be defined: if it increases, the model is more accurate, but also
more complex (exponentially increasing number of cells, as we have more dimensions) • Brute-force
algorithm is in O(ÿd): we have to calculate the sparsity coefficient for each cell • Evolutionary
algorithm (input: m and the dimensionality of the cells) ÿ to look if we are interested,
but it is not used for exam

Discussion
• Results need not be the points from the optimal cells •
Very coarse model (all objects that are in cell with less points than to be expected) •
Quality depends on grid resolution and grid position: if we are in 2 dimensions and we have
something like l 'image below, if the grid is done this way we are splitting two clusters of points

• Outputs a labeling
• Implements a global approach (key criterion: globally expected number of points within a cell)

Model-based Approaches
This is the only approach where a model is “learned” from the data in order to assign an outlier
score.
isolation forest
Idea: Few and different instances can be isolated quicker. It is the only approach that considers instances both
different from each other and few, based on Ricky's
experience. • Given the dataset build a forest of trees.
ÿ For each tree (so as not to have bias): 1. Get
a sample of the data 2.
Randomly select a dimension (e.g. dimension y); it is not the same process of decision
tree in which we look for the best values to split after calculating the impurity etc… here we randomly
take a dimension and a
3. value on that dimension. In the image below, the points have already been selected
starting from a larger dataset.
Machine Translated by Google

4. Draw a straight line through the data at that value and split data: for example I say: if the value is >
2 I go to the right, otherwise to the left. I create a rule

5. Repeat until tree is complete : in each leaf I have ONE AND ONLY ONE POINT (or I select a
minimum value and I finish earlier). In each split identified in the splitting (see image ..) I have a point.

6. Generate multiple trees -> forest


Machine Translated by Google

- Intuition: Anomalies will be isolated in only few steps : outliers are points that can be found in the initial
branch of the tree, as they require only a few splittings to be isolated.
Since the approach is random, it obviously has to be repeated several times.
- Nominal points (ie inliers) need more splits to be separated.

So how do I find outliers now? I calculate the number of steps I have to follow in each tree from root to leaf to
find a specific point and I average these values. On the left, here we have a tree and in blue we have a very long
path (the one for the inlier) and a short one (red: for the outlier). similarly in the right graph in blue we see many
steps for the inliers and vice versa. The formula shows how to compute the score outliers, where E, h, c are
functions capturing the aforementioned paths . The simpler it is to isolate a point from the others with this
random structure, the more the point can be considered an outlier, as it means that it is very different (and
distant) from the others and therefore is found alone in a region of space.
Machine Translated by Google

Isolation Forest
• Computationally Efficient
• Parallelizable : we can build all the isolation trees we want in parallel • Handle high
dimensional data: I don't have to calculate the distances considering the various features at each split.
It is not a simple approach like others, but for now it appears to be the most useful
and used. • Inconsistent scoring can be observed: below in red we have an outlier and in yellow an inlier.
Although the blue regions drawn are not too red, but lighter, so the dot is less outlier, the model therefore
assumes that the dots in the dark red areas are more likely to be outliers
Machine Translated by Google

Especially the last picture above is wrong.


These problems can be solved through the following approach:

Extended Isolation Forest


• Idea: Few and different instances can be isolated quicker
• Given the dataset build a forest of trees.
• For each tree:
1. Get a sample of the data
2. Randomly select a normal vector (instead of the size. Then everything else is equal)
3. Randomly select an intercept (slope of vector)
4. Draw a straight line through the data at that value and split data 5.
Repeat until the tree is complete 6.
Generate multiple trees –> forest The
result is the following: we no longer have orthogonal divisions of space, but oblique ones:
Machine Translated by Google

Characteristics
isolation forest
• Computationally Efficient •
Parallelizable
• Handle high dimensional data
• Inconsistent scoring can be observed
Extended Isolation Forest
• Computationally Efficient
• Parallelizable
• Handle high dimensional data
• Consistent scoring: in the other case we had more of a sort of Gaussian distribution, i.e. points in the center
rather than in the edge of the plot, while at this turn the model understands better: in the image below we see
on the left the old result (Isolation Forest) and on the right the new one (Extended Isolation Forest)

Summary
• Different models are based on different assumptions
Machine Translated by Google

• Different models provide different types of output (labeling/scoring) •


Different models consider outlier at different resolutions (global/local) • Thus,
different models will produce different results • A thorough
and comprehensive comparison between different models and approaches is slll missing

MODULE 2
Naïve Bayes Classifiers
Bayes Classifier
The Bayes classifier is a probabilistic framework (based on probabilities) for solving classification
problems, where P is the probability P(X=x) (between 0 and 1, where 0 is =never appears 1=always appears ) that
an events X=x are happening (Ex: X=”a student attended the DM2 course” and x can be “Yes” or “No” if it is a
binary variable).
With two events happening together, we compute the joint probability P(X=x, Y=y) where Y is another event
(Ex: X = “student attended DM2 course, Y="student has black eyes").
Instead, we call conditional probability P(Y=y|X=x) the probability of a student having black eyes GIVEN
THAT the student attended DM2 course.

The joint and conditional probabilities are needed to compute the Relationship between X and Y, which is: P(X,Y)
= P(Y|X) P(X) = P(X|Y) P(Y), which is the basis for the Bayes Theorem.
In fact the Bayes Theorem is described as: P(Y|X) = P(X|Y)P(Y) / P(X) ÿ The probability of obtaining a certain
outcome Y given the features X is equal to the probability of having certain features X given the outcome Y
multiplied by the probability of the outcome Y divided by the probability of the feature.
With this formulation we can notice that Y is the class variable and X is the set of features describing a
certain instance.

Another useful property used to show I to apply the Bayesian classifier:


P(X=x) = P(X=x, Y=0) + P(X=x, Y=1) ÿ The probability that event X assumes value x equals the probability of X=x
happening together with event Y with value 0 (“No”) plus the probability of X=x happening together with event
Y with value 1 (“Yes”).

Example on Bayes' theorem:


Consider a football game. Team 0 wins 65% of the time, Team 1 the remaining 35%. Among the games won by
Team 1, 75% of them are won playing at home. Among the games won by Team 0, 30% of them are won at Team
1's field.
If Team 1 is hosting the next match, which team will most likely win?
Y = “team wins”, y=0 team 0 wins, y=1 team 1 wins X = “team hosts
match”, x=0 team 0 hosts match, x=1 team 1 hosts match

• P(Y = 0) = 0.65 ÿ the probability that Team 0 wins 65% of the time •
P(Y = 1) = 0.35 ÿ the probability that Team 1 wins 35% of the time •
P(X = 1|Y = 1) = 0.75 ÿ Among the games won by Team 1, 75% of them are won playing at home. • P(X =
1|Y = 0) = 0.30 ÿ Among the games won by Team 0, 30% of them are won at Team 1's field

• Objective P(Y = 1|X = 1): What is the probability that Team 1 wins the game given that it plays at home?
Solution: Apply the Bayes theorem formula P(Y|X) = P(X|Y)P(Y) / P(X): P(Y = 1|X = 1) =
P(X = 1|Y = 1)P(Y = 1) / P(X = 1) = I rewrite the
denominator by applying the formula P(X=x) = P(X=x, Y=0) + P(X=x, Y=1 ): = 0.75 x
0.35 / (P(X = 1, Y = 1) + P(X = 1, Y = 0)) =
Machine Translated by Google

I apply the Relationship formula P(X,Y) = P(Y|X) P(X) = P(X|Y) P(Y): = 0.75
x 0.35 / ( P (X = 1|Y = 1) P(Y=1) + P(X = 1|Y = 0)P(Y=0)) =
Now that all terms are known we can substitute the values: = 0.75
x 0.35 / (0.75 x 0.35 + 0.30 x 0.65 ) = =
0.5738 ÿ The probability of winning Team 1 if playing at home is more than 50%

We treat the relation probabilistically using P(Y|X):

Application of the Bayes Theorem for Classification


The idea is that we learn the posterior probability P(Y|X) for every combination of X and Y.
By knowing these probabilities, a test record X' can be classified by finding the class Y' that maximizes the
posterior probability P(Y'|X'). This is equivalent of choosing the value of Y' that maximizes P(X'|Y')P(Y'),
and that's because we are going to have P(X') as a denominator.

Naïve Bayes Classifier estimates the class-conditional probability by assuming that all the attributes
(between columns) are conditionally independent given the class label y, this is why it's called Naive,
because we assume conditionally independence among the different features (which is different from
statistical independence between rows of the data like we discussed in anomaly detection). So, if we want
to use this classifier, we need to make sure that the attributes are independent of each other by checking for
correlation, leaving only one attribute between two highly correlated attributes.

The conditional independence is stated as:

where each attribute set X = {X1, X2, … Xd}.


This formula explains that: The probability of X given a specific value of Y is equal to the product of the
probability of the single values of every column X given the target y.

Given three variables Y, X1,X2 we can say that Y is independent from X1 given X2 if the following condition
holds:

The probability of Y given X1 and X2 is equal to the probability of Y given X2, this means that Y is
independent from X1, so X1 doesn't affect the probability of Y.
With the conditional independence assumption, instead of computing the class-conditional probability for every
combination of X we only have to estimate the conditional probability of each Xi (every value of the columns)
given Y. Thus, to classify a record the naive Bayes classifier computes the posterior for each class Y and takes
the maximum class as result.

How the Naive Bayes classifier is applied:


Machine Translated by Google

How to estimate the probability from data:


For Categorical attributes

• probability of class y: P(Y) = Ny/N , where:


- Ny number of records with outcome y
-N number of records

• Probability for categorical attributes: P(X = x | Y = y) = Nxy / Ny , where:


-
Nxy records with value x and outcome y
- Ny number of records with outcome y

Examples:
• P(Evade = Yes) = 3/10 ÿ Consequently P(Evade=No) = 7/10 • P(Marital
Status = Single|Yes) = 2/3 ÿ P(Marital Status = Single|No) = 2/7

For continuous attributes


Discretize the range into bins • one
ordinal attribute per bin • violates
independence assumption, because this is tied to the binning

Two-way split: (X < v) or (X > v), so we have two bins


• choose only one of the two splits as new attribute

Probability density estimation:


• Assume the attribute follows a normal distribution

• Use data to estimate parameters of distribution (eg, mean and standard deviation) • Once probability
distribution is known, can use it to estimate the conditional probability P(X|y)
Machine Translated by Google

Assuming Normal distribution:


- ÿij can be estimated as the mean of Xi (every continuous attribute, in our case it is only Taxable
Income) for the records that belong to class
- yj. ÿij is the standard deviation of the continuous attribute Xi for the records belonging to class yj.

Ex: P(Xi = xi | Y=y) = P(Taxable Income = 120|No) = 0.0072 - mean=ÿij


= 110 std dev= ÿij =
- 54.54

Exercises:

M-estimate of Conditional Probability (The


values used are taken from the previous year)
If one of the conditional probability is zero, then the entire expression becomes zero, which could be a problem if both classes
are zero.
Example: given X = {Refund = Yes, Divorced, Income = 120k}, if P(Divorced|No) is zero instead of 1/7, then:
Machine Translated by Google

- P(X|No) = 3/7 x 0 x 0.00072 = 0 -


P(X|Yes) = 0 x 1/3 x 10^-9 = 0
In this case we cannot say if one probability is higher than the other. But we can use the M-estimate to
compute the probability in a different way:

m is a parameter, p is a user-specified parameter, where we correct the original fraction with the
parameters m*p at numerator and m at denominator (eg p is the probability of observing xi among records
with class yj), with m and p specified by the user.
In the example with m = 3 and p = 1/m = 1/3 (ie, Laplacian estimation) we have:
P(Married |Yes) = (0+3*1/3)/ (3+3) = 1 / 6

Recap on Naive Bayes Classifier


- Robust to isolated noise points (noise to a certain class) because it is going to compute different
probabilities for them, so it is good in classifying noise points.
- Handle missing values by ignoring the instance during probability estimate calculations, because it
assumes statistical independence between the columns so it ignores the probability of a certain
column during the calculus.
- Robust to irrelevant attributes, because both of them are multiplied in the calculus.
- Problem: Independence assumption may not hold for some attributes
- Use other techniques such as Bayesian Belief Networks (BBN, not treated in this course)
that considers the dependencies between attributes.
Machine Translated by Google

Lesson Monday 15/03


Linear and logistic regression
Regression
• Given a set dataset containing N observations with two variables: Xi, Yi, i= 1,2,...N •
Regression is the task of learning a target function f that maps each input attribute set X to
a Y output.
• The goal is to find the target function that fits the input data with the least error. • The error
function to be minimized can be expressed as:

(minimize the residuals)


Machine Translated by Google

Linear regression
Linear regression is a linear approach to modeling the relationship between a dependent variable Y and one or more independent
(explanatory) variables X.
When there is only one explanatory variable it is called simple linear regression. If instead there are more explanatory variables, the
process is called multiple linear regression. When there are multiple correlated dependent variables, the process is called
multivariate linear regression.

What does it mean to predict Y?

Given X=5. There are many values


different than Y at X=5.

When we say we predict Y at X=5, we


are really asking: What is the
expected value (average) of Y at
X=5?

Formally, the regression function is given by E(Y|X=x) : expected value of Y given a specific X (like Bayes). This is the expected
value of Y at X=x.
The ideal or optimal predictor of Y based on X, called regressor, is therefore: f(X) = E(Y | X=x)
Then the yellow line finds the mean value of the points

Simple Linear Regression

two different notations but same formula, the second is more general.
Machine Translated by Google

• mo 1 are the slope, i.e. the inclination of the line • bo 0


are the intercept (bias) that adjusts the height in the X axis
In general, this relationship cannot exactly include the large part of the unobserved population.
The unobserved deviations from Y are called errors or residuals. So we need to minimize the errors by looking
for the best parameters that minimize the error.
The goal is to find the estimated values m' and b' for the parameters meb that would provide the “best” fit for
the data points.

Least Square Method


A standard approach to do this is to apply the method of least squares, which tries to find the parameters m, b that
minimize the sum of squared error

also known as residual sum of squares Work


starting with random meb, iteratively changing them by setting their values as the corresponding partial
derivatives of the equation (such as Likelihood), until convergence is achieved.

What I want to do is find the blue line that minimizes


the sum of all the gray lines that represent errors/
residuals. It is not very accurate as it is a generalization to
two variables.
Blue line shows the least square fit. Lines from red
points to the regression line illustrate the residuals.
For any other choice of slope m or intercept b the SSE
between that line and the observed data would be
larger than the SSE of the blue line.
The smaller the error, the better the model.

Examples:

there are different amounts of TV, radio and newspapers on the X-axis, and the sales amount on the Y-axis.
The blue lines represent linear regression. For example if I have 20 radios at what price will I sell them?
according to the graph the answer is 12.5, but the real observations are the red points and must be considered.

Alternative Fitting Methods


Machine Translated by Google

• Linear regressions are often fitted using the least squares approach • However, they can be
fitted in other ways, such as minimizing a penalized version of the least squares cost function as in ridge regression (L2-
norm penalty) and lasso ( L1-norm penalty ). • Tikhonov regularization, also called ridge regression, is a
method of regularizing ill-posed problems particularly useful for mitigating multicollinearity, which commonly occurs in
models with a large number of parameters (and like having multiple linear regressions, so when you have
multiple X to predict a single Y)

[Multicollinearity: is a phenomenon in which one predictor variable in a multiple regression model can be linearly
predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the
multiple regression may change erratically in response to small changes in the model or the data.

It concerns statistical models expressed through a linear equation, when some or all of the variables are
strongly correlated with each other, making it very difficult, and sometimes impossible, to identify the influence
of the variables separately and also to obtain a sufficiently reliable estimate of their individual effects .]

• Lasso (absolute minimum restriction and selection operator) performs both variable selection and regularization in order
to improve the accuracy and interpretability prediction of the statistical model it produces.

Linear Regression Models Objective functions

(this type of regularization will be re-presented in the neural network)


j it is a coefficient to be added which allows to increase the error
Important to remember that 0 is the amount to minimize in any model you decide to use

Evaluating Regression (evaluating a regression model)


1. Coefficient of determination R2
is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
The sum of the squared error is the numerator, the difference between the real value and the average of real value
as the denominator.

the closer the result is to 1, the better it will be but it can also be 0 and means that there is no relationship between
the prediction and the dependent variable.
In range from 0 to 1.

2. Mean Squared/Absolute Error MSE/MAE


a risk metric corresponding to the expected value of the squared (quadratic)/absolute error or loss.
Sum of the squared differences of the residuals, divided by the number of semples. On the right is the
equivalent formula, but in absolute value.
Machine Translated by Google

the same formulas but in the second the absolute values are used.
Sometimes you can also find the RMSE (Root-mean square error, i.e. the square root of the MSE)
range
from 0 to infinity

Example:

In the example intercept=ÿo and coefficient=ÿ1. The black dots are the observations.
R2 = is very good.

Linear regression Recap:


• Linear regression is used to fit a linear model
to data where the dependent variable is
continuous.
• Given a set of points (Xi ,Yi ), we wish to find a
linear function (or line in 2 dimensions) that “goes
through” these points.
• In general, the points are not exactly
aligned.
• The objective is to find the line that best fits the
points.

Logistic Regression
Machine Translated by Google

Logistic regression is used to curve fit data where the dependent variable is binary or dichotomous.

For example: predict the response to treatment, where we might code survivors as 1 and those who don't
survive as 0, or pass/fail, win/lose, healthy/sick, etc.

Assuming we have these green and blue dots, we have continuous


variables along the X axis, while the dependent variable has only two
values: one green and one blue

Trying to apply Linear Regression ÿ Problem Drawing a


line between the means of two variable levels is problematic
in two ways: a. line appears
to oversimplify relation b. provides a prediction that cannot be
observable values of Y for extreme values of X (predicts
values that may not even be in the codomain, i.e. on the
red axis)
This is analogous to fitting a linear model to the
probability of the event. Probabilities can only have values in [0,
1].
So, we need a different approach to ensure our model is
appropriate for the data.

A problem with Linear Regression


taking into consideration the drawn red dots (mean values)
and trying to use linear regression along these points, we
can try to create a linear regression in the mean values for
the X axis.
The mean of a binomial variable coded as (1,0) is a proportion.
We can plot the conditional probabilities of Y for each
level of X. It is possible to fit a linear
model to these probabilities, but the linear model does not
predict maximum likelihood estimates for each group
(the mean values, which are the largest circles in the graph).
and still produces unobserved predictions for the extreme values
of the dependent variable There is also still the problem
that the line can go above and below the theoretical maximum
and minimum
probability

A better solution is to use another function instead of the straight line


Machine Translated by Google

As mentioned earlier, we can model the nonlinear relationship


between X and Y by transforming a
of the variables.
The common result of a transformation into

sigmoid functions (red line) is the logit transformation. The


function has two asymptotes (1,-1). As can be seen, the tendency
that the straight line had does not occur

previously.
Logit transformations impose a cumulative normal
function on the data and are easy to

handle because the function can be simplified to a linear


equation.

Before investigating the transformation into a linear equation, we


need to focus on one concept: Odds
Assuming you have this data:

Given an event with probability p of being 1, the odds of that event are given by: odds = p / (1 - p)
[p = probability]

The probability of being sick given that one has a normal fever value is: Odds(Sick|Normal) = P(sick)/1-
P(sick) = (402/4016) / (1 - (402/4016 )) = 0.1001 / 0.8889 = 0.111 The probability of not being sick with a normal value
is the reciprocal: Odds(not Sick|Normal) = 0.8999/0.1001 = 8.99

For a high fever value we have:

• Odds(Sick|High) = 101/345 = 0.293 • Odds(not


Sick|High) = 345/101 = 3.416

So the ratio between Yes and No can be used as odds in this case given the conditioned status. The interesting
thing is that when we move from Normal to High, the odds of being Sick triple: Odds ratio: 0.293/0.111 = 2.64 to be read
as : it's 2.64 times more likely to be
Sick with high values

Logit transform
This is connected with the logit transformation because the logit of a probability is the natural logarithm of the odds, i.e. the natural
logarithm of prob/1-prob: logit( p) = ln(odds) = ln(p/(1-p) )
Machine Translated by Google

Logistic regression
In logistic regression we look for a model:

Therefore the log odds (logit) is considered linear related to the independent variable X.
In the logistic regression model we model the dependent variable Y, as the logit of the probability of
get a certain class.

That is, we assume that the log of the odds (i.e. the logit) is linearly related with the independent variable X.
In this way it is possible to solve an ordinary (linear) regression using the Least Square Method but solving a classification task.

Recovering probabilities

Starting from the fact that logit is represented by the natural logarithm of the probability/1-probability, in the first step it is equated to
the linear regression formula.
In the second step it transforms making it exponential, finally p is isolated, and rewritten a little obtaining

p= , i.e. the sigmoid function.

Interpretation of Beta1
Machine Translated by Google

the ratio between the two odds is equal to and to the power of Beta1
The slope exponent describes the proportionate rate at which the predicted odds ratio changes with each successive unit of X

Example:

One additional hour of study (thus e.g. passing from 0.50 to 1.50) is estimated to increase log-odds by 1.5046, so multiplying
odds by e1.5046 = 4.5 (e^ÿ).
For example, for a student who studies 2 hours we have an estimated probability of passing the exam of 0.26. Similarly, for a
student who studies 4 hours, the estimated probability of passing the exam is 0.87.

EXAMPLE IN THE CODE.

Logistic regression= we use the same approach as linear regression, but to solve a classification problem. We do this by
transforming the variable y, expressing it as the logit of the probability of observing a certain class.

We use odds to model the probability because the odds reveal the increase we have, increasing by one unit. We can express it
with the sigmoid function, between 0 and 1. In this way we can represent the relationship between y and ex, not as a regression,
but as a function, which has a probability between 0 and 1.

Rule-based classifiers
Classification of records using a collection of “if.. then..” rules Rule: (condition) ÿ
y where:

• the condition is a combination of attribute tests • y is the class label

Examples of classification rules:


- (Blood Type=Warm) ^ (Lay Eggs=Yes) ÿ Birds
- (Taxable Income < 50K) ^ (Refund=Yes) ÿ Evade=No
Machine Translated by Google

Example:
In the example there are 5 rules which classify the animal species e.g. with blood type and the final class is the
species, which can therefore be predicted based on characteristics

Application of the Rule-Based Classifier


A rule r covers an instance x if the instance's attributes satisfy the rule's condition

Question: The rule 2 covers (Give birth = no) ^ (LIve in water=yes) ÿ Fishes covers:
to. salmon, eel ÿ this is the right one b.
salmon, eel, dolphin c.
frog, eel d.
frog, turtle

Rule coverage and Accuracy


The correctness of a rule is measured based on two
characteristics:
Coverage of a rule: fraction of the records that satisfy the
antecedent of a rule on top of all the records in a reference data

Accuracy of a rule: fraction of record that satisfies the


antecedent and also satisfies the consequent of the rule on top of the
covered rules
Machine Translated by Google

Table for the questions that will be addressed from now on:

What is the coverage of R2: (Give Birth = no) and (Live in water= yes) ÿ Fishes?
to. 3/20 b. 2/10 c. 1/20 d. 2/20 ÿ this is the right one What
is the accuracy of R5: (Live in water: sometimes) ÿ Amphibians?
to. 4/4 b. 2/4 ÿ this is right (frog and salamander) c. 2/20 d. 2/2

How does a Rule-Based Classifier work?

Characteristics of Rule Sets: Strategy 1


• Mutually exclusive rules
The classifier contains mutually exclusive rules if the rules are independent of each other. Each record
is covered by at most one rule, so no two rules cover the same record. If a record is covered by
a rule, there are no doubts and the outcome of that rule is used to make the prediction

• Exhaustive rules
Machine Translated by Google

The classifier has exhaustive coverage if it represents every possible combination of attribute values.
Each record is covered by at least one rule. There are no records covered by none of the rules you
have in your rule-based classifier.
You can think of the rule-based classifier as a list of rules, and if there is a record that doesn't
satisfy any rule in the list you have, then the rule-based classifier and the rules you have are not
exhaustive

Characteristics of Rule Sets: Strategy 2


• Rules are not mutually exclusive
A record can activate (trigger) more than one rule, so if a record is covered by more than one
rule, which rule applies in case a class different is predicted by two rules? eg. in the example above
a penguin which is predicted both by a rule that says make sex and has swing ÿ bird and also by a
rule that says it lives in water ÿ amphibian and in both cases these rules are true, so a penguin is a bird
or an amphibian?
The solution?
ÿ set of ordered rules: order the rules that you have with a certain score and consider the
classification of the rules with the highest score
ÿ unordered rule set (using grading schemes): you end up with a final class supported by
multiple rules

• Rules are not exhaustive


A record may not trigger any rules. The
solution? Use a default class, such as majority/minority class, or use some other type of
classifier if the rule-based classifier is bad

Question: what kind of rules return the decision tree classifier?


to. mutually exclusive and not exhaustive b.
mutually exclusive and exhaustive ÿ the rules provided by the decision tree are exhaustive because there
are no records that are not classified by the rules, all possible domains are covered in the
construction, and by the construction the rules are mutually exclusive because they are given by the
structure of the tree . Each path from the origin to the leaves identifies a set of records, and a record
belongs only to one leaf, it cannot belong to more than one ÿ therefore mutually
exclusive c. not mutually exclusive and not
exhaustived. not mutually exclusive and exhaustive

What kind of rules return Apriori?


to. not mutually exclusive and not exhaustive ÿ not exhaustive because with the rules obtained
from the Apriori there are no guarantees that a record in the dataset is not covered by any
extracted rule. For example in the Titanic dataset a set of rules with a very high coverage is being
extracted, in that case probably some records can be defined excluding the first class children
passengers and therefore maybe there are some records that are not covered by any rule that is been
designed. On the other hand, it is easy to realize that they are not mutually exclusive because
you can have records belonging to several rules eg. young female in first class may actually belong to
multiple rules b. mutually
exclusive and comprehensive c.
mutually exclusive and not exhaustive d.
not mutually exclusive and exhaustive
Machine Translated by Google

Ordered rule set


When you have a rule set that is mutually exclusive and exhaustive you always know which rule will apply
without problems; problems arise when a record triggers more than one rule and you have to decide which rule
to trigger. You can:
• Sort the rules according to their priority
An ordered set of rules is known as a decision list
• When a test record is present in the classifier:
to. is assigned to the class label of the highest ranked rule that was triggered b. if none of the rules
is triggered (therefore in a non-exhaustive situation), it is assigned to the class of the default class

In this case the turtle will belong to the rule class R4 as its order of precedence is higher than R5.

In this way, rules are put together regardless of the class to which they belong.

Rule Ordering Schemes


We can also use an ordering schema that groups together rules that have the same outcome:
• Rule-based ordering: individual rules are ranked based according to their quality (use ad
example accuracy, precision to evaluate the quality)
• Class-based ordering: rules that belong to the same class appear together (you can use
a majority voting schema) in
the example for the class-based one it can be seen that having a record triggering rules in both classes
(yes and no) one can count how many classes are triggered by class yes and by class no and use the
majority class

Building Classification
Rules Direct method: designed to extract rules directly from data, ex. RIPPER, CN2, Holte's 1R
Indirect method: extracting rules from other classification models (ex. decision trees, neural networks, etc.)
Ex. C4.5rules

17/03
Machine Translated by Google

Direct method: Sequential Covering (CAP 5.1.3 BOOK)

This is the idea of Ripper, one of the direct methods that belongs to a family of methods that have the property of being
sequential covering methods. Let's see the steps:

1.start with an empty rule


2.develop a rule using the Learn-One-
Rule feature
3. remove the training records that
satisfy the rule 4.
Repeat step 2 and 3 until the stopping criterion
is found.

The most obvious stop condition is one where


the remaining dataset is empty or there are a
number of records less than one
certain threshold.

Example of Sequential Covering

We start from (i) Original data: all these


methods start from the data to which we assign
a positive and a negative class. If we are in a
situation with a binary problem, it will be
easier. If we have a multi-class problem, as
in the Iris dataset, it is necessary to identify one
class as positive, identifying the rest as
negative.
After that we move on to Step 1 (ii) where
the goal is to find a rule for the class
positive: that is, we are looking for a rule for a portion of data of the positive class.

Once a rule has been found which


defines these data R1, one passes to
Step 2 by repeating the steps from the
beginning, ie covering the positive class.
Here is the idea of covering: we
want to cover a subset of data.
However, it is possible to see in the
image (iii) that elements of the negative
class are also identified, which will be
removed from the dataset.

Rule growing: construction of a rule


Machine Translated by Google

To create a rule there are two possible strategies:

1. General-to-specific, i.e. adding one condition at a time. Under the generalto-specific strategy, an initial rule r : {} ÿÿ y is
created, where the left-hand side is an empty set and the right-hand side contains the target class. The rule has poor quality
because it covers all the examples in the training set. New conjuncts are subsequently added to improve the rule's quality.

2. Specific-to-general, trying to generalize. In this case we start from a very complex rule, in which the high level can match the
records, removing the conditions that are not shared, ie that are different. Ex: the specific rules we have in the slide share Refund
and Status, we will remove Income and keep Refund and Status which will be useful for predicting the Yes class.

ÿ Direct method: RIPPER

The Ripper method: how does it manage the "sequential covering" and the various steps of the rules?
ÿ For a 2-class problem, choose one of the classes as a positive class, and the other as a negative class:
-
Learn rules only for the positive class.
-
If the rules doesn't show us anything, it means that by default the class will be negative.
-
If we want to obtain the rules for the negative class, they will be obtained thanks to the combination of rules
found preceded by NOT: Not 2nd rule, Not 3rd rule: then the record is classified with the negative class. Generally
it is a good choice to select the minority class as a negative class, this because it is easier to find precise rules
and to remove less, computationally better records from the dataset ÿ For a multi-class problem: sort the classes
according to the increasing prevalence of the
class (fraction of instances that
-

belong to a particular class).


-
learn rule set for smallest class first, treat rest as negative class repeat with with next smallest class as positive
class
-

Ex: Ex:10% class A, 60% class B, 30% class C =


start by labeling A: positive class and B, C: negative class; I find the rules for class A and remove those data that
satisfy it. I put C as positive class, which is second in terms of lower prevalence in the dataset and B: negative
class. We are in a 2-class problem and will solve it as such

Multi-class problem question: Deleting the subset of the dataset I just classified right?
Machine Translated by Google

Answer: Yes, it's the coverage idea. If you cover a piece of data with a rule, it is removed. Then another rule is generated, and we'll
remove data that matches it and so on, looking at the same class.
Once the process is finished, we will start it again by changing our positive class. We will then create rules etc
etc

• 1. How to build a rule in the RIPPER method:


-
We start with an empty rule. Add
-
conjunctions until you improve the FOIL's information gain. (FOIL: First Order Inductive Learner, an early-rule
based algorithm). Foil is a number, the higher it is, the better.

-
Stop when the rule no longer covers negative examples, when it has maximum accuracy.
-
Immediately prune the rule using incremental reduced error pruning Pruning measure: v = (pn)/(p+n)
-
-
p: number of positive examples covered by the rule in the validation set n: number of
-
negative examples covered by the rule in the validation set
- Pruning Method: Eliminate any trailing sequence of conditions that maximize v

ES: Let's imagine we have 3 attributes A,B,C. We start with an empty rule where we will insert a condition that refers to A, for example:
“A=blue?” We will have to calculate the FOIL information Gain. Then we think “B=high?” and here too we calculate the FOIL IG
We will also calculate it for "B=low", so as to add attributes to my rule until it will cover only negative examples. After this, the rule
is pruned using a metric.

In this case we have a validation set in which the number of records covered p and n are calculated, respectively the
positive and negative examples covered. The pruning method is based on the removal of the conditions that maximize v = (pn)/(p+n) (it's
a bit like the way best-split was selected in decision trees: add a condition + add a new condition, etc. .. as long as no negative examples
are covered). Finally I select the best set of conditions, agreeing with the FOIL information Gain. However, we understand that it is
a bit expensive process because it is necessary to test many conditions. In addition, the order in which I add the conditions can
change the resulting rule (greedy approach).

Rules evaluation:

Evaluation methods could be accuracy, statistical tests, metrics


such as Laplace or m-estimate and the FOIL.

How is the FOIL information Gain calculated? where FOIL:

First Order Inductive Learner. First let's assume we have


two rules R0 and R1.R0 is the first rule of the initial step, while R1 is
the rule resulting from the addition of a new condition. The
Gain is calculated with the formula beside. There

formula reminds us a lot of the information gain that we had


to calculate within the decision tree when entropy was adopted. : we will have to continuously
calculate this gain until the gain will no longer increase or there will be no more negative instances (in fact it will be
useless to add new conditions) and then we will redefine the rule using the measure v for pruning, until we select the rule
with the most high Gain.
We will then remove the portion of data that satisfies this rule and repeat the procedure.
Machine Translated by Google

let's go back to the previous example: we have the


data (fig.1). We then look for a rule and imagine that
it is represented by the rectangle at the bottom
left (fig.2). We see that there are some points that
satisfy the rule and they will be the ones to be removed
(fig.3). We will then repeat the procedure (fig.4).
We will stop when we have a stopping
condition, one of them is the minimum description
length, MDL.

Example on MIRO to explain the pruning technique:

Let's imagine that at some point we have a rule:


R={A,B,C,D,} and stop because the FOIL Info Gain
it doesn't grow. So what we do is count the number
of neps, respectively negative and positive examples
covered by the rule, not on the dataset Xtrain but on a
validation set Xval.
We will then calculate v = (pn)/(p+n). After
that we will iteratively remove the first condition A.
-Once A is removed, the rule will be R={B,C,D,},
we will calculate v again.
-I remove B and compute v on rule R={C,D,}, I will remove
C, compute new v and so on. I will choose the rule that
maximize v.

: the goal of this procedure is to understand if the rule is general enough, i.e. the more conditions I have on
a rule, the more specific it is. so the goal will be to test the accuracy or, in our case, the coverage of a rule
on a dataset that we didn't use to create the rule. : if the accuracy
increases by removing the conditions, it means that we are generalizing, i.e. the rule we created is too
specific. This is the idea of the to reduce the complexity.
NB: you must pay attention because in RIPPER there are two types of pruning and the one we have just
represented is only related to the rules.

• 2. Build a set of rules: • Use


sequential covering algorithm -Find
the best rule covering the current set of positive examples
-Eliminate both positive and negative examples covered by the
rule • Each time a rule is added to the set of rules, calculate the new description length
- Stop adding new rules when the new description length is a number
of bits d (parameter) longer than the minimum length of the description obtained so far.
:The idea is to minimize the description length, calculated each time a rule is added to the set. Whenever
this measure is greater than the previously calculated one, I stop and stop looking for rules for the current
class: 1 among the types stopping condition.
Machine Translated by Google

Minimum Description Length (MDL)

It is one of the stop criteria of the


RIPPER.
It is calculated as = MDL: composition of
model complexity + model cost with
respect to a dataset that we are
analyzing.
It is expressed by the formula in red and
represents the cost of the model (in this
case the set of rules) and the cost of the
data I am analyzing.
The alpha that multiplies the cost of
the model is a parameter that marks
me the portion of the cost of the
model.
The first part of the formula, i.e. the cost of the data given the model (Cost(Data|Model)) is typically calculated as the
misclassification error, i.e. it measures the performance of the current model.
In the slide we have an example of decision trees because here too it is possible to calculate the cost of the model,
more precisely of an increasing model. it is not important that you tell me what the cost of the model is, but it is important that
you remember how the model is stopped in the construction of the rules.

• 3. Optimize the set of rules:


Optimize the set because obviously the procedure has many greedy steps, ie generalization weaknesses or loss of correctness
during data classification. Then: • For each rule r in rule set R Consider 2 alternative
rules: -Replacement rule (r*): develop new
rule from scratch on data covered by r
-Revised rule (rÿ): add conditions to extend rule r • Compare the rule set for r with the
rule set for r* and rÿ • Choose a rule set that minimizes the MDL principle
• Repeat rule generation and rule optimization for the remaining
positive examples

: there are therefore two optimizations: one during the construction of the rules that looks only at the specific rule;
another occurs at the end, where the entire rule set, i.e. a set of rules, is optimized.

Comprehension questions: -
What type of rules from RIPPER? ÿ Mutually Exclusive and Exhaustive : the covering
staff guarantees the Exhaustive, i.e. that in the end all the records are covered, because for the last class I will have an
empty condition that says "everything that is not respected in the other rules is classified
Machine Translated by Google

using this rule”. Mutually Exclusive as rules are constructed, so a record is assigned and removed by a specific rule.

-The RIPPER builds rules...general-to-specific but then refines them aiming at generality

-What are the RIPPER steps? -Start from an empty rule


-Build a rule using FOIL
-Prun the rule using reduced error pruning
-Remove records covered by the rule
-Repeat the previous steps until MDL no longer decreases
- Optimize the rule set by considering alternative rules and selecting
among those that minimize the MDL: point 3 which talks
about pruning through reduced error pruning refers to the v to be calculated
:MIN 46:00 notebook example

ÿ Indirect method: C4.5rules

Extracting rules from a stale, unpruned decision tree

• For each rule, r: A ÿ y,


-Consider an alternative rule rÿ: Aÿ ÿ y where Aÿ is obtained by removing one of the conjunctions in A
-Compare the pessimistic error rate for r with all other r's
-Prune if one of the alternative rules has a lower pessimistic error rate
-Remove duplicate rules
-Repeat until we can no longer improve the generalization error

Pessimistic Error Estimate:


for a rule set T with k rules it is:

C4.5 rules, instead of ordering the


rules, sorts subsets of rules (class ordering) • Each subset is
a set of rules with the same consequent rule (class) • Calculates the length of the description of each
subset - Description length = L( error) + g L(pattern), where g is a
parameter that accounts for the presence of redundant attributes in a rule set (default = 0.5)

Example:
Machine Translated by Google
Machine Translated by Google

Benefits of Rule-Based Classifiers:

• They have very similar characteristics to decision trees


-As expressive as decision trees
- Easy to interpret
-Performance comparable to decision trees
-Can handle redundant attributes
• More suitable for handling unbalanced classes due to their sequential covering procedure •
More difficult to handle missing values in the test set

SUPPORT VECTOR MACHINE


(SVM) let's start talking about slightly more complex classifiers. SVM basically represents the decision boundary
(a linear hyperplane) through a subset of training examples, also called support vectors and are what give
the method its name.

Let's illustrate the basic idea behind it


SVM introducing the concept

maximal margin hyperplane.


so that it can best separate two
classes. All SVM computations
rotate on the support vector
identification and simultaneously

select the correct hyperplane.


Machine Translated by Google

In a
situation like
the
one represented (fig1) and it is possible to have
an infinite number of possible solutions (fig2).
However, we see that the best hyperplane we
can find is B1 (fig3), because it is the one that
maximizes the margin (fig4). This margin is
the distance between the two closest instances.
It is possible to see that the distance between
b11 and b12 is > than that between b21 and b22.
So we have to find the vector "w" which
maximize quantity:

Linear SVM: separable case


A linear SVM is a classifier that looks for a hyperplane with the largest margin, that's why it is called a "maximal margin
classifier". In this case the decision boundary is a straight line because we are in two dimensions. Learning
the model is equivalent to determining the parameters "w" and "b" which are obtained by maximizing the margin
(obtained from the training data). The linear model is represented by the following function, which indicates how the
data classifier "w" and "b" works. x is the
instance
that I want to classify

The

arrows
above the various letters indicate the vectors,
this means that each element corresponds to
more than one feature. The decision boundary will be found by setting w*x+b=0, while by setting =1 or -1 we find
the two boundary margins. concerning at the decision
: the goal of SVM is to identify web such that the constraints are satisfied.

Example of dot product calculation:


Machine Translated by Google

Data w = [.3 .2]; The x = [1 2]; b = -2


resulting formula is w x + b = .3*1 + .2*2 + (-2) = -1.3.
So according to the formula, it will be classified as -1. If the result was instead =0, we would have classified it as 1, but this usually
doesn't happen.

: The distance expression for a point x from the decision boundary line
wx+b=0 is:

:The distance between

Everything is fine
record and the decision

boundary And

calculated as

1/||w||, The

denominator

represents the norm of w. Consequently, the distance from one side of the margin to the other is 2/||w||. To find the best hyperplane, the
SVM aims to minimize ||w||, which is the equivalent of maximizing the distance 2/||w||
Between the

margins

Learning Linear SVM

• Learning the SVM model is equivalent to determining web b. •How to find web? maximizing
the margin, which is equivalent to with num2 L(w), following margin
minimize reverse of the

• Facts save the constraints y


• This is a constrained optimization, the problem that can be solved by Lagrange multiplier.
using The Method of the

• Introduce the Lagrange multiplier

: we see that y(w*x+b) ÿ1 (Lagrange multiplier method) is equivalent to the original


formulation; In the next slide we will try to make things clearer, they won't be asked but it is
to explain better.
NB: =ÿ, a lamba is associated to all points; if =0, the point is not a support vector. If ÿ0 it is support vector = it is a sort of score.
Machine Translated by Google

Nb: it is not to remember but it is only to explain

Example of Linear Vector Machine:

In this example we have two


features, x1 and x2. The first two

instances having very high will


be the supports
vectors.

these will be the margins.

Example 2, Geometric interpretation: In this case


a =ÿ is associated with each point. We see that we have 3 support vectors: - ÿ1= 0.8

- ÿ6= 1.4
- ÿ8= 0.6
Machine Translated by Google

: in the linear case we have like


goal is to find the hyperplane that
minimizes ||w||. It is solved by the
Lagrangian method so between finding
web to realize the classification
function.

Non-separable-case:

When the problem is not linearly separable as in this figure, we will have to
account for errors in our solution. We will therefore have to find the
hyperplane that most distinguishes between the two classes and
reduces the
error. • To do this, we need to introduce slack variables
The inequality constraints need to be "slacked" to accommodate
nonlinearly separable data. This is done by introducing the
slack variables ÿ (xi) in the constraints of the
optimization problem. • ÿ
provides an estimate of the decision boundary error on misclassified
training errors; if we want to measure how much Point P has been
misclassified, I will measure the distance between Point P itself and
the closest point of the margin.
So here we will have to minimize one more thing than in the linear
case, a penalty term *, a sort of adjustment that will depend on the
slack variable:

ÿ remember that there is a slack variable for each record:

it will be 0 if the value is classified correctly, if it will be a

positive value, which depends on the decision boundary means

say it is not well classified. this second scenario explains the use of two new parameters

specified by the user representing the penalty of the misclassification:

-C, measures the intensity of the penalty

-k, exponent of the slack variable ÿ

the constraints have also changed, we will have to add the slack variable.

the Lagrangian multiplier is enclosed in a range between ÿ 0 ÿ ÿC


Machine Translated by Google

:The Slack Variables -estimate the error of

decision boundary for poorly classified examples -


help with handling
misclassified instances

Lesson 03/22/2021

Non-linear SVM
What if we want to solve a non separable problem, using a linear SVM?
The answer is that we can't.
In fact if the decision boundaries were not linear we would have a situation like this (initial data space):

In this case we could not draw a diagonal that separates the data. What can we do to use the SVM then?
We have to try to reframe the problem in a way that allows us to solve it.

In this image it is possible to notice that: if the problem were two-dimensional we will not be able to solve it,
while, considering it as a three-dimensional problem we could separate the points.
A problem that was not separable in two dimensions becomes linearly separable in three
dimensions.
Machine Translated by Google

How do we find the correct transformation ÿ?


The trick is to transform the data from the original x space into a new ÿ( ) space so that we can use a linear
decision boundary. (Another way than the previous example is to transform the initial data space like this:)

A possible transformation of the initial problem corresponds to:

Depending on the data, we need to find the decision boundaries via the classifier.

Knowing the transformation ÿ, the decision boundary for the SVM moves from w*x to z*ÿ(x), where ÿ(x)= transforms x
from n dimensions to m dimensions, where usually m>>>n (m is very bigger in) , because the idea is that you increase
the number of dimensions to find the decision boundaries.

Assuming we have this function ÿ, the optimization problem remains the same:

**

• We replace x by ÿ(x) and everything else stays exactly the same:

Problems:

• The function ÿ(x) is difficult to use, in fact it is as if we have a new problem. • Which type of mapping
function ÿ(x) should be used? • How to do if we are in high dimensional space?
• Most calculations involve the dot product ÿ( ) *ÿ( ) • We may have
another problem called “curse of dimensionality”
Machine Translated by Google

The Kernel Trick

This is the mathematical formulation(**), but it is not exactly what we solve with the Lagrangian
method.
The problem that is solved is this:

Where X just appears as a product(xi * xj), but this part would be replaced as ÿ( i) *ÿ( j). What interests us,
therefore, is the result of this product.
Let's imagine we can modulate this quantity with the function k(xi, xj) to replace (xi * xj).

This procedure is defined as Kernel trick and consists in defining this function k(xi, xj) , the kernel function
which takes two vectors as input and approximates the product of the two vectors, after the transformation.

Indeed, K is defined as a kernel function which is expressed only in terms of coordinates with respect to the
original space, i.e. it does not use the different dimensions that are used by ÿ, but considers only the
dimensions of the original space. This is why it is called "trick": that is, we replace the definition of the function
ÿ with the function k .

The three most used kernel functions are:

1. Polynomial kernel with degree d:

- this is the most used -


product of the two vectors, one of them is raised to T, +1. All raised to d.
2. Radial basis function kernel with width ÿ:

- Related to radial basis function neural networks


Machine Translated by Google

- The feature space is infinite-dimensional 3. Sigmoid


with parameter k and ÿ:

- Does not satisfy Mercer condition on all k and ÿ

To solve a classification problem with non-linear SVM, the hardest part is selecting the correct Kernel trick.
Also, depending on the chosen function there will be new and different hyper parameters that need to be
optimized.

T
Justification for the fact that we replace k( xi , xj) with xi *xj (which is the same as writing xi*xj ):

If each datapoint were mapped in a high dimensional space with a transform ÿ: xÿ ÿ(x),

the inner product would be:

A kernel function is equivalent to an inner product in some feature space.

Example: suppose we are in a two-dimensional space and therefore we have two vectors defined as
“x1” and “x2”, x=[x1 x2]. We adopt the polynomial function as kernel function: k( xi , xj) =(1 + xi , xj)=
ÿ( i)T *ÿ( j): T *xj ) 2 .
We have to show that k( xi

The first row (quantity) can be rearranged as the product of the two vectors in the second row
(same as the first row, only arrange by xi first and then by xy). To obtain:

• In this way we can prove that the property: • The kernel function
implicitly maps high dimensional spaces (without the need for
compute each ÿ( ) explicitly).

Advantages of using the kernel trick:

• The function becomes:

• We don't need to know the mapping function ÿ • We


don't have to calculate the product ÿ( ) *ÿ(y) avoiding the "curse of dimensionality" • We need to
remember that not all functions can be kernels:
- it is necessary to be sure that there is a corresponding ÿ in a high dimensional
space
- the kernel function respects Mercer's theorem which ensures that the function can always be
expressed as a product in some multidimensional space.
Machine Translated by Google

More precisely: The


Mercer theorem of the function must be "positively definitive".
This implies that the n of the n kernel matrix with (i,j)-th entries of the kernel function is always positively
defined.
This means that the optimization problem can always be solved in polynomial time!

• So the kernel trick allows us to solve non linear separable problems , using a
kernel function. Not all functions can be used, but only those that respect the Mercer theorem.

Problems using the kernel trick:

• The problem remains the same as seen previously, the only thing that changes is that the product xi*xj is replaced
with the kernel function

At the extreme, the partial derivative of L with respect to ameb must be 0. Taking the derivatives, and setting them = 0,
replacing them and simplifying, we obtain:

Example:

- we have three instances of class 1: 1,2,6 - we have


two instances of class 2: 4,5
Machine Translated by Google

- goal is to maximize that quantity, under the alpha constraint.


- the kernel function is
- Using a quadratic problem solver, we get:

- the result we get is a polynomial function, a parabola.

- we can obtain b by setting the function=0

- when we have a value to rank it is passed in place of z^2 and z. If the value >1
is classified as 1, if it is less than 1 it is classified as -1

The obtained function is represented as the blue line and perfectly separates the instances we have.

Features of the SVM


Machine Translated by Google

• Since the learning problem is formulated as a (convex) optimization problem, there are efficient algorithms
to find the global minimum of the object function (many other methods use greedy approaches and find
a local solution)
• Overfitting is addressed by maximizing the decision boundary margin, but the user needs to define the
type of kernel function and the cost of the function • it is difficult to manage
missing values= because the formulation is mathematical, so it is
better to remove them or replace them with other values.
• robust to noise= thanks to the slak variable • high
computational complexity to build the model if we have a large dataset

Questions about wooclap:

1. The kernel trick:


- avoids mapping a vector in a high dimensional space - does not
mathematically influence the approach used to find support vectors 2. How does SVM handle
errors? with slack variables. Minimizing w, we find i
decision boundary and deal with overfitting. With the kernel trick we make the SVM usable for non linear
separable problems.
3. Which are the macro settings analysed?
- Linear SVM, Separable case - Linear
SVM, Non-separable case - Non-linear SVM

Neural networks (Linear Perceptrons)

At the basis of Neural networks lies the Neuron Metaphor The


basic idea is to simulate the behavior of a neuron, in which we have synapses (inputs) that enter the body of
the neuron and produce an output, which in ours is produced through mathematical formulations. The
model output derives from a dot product of the input values, returning an integer value (0,1,2,...) if we have discrete
classes, otherwise a real value if we solve a regression problem (continuous values) .

Artificial Neural Networks (ANNs)


The "neuron" can be thought of as a black box with various inputs and an output, and we treat it as
such because the mathematical formulations according to which the output is generated are very complex.
Machine Translated by Google

The result Y is calculated by the sign function which in this case is the activation function, where the
weights (0.3) of the inputs X1,X2,X3 and the bias or threshold (t=0.4) are entered .

Question: What is bias?


Bias is like the intercept in linear regression and the b in SVM. It is a constant quantity that is added to
the dot product. Bias is used to fix the content. It is fixed at 1, what we learn is the coefficient of the bias, i.e. its
importance. The bias could also be set to 0, and in this case, it would not count in the result.

The general definition of Linear Perceptron can be represented in the following two ways:

Artificial neurons receive real input values and can output unipolar values {0,1} or bipolar values {-1,+1}
(obviously these results are obtained only with a binary class) .
Weights are denoted w, sigma or theta (ÿij) and indicate the strength of the connection from unit j to unit i, and
in the Linear Perceptron they indicate the strength between input and output . The greater the weight,
the greater the importance of the contribution of the input to obtain the output.
The weights are adjusted through an optimization algorithm that aims to minimize the cost function:
for example, a child learns to do something through rewards and punishments, so the model will try to get
the output that minimizes the "punishments".

The bias b is a constant that can be written as the dot product between the input xj and the weight wij such
that:
Machine Translated by Google

The result is passed as a parameter in the activation function, where in the simplest case it is an identity
function, otherwise we would have the sign function or the logistic unit.

A simple linear neuron


Graphical representation of the perceptron classifier (under examination he wants us to represent this graph):

What contributes to the output? The bias (which has a fixed input =1) * the coefficient (tetha)* n-dimensions.
In the case of the linear activation function the output is y=hÿ(x)=ÿ(ÿT x) , given by the dot product
ÿ(a)=anet. coefficient
of the x.
* Sigma represents the activation function and is given by

Linear Threshold Unit (aka Perceptron)


The structure is the same, ie what changes is the sigma, which is 1 if a (which in our case = ÿT x) >=0 and -1
if a<0.

The Logistic Neuron


It uses the sigmoid function which has an advantage over the two previous models: we can have an
output expressed also in terms of probabilities; since the range is (0,1) we can say that
Machine Translated by Google

“it will be class 1 with a probability of 0.8”.

Perceptron
This is a single layer network, which contains only inputs and outputs.
Its activation function is f=sign(w•x).
The application of the method is very simple: you just need to do the dot product between the inputs and the
weights. If it is >=0 the result will be 1, otherwise -1.

Exercises (which may be asked in the exam):

- we have this pre-trained linear perceptron (right) where the activation function is drawn
- Question: Provide the classification for this tested instances
- Remember that x0 is always equal to 1
Machine Translated by Google

- Result: y=sign(x0 + w1*x1 +w2*x2 + w3*x3) = sign(..) )

Learning Iterative Procedure


During the training phase the weights are adjusted until the perceptron outputs become consistent with the
true output values of the training examples.

- We start with a random initialization of the weights (w0,w1,...,wm) - we


repeat the procedure below for each training instance we have in the data

- we calculate the classification, given the current weights wk (the k indicates the k-th iteration)

- we update the weights= the weight at iteration k+1 (wk+1) is equal to the weight at iteration k + [ the
learning rate*(the real classification (yi) - the classification of the linear perceptron during the current
iteration) * the current value of x]

- all this is repeated until a certain stopping condition is reached. Usually a certain number of iterations
is fixed. Another option is to have no errors.

Perceptron Learning Rule


The weight update formula works as follows:

- we update the weight of the iteration with respect to the error:


Machine Translated by Google

- If y=f(x,w), e=0: no update


- If y>f(x,w), e=2: weights must be increased, then f(x,w) will increase (y=1 and f(x,w)=-1)
- If y<f(x,w), e=-2: weights must be decreased, then f(x,w) will decrease (y=-1 and f(x,w) =1)

The Learning Rate The


learning rate is a parameter between 0 and 1, used to control the adjustment made in each iteration.

If it is close to 0, the new weight is affected more by the value of the old weight.
If it is close to 1, the new weight is affected more by the current adjustment.

The learning rate can be adaptive: a large ÿ (close to 1) is preferable at the beginning of the training and gradually
decreases with the various iterations, but in the long run we prefer a small ÿ such that we can keep track of what the
model is doing .

Exercise: Train Linear Perceptron

In this exercise the goal is to train a linear perceptron considering three training instances (a, b, c) in two dimensions
(x1, x2), ÿ=0.3 and the activation function is f=sign .

1) We input random coefficients w0=-1 (bias), w1=0, w2=0, obtaining a dot product XW = -1. So the activation is equal
to -1 (value returned by the classifier), and since Y=-1, we get error=0 (correct classification).

The quantity to add to the new weight would be but in this (we call it ÿ),
case the classification is correct (error=0), therefore it is not necessary to update the weight because
the quantity inside the square brackets is the error.
The quantity ÿ will be 0 for each factor/dimension. Where ÿ =

2) Since it hasn't undergone any updates, the weights and bias are the same as the first one
iteration. Both dot product and activation equal -1, but Y=1 (instance b), so error=2.

ÿ=lamda*error*x0
Remember that x0 is always equal to 1
Machine Translated by Google

*2 *
ÿ0 = 0.3 1 = 0.6.
*2 *
ÿ1 = 0.3 0 (where 0 is the value of x1 of instance b) = 0 0 =
*2 *
ÿ2 = 0.3 0

3) In this iteration only w0 varies, going from -1 to -0.4 (we add ÿ0) and we use
the instance c to calculate each factor.
4) I start using instance a again, so every time I run out of instances I start again from
Before.

...

12) The last iterations (10,11,12) all have error=0 and always the same weights, therefore the
classifications have all occurred correctly (it could be overfitting). The correct weights to use will
be those calculated at the last iteration. In this case we have an overfitting, so it would have been better
to stop earlier.

Nonlinearly Separable Data A


weakness of the linear perceptron is that it is not able to work with inputs that are not linearly separable.
Since the activation function is a linear combination of variables, the decision boundaries will also be linear.

In this example we see that the data cannot be separated properly using just one line, so a more complex model
will need to be used.

Requests:
Machine Translated by Google

- Which function is not used by a linear perceptron? LASSO (is a regularization term for the linear regression problem,
NOT A FUNCTION)
- What happens if the learning rate is close to 0? The new weight is mainly influenced by the
value of the ecchio weight.

Lesson Wednesday 24.3.21

Neural Networks and Deep Neural Networks

What time?

1. Big Data: the beginning of the era is around 2010 2.


GPU: the diffusion and economic cost of Graphical Processing Units for videogames and machine Learning (and DM)

3. NN theory which ended in the late 90s due to lack of data e


computational resources

It is therefore a concept that already exists but has become successful lately thanks to the elements listed above.

A possible vision

For example the KNN method is between machine learning and deep learning without Repres. Learning, while
Classical AI is AI without Machine Learning. In all of this, DM could represent her like this. it's about data analytics,
which isn't really learning, but exploring, as is Pattern Mining, BUT since data underlies all of these, we can say that
DM is an intersection.

Deep learning

We move from a Raw data representation to a Higher-level representation by expressing the concepts mathematically
Machine Translated by Google

In Representation learning the idea is to obtain different levels of abstraction at different times in
the structure of the model that we define and the goal is to define a deep neural network model.

Last time we saw this problem : from the linear perceptron we were not able to separate the
XOR problem and a possible solution is a decisionboundary as follows, but this is not possible
with a Linear perceptor only finding a linear hyperplane
Machine Translated by Google

Hence the idea of

Multilayer Neural Networks

• Hidden Layers: intermediary layers between input and output layers.

• More general activation functions (sigmoid, linear, hyperbolic tangent, etc.). We can adopt
different ones for each Layer

• Multi-layer neural network can solve any type of classification task involving nonlinear decision
surfaces.

• Perceptron is single layer. Recall how a perceptron with only 2 inputs, node and output is made:
now we can think of each hidden node (n3, n4) as a SINGLE perceptron that tries to construct one
hyperplane (n1, n2, n4), while the output node combines the results to return the decision boundary.
See drawing in green for the first and blue for the second. The last part of the network is seen as a
further perceptor (therefore we have 3 linear perceptors where the first 2 are aimed at identifying the
red and blue hyperplane (XOR Data), while the last one combines the results to have the final decision
boundary .
Machine Translated by Google

So the general structure of an ANN (which is called ANN or NN when it has at least one hidden layer).

Another possible name is Multilevel NN or Multilevel Perceptron. What we have in each hidden layer
and in the output layer is the same structure as last time: inputs coming from previous layer nodes,
combined with a dot product wrt weights, activation function, output

What is a Deep Neural Network?

A network with multiple hidden layers

The number of units in the internal node does NOT define a NN, but the number of hidden layers.
When we have a single hidden layer we can talk about NN, otherwise if we have more hidden layers
we talk about Deep NN (depth given by the number of hidden layers). The number of Units inside
each layer does NOT cause the depth of the Network. Researchers are also studying the wide NN
effect, where wide refers to the number of neurons in each layer.
Machine Translated by Google

Number of output layers and hidden layers is not related: we could also have something like this. Obviously
the more hidden layers there are, the easier it is to have more complex representations than those with
a binary 01 output.

Artificial Neural Networks (ANNs)

• Various types of neural network topology

• single-layered network (perceptron) versus multi-layered network

• Feed-forward versus recurrent network

There is no best way to find the parameters - it depends on so many different factors

Another element that differentiates linear perceptors from NN are the different types of activating functions

Deep neural networks

This NN is called Dense, as we have no missing connections and we have all the connections
between a node and the subsequent node. It is the simplest and one of the most used.
Machine Translated by Google

However, deep learning is much more than having many layers…

First of all, how do we calculate the output? The difficult thing is to train a structure like this: in the
perceptrons we use the errors made by each training record to adjust the weights.
Now, how do we repeat the same thing going from output to top of input? Considering that, moreover, now we
have to not only update the weights of the nodes connecting input and output, but all the weights of the nodes
connecting each couple of layers: we have many more coefficients. the direction drawn by the prof is like
calculating the class of a DNN

Exercise

We have 2 inputs, 1 hidden layer and 1 output layer. What we read are the trained weights, In each layer we
use activation function sign(S “-0.2” ). We have to label each test set. The strategy we follow is to do
the weighted sum for each neuron and we apply the activation function and the output of one neuron is the input
of the following one (in the following layer considering the arrows).

The output of the first node is -1 and we continue for each hidden layer (H1, H2, H3)
Machine Translated by Google

When we get to the output we do like last time

Here having paid attention to the fact that same inputs have same outputs, we could have saved time.

NOTE: the activating function is NOT applied on the inputs (which are recognized because they only have
outgoing arrows).

Representation learning

These additional layers are a way to extract different levels of representations for the data: these
subsequent levels have the goal of extracting and representing complex concepts in the data and the ultimate
goal is to discriminate between different principles: the concept learned at each level is not general, but necessary
to achieve the purpose for which the network was created: if I have a dataset of images and I have to
distinguish between cats and foxes, both have ears like triangles; this complex concept is not useful for
discriminating between the 2 animals. It is more useful to focus on colour, size, mouth, tail... the discriminant
aspects can be captured at different levels of abstraction. why in deep neural networks we increase the number
of layers? because in this way we can also address non-linear separable decision problems and because the
idea is to represent in the various hidden layers complex concepts that can help in the final classification.

Activation Functions

A new activation function which is one of the most used especially in the output layer for replace the sigmoid is
the hyperbolic tangent: it has better properties than the sigmoid keeping about the same shape (but the
domain changes: sigmoid goes from 0 to 1 and hyp .tangent goes from [-1,1]). In the codomain.
Machine Translated by Google

Other activation Function: RELU (Rectified Linear Unit)

Takes the value 0 for input < 0 and the input for values > 0 ÿ max(0,z = parameter passed)

Comparison between the 3 most adopted activation Functions

Blue: RELU

Green: sigmoid

Red: Soft version of the RELU. This allows sparsifying the output (we don't have just 0) and
vanishing the gradient
Machine Translated by Google

Learning Multi-layer Neural Network (or deep neural network)

• Can we apply perceptron learning to each node, including hidden nodes?

• Perceptron computes error e = yf(w,x) and updates weights accordingly : we look at the
difference between the true outcome and that of the network

• Problem: how to determine the true value of y for hidden nodes?

• we can't really determine it, but we can approximate error in hidden nodes by error

in the output nodes ÿ backpropagate the error from the output node to the input node

• Problems:

1. Not clear how adjustment in the hidden nodes affect overall error (weights connecting the
various layers)
2. No guarantee of convergence to optimal solution. We can't do anything about it

to solve the first problem we can apply a strategy based on gradient descent for training the multilevel
NN, which works like this:
Machine Translated by Google

1. Error function to minimize ÿ we define the loss function we want to minimize. y = real

labels (ground truth) , f = applied on a set of instances given the weights at the current time is

the output that comes from the NN. we want to find a global min solution given this loss

function.

2. weight update: first element indicates the weight at the k+1 iteration and the other at the k-th iteration. there
quantity after lambda (in the linear perceptron it was easy to define. This time we express the delta with the
first derivative of the loss function wrt the weights we have to update). we can solve the problem using a
tochastic gradient descent algorithm only if the function f is differentiable. That's why we can use the
activation function we want at the beginning of the NN if the function is differentiable. otherwise we can't
Backpropagate the error.

3. For sigmoid function: last time as activation function we did NOT use a sigmoid, but this formula is
practically the same (the same as for the error function to minimize). The derivative of the loss with
respect to WY becomes the quantity for the output if we are in the last layer (see arrow from weight update
to sigmoid function)

This is the final part of the computation for backpropagating the error. It must also be repeated for the other internal
hidden layers.

The techniques used to find these weight updates are the Stochastic Gradient Descent methods. To give a
visual idea:

Gradient descent for Multilayer NNs


Machine Translated by Google

when we are at a certain level


(with loss in blue), once we have the weight at k iteration, we want to find the direction (orange) to push
the weights downhill such that the loss is minimized so that we go from weight k to k+1 .

They are interested in the idea of backproppagating the error where we don't have the true errors in
the inner layers, so we have to, with the Stochastic gradient Descend method, backpropagate the error.
So we pass and update each weight through the derivative of the loss wrt that weight. How do we
calculate the derivative?

- for output neurons it is the same as last time - for hidden


neurons: the delta quantity is given by the aprte fixed = derivative before the
activating function (output times 1 - output) times sum of the delta (change) for neurons at the
subsequent layer: to compute the delta for neurons at layer j I already need to know the delta
for those at layer j+1. At some point, of course, we will have the output layer and then I can
calculate the delta quantity directly, as I have the target value. It is called backpropagation as I
calculate the deltas starting from the output layer and then I calculate the various hidden
neurons.
Machine Translated by Google

So in training a Multilayer NN, if we have an input x(i) we have these different computations

to update the loss we use the backward pass which updates the backwards weights, i.e. first the delta
(y) is calculated , then delta j - 1 ( z1), then j - 2 (x1) and does the same for the hidden layers.
Obviously the forward pass does the calculation to estimate the errors and with this we calculate the
deviation of the error at the output layer and then we can backpropagate it and do the calculation for
the former layers

Error Backpropagation

Taking the first derivative of the activating function we have only basic operations without logarithms
or exponentials.

Vanishing Gradient Problem

It could happen that the delta quantity (i.e. the deltaE/deltaw(j) gradient) that is the part or backpropagate,
depending on the data and depending on the activation function, could be a quantity very close to 0.
Consequently, this quantity vanishes and in each iteration I have that the value at iteration (k+1) is the
same at level (k): the weights remain i
Machine Translated by Google

usual and the NN is not learning anything. This phenomenon usually occurs in a specific
time of the NN which is the recurrent NN.

Question: the weights are initially random as they were in perceivers. It's a kind of aprameter:
we can have better initial sets than random. He will show us later

What is a NN?

The loss function..

- Must be differentiable (because in the loss functin we have the activation fucntion
applied to the last layer)
- A possibility is the quadratic function

On the Key Importance of Error Functions

• The error/loss/cost function reduces all the various good and bad aspects of a possibly
complex system down to a single number, a scalar value, which allows candidate solutions to
be compared. the choice of the loss function is crucial, since it reduces everything to a single
number.

• It is important, therefore, that the function faithfully represent our design goals.

• If we choose a poor error function and obtain unsatisfactory results, the fault is ours for
badly specifying the goal of the search.
Machine Translated by Google

Objective Functions for NN

ÿ Regression: A problem where you predict a real-value quantity.

• AF: Output Layer: One node with a linear activation unit (result returned by the NN)

•E: Loss Function: Quadratic Loss (Mean Squared Error (MSE))

ÿ Classification: Classify an example as belonging to one of K classes

• AF: Output Layers:

• One node with a sigmoid activation unit (if the number of classes is K=2)

• K output nodes in a softmax layer (each output layer with its own probability for the
classification of that class) (K>2)

• E: Loss function: Cross-entropy (ie negative log likelihood)

Two examples of loss functions: in the forward pass we calculate the error with the quadratic
and in the backward pass we have the relative first derivative. Or we use cross entropy . The
quadratic is normally used for K = 2 and the Cross Entropy for K >2.

Design Issues in ANN (must be fixed)

• Number of nodes in input layer

ÿ One input node per binary/continuous attribute

ÿ k or log2k nodes for each categorical attribute with k values, but this can reduce

the expressiveness of the Netword, so usually OHE is used (or particular encoding, eg.

OneHotEncoding with k units per k values)

• Number of nodes in output layer

ÿ One output for binary class problem

ÿ k or log2k nodes for k-class problem

• Number of nodes in hidden layers


Machine Translated by Google

• A number of hidden layers

• Initial weights and biases

Characteristics of ANN

• Multilayer ANN are universal approximators but could suffer from overfitting if the
network is too large:

• Gradient descent may converge to local minimum ÿ underfitting

• Model building can be very time consuming, but testing can be very fast. We can
download some already trained and redo the training to then apply them.

• Can handle redundant attributes because weights are automatically learned.

• Sensitive to noise in training data: carefully cleaned data, no bias, no noise. One of the
reasons why NNs have become famous in recent times is that in big data we have noise that
impacts training less (if it's not a really noised dataset)

• Not able to handle missing attributes.

Tips and Tricks of NN Training

Datasets Should Normally be Split Into

• Training set: use to update the weights. here, weights are learned. Records in this set are
repeatedly in random order. The weight update equation are applied after a certain number of
records.

• Validation set: use to decide when to stop training only by monitoring the error and to
select the best model configuration. here we monitor the loss and, when it stops
decreasing, we stop the training.Validation set is CRITICAL.

what is a validation set? A dataset used during the training phase to check the NN
performance but not used to calculate the NN weights. Recall in the RIPPER, the validation set
is used to prune the rules: we were checking if the rule was too general or not, but it was already
extracted. Here we just check the loss but we DO NOT use data data in the valdiations et
to change the weight values.

• Test set: use to test the performance (accuracy, f1..) of the neural network. It should not be
used as part of the neural network development and model selection cycle.

Before Starting: Weight Initialization


Machine Translated by Google

• Choice of initial weight values is important as this decides starting position in


weight space. That is, how far away from the global minimum

- Aim is to select weight values which produces midrange function signals


- Select weight values randomly from uniform probability distribution
- Normalize weight values so number of weighted connections per unit produces
midrange function signal
- weights can be placed so that there is some sort of signal at the beginning

• Try different random initialisations to

- Assess robustness
- Have more opportunities to find optimal results

Two learning fashion (plus one) ÿ how to train the network

- Sequential mode (online, stochastic, or per-pattern)

• Weights updated after each record is presented

• Many weight updates, can quicker convergence but also make learning less stable and
more influenced by the initial pair of weights

- Batch mode (off-line or per-epoch). All the records are passed to the NN

• Weights updated after all records are presented. Uptade the records once for every batch

• Can be very slow and lead to trapping in early local minimal as the majority of training
records activate the network's neurons in a particular way and therefore we have a
certain leak assessment which is not good for DIFFERENT training records in the
data

- Minibatch mode (a blend of the two above) MOST ADOPTED

• Weights updated after a few records (from tens to thousands) are presented ÿ subsets

of training records are selected and classified in the forward step. Then in backpropagation

step is applied, weights are updated, another minibatch is flowered until all the records are

analyzed. After this, one epoch is passed and the second one starts with another random
selection of minibatches.

• Best of both (and good for GPU)

Convergence Criteria
Machine Translated by Google

- Learning is obtained by repeatedly supplying training data and adjusting by


backpropagation

• Typically 1 training set presentation = 1 epoch ÿ all the training records have been

classified by the NN

- We need a stopping criteria to define convergence

• Euclidean norm of the gradient vector reaches a sufficiently small value

• Absolute rate of change in the average squared error per epoch is sufficiently small

• Validation for generalization performance: stop when generalization performance reaches

a peak ÿ typical. We have a max number of epochs (let's say 100) and we start the
training. After each epoch check the loss in the validation set.

We must note that in the training set the loss decreases, while in the validation (since
it is not the same set of data used to adjust the weights) it can happen that the
model goes into overfitting: it decreases up to a certain point and then increases.
The minimum point is the right one in which the training stops, otherwise we are in overtraining

Typically we consider a certain number of iterations and remember for example that every 5
epochs i store the model and the loss for validation and training. If after 10 epochs the
validation loss has increased, while that of the training decreases, I can set this
heuristic and say that I stop the calculation and take the model that I saved 10 epochs ago,
which was the one with the lowest loss in the validation set that I observe , so it has the
HIGHEST LEVEL OF GENERALIZATION.

Early Stopping

• Running too many epochs may overtrain the network and result in overfitting and perform
poorly in generalization

• Keep a hold-out validation set and test accuracy after every epoch. Maintain weights for
best performing network on the validation set and stop training when error increases
beyond this

• Always let the network run for some epochs before deciding to stop (patience parameter),
then backtrack to best result
Machine Translated by Google

Model Selection

• Too few hidden units prevent the network from learning adequately fitting the data and
learning the concept.

• Too many hidden units leads to overfitting, unless you regularize heavily (eg dropout,
weight decay, weight penalties)

• Cross validation should be used to determine an appropriate number of hidden units by


using the optimal validation error to select the model with optimal number of hidden layers
and nodes. Let's do a RandomSearch . Typically in literature we start from uns et of
architectures of the network and then we change some things (n. epochs, starting function..)

Regularization to recover from Overfitting

• Constrain the learning model to avoid overfitting and help improve generalization.

- Add penalty terms to the loss function that punish the model for excessive use of
resources

• Limit the amount of weights that is used to learn a task : if it is too high, a particular
combination of features can have a too high contribution in the final output and lead to
overfitting

• Limit the total activation of neurons in the network

Terms for regularization


Machine Translated by Google

Dropout Regularization

It is the most common type of regularization: it is one of the best ways to prevent
overfitting. A network with dropped units is not used at the prediction time, but the full network
with the resulting weights is used. When we have dropouts, some weights are not used
and therefore not adjusted, but eventually all are used. If we use it at the prediction
time, the predictions are given with a confidence interval.

momentum

• Adding a term to weight update equation to store an exponentially weight history of


previous weights changes

- Reducing problems of instability while increasing the rate of convergence

• If weight changes tend to have same signs, the momentum term increases and gradient
decrease speed up convergence on shallow gradient

• If weight changes tend have opposing signs, the momentum term decreases and gradient
descent slows to reduce oscillations (stabilizes)

• Can help escape being trapped in local minima. For example, if we have this loss
function and with our initial random weights that are up there, with the Stochastic Gradient
Descent Method we end up at a local minimum (LM). Without a technique like
Momentum or dropout etc.. it is easy for that local minimum to be uniquely selected,
while the model should note that there are better solutions that minimize the loss even more
and lead to higher accuracy.
Machine Translated by Google

Which of the following are valid regularization techniques for NN ?

- Dropouts
- Add penalty terms to loss
- Reducing the number of hidden units
- Increasing the number of hidden layers
- Using cross-validation

Question: regarding the n. of hidden layers, if we move a layer we have to recalculate the weights.
Let's start with an architecture and we train it from scratch: for example, we start with 5 layers, we
train with 5 layers, if we then change the layers, the weights also change.

lesson 29.03.2021

Choosing the Optimization Algorithm:

- Standard Stochastic Gradient Descent (SGD) is one of the best algorithms, where the only
requirement is that the activation function must be differentiable, and it is often used
with momentum. Disadvantages: Difficult to get the best learning rate and convergence
is unstable.
- RMSprop is another algorithm with an adaptive learning rate ÿ (it starts from a high value and
the algorithm decreases it via a weighted moving average of the squared gradient). This
approach speeds convergence through faster gradients when needed.
- Adagrad extends RMSprop with element-by-element gradient scaling.
- ADAM is like Adagrad but adds an average of the previous gradients like the
momentum, which decreases exponentially.

Convolutional Neural Networks


(He didn't include them in the course, but he talks about them because they are the most used)
They are neural networks mainly used for the classification of images and time series, and the crucial
difference with the neural networks seen so far lies in the use of convolutional layers which
act as a "filter" to be applied to the image to be classified, compressing the image into an array
of values, which in turn will be compressed into a vector.

Recurrent Neural Network (He


didn't include them in the course, but he talks about them because they are the most used)
Typically used in Natural Language Processing (NLP), where the nodes of the network remember
the previous states, as the current state of the node depends on a function that has as
parameters the old state and an input vector. For example, if the letter "h" enters the input,
the algorithm calculates that the next letter will be "e", which will be sent as input and the letter "l" will
be calculated, and so on until the final word " hello”. This is what happens when Google predicts
what we are typing when we do a search.
Machine Translated by Google

ensemble methods
Aggregate predictions from multiple classifiers to improve accuracy by building a set of base
classifiers from the training set. The predicted classes in the various test sets are combined to obtain a
joint result.

The wisdom of the crowds


Knowledge derived from an independent group of individuals is superior to that of a single individual. Even if
the group is made up of both experts and non-experts, the estimate of the whole group will be better than that
of the single expert (also because they may be influenced by their previous knowledge or it is difficult
to find experts for certain topics) .

The advantages of the group of people are:

- Diversity of opinion: Different people have different backgrounds and experiences


- Independence: it is assumed that the decision of the individual is not influenced by that of the
crowd

- Decentralization: individuals specialize in some "local knowledge"


- Aggregation: how the different opinions are aggregated in the final result

In the field of machine learning, the combination of classifiers allows for more accurate predictions.

For example: suppose we have 25 classifiers, each of which has an error rate ÿ=0.35 and the errors are
uncorrelated. The probability that an ensemble classifier will make an incorrect prediction is:
Machine Translated by Google

Having a wrong prediction means that at least 13 classifiers (the majority in this case) out of 25 must
make a mistake. The probability that the ensemble classifier makes an error is 0.06, which is less than
the error of the single classifier (equal to 0.35).

This graph shows that, as long as the probability of making an incorrect prediction of the single
classifier is less than 0.5, it is better to use an ensemble classifier, otherwise it would mean that
the single "individual" is making random predictions and the ensemble classifier would be worse of the
single.

Exercises on ensemble classifier


Consider three independent models for the same data, with poor performance:

- Error1 = 45%
- Error2 = 40%
- Error3 = 35%

We ask ourselves: is it better to use model 3 alone because it has the best performance, or is it
better to use them all together?
In the previous example, the formula for calculating the probability of an ensemble failing assumes
that the errors are all the same, but in this case, they are different for each model.
Now we must explicitly enumerate each case:

We now get 8 possible cases (2 outcomes ^ 3 models):


Machine Translated by Google

The cases we want to avoid are those where at least two models make a classification error
(highlighted lines). For each unfavorable case we calculate the probability of obtaining that particular
combination by multiplying the respective individual probabilities, and adding the probabilities of each
unfavorable case we obtain the total probability (probability that those four unfavorable cases occur).
Since the ensemble probability is 35.15%, it is convenient to use only model 3 since it has a lower error
probability.

Question= what makes a crow wise?


Answer= diversity of opinions, independence, specific and localized knowledge, aggregation

Types of ensemble methods

Types: bagging and boosting (manipulate data distribution) and random forests (manipulate input
features).

Random Forests
Is a class of ensemble methods created for decision trees, which combine the predictions of multiple
decision trees via a fashion of the classes predicted by the individual trees. It is important to remember
that trees are not always the same.

Each decision tree is built on a bootstrap sample (a sample from the training set) based on values
from an independent set of random vectors and a random selection of the columns.

Let's assume we have mxn features. Let's create 3 decision trees for our random forest. The cardinality
of each dataset is n, but they don't have the same instances. The cardinality is
Machine Translated by Google

maintained, keeping instances, but not all bootstrap samples necessarily have all input instances. This is
obtained by randomly selecting instances of the original dataset, the probability is 1/n.

More precisely, random records are selected with probability 1/n (where n is the length of the
dataset), and the generated trees will have the same length n because some random lines will be
repeated several times to fill the spaces. It is as if the decision tree was built from a dataset that has the
same size as the original dataset and has random column selection.

Each tree is evaluated on m attributes randomly selected from the M available attributes. Usually the

subset m is chosen according to these rules:


The number of columns of each sample is much smaller than the cardinality of the original original
dataset.

Advantages of random
forest It is one of the most accurate and efficient classification algorithms that works even on very
large datasets and with a large number of features (because it estimates which features are more important).
Another important aspect is that it does not need data normalization because the variables are analyzed
independently of each other.
Curiosity: A Deep random forest can be made , i.e. an ensemble of the results of several random
forests.

Question: In Random forest classifier….. each base classifier is a Decision Tree and each bae
classifier works on a bootstrap sample

31/03

bagging

To define bagging, we need to remember the concept of bootstrap: it is a statistical re-sampling


technique that selects a random subset of samples from a given dataset.
In particular, considering a dataset with n records X= {x1, …, xn}, if we want to create m re samples
of X, having the same constant size equal to n, we repeat the same operation m times, which is to
select random the records in X, say {X1*,…, Xm*} where each record typically has a probability of
1/n of being selected and can be picked more than once or zero times. In this way the bootstrap
sample is created which has the same size as X, let's call it Xi* but not the same records Same procedure
of the Random Forest to
select the instances in the decision trees

Bagging comes from Bootstrap AGGregatING .


Machine Translated by Google

Given the definition of Bootstrap, which is essentially a sampling with replacement, the idea is that
given an original dataset like the one below, you can decide how many bagging rounds you want to
do, in this case 3.
So m = 3 and from the original data we bootstrap on 3 datasets, we have 10 records and in this case,
for example, in the first round the tenth is selected 3 times and the fifth 2, etc.

We create this dataset with m = 3 and then train a classifier on each of these bootstrap samples.
For example, a KNN for the first round, one for the second and one for the third, and at the end,
given a new instance, a prediction is made as in Random Forest.
So it also reduces dimensionality by randomly selecting a subset of columns.

Example:
Considering this 1-dimensional dataset:

The classifier is a decision stump: it can only make decisions in relation to a feature (in this example
we have only one feature so it takes x features). It must be considered that
with a single split it is not possible to divide the class:
- Decision rule: x<=k versus x>k
- Split point k is chosen based on the entropy

First step:

In the example 10 stumps are learned, the decision threshold moves according to the selected data.
Machine Translated by Google

At the end, the 10 learned models can be represented in this way:

When you have a new instance to rank you can use


majority voting to determine the class of the ensemble
For example, if
you give x= 0.5, round 1 gives -1, 2: 1, 3: -1 , the 4: -1,
the 5: -1, the 6, 7, 8, 9: -1 and the 10 1 à therefore 8
votes for -1 and 2 votes for 1 à the class is 1 and it
is correct because the we can see from the original
data.

It can also be represented more compressed like this, using the majority vote to determine the class of
the ensemble classifier:

In this representation we can add the 1 and -1 used as classes and take the sign of the sum and get
the decision boundary of the classifier.

Visual Example:

You can imagine that in the first bagging


iteration using for example the decision
tree, the first base-classifier is learning that
decision boundaries, the same with all the
other iterations and each time you have a
different shape of the decision boundaries,
putting them all together you will get
something resembling the true decision
boundary used to generate the data (from the previous figu
Machine Translated by Google

Question: Correct statements on bagging: Random Forest adopts bagging (bagging is much more general
than RF), in bagging each base classifier adopts a sample of the features.

Boosting

Boosting is similar to bagging, except that each random selection of data for one boostrap sample affects
the next, the idea being to increase the probability of picking instances that were misclassified by the
training classifier in the next boostrap sample. ÿ An iterative procedure to adaptively
change distribution of training data by focusing more on previously misclassified records ÿ Initially, all
records are assigned equal weights

ÿ Unlike bagging, weights can change at the end of each boosting round

ÿ Records that are classified incorrectly will have their weights increased. Major
are the weights à the greater the probability of being selected
ÿ Records that are classified correctly will have their weights decremented

For example, in this example it can be understood that since the fourth record appears 5 times à it is
difficult to classify therefore it is more likely to be selected in subsequent rounds Each
model is associated with an importance (alpha), when we classify a new record , we have to weight
the classification of each record in relation to its importance.
The final outcome could be 1 even if the majority is 0.

Question: Correct Statements on Boosting: each iteration and sampling affects the next one, correctly
classified instances get low weight.

AdaBoost
It is an algorithm that builds base classifiers: C1, C2, … CT The
error rate is calculated depending on the importance that each classifier has for the various instances:

This is the classification error on a single classifier, weighted (w) with respect to the importance that
a certain instance has at a current time on the dataset.
We then use the error to evaluate the importance, which is evaluated:
Machine Translated by Google

In the graph it can be seen that the domain is between 0 and 1 and the codomain between

5 and -5 so the importance will be between -5 and 5 while the error between 0 and 1.

This is because we have high positive importance when the error is

close to zero (e.g. error of 0.2 ÿ importance 1.5, but if error of 0.8

ÿ importance -1.5)

AdaBoost algorithm

This way AdaBoost updates the weights in subsequent iterations:

Each record has an associated weight that tells how important it is to the classification, and the
update depends on whether a classifier at a certain iteration classifies correctly or not. So the new
weight equals the old weight times and to the power of -ÿ (importance). The minus serves to
reduce the importance of the weight, while the exponent is positive if the base classifier at iteration
j misclassifies it.
If each intermediate round produces an error rate greater than 50%, the weights return to 1/n and
the resampling procedure is repeated.

Classification:

Algorithm:

Explained with the previous example:


Machine Translated by Google

in the first iteration all instances have the same weight and we get the first iteration:
in the first stump the best split is composed of all the
records to the left of the red line classified as -1
while those to the right as 1

Referring to the table


above we see that we have misclassifications for
records in 0.1, 0.2, 0.3 and corrected for all others If I
classify the original data using

the first

stump, I will be able to see how the weight decreases

from 0.4 onwards and increases the weight by 0.1, 0.2, 0.3. instead, at the second iteration the points from 0.4 to 0.7

are misclassified ÿ the weight of all the others decreases while that of those elements increases.

On the third iteration the stump decider is at 0.2, etc.


In the last table there are the weights of all the elements, which allow to calculate the importance of the 3 classifiers

For example, if I want to rank the point 0.5, it is smaller than 0.75 so it is ranked as -1 in the first round, 1 in the second
and -1 in the third à i weight these different results with respect to the importance.

For class -1: importance 1.738 + 4.1195 = 5.85 For class 1:


importance 2.77 So for class -1 à is
greater than correct Another possibility is to use that
classification formula and the result is -3.08 which corresponds to 0.5 in the formula and since it is negative the sign is -1.

If we want to see a visual example this time we have something a little more different:
Machine Translated by Google

Circled ones are misclassified


records

Here too the circles are misclassified and therefore their importance increases
The third plot is the weighted strong classifier obtained as combination of the previous one

We then go on until we get to the real decision boundaries

As the iterations increase, the misclassified instances get more and more approaching the decision
boundaries, this is reasonable because they are very difficult to separate. The algorithm stops given a
predefined number of base-classifiers to create

AdaBoost-Reloaded
Machine Translated by Google

- AdaBoost uses stumps, i.e. decision trees with


a single node and two leaves. Generally use a
forest of stumps.
- Stumps are not so good at making accurate
classifications and are weak learners.
- The combination of various weak learners waxes
a strong learner.
- In a forest of stumps, some stumps have more
importance than others in the final classification -
Each
stump is not created independently
from the others - The
mistake that the first stump can make influences
the second stump and so on

AdaBoost - Step by Step Training

With this dataset with 3 features, we initially have all records with the same weight: 1/ number of
samples.
What we have to do is look for the first stump in the forest.
All weights are the same so you can ignore them for now.

At this point the gain is calculated with the GINI in this case:
Machine Translated by Google

·
You want to minimize impurity so best is last.

·
So this is the first stump of our forest and first we want
to calculate the error of the stump: which is 1/8

·
Now we want to determine the importance of the
stump for the final classification and to update the weights (with
the importance formula seen before)

·
We determine importance based on how well the
samples rank in terms of total error

·
The total error is always between 0 and 1

since the error is very low we have a high importance, in our case it is
½ log(7) = 0.97 If we had another stump like
chest pain we would have had another error which would have
been 3/8 and an importance of 0.42 Now we use the importance
(0.97) to increase the
weight of samples that are misclassified and to reduce the weight of
correctly classified ones.

The new sample weight will be:

This graph shows how the scaling factor of the importance varies with respect to the variation of the importance.

In this case we have 1/8 e0.97= 1/8 * 2.64 = 0.33 > 0.125 (= 1/8)
Machine Translated by Google

To correctly classify the instances, the other part of the formula is used, obtaining: 1/8
e-0.97= 1/8 * 0.38 = 0.05 < 0.125
So now we have a new column of weights and a
column with the normalization of weights which
together add up to 1.
With the new weights, updated and normalized, we can
replace them with the old ones (so we replace the values
of Sample Weight with those of Norm. Weight)

Now you can use the modified sample weight to make the second stump in the forest.
We have two possibilities, in theory we could reuse the original dataset and adopt the weighted Gini
index to determine which variable should divide the next stump. Alternatively, you can create a new
training that contains duplicate copies of the samples with higher sample weights. Possible
example:

Finally, assuming we ran AdaBoost 6 times, given a new patient we want to predict whether he will
have heart disease.
You add up the importance and as you can see heart disease = Yes has a higher importance so it is
more likely that you have heart disease.

Other exercises to try:


Machine Translated by Google

(there are solutions in the slides, the questions are questions that you could potentially ask in the exam, or how do you
obtain the weights normalized? This exercise could be asked in the exam).

Lesson 12/04/2020
TIME SERIES
What is a Time Series?
A TIME SERIES is a collection of observations made sequentially over time, usually at constant time intervals.
Relative to one dimension or multiple dimensions.

Time Series are Ubiquitous • You


can measure many things and things change over time: • Blood pressure •
Donald Trump's
popularity rating • The annual rainfall in
Pisa • The value of your stocks

• In addition other data type can thought of as time series • Text data:
words count (appearance of every word can be considered an instant) • Images: edges displacement •
Videos: object positioning

Time Series Problems:


• Lots of data •
The similarity between two time series is difficult to define
• We can have different formats • Different
sampling rate (sampling rate= time passing between two subsequence of
observations) •
Noise, missing values etc..

What can be done using time series?


• we can observe trends and seasonality (if the time series grows, decreases, has strange shapes and therefore
there are noises to remove) • clustering
Machine Translated by Google

• we can motif discovery= same as pattern discoveryÿ recognize some parts of the

time series that appear multiple times (called motifs)


• we can discover rules (as in pattern mining and rule extraction) • we can
forecast (forecasting) • we can perform classification

What these methods have in common is SIMILARITY


What is similarity? It is the quality or state of being similar, similarity as similarity of
characteristics.
In the Time Series we can recognize two types of similarity:
- similarity at the level of shape -
similarity at the structural level

Structural-based Similarities
• For long time series, similarity based on shape gives very poor results.

Here it seems that A and B are very similar, but we might have that time series B is evaluated as
more similar
to C • We need to measure similarly based on the high level structure. •
The basic idea is:
1. extract global features from the time series, 2. create a
feature vector and, 3. use it to
measure similarity and/or classify Example of features:
• mean, variance,
skewness, kurtosis, • 1st derivative
mean, 1st variance of the derivative, ... • regression
parameters, prediction, Markov model •The time series
has a max value of 11, the b of 12 etc…

Compression Based Dissimilarity


Sometimes, a dissimilarity is calculated in a compressed version of the time series. The features we
use are no longer understandable, but are the “hidden features” used by a compression algorithm
to represent a time series.
Compression Dissimilarity Measure (CDM) function:
Machine Translated by Google

C(x,y)= compression algorithm applied to the time series together


C(x)= compression algorithm applied to the single time series

The more similar the time series are, the more similar the ratio will be.

Ex: if we use hierarchical clustering using Euclidean distances in this dataset the result will be this= at the
end the yellow time series will be merged with the red one.

Running the same algo with the CDM distance we will get the clustering we want:

NB= these distances are not widely used

Shape based similarities


Distance measure notions:
- Let A and B be two objects in the universe of possible objects. The distance (dissimilarity) is
denoted by D(A, B).
- Distance measures properties : • D(A,B)
= D(B,A) Symmetry • D(A,A) =
0 Constancy • D(A,B) = 0
IIf A = B Positivity • D(A, B) ÿ
D(A,C) + D(B,C) Triangular Inequality

(when the last property doesn't hold, we have the figure on the right)
Machine Translated by Google

Euclidean Distance From


now on we assume that the time series we analyze have the same length, ie the same number of observations. And
furthermore, assume that the sampling rate is the same, so we can perfectly align the first observation of time series q
with the first observation of time series c. • Q = q1 … qn • C = c1 … cn

How is the Euclidean distance calculated ? Root of the sum of the squared differences, at each time interval.

Problems with Euclidean Distance:


• Euclidean distance is very sensitive to "biases" in the data.
• These distortions are dangerous and should be removed. • Most common distortions:
• Offset Translation • Amplitude
Scaling • Linear Trend •
Noise

• They can be removed using the appropriate transformations.

Transformation I: Offset Translation


Machine Translated by Google

In the first image (top left) = the shape is the same as the time series In the
second image (top right) = if we calculate the Euclidean distance we find that they are a bit dissimilar.

In the third and fourth image= we want to place the time series one above the other, i.e. performing
the normalization. An easy way to do normalize. How does it normalize?
A simple way is to subtract the mean value from the time series from the time series.
Ex: Q=Q-mean(Q) ÿ the time series Q is given by Q minus the mean. The same for for C.

Transformation II: Amplitude Scaling

We have two time series, the blue and the green. With amplitude scaling we not only subtract the
mean, but also divide by the standard deviation (corresponds to the standard scaling
normalization).
The idea is to see how similar the two time series are without considering the mean and the standard
deviation.

What is the difference between this normalization and other normalizations with respect
to standard scaling? (He will ask for it in the exam)
When we do the standard scaling=
- let's take the mean and the standard deviation of column A: mean (A) and STD (A) - x'i=
xi- mean(A) / STD(A)
Machine Translated by Google

- now r1 is a time series and also the other r's


- now when we do the amplitude scaling: we don't
calculate the mean and the std of each column 12,12 etc.. What we do is calculate the mean and
the STD per row. However, the formula is the same: x'i= xi-
mean(A)(STD(t1)

In the first case therefore, we have a normalizer for which


we have a transform and an inverse transformer. In the
second case we cannot do an inverse transform, because each
normalization refers specifically to a time series. That is, if we run
clustering in a time series and in pre processing we need to run
amplitude scaling, eventually the time series represented the
centroid will be amplitude scaled. If we later want to represent
the points again in the original version, do not

we can do an inverse, because the normalization is applied for every time series.
We normalize the time series for data that comes from a time series.
On the other hand, we normalize from the columns, for data that comes from outside and therefore it is
possible to do a denormalization.

Transformation III: Linear Trend


We can remove the linear trend to compare two time series: we find a straight line that fits the series
well, and then the straight line is subtracted from the time series, so we can compare the two series.

Transformation IV: Noise


As you can see from the two time series in the image below: they are very similar but there is some noise.

To remove noise from the time series, a smoothing function is defined that calculates the average
of each data point with its neighbors. Smoothing = make different points similar but what
Machine Translated by Google

appear similar in time. That is, points that appear similar over the same time interval should have the same
value.

Moving Average
One of the most common methods for removing noise is the Moving Average (MS), in which a time window of length w
and a TS t are defined .
Moving average calculation:

According to this formula i is the central point and the average is calculated considering the value of the previous data
point (i-1) and of the following one (i+1).

Example with w=3:

Since the mean is calculated using previous and next, the first item in the list has no previous, so the moving average
cannot be calculated (and the last will not have the next). There are two solutions to this problem: 1) The missing value of
ma can be replaced with the next value (in this case

22.0)
2) Or add a row at position 0 with the same value as row 1, and a row
following the last one (in this case in position 6) with the same value as the previous one, like this:

Dynamic Time Warping


Machine Translated by Google

Sometimes two time series that are conceptually equivalent evolve with different speeds, at least
in some moments

In the first image (fixed time axis= sequence are aligned “one to one”) we see that there is a fixed
alignment up to a certain point. Then a misalignment develops.
In the second image (Warped Time Axis) we see how Dynamic Time Warping manages to correct
misalignments. This is why we need a Euclidean distance.

How is Dynamic Time Warping calculated?


1. We create a matrix of the dimensions of | Q | of | C |, so it populates with the distance
between each pair of points in our two time series.
What does it mean?

Each line is a distance.


In the end we get an array like this: -->

2. The Euclidean distance works only on the diagonal of


the matrix. The sequence of comparisons
performed: • Start at the pair of
points (0,0) • After the point (i, i) move to (i
+ 1, i + 1) • End the process on (n, n)

The DTW distance can move "freely" off the diagonal of


the matrix. These cells correspond temporally to
points shifted in the two time series.

Every possible deformation between two time series is a


path through the matrix. • The
constrained sequence of comparisons performed:
• Start at the pair of points (0,0) •
After the point (i, j), i and j increase by one, or both (see image below right) • End the
process on (n, n)
Machine Translated by Google

This path must have a guarantee of optimality, to be able to say that: we have discovered the best
alignment between q and c, it is as if we took the two time series, separated them and tried to
match each point with the point that represents the best to compare them and at the finally we
return the sum of this distance. We need dynamic time warping because Euclidean distance can't
handle time series that grow and decrease with different speeds, but always keeping the same
shape.
Then all the preprocessing required in the Euclidean distance is also required here (amplitude
scaling, noise removal).
One problem that DTW may have is running time, because it has to calculate the distance between
points for every couple of time series we have. If a time series has length m, m^2 is the distance
between two objects and then we have to multiply it by all the pairs of time series we have.

Final questions:
1. Which of the following information can be represented as a time series? covid-19
infections, temperatures in a region, number of purchase orders received by the
company, blood pressure.
2. How to calculate a structural-based distance between two time series?
- define the feature set we are interested in - compute
the same feature for the two time series - create two
vectors with the feature values - apply a
traditional distance function between the feature vectors 3. Which
of the following transformations of the time series can be used to remove the different
variability before shape-based distance calculation? Amplitude scaling, because
different variability means that time series go up and down.
4. Which of the following time series transformations can be used to remove different
range effect before shape-based distance calculation? Offset translation 5. Which
of the following time series transformations can be used to remove
noise? Moving averages
Machine Translated by Google

Lesson Wednesday 14/04

Let's continue: How is Dynamic Time Warping calculated?


Every possible warping (deformation) between two time series is a path through the matrix.
We can find the best one by using the DTW definition recursively:

the gamma represents the cost of the best path to reach the cell
(i,j), the top right cell where we have the lowest cost. The
objective to maintain is to obtain the minimum cost by
reaching box (i,j).

: But how is this cost calculated? is the distance between the two time series at time i and j (i.e. how the value differs in
those two times) + the minimum of the three values preceding cell (i,j), i.e. (i-1, j), (i -1,
j-1) or (i, j-1).

: the idea is, from the moment we construct the path, from the bottom left point to the
top right point, the formula will be the one contained in the matrix (to be read from bottom
to top).
NB: Never down, never back and never right!

How can we compute d efficiently using a Dynamic Programming Approach?

1 Step: calculate the matrix of all possible distances between the two time series; i.e. we consider
all possible times of all possible time series and compare the values. Typically the absolute value
of the distances is used, but the squared difference can also be used.

2 Step: calculate the cumulative cost matrix, i.e. the matrix of all the costs of the routes. It starts
,
from the bottom left ie (1,1), copying the first value and calculating the rest of the matrix with the
formula, repeating the procedure for all the columns. In the top right box we will have the result, which
corresponds to the final cost between time series Q and C.

3 Step: along the matrix it is possible to find the best path, with the lowest cost, also called "best
alignment" between each pair of points.
Machine Translated by Google

Let's look at an example graphically:

Let's assume that we have already calculated the distances from step 1 and start from step 2.
At the beginning we have to copy the value corresponding to the box (1,1) and we will see that in that case the first
box is exactly equal to the distance d(q,c), as written in the slide.
After that, to calculate the upper box, we will have to apply the formula, noting however that we do not yet have the
3 numbers to make the comparison; in fact we will be in the position ÿ(i,1), i.e. in correspondence with the first
column and here the value to insert will be = current value in the cost matrix d(q,c) + minimum value between (cell
of the previous column and row zero, which is missing) (the previous row cell of the same column) (another missing
in the
value). = the only value accessible to us is the one in the center, i.e. ÿ(-1,1).

We will then have to repeat the procedure for the entire matrix, considering all three values.
Machine Translated by Google

Exercise: Dynamic Programming Approach

First of all we will have to build the point to point cost matrix (the one
represented here) using the absolute difference between the points.
To facilitate the calculation, the numbers are shown on the side of the
matrix, therefore in the first position we will have 2=|3-5 |, then above 4= |3-7|, 3= |3-6| and so on.

After that it will be necessary to calculate


the cumulative cost matrix, using the
formula.
We will first of all have to copy the first
value (which is 2), while the value in the
upper cell is given by 4 + min (missing
value x) (missing value x) (2) = 4+2 = 6

to calculate the values in the cell above, we'll need to do = 3 + min (6, missing value, missing value) = 3+6
=9 and so on. Let's move to the second column, so as to apply the formula well. In correspondence with
the first value of column n 2 of the cumulative cost matrix we have 4, which is given by = 2 (blue square) +
min (2, missing value, missing value) = 4 In the box above, in correspondence with 2, we will have= 0 +
min (2,6,4) found as written in blue in the slide below, therefore 0+2=2.
Machine Translated by Google

For the cell above, 3 will be given by = 1 + min(6,9,2) = 3. We will run the whole matrix to find the path with the highest cost.

STW- Exercise 1: Given two time series calculate the cumulative cost matrix and the distance between the two

time series: t1 <4,3,6,1,0,>

t2 <3,6,7,0,1>

A) Calculate distance between t1 and t2 using DTW with distance between points calculated as d(x,y) = |x

y|

B) if we repeat the computation of point A, but this time with a full size Sakoe-Chiba band

r=1, The result change? Why?

C) If we compute DTW(T1,T2) where T1 is equal to t1 in reverse order (i.e. T1=<0,1,6,3,4>) and

equal for T2 (that is, T2=<3,6,7,0,1>), is it true that DTW(T1,T2)= DTW(t1,t2)? Discuss the problem

without Do calculations.

ÿ NB: sometimes it is possible to calculate the distance both with the Euclidean method (between absolute values) and with

the distance to manhattan. he can ask for it on the exam either way.

We start with the point to point cos, calculated with the Euclidean distance, then: |4-3|=1, |4-6|=2, |4- 7|=3 3 and so on
throughout the matrix.
Let's move on to the cumulative costs matrix: -
we copy the first value, i.e. 1 - the value
above is given by 2 + min (1, x,x) = 3. The value corresponding to the third row is given by = 3 + min (3 ,x,x), and
more.
Machine Translated by Google

- the first value of the second column (1) is given by = 0 + min (1, x,x)=1, the value in
position (2,2) is given by = 3 + min (3,1,1) = 4. The value still above = 4 + (6,3,4) = 7, then in
position (2,4) we will have 3 + min (10,6,7) = 9. Value of the last row = 2+ min (13,10,9)= 11 -
we will complete the table like this.
= The DTW value, i.e. the final result is the one corresponding to the box at the top right, in our case
4. The result at point A) is therefore 4.

We can also find the best path starting


from the result and moving through
the boxes with the lowest value.
The path in this case will be:

This is the comparison made by DTW:

This figure represents a real example,


more real because it is 3 dimensions.
The blue area in the bottom
representation represents the part that has
the lowest cost in the cumulative cost
matrix; it's a bit like finding the valley
between the mountains, where the
mountains are the points where the
mismatch is maximum and the cost is
higher. • This example shows 2 one-week
periods from the power demand time series.
• Note that
although both describe 4-day
work weeks, the blue sequence had Monday
as a holiday,
and the red sequence had Wednesday as a day of celebration.
• We can see that in both lines we have two pairs of peaks because the points in the first red peak
probably pay a higher cost because they are compared to those in the other line. (minute 19). • this is
the idea of the
DTW scheme to capture the different shapes regardless of the constraints.

There are many different and famous time series classification datasets that help us to recognize faces,
understand if a person is pointing a gun or not etc and as I told you, use a k NN classifier using
Euclidean distance or DTW from dei different results. Let's see them:
Machine Translated by Google

What we've seen so far... • Dynamic Time


Warping gives much better results than Euclidean distance on many problems. • Dynamic Time Warping is very very slow
to calculate! • Is there anything we can do to speed up the similarity
search under DTW? 2 ways

1. Fast approximations for the DTW

Approximate the time series with some


compression or downsampled
representation and do the DTW on the new
representation. let's imagine we have an
image on our pc and cut out part of the
pixels, so as to make fewer calculations.
How can we cut them and do this
'squeeze'?

There is usually strong visual evidence


which suggests how to operate and which works
well
• There is good experimental evidence in
the literature for the usefulness
of the approach regarding grouping, classification,
etc. • we will see how,
with these approximations, the
number is reduced but in any case the general
shape of the time series is "saved". In the slide we see that although there is a point every 20 seconds instead of every
second, the line remains more or less the same. It is therefore not necessary to make a large number of comparisons
because with the approximation a good balance is found between data and result.
Machine Translated by Google

• if we don't want to make many approximations or if the approximation is not enough, we


can use the Global Constraints.

2. Global Contraints or Global


Constraints These constraints are used to slightly speed up the calculations, trying to prevent
pathological deformations. The idea is to focus on a particular area, ie we limit the displacement of
the warping index with respect to the diagonal (Euclidean), therefore we set global limits in the matrix
of the distances of the points. I speed up the calculations; I prevent pathological deformations.

So we have: wk = (i, j)k such that jr ÿ i ÿ j + r

where "r" can be considered as a window that reduces the number of calculations. Term defining the
allowable range of warping for a given point in a sequence.

Two possible constraints are:

Sakoe-Chiba band ÿ The band within which I have to calculate the

distances is constant for the whole matrix (dark gray band). you decide

then a window representing the maximum number of cells that

we want to cut off the diagonal line.

Itakura Parallelogram ÿ same concept, but in this case the band

used is a parallelogram, so the width grows up to half of the

matrix, and then shrink again

We will simply have to calculate the point to point cost matrix and the cumulative costs matrix

only for the points inside the range ÿ it's like a window that decreases my calculations and that mi

avoid unnecessary calculations.

Accuracy vs Width of the warping windowÿ in the slide below we have a resolution

of the previous dataset using Sakoe-Chiba band. along the x axis we have the warping width, while
Machine Translated by Google

on the y-axis the accuracy. We see that for w=1 we have the Euclidean distance, but when we start a

moving towards w = 2 the performance starts to grow more and more until w = 5 (in our case) for

then decrease and remain constant. This means that you practically do not have to calculate all the

possible values.

ÿ obviously, as the size of the band considered increases, the accuracy also increases

(considering a classification task). If we place a width of one at the Sakoe-Chiba band

we are calculating the Euclidean distance (in fact I only consider the diagonal), as I enlarge it

the accuracy increases to then reach a constant value ÿ once this value is reached, all that is needed

I broaden the band

SUMMARY of similarity of time series • If you have short time


series it is preferable to use DTW after research on the size of the deformation window

• If you have long time series we have two ways:


- if something is known about your data => extract features - if nothing is
known about the data => compression/approximation based nonconformity test.
Machine Translated by Google

B) the answer is that the result does not change, because r=1 means
moving by one with respect to the diagonal, creating a band limited
by the two blue segments. So we see that the shortest path still
remains within the Sakoe-Chiba band.

C) if we do the inverse of the two time series we get the same results,
despite having both the point to point matrix and the cumulative matrix
different.
Machine Translated by Google
Machine Translated by Google

Time Series- Approximation:


The idea is to use approximations to represent a TS in a new, smaller and simpler space
and use this innovative representation for computation. • Approximation is a
special form of dimensionality reduction specifically designed for TSS.

The difference is that in the reduction, pieces are directly eliminated or variables such as the
PCA are used, while each time series is generally expressed by a particular attribute (such
as temperature, humidity, etc.) so we can see how the variable evolves at different times; it is like
a “timestamp reduction” •
Approximation vs Compression: the approximated space is always
understandable, while the compressed space is not always necessarily
understandable

An example of approximation: •In


this graph we have a time series with 128 points and
on the side we see the raw data used to create the
graph itself. The raw data used to produce the
graph is also rendered as a column of numbers
(only the first 30 points are shown). •We can
decompose the data into 64 pure sine waves using
the discrete Fourier transform (only the first sine
waves are shown). • Fourier coefficients
are rendered as a column of numbers (only
the first 30 or so coefficients are shown). •
Note that in this phase we
have not done a dimensionality reduction, but we
have simply changed the representation;

• However, note that the first few sine waves tend to be


larger (equivalently, the magnitude of the Fouriers
tends to decrease as you move down the
column). We can then truncate and throw away most of
the small Fouriers that are not necessary to
represent our TS, i.e. those numbers that have
little effect.

We can therefore say that it is possible to represent TS


C' only by means of the numbers in orange.

As we can see, the points describing C were 128, those describing C' are only 8, therefore we
have removed 15/16 of the data, keeping 1/16 of the data. It is necessary to say that if we want to
reconstruct the time series starting from these courier coefficients, it is not certain that we obtain the
usual TS, this is given by the approximation made.--> There are no guarantees of having a
Machine Translated by Google

lossless reconstruction, that is, it is not said that it is identical to the initial one.
Instead of choosing the first coefficients (as we have done), however, it is possible to take the coefficients
we think are the best; this certainly helps us to increase quality but it cannot be done in all areas,

Time Series can be compressed via various transformations/approximations.

1) Discrete Fourier Transform (DFT)


- Represents the time series as a linear combination of sine and
cosine, but retains only the first n/2 coefficients to represent
the time series, since each sine wave requires 2 numbers, one
for the "phase" (w) and one for amplitude (A,B).
- A Time Series represented with DFT is said to be in the "frequency
domain".
- Many of the Fourier coefficients have very low amplitude (such as
lines 2 and 3) and contribute little to signal reconstruction. They
can be deleted without incurring a big loss of information --> space
saving!
- Pros: Good ability to compress most natural signals, just because it has these 'repeated
Gaussians'. Some DFT algorithms cost O(n*log (n)). Fast to calculate - Cons: Difficult to handle
sequences of different lengths and does not support weighted distance measurements.
It can only be applied to time series that have a length that is a multiple of 2: it's not widely
used

2) Discrete Wavelet Transform (DWT)


- Represents the Time Series as a linear combination of Wavelet basis functions, but retains only the
first N coefficients. The idea is similar to Fourier's but the shapes of the waves are different! Before
the waves were more curved, now they are stepped.
- Wavelets represent data in terms of the sum and difference of a prototype function
Machine Translated by Google

(“analyze” or “mother” wavelet).


Wavelets are localized in time, i.e. some of the wavelet coefficients represent
small local subsections of the data. This goes against Fourier

coefficients which always represent the global contribution to the data.


- Haar wavelets seem as powerful as other wavelets for most problems and
are very easy to code.
- Pros: Good ability to compress stationary signals; There are fast linear
time algorithms for DWT.
- Cons: It is defined only for a sequence whose length is an integral power of
two; Unable to support weighted distance measurements. In contrast, wavelets approximates the left side
of the signal at the expense of the right side.

(Exercise that the professor


will not ask but which is useful
for understanding how it works)

3) Singular Value Decomposition SVD)


Approach used for dimensionality reduction.

• Represents time series as a linear combination


of 'eigenwaves' but retains only the first N
coefficients. • SVD is similar to Fourier
and Wavelet approaches in that it is
represented by data in terms of linear combination
of shapes (in this case the eigenwaves). Having a
linear combination means having an
eigenwaves in the first component (as in this
case) which has a certain beta which marks its
importance.

Suppose we want to represent the TS using the


eigen value 1,2 and 5; we will have 3
Machine Translated by Google

beta that will tell us how much final TS can be represented using these. • SVD differs
from other methods in that eigenwaves are data dependent. • We have already seen that
we can consider TSs as points in high-dimensional space. • We can rotate the axes so that axis 1 is
aligned with the direction of the maximum

variance, axis 2 is aligned with the direction of maximum variance orthogonal to axis 1 etc.

ÿ the idea is that: since the first eigenwaves contain most of the variance,

the rest of the time series can be truncated ensuring a good construction.

4) Piecewise Linear Approximation (PLA)


Another technique that is widely used for approximation especially in data science and computer
science application, because it has a more geometric approach. The PLA represents the Time Series
as a sequence of straight lines (segments).
The lines could be connected (K/2 lines are allowed) and not connected (K/3 lines); with K being the
“optimum number” of segments used to represent a particular Time Series.
However, there is no deterministic way to find the optimal K.; there is in fact a trade off
between compactness and accuracy, because if you use many segments, you can have
a good approximation, in the other case not.
Each segment has a right length and height, this is because the
left height can be affected by the adjacent segment. If you
have such a line, I'll segment it like this:

I see that there will be some segments (like the middle one)
that don't represent the curve very well. I will now have to
calculate the information by measuring the right length and
height.
- Pros: compress data; it is a good noise filter; this type of representation is able to support some
interesting non-Euclidean similarity measures

5) Piecewise Aggregate Approximation (PAA)

- PPA is a way to do PLA in a simpler way - Represent Time Series as


a sequence of basic box functions, with each box the same size.

- Approximate a TS by dividing it into segments/frames of equal length


and recording the average value of the data points that fall within the
segment; - Reduces data from n dimensions to N dimensions by dividing
equi-size ``frame' time series. - the average value of the data falling
within a is calculatedin No

frame and a vector of these values becomes the reduced representation


some data.
Machine Translated by Google

- Pros: Extremely fast to calculate; Supports non-Euclidean measures; Supports weighted Euclidean
distance

6) adaptive Piecewise Constant Approximation (APCA)


- Modification of the previous approach: instead of having
segments, bits and pieces of a fixed length, the length will be arbitrary.
- Allows segments to have arbitrary lengths, which in turn require
two segment numbers. for The first number records the average
value of all data points in the segment, and the second number
records the length of the segment.

- The APCA has the advantage of being able to place a single


segment in an area of low activity and many segments in activity.
areas to high
- This method allows you to better consider the structure of the
TS -
Pros: Fast to compute O(n); Supports non-Euclidean measurements; Supports weighted Euclidean distance

For a complete display only, a TS can be segmented using a predefined length w or a predefined
number of segments k, or using change point detection methods. The latter is the best way because it
does not imply a parameter which could cause the data to change.

7)
Symbolic Aggregate Approximation (SAX)
- The idea is to convert the data into a discrete format, with a small size alphabet.
Each part of the representation contributes the same amount of information about the shape
of the TS.
-1 step: A Time Series T of length n is divided into w segments of equal size; the values in each
segment are then approximated and replaced by a single coefficient, which is given by their mean. By
aggregating these w coefficients, the PAA representation of T is formed.
Machine Translated by Google

-2 step: Next, we determine the break points that divide the distribution space into ÿ equally probable
regions, where ÿ is the user-specified size of the alphabet, i.e. a letter (the MDL could be used al
instead of being chosen by the user).
- Breakpoints are determined in such a way that the probability of a segment falling into one of the
regions is approximately the same. If the symbols are not equiprobable, some substrings would be
more likely than others. As a result, we would have injected a bias

probabilistic in the process. Once the breakpoints are determined, each region is assigned a symbol.
The PAA coefficients can then be easily associated with the symbols corresponding to the regions in
which they reside. Symbols are assigned bottom-up, i.e. the PAA coefficient falling in the lowest
region is converted to "a", the one above to "b" and so on.

ÿ the first step is to apply the

PAA to the time series and then split the

TS vertically thanks to a size

default.

ÿ It analyzes horizontally and divides

the space so that in each region there

are the same number of letters, why

let us assume that the regions are

equiprobable.

ÿ Finally, we represent the segment with

the letter that belongs to that

partition. Letters are assigned bottom-up, i.e. the PAA coefficient falls into the

lower region is converted to "a", the one above to "b" and so on.

: what we discussed so far was a sort of preprocessing phase for each task with the TS.

Clustering:
It is based on the similarity between time series.
• The most similar data is grouped into clusters, but the clusters themselves should be
different. • These groups to find are not predefined, it is an unsupervised learning
task. • The two general methods of grouping time series are: -
Partitional clustering -
Hierarchical clustering
Machine Translated by Google

Hierarchical clustering: • Calculates the distance in pairs and then combines similar
clusters bottom-up, without the need to provide the number of clusters • It is one of the
best tools for data evaluation, from creating a dendrogram to different time series to the
sector of interest. • Its application is limited to small datasets due to its quadratic
computational complexity.

Partial Clustering: • Typically uses K-Means (or some variant) to optimize the target function by minimizing squared. • K-
Means is perhaps the most commonly used clustering algorithm in the literature, one of its shortcomings is the fact that
the number of clusters, K, must besum
there
pre-specified. •of
Also errors function
the the distance intra-cluster
plays a fundamental role both for the
to the

quality of the results and for the efficiency.

Of • Full Clustering:
Time Series Types: Ofof discrete objects. Given a dataset of individual
Similar to conventional clustering
grouping
time series, the goal is to group similar time series into the same cluster.

• Feature Based Clustering: Extract features or time series motifs (see later lessons) as features and use them to cluster
time series. • Compression-based clustering: compress time series and perform clustering on the most separated
compressed versions, • Subsequence clustering: given a single time series, subsequence clustering is performed on
each individual time series extracted from a long sliding window time series.

: Exercises at the bottom of the slides


Machine Translated by Google

Lesson 04/19/21
Time Series – Matrix Profiles, Motifs & Discords

Time Series Motif Discovery


Motifs are patterns that repeat in a single TS.

Why finding Motifs?


Machine Translated by Google

- TS Pattern mining: Once you find the Motifs, you can apply Pattern Mining and
generate rules.
- TS Classification: We can make predictions using classifiers built on
subsequences
- TS Anomaly detection: You can find anomalies (which we will call Discords)
in the TS, through patterns that are very different from the Motifs.

How do we find Motifs?


Given a predefined motif length m (a portion of length m of the TS), a brute force method will compare each portion of length
m. The problem with this approach is the slowness caused by the cumbersome calculations.

The first approach to speed up calculations is based on the SAX method.

Example:
We define m=16 (in the slides he calls it "n") in a TS of length 1000. Then we represent the TS using the
SAX method which will convert it into a sequence of letters, every 16 timestamps.

In the table we get:


1) In line 1 (in red) we have the first motif (C1) represented by the four letters “acba”, because each w has length 4,
multiplied by 4 letters = 16 timestamps.

2) Then I randomly choose two columns (in this case columns 1 and 2), and count the occurrences of the letters: the first
pair "ac" is found both in row 1 and in row 58, the pair "bc" is found in row 2 and 985, and so on.

3) I choose two other random columns and repeat the count of occurrences.
Machine Translated by Google

4) After a certain number of permutations we obtain a matrix in which, in the x and y axes, we have the indices
of the subparts of the motifs. In the cells instead we have how many times we have found the same
subpart of a motif expressed in terms of SAX.

5) Let's take the maximum value which in this case is 27, which indicates in the axes that the motif of
length 16 is found (and begins) at timestamp 1 and timestamp 58.

matrix profile
The Matrix Profile is a data structure that annotates a TS, and is often used to find motifs.

Also here a portion of length m is specified (it cannot be smaller than 1), which moving to the right on the TS, will find all
the subsequences of length m. This will find |T|-m+1 subsequences, where T is the time series.

Now it is possible to find the pairwise distance between all subsequences.


Machine Translated by Google

With this matrix the closest nearest neighbor is found for each subsequence (for each sequence the
comparison with the smallest distance is taken, discarding the others).

Later it is possible to find the Matrix Profile, which corresponds to the distance between a subsequence
and the closest subsequence, and the distance is stored in a vector. Graphically it is as if you were
generating a new TS where a point (black line) will indicate the distance between the yellow subsequence
and its nearest neighbor (probably the portion circled in red).

The Matrix Profile Index, on the other hand, shows where the closest similar subsequence is found:
in the following example, the subsequence (of length 16) at position 20 has the one at position 194 as its
nearest neighbour.

This system allows us to find the nearest neighbors in constant time.


From reading the Matrix Profile, we realize that the minimum points (base of the "valleys") are the
points with the smallest distance (greatest similarity) between the subsequences.
The pointers in the matrix profile index are not necessarily symmetric: if A points to B, B does not
necessarily point to A.
Machine Translated by Google

How to read a matrix profile


For low values, a subsequence of the original TS must have a relatively similar subsequence
somewhere in the data (these sections are the Motifs).
For high values, a subsequence has no closest neighbor, i.e. it has a unique shape (these values are
called Discords).

How to compute the Matrix Profile


Given a time series T and a subsequence of length m:
- A distance vector is initialized in which all values are "inf" (infinite).

- At the first iteration a subsequence Ti is selected, which will be compared with


all subsequences of the time series T by finding the distances, with a complexity of O(|T| log(|
T|))). The distance "0" is the one calculated between the selected subsequence and itself,
therefore it is not considered.
Machine Translated by Google

- I update the vector with the "inf" values, inserting the minimum of the values of each cell (at the first
iteration we will have only one value per cell, so the minimum will be the value just calculated).
We ignore the value "0".

- I select another subsequence Tj and repeat the previous steps for calculating the distances,
and finally I apply the minimum operator by updating the initial vector.

- Repeat all steps until each timestamp has been included at least once in a subsequence.

In the worst case the complexity is O(|T|2 log(|T|)).

Exercise: Matrix Profile


Giver the TS x = <2,1,3,4,7,5,3,4,7,5>
- Build the Matrix Profile for x with m=4 using the Manhattan distance as distance function between
subsequences.
- Draw the Matrix Profile
- Identify the motifs with distance equals 0 and length equals to m
- Which is a correct value for m that would have retrieved less motifs with distance equals to
0?
Machine Translated by Google

1) We write the TS twice to calculate the distances. The two highlighted time windows are equal,
therefore the minimum of the differences will be "inf" because all the distances will be "0".

2) I advance the time window by 1 and calculate the distances, after which I put the minimum in the
matrix. I perform this step until the end of the vector in red.

3) I start over by moving the yellow vector forward by 1.

4) I do all distance calculations for each permutation and in each row I choose the value
minimum. I repeat the steps until the end of the matrix.
Machine Translated by Google

5) Finally we obtain the values (in blue) of the matrix profile.

6) The points where the matrix profile is 0 are the Motifs.


Machine Translated by Google

21/04

Reminder: given a time series T, the matrix profile of a ts indicates the distance of a point, of a
subpart, to the most similar part in the same ts. Local minima correspond to motifs, subparts of the
ts that appear multiple times. Then, given these distances, we can easily find the top-k motifs.
As?

Motif Discovery from Matrix Profile:

it is useful to think of the subsequences of the ts as points in an m-dimensional space.


A motif therefore can be considered as a sequence representable in an n-dimensional space. In 2
dimensions we will have a situation like this:

Here 2 motifs are close together if their distance is


small. In this representation, the densest regions of
points (where each point corresponds to a certain
subsequence) correspond to the depressions in the
matrix profile. What happens is that we will have
many similar motifs close together, while different
subparts will be far apart.
Machine Translated by Google

Top-K Motifs

In the figure we show a way to extract the top-K motifs.


We need a parameter R: 1 <R < (small number, which we
can consider equal to 3) and assume R = 2 for now.

Top-1 motif

We start by finding the closest pair of points, the pair

of motifs. The motif pairs are the two subsequences with the

lowest value in the entire matrix profile; plus we know

chr the first pair of motifs is linked in the matrix

profile index and have the same distance

minimal. Once this pair is found, represented

at the top left of the slide, we want to find the pair of

points closest. - Except the distance between the two plus points

neighbors= D1, I select a radius D1*R, where R is the selected parameter.

- I draw the radius around the pointsÿ if a point falls inside the circle then this point is

quite similar to the other sub sequences from


be considered a motif.

All points that are within one of these

circles, are added to the motifs, in this case only

one. The Top-1 motifs, composed of three members,

end. The three motifs found can be represented in blue, as in this example.

: we can therefore represent these 3 parts of the ts with the same shape, 3 times.

Top-2 motif:

To find the closest two points it will be enough to repeat the previous procedure for the points, excluding
those belonging to the top-1 motif.
Along the Matrix Profile we will find the smallest distance, disregarding the distances identified in the
previous iteration. We find the closest one, we find the neighbors of the closest points,
Machine Translated by Google

drawing a larger circle, expressed by 2*D*R.

I will then consider the 2 closest points + their neighbors: --> the
Top-2 motif will have 4 members. Each stitch inside the circle is part
of the top 2 motif.

I continue until I find the K number of motif che

I set out to find, or until I use the

minimum descriptive length MDL as a stop criterion

I notice that the selected distances increaseÿD1 < D2

< Q3 … DK extension.

- Recap: at the first iteration I look for the smallest value in the MP; suppose it is in the Di,j box, so I will connect motif j to
motif i. After that I will trace the radius around them, which will be calculated as eps=D1*R. If I see that there are points that
have a smaller distance than eps (ie that fall within the radius, they will also be motifs). On the next iteration I repeat the steps.

The distance D1 can be used in the distances discovered in the dataset: i.e. if this distance is 5
and I have decided R=2, this means that I will look for distances smaller than or equal to 10. At the
next iteration instead let's suppose we have a distance equal to 11 , what I will have to calculate
will be 11*2 =22

Anomaly Discovery from matrix profile:

The necessary steps to identify an anomaly in a time series are the following: - I identify a parameter E of
subsequences to be excluded near an anomaly, this time we will exclude the points that are too close. In our example: E=2 - I
find the sequences that have the distance (from the first nearest neighbour) higher in the MP.

- I find the E below sequences closest to the anomaly, I exclude them


by removing them from the matrix profile

the one identified by the arrow is the first major anomaly. Next, we'll
look for the other closest anomalies, those identified by the yellow
arrows. We then consider the two closest points (celestial features) and
eliminate all the anomalies found in the previous points.
Machine Translated by Google

I repeat the procedure: find the points with the greatest distance, look for the closest points E=2 and remove
all three. We will repeat the process until: - I have found a number K of anomalies.

- I use the minimum descriptive length MDL as a stop criterion. typically for
anomalies, it is interesting to see only the Top1s, for motifs instead K=3 or 5. It depends on the length and type
of the Ts that one has.

: To summarize:

ÿ The matrix profile is a model to efficiently describe a single time series, it can be

effectively extracted and the process is not particularly affected by the shapes, because everything is

referring to the ts itself.

ÿ Given a matrix profile you can extract many patterns: we know how to extract motifs and anomalies.

The only thing that is a bit difficult to adjust is the m parameter, the one that represents the length

of the window to consider, used to create precisely the MPà is what could afflict all

calculations. Generally it is better to take a small st but not too much, about 3 or 4. If you took

too large a w would fall into an already known error, precision, i.e. curse of

dimensionality! As long as small portions are being compared we can use a Euclidean distance or la

manhattan. If the portions were more substantial, the two distances mentioned would not be

particularly reliable. The DTW distance will be more suitable or apply an approximation (in

this second case, however, would have the influence of the choice of the type of ours

approximation)à exam question.


Machine Translated by Google

Recap classification: Time Series Classification:

Given a set X of n time series, X = {x1, x2, …, xn} , where each time series has m (i.e. same length m) ordered values xi = <
xt1, xt2, …, xtm > and a value of Ci class.

Objective: to find a function that maps the set of possible time series to the set of

class values.

ES ÿ let's assume that each time series is associated with a class and let's take one for example

time series like the pulse measured by a smartwatch. The goal of classification would be

find a function f that identifies the classes run, sleep, swim…etc

We generally assume that all time seekers have the same length m.

The methods for the TS Classification are: particular type of Deep Neural Networks (convolutional neural

networks) but also Ensemble Methods. Both techniques require raw data as input, ie

all timestamps. We could practically obtain the same performances with a transformation

of the ts datset using a Shapelet based classifier.

SHAPELET A
shapelet is a pattern/subsequence that is most representative of a class with respect to a given Time Series dataset. Thus
Shapelets are discriminative subsequences of the Time Series that best predict the target variable. We will see that shapelets
are needed for the classification task. Shapelets can provide interpretable results of their data and can be more precise than
other Time Series classifiers because they address local rather than global characteristics.
Machine Translated by Google

Shapelet-based Classification

So the idea is:


1) Represent a Time Series as a distance vector with
representative subsequences, called a shapelet.

2) Use this transformed dataset as input to the learning


classifiers. Of machine
: Simple classifiers such as decision tree, KNN based
approach and logistic regression achieve good performances
comparable to deep neural networks.
For example, we can compare two leaves. A shape can be
converted into a one-dimensional time series representation.

These representations are successfully used for


classification, clustering and outlier detection in “shapes”.
Let's imagine that each leaf image is a time series by taking
the edge of the leaf and spreading it flat.

ÿ Shapelets are a representative part (subsequence) of the ts which are well representative
for the discovery of the classes of the ts themselves. We will be able to identify the red part on the ts which
makes us distinguish urtica from verbena; this is captured by the parts in light blue.
VS the motif instead was a pattern that described a single ts, it had no relationship with the classes of the ts.

In the end I can represent a dataset as the distance with respect to the time series; then, after this learning
phase in which I calculate the distance between the shapelet of a certain leaf and the one identified for the
two classes, I can build a decision tree based on the shalpelets which discriminates between the two classes
using a certain distance as a split criterion (derived during the training phase).
Basically what you do after finding a shapelet and recording its distance from the closest matching
subsequence (relative to all other objects in the database) is compiling a simple decision-tree classifier, like
the one in the photo. Of guy from the

Example: The question will be “Q has a subsequence within a distance Of 5.1?”

ÿ YES, then the leaf is of the verbena urticifolia type, NO, then the leaf is of the urtica dioica type.
Machine Translated by Google

Shapelet based Classifier: procedure to extract k shapelets: I start with a TS


dataset with a dimensionality NxM (with N=timeseries and M=timestamps). What I will apply is a procedure to extract an
equal number k of shapelets. Then I pass the TS dataset and the shapelets to the trabsformer shapelet, so as to obtain a
shapelet dataset of dimensionality NxK, distance of the ts from a certain shapelet.

ES: if I have two ts two shapelets; I calculate the distance between shapelet 1 and the ts, and the same for shapelet 2. Now
I take the best alignment what the shapelets can do with the TS:

I can represent TS T1 as the distance between S1 and S2. if I had other time If
series, I would build a table like the one a. Knowing the classes of the ts, side.
I can identify them (Y=A,B) table. on the

: it is clear that given a ts, if the distance between the ts itself and the shapelet1 is small, then the ts is of type A, otherwise
it is of type B. :one advantage
of shapelets is that they are accurate and robust. : the number of shapelets
is a parameter to choose, it depends on the length, variety and the presence of noise in the TS. However K<<M. :
this technique is not widely used for image recognition (in
that case we prefer to use deep neural networks) because often transforming images into TS is not very easy
due to the presence of the background. If we have simple images and we want to use only the outline, with a completely
black or white background, as in our case with the leaves, it is possible to use it efficiently.

:Given these two TS, T and S, how do you calculate the distance between them?

I align the two TS and calculate all the distances (in green)

using either Euclidean distance or manahattan,


Machine Translated by Google

in this way I calculate the distances between the TS:

After that I move the ST S by 1 timestamp and


recalculate the distances.

As could be imagined, the best alignment


between the two time series will be the following,
called dK:

The distance between the two TS is dK= minimum of all distances found by moving the TS S on T. : we
therefore understand how important it is to have a normalized and scaled TS.

Dataset example:

here are represented 4time Series relating to the monitoring


of a heartbeat. In box x we will have the heart rate value for
each second; each second is represented by a column. The
part 1 column represents the first second, while the last
column a represents the second 60x10. the x column instead
represents the classes, where S stands for sitting, while R
stands for running.

After that we need to turn it into a shapelet dataset, with one


column for each shapelet. In the first box we will have the
distance between shapelet 1 and TS1.

ÿ The shapelet therefore represents a sort of approximation


for the time Series and has the property of identifying the
different classes.

NB: we could have used the motifs as well but they would not
have given us any guarantee on the division of the classes.
Machine Translated by Google

Extract the shapelets

in the literature there are many ways to extract shapelets. We will present only one, which is not even very
efficient because it is based on a brute force approach. It's not particularly used.

- Let's imagine we have a set of TS and a certain time series dataset, I define a certain window of length m1,
making it move on TS1.
- I extract all the subsequences of length m, moving the window on TS1, so as to have a set of candidate
sequences. I update the candidate pool. I do the same thing for the TS2.
- I can decide to specify a new m, called m2 and I move it, looking at the distances and again updating the
candidate table.

- the more I change, the more difficult the calculation will be, but the more correct the shapelets will be. It
doesn't matter if the shapelets are of different lengths.

-I test the utility of candidate shapelets: I use the concept of distance from a subsequenceÿ la

distance between a certain series and a sub-sequence S, is the function that reports a certain value d which

corresponds to the minimum distance between S and sequences with the same length as S in the series

historic.

- The distance from the Time series to the subsequence SubsequenceDist(T, S) is a function of

distance that takes the time series T and the subsequence S as input and returns a value

non-negative d, which is the distance from T to S

ÿ SubsequenceDist(T,S) =min(Dist(S, S')), for S' ÿ ST |S|, where ST |S| is the set of all
possible subsequences of T.

Intuitively, it is the distance between S and its best match position in T, I find it by sliding S and calculating
the various distances.

The figure illustrates the best corresponding position in Time Series T for the subsequence S

Test the utility of candidate shapelets -> it is necessary to estimate how useful candidate sequences can be
to discriminate between the two classes; given a certain candidate subsequence ad
Machine Translated by Google

being shapelets, I order the TS dataset according to the distance of each TS with each candidate, i.e.
the sequence (use distance seen above).
I will compare the candidate with the different time series represented here in red above the leaves. From the
previous step (figure above) I will know the distances between the TS and the candidate, for example we will
have the distances between the shapelet and the time series of: 0,2,4,5,7,8.. I can sort all the distances by all
thets.

later, I can consider the classes for a certain time series; in fact for a certain candidate sequence I
get:

The squares and circles indicate the class it belongs to. What I then want to find is the optimal split point
between the two classes, which maximizes the information gain. This step is the usual one that the decision
tree classifier does. of the TS, its position on the straight line is the distance from the candidate
shapelet.
Finally, after split
found it point and
after calculating
the information
gain of each
candidate,
I as
choose the shapelet with the highest information gain. How do we calculate this information gain? One way
is entropy

ENTROPY: A Time Series dataset D consists of two classes, A and B. Since the proportion of objects in class
A is p(A) and the proportion of objects in class B is p(B), the entropy of D And:
ENTROPY d= I(D) = -p(A)log(p(A)) - p(B)log(p(B)).
Given a strategy that splits D into two subgroups D1 and D2, the information remaining in the dataset after the
split is defined by the weighted average entropy of each subset.

If the fraction of objects (frequency) in D1 is f(D1) and in D2 is f(D2), the total entropy of D after division

And
ÿ
Î(D) = f(D1) I(D1) + f(D2) I(D2)
After that we will calculate the information gain, in the same way as it is done in the decision tree, as a
difference, i.e.:

INFORMATION GAIN: Given a certain division strategy “sp” which divides D into two subsets D1 and D2, the
entropy before and after the division is I(D) and Î(D). The information gain for this splitting rule is: Gain(sp) =
I(D) - Î(D) = I(D) - f(D1) I(D1) + f(D2) I( D2)
Machine Translated by Google

:After that I can evaluate this measure to partition the dataset against all candidate shapelets, because as a splitting
rule (sp) we usually use the distance from T to a shapelet S. To find the best Shapelet (the one that splits best) we
may have to test many candidates.
We'll take the shapalet with the greatest info gain and use that to find the best splitting point, continue.
as Yes does In the decision tree For attributes
For example, in the brute force algorithm we sort objects by distance and find an optimal split point between two
nearby distances. Since the shapelet is simply a time series of some length less than or equal to the length of the
shortest time series in our dataset, there are an infinite amount of possible shapes it could have. However, they
could candidates. however : This process is reliable in finding the best split, the pitfall lies in the dimensionality of
shapelet I'm be there many the
looking for. With the information gain definition, we are guaranteed that the particular high information gain shapelet
is the best way to separate the TS (according to the IG definition, not absolute)

Mira example:

let's imagine we have a


candidate shapelet and five time series
belonging to three different classes:
A, B, C. I calculate the distance
between the shapelet and each
time series, imagining the
distances respectively: 0.3 , 9.4,
0.8, 1, 0.5, 0.6: they are the
distances from the candidate
shapelet S1 and from the TS dataset.
Now let's order in order
increasing the distances,
assigning them the respective
distances.
I will then find the different SP splitting
points and calculate the
information gain for each one.
Suppose they are: SP1= 0.5,
SP2=0.7, SP3=0.8.
I repeat the same procedure for another candidate shapelet with respect to the TS.

ÿ So we will select the best split point given by the maximum value of the information gain, I will take

then the top-k candidate shapelets. Suppose Top-1 is here, we will select SP3 because it has

an information gain of 0.8.

26/04
Machine Translated by Google

Recap Shapelets: example

We have a dataset of ts with classes A and B.


We want to run a classification
algorithm for these ts by doing a shapelets
discovery, so we want to base our
classification algorithm on the
shapelets.
What is a shapelet? It is a subpart of a ts
that is useful for recognizing a
class. In this case we can have shapelets
like the ones in red for example.
What we want to do is identify the set S
of shapelets and transform the ts
dataset into a dataset with distances from S.

Now we need to extract the set S of shapelets


so that it has a high discriminative power, so for example the shapelet S1 is discriminative for the class A.
We need to remember that when we talk about the distance between the ts and the subsequence
we mean the minimum distance between the ts and the shapelet which is the best
alignment So:

The distance from the ts to the subsequence SubsequenceDist(T, S) is a distance function that takes

the ts T and the subsequence S as input and returns a non negative value d, which is the distance between

T and S: SubsequenceDist(T, S) = min(Dist(S, S')), for S' ÿ ST|S|


(ST|S| is the set of all possible subsequences of T)
Intuitively, this is the distance between S and the best-fit position in T.

At this point a length m is


decided and in each ts
possible candidate shapelets are
extracted. When you have the
candidates, you need
to select the best
shapelets to implement
the transformation mentioned above.
For example wanting k = 3, it
means I want the 3 best
shapelets. What do we do?
For example, consider the
first shapelet and enter the
distances between the ts and the shapelest on an axis, find the best split-point and separate the dataset.
Calculate the various Information Gains considering all the pairs, then repeat the process with all the other
shapelets, finally consider the top-k shapelets with respect to the IG.
Machine Translated by Google

Problem: extracting shapelets is quite expensive because if we establish a minimum and a maximum
length for the shapelet size then the total number of candidates for a dataset D is

(l is the m used before; |Ti| is the length of the ts)


For each candidate, the distance between that candidate and each training shapelet and each instance in
the training set must be bridged.
Example: 200 instances with length 275 à 7480200 shapelet candidates. We need to calculate the
distance between this number of SCs * number of instances by multiplying by the length 275

Speedups:
Since calculating distances from TSs to SC is expensive there are ways to reduce the time: • Distance
Early Abandon: reduces the distance calculation time between two TSs • Admissible
Entropy Pruning: reduces the number of distance calculations

Distance Early Abandon


We only need the minimum distance between TS and CS(S) because what we want to find is the best-
matching location.
The method is to consider the best distance obtained and keep it in mind up to a certain point, finish the
calculation if the current distance is greater than the best so far.
In the example it can be seen how moving the CS by 1 timestamp and repeating the calculation, but
going forward one realizes that the distance to the
fifth alignment is already >0.4 (which is the best
alignment obtained) so it is useless to continue
because continuing the distance can only
increase and become >0.4, so you start with a new
one. If at the end the distance is less than 0.4, the
minimum distance found at that point is updated
and goes on. Finally, the distance obtained is
established as the minimum distance.
So we compare a long TS with a very small one
looking at each timestamp, and if during the calculation
we realize that the distances are already greater
than the best one found in a certain
Machine Translated by Google

point, stop and start again from the next timestamp.


Orange curve: shapelet of length 3.
The distances from the alignments are calculated with each shapelet, as you can see t3 becomes the first
minimum, then t2, then t1 and then t0. We see that t4 starts with 0.5, since it is already greater than our
minimum 0.02 we stop. Same thing with t5 starting with 1.

ÿ Then the calculation of other


subsequences is stopped if the minimum is
already exceeded

Admissible Entropy Pruning

We only need the best shapelet for each class Procedure:


assuming
we keep k = 1 shapelet so only the best one (binary
classification problem), we know the information gain
0.42 for this shapelet, which is currently the best candidate,
then we have a new candidate to test, then we work on a
sample of the training set in terms of distance calculation.

For example, we have 100 instances and I only calculate the distances 5 times.
Now we need to identify the best possible scenario, which is in the example the division with the blue line, which
puts all the instances of class 1 on one side and all the others on the top and calculates the
Information Gain. Since the IG points to the maximum when there is the best separation, if with this simulation
the IG of the best candidate shapelet is not obtained, this means that the IG performed with this split
point cannot be the best.
Then stop the computation and test the next candidate and so on ÿ way to reduce the

complexity but this time completely avoiding the calculation of distances and you are guaranteed to

don't miss anything.


Obviously the larger the sample of the training set of instances used to calculate distances, the more accurate
the result will be, but this will be the only portion of data for which distances need to be calculated, and
early abandonment can be used for these distances optimization.
Machine Translated by Google

Shapelet Summary:

1) Extract all possible subsequences from a given


set of the dataset (candidate shapelet)

2) For each possible candidate shapelet:


a. calculate the distance with each time
series taking the minimum distance
(best alignment)
b. evaluate the discriminatory effect of
the shapelet using the Information Gain

3) Provide the best k shapelets with the highest GI

4) Transform the ts data into a dataset of distances


from shapelt to train to ML model

An alternative way to extract the shapelets: the idea


is to solve the optimization problem to find the shapelet and the radical difference is that in this case
existing shapelets extracted from the ts are not used but a shapelet is built starting from a random
shapelet and after each iteration it is modeled and improved to minimize the objective
function, for this reason it is an optimization problem (remember the structure of a
Neural Network, you don't need to know it in detail)
- The minimum distance (M) between TS and shapelets can be used as a
predictive factor to approximate the TS labels (Y) using a linear model (W):

- Logistic regression loss can measure the quality of the prediction:

- The goal is to minimize a loss function smoothed across all


instances (I):

- We can find the optimal shapelet for the objective function via a NN method by
updating the shapelets in the minimum direction of the objective, thus in the
first gradient. Similarly, the weights can be updated together by minimizing the
objective function.
Machine Translated by Google

Difference between motif and shapelet? (typical exam question)

A motif is a pattern/subsequence repeated in a single ts and the


extraction of a motif has no relation to a class

A shapelet is a pattern/subsequence that is most


representative of a class with respect to a dataset of TSs

04/29/2021

SEQUENTIAL PATTERN MINING

It's about extracting patterns on sequential data considering time, so essentially we're going from
time series to transactional sequences.

An example of a customer's purchase sequence in an online store: <{Digital Camera, iPad} {memory
card} {headphone, iPad cover}>; items between brackets are bought together.

Frequent itemsets and basic transactions do not contain the notion of time, whereas in sequential pattern
mining we have to consider it. In the apriori algorithm we only considered the frequency of that pattern, for
example if a customer buys chicken and fish 5 times and our threshold is 5 we know that the pair <chicken,
fish> is a frequent itemset. Now let's consider sequences, ie the purchase of chicken and meat is followed
by the purchase of tomatoes which in turn is followed by the purchase of fish. This is not a rule, but a
sequence of events. We then check which sequences appear frequently.

The sequence is the collection of transactions or elements; an item or transaction is a set of events or
items. So I go to the supermarket several times and each time I make a transaction (a purchase) of a
set of items that I buy all at the same time.

This type of data can be represented in many ways


Machine Translated by Google

Object identifies the customer sequence. Customer A at the first purchase buys items 2,3 and 5 at time 10.

In general, instead of the actual timestamp, to simplify, we will use the sequence of numbers starting from
1, to indicate whether it was the first, second or third purchase for that customer, etc., assuming that
each purchase takes place at a different time than the one before. so we are not considering time as a
continuous variable in our models but we are discretizing it by representing the events simply as one
after the other, I don't care if the next purchase is after two hours, two weeks or two months, it will
always have a timestamp 2.

A sequence is an ordered list of elements called transactions S=<e1,e2,e3...>

Each transaction is assigned a specific time, so the first element is at time 1, the second at time 2 and
so on. Each transaction contains a set of items e(i)={i1, i2,…, ik}.
Machine Translated by Google

The length of a sequence |s| it is given by the number of elements (transactions) within it.

A k-sequence is a sequence that contains k items.

When we have only one item in the transaction we call it a singleton, when we have more than one we call
it a complex.

A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (with m>=n) if there are
indices i1 < i2 …
< < in such that transaction a1 is a subset of transaction b with index i1, la
transaction a2 is a subset of transaction b with index i2 etc.

So in this set we have that the first transaction of the first sequence is a subset of the second transaction
of the second sequence (ie {A} is a subset of {A,C}); we have that the second transaction of the first
sequence is a subset of the third transaction of the second sequence ({B,C} is a subset of {A,B,C}); finally
{D} is a subset of {D}, so the first sequence is a subsequence of the second and in this case i1=1, i2=2
and i3=5.

Given a sequence, a subsequence of it can have multiple occurrences; these occurrences are
represented by the indexes of the transactions of which the transactions of the subsequence are subsets.
Machine Translated by Google

Occurrences of the subsequence <{B}{E}> can be <1,2>, <1,6> or <3,6> because {B} can be a subset of
{B,F} or of {A,B }; we don't consider {B,E} because there are no transactions containing {E} after it.

The support of a subsequence w is the fraction of sequences that contains the subsequence w.

w is a frequent subsequence if its support is greater than the minsup treshold.

Recall that since even if a subsequence can be mapped multiple times in a sequence, this does not mean
that that subsequence is contained 3 times within the sequence, so when we calculate the support we just
need to see if the sequence contains any mapping of the
sequence.

Sequential pattern mining

Given a sequence database and a user-given minsup, the idea is to find all subsequences with support >=
minsup, so it's roughly the same definition as DM1's frequent pattern mining.
Machine Translated by Google

The trivial approach is to generate all possible k-subsequences and compute the support, but it is
impossible because we have a combinatorial explosion.

Even if we generate subsequences from a given input sequence the number of k-subsequences to extract is
too high.

GSP extension

To effectively run sequential pattern mining there is a first algorithm which is the basic one, namely the
Generalized Sequential Pattern (GSP), which however is not widely used because there are more efficient
implementations.

The idea of the GSP is the same as apriori, that is, it starts from the short patterns and then looks for the
longer ones at each iteration so that they respect the support threshold. Also this algorithm is based on
the anti-monotonicity principle of the support, so if the sequence S1 is contained in the sequence S2 then the
support of S2 cannot be greater than the support of S1.

An intuitive proof of this is that any input sequence that contains S2 will also contain S1.

At the first step the algorithm scans the dataset and takes the sequences with an element (a transaction, it
can be <{i1}> but also < {i1,i2}>);

step 2 is repeated until frequent new sequences are discovered and consists of:

-Candidate generation, i.e. generate the candidates by joining two pairs of frequent subsequences
found previously and generate the candidate sequences that contain k-item

-Candidate pruning, i.e. remove candidates that contain (k-1)-subsequences that are infrequent

-Support Counting for newly generated sequences

-Candidate Elimination of sequences that do not satisfy the minsup

When we consider for example the 2-subsequences, these can be formed either by a single transaction with
two items, or by two transactions with a single item.

To run this algorithm it is important to establish an order of the items within the transactions, and therefore
by establishing an order through timestamps we give an order to how the algorithm works.
Machine Translated by Google

Candidate generation

The base case is when I join two frequent 1-sequences to produce two new candidate 2-sequences, i.e. <{i1}{i2}
> and <{i1,i2}>. We also need to remember the special case, i.e. the case where we join the same item with itself
<{i1}{i1}> .

In the general case I know all (k-1)-frequent sequences and a (k-1)-frequent sequence w1 is combined with
another w2 to produce a k-sequence if the subsequence obtained by removing the first item in w1 is the same as
that obtained by removing the last item in w2. The result of this operation is obtained by taking w1 and
extending it with the last element of w2. There are 3 possible cases:

Here too we must remember the special case in which we have <{a} {a}> + <{a} {a}> = <{a} {a} {a}>

Candidate pruning

It is based on the a priori principle: if a k-sequence W contains a (k-1)-subsequence which is not frequent, then
W is not frequent and can be pruned.

Given a sequence we enumerate all the (k-1)-subsequences by deleting an item from the sequence, therefore
for a sequence of length k we have k (k-1)-subsequences; we already know that the subsequences formed by
the sequence without the first element and by the sequence without the last element will certainly be
frequent because they are the ones that led to the generation of the candidate, so the subsequences to be
checked are actually k-2 and we check if they were generated on the previous interaction. If they are not, the
candidate k-sequence in question can be pruned.

In the case where k=2 pruning has no use because they are certainly generated by frequent
subsequences.

Lesson 03/05

TIMING CONSTRAINTS
Machine Translated by Google

• Timing constraints are very useful because it might not be very interesting to find a
sequential pattern of two transactions that occurred six months apart. ex: Sequential
Pattern {milk} ÿ {cookies}
If we had the purchase of the milk six months after the purchase of the biscuits, it would certainly not be
useful information
• Too short a time is also uninteresting. If we have: {cheese A} ÿ
{cheese B}
We may have a purchase of cheese A and cheese B 20 minutes apart, but this may be because we forgot
to get cheese B along with that
TO

• It depends a lot on the total time of the process:

We have 3 types of time constraints:

1. xg: Max-gapÿ each element of the instance pattern must be at most in xg time after the
previous
2. ng: min-gapÿ each element of the instance pattern must be at least in ng time after the previous
one. It allows us to understand that the time interval between two transactions is of little
interest/importance. 3. ms:
maximum spanÿ the overall duration of the pattern instance must be at most in ms.

Former:

In this case we say that the gap between A and C <= xg. This allows us to avoid considering pattern
instances that occurred at very distant points in time.

Exercise:

xg: 2, ng:0, ms:4


Consecutive elements are “at most distance” 2 and the overall duration is “at most 4 time units”.
Is the first subsequence <{6},{5}> contained in the first sequence? The sub-sequence also complies with i
constraints?
Machine Translated by Google

- <{2,4},{3,5,6},{4,7},{4,5},{8}>
- <{happens at time =0}, {happens at time =1}, {happens at time =2}, {happens at time =3},
{occurs at time =4}, > The gap
-
between transaction {3,5,6} and transaction {4,5}= 3-1=2 Now we need
to see if 2<=xg. That is if 2<=2 in this case.
- So: it is contained

- Let's move on to the second.


We see that the distance between 1 and 4 ÿ 3-0=3.
-
in this case 3 is not <= xg
-
The second subsequence will not be contained

The final solution will be:

Mining Sequential Patterns with Timing Constraints

Approach 1:

- Extraction of sequential models without time constraints


- Post process detected patterns
- Dangerous: It could generate billions of sequential patterns to get only a few ones with
time constraints

Approach 2:

-
Modifying the GSP algorithm to directly eliminate candidates that violate time constraints - Question: Is the Apriori
principle still valid? The solution is both yes and no, that's because you open it
principle is satisfied under certain constraints and is not satisfied under other constraints

Apriori Principle with Time Constraints

Case1: max-span
Machine Translated by Google

- As you can see we have the two input sequences S1 and S2.
- The span for S2 is 4 (4-0=4) and for S1 (1-0=1) it is 1.
- When S1 has fewer elements, S1 span can only decrease: if S2 span is OK, then S1 span is OK.
- If we have IL supp(s1)>= SUP(s2) , then we know that S1 is contained in S2. Then the span of S1 <= the span of S2. So
if the span is respected for S1 it is also respected for S2 , therefore the a priori principle is respected

Case2: min-gap

-
In this case the gap of S1 is given by= position of D - position of A. So 4-0=4
- While for S2 we have to check for two gaps, between each transaction:
4-1=3 and 1-0=1

- Even with respect to the min_gap the constraint is respected: because the min_gap wants the gap between two
transactions in a subsequence is > than the min_gap.
- Therefore, if the constraint is respected by S2 then it will also be respected by S1. This is because S1 has fewer elements
and by construction the smaller subsequence has a gap > the gap of the larger one
sequence.

Case3: max-gap

-
The most unfortunate case is this.
Machine Translated by Google

-
In this case when S1 has fewer elements than S2, the gap can only increase. We might have that S1 has no internal
elements , contained in S2. In this case if the gap <=3 , then it will be correct for
gap=1, gap=3 and not for gap=4 .
- This is because if S1 has fewer elements, the gap for S1 can only increase compared to the gap for S2.
Even if the gap for s2 respects the time constraints, the gap for S1 can be > the timing constraints.

Example:
Apriori principle for Sequence Data

The table is divided according to A,B,C,D,E.


Max-gap, min-gap, maximum span and minsup are the constraints, even if in this case maximum span = 5 is useless because
there are no sequences that have more than 3 transactions.

- Consider the transaction/subsequences <{2}{5}>:


We select the transactions underlined in blue in the table.
Machine Translated by Google

However, since we have the max-gap = 1, we cannot consider these events selected, because in the part of object A
the difference between the corresponding timestamps (3-1 = 2) is greater than 1.

- So let's choose different values:

Now we no longer consider those highlighted in blue, but we consider the purple ones.
The support is therefore 40% because it is given by ÿ cases. The occurrences under (D) don't even respect
the max_gap, so we eliminate them.

- Now consider <{2}{3}{5}>:

All those highlighted in green respect the constraint of the maximum gap, therefore the support is 60%
because we have ÿ cases.
Machine Translated by Google

- We note that the sequence S1 is a subsequence of S2:

In this case S1 does not respect minsup, while S2 respects it even if it is longer. But this can never happen with the apriori,
because if a sequence of length k respects the minsup then a sequence of length k-1 must respect the minsup, because the support
decreases monotonically (this does not happen since, in this case, as the length of the sequence increases, the support also
increases).

- To solve this problem we have to refer to the notion of Contiguous Subsequences.

Contiguous Subsequences
Definition: s is a contiguous subsequence of a given sequence w=<e1><e2>...<ek> if each of the following conditions is
true:

1) s is obtained from w by deleting an element from e1 or from ek (we are only interested in this condition, for
avoid internal jump, the others we can ignore)
2) s is obtained from w by deleting an element from each element ei which contains more than 2 elements. 3) s is a
contiguous subsequence si s' and s' is a contiguous subsequence of w (recursive definition)

Examples: s = <{1}{2}>

-
s is contiguous subsequence of <{1}{2 3}>, <{1 2}{2}{3}>, <{3 4}{1 2}{2 3}{4}>, because we could remove the values
as follows:

- s is not contiguous subsequence of <{1}{3}{2}> and <{2}{1}{3}{2}>

Modified candidate pruning step


Via contiguous subsequences, the pruning method changes like this:

- Switch by: Without maxgap constraint, a candidate k-sequence is pruned if at least one of its
(k-1)-subsequences is infrequent.
-
Transforms to: With the maxgap constraint, a candidate k-sequence is pruned if at least one of its contiguous (k-1)-
subsequences is infrequent. We are also counting for subsequences that are contiguous, but don't necessarily meet
the min support threshold.

So pruning power is reduced, because there are fewer subsequences to consider since only contiguous ones are taken.

Sequential Pattern - Exercise 1 (the


same as last time, but with the time constraint mingap > 1)
Machine Translated by Google

The occurrence <1,2> is eliminated because it does not respect the constraint mingap > 1. That is, we are looking for a gap>1, ,
constraints. therefore
but 2-1=1it does not respect the

The same goes for <0,1,2> and <0,1,6>, in this case we always have 1-0=1.

We eliminate the occurrences <3,4> and <6,7> because they do not respect mingap > 1.

In the last one we eliminate all the occurrences always for the mingap constraint. So the sequence w3 is not contained in the
initial sequence if we apply the constraint mingap > 1.

Sequential Patterns - Exercise 2


Machine Translated by Google

(with the time constraint maxgap > 4, it did not show the passages, but it is enough to discard the occurrences that
have a gap greater than 4 as for the occurrence <0.5> in the first line). That is, just do ex: - <0.4> ÿ
4-0=4 (we keep it) - <0.5> ÿ
5-0=5 (we discard it)
-etc...

Sequential Pattern - Exercise 3


In this exercise only two small patterns: we have to calculate the occurrences and support. When calculating
the support we don't have to count how many times the sequence appears, but it only matters that it appears
once, and in the case {A} ÿ {D} it appears 5 times, while {B} ÿ {C,D} appears 4 times.

GSP - Exercise 1

If the professor asks us to simulate the GSP he will not ask us to consider the constraints because it would
become too complicated even with small datasets. We just need to remember that the maxgap does not satisfy
the a priori principle of the algorithm.
Machine Translated by Google

Let's start with single transactions and remove the ones that don't meet the support threshold:

A=¾
etc..

We build the ones for the next iteration and remove the ones that don't respect support. In this case it is not possible to do
the pruning step.

Let's build those of length 3:


We need to highlight those removed for pruning

We need to check A ÿ D:

A ÿ D doesn't respect support, so we removed it. While the last 3 we have removed because their support is less than 60%.
Machine Translated by Google

So the output is all those that haven't been deleted.

GSP - Exercise 2

05/05/21

ADVANCED CLUSTERING
For the last part of this module there is no real reference to the book but at the end of the slides there will be some
papers. We will not go into too much detail but we present these advanced clusterings only to know and
understand that there are others, beyond those already studied in DM1.
Machine Translated by Google

Bisecting K-MEANS:
The bisecting K-means algorithm is a simple extension of the K-means algorithm which is based on the following
idea: to obtain a number of clusters equal to K, the set of all points is divided into two clusters select one that splits
further, and so on, until K clusters are produced. - It is a variant of K-Means which produces hierarchical clustering.

- The K must be specified.

: in particular, we start from a single cluster and split it into two partitions (a 2-means); after doing the split you get
two clusters that will be added to the list of clusters. After that we will eliminate from the list the cluster considered
"least cluster", i.e. which has the lowest ESS.
This procedure is repeated until a number K expressed by the user is obtained.
: There are several ways to choose the cluster to split: we can choose the largest at each step, or the one with the
largest SSE, or a criterion based on both size and SSE. Different choices will generate
different clusters.

Practically, we are using the K-means algorithm locally, ie to do the bisection of single clusters.
Thus, the final set of clusters does not represent a clustering that is a local minimum with respect to the ESS
total.
Advantages: If I track the splitting of the various clusters, I can get a hierarchy. Work with areas of diversity
density and the best cluster, which will be more compact and have less SSE will not be deleted from the cluster
list, while the more sparse areas will become many small clusters.
Limits: the K must be specified by the user, but if it is not, i.e. it is not known or has not been specified, the
algorithm ends exhaustively in singleton clusters, but this takes a long time. Also singleton clusters are meaningless
(e.g. over-splitting).

: Intermediate clusters are more likely to correspond to real classes.

ÿ A crucial difference between Bisecting K-means and hierarchical clustering is that the latter has a bottom up
strategy, because it starts from a cluster containing a single instance and joining time after time. Bisecting K-
means instead starts from a single large cluster and at each iteration the clusters are separated.
:However, the algorithm has no criterion for stopping the bisection before singleton clusters are reached

X-MEANS:
The X-means algorithm is also an extension of the K-means algorithm; here you are not asked to specify the
Machine Translated by Google

number of clusters K and for this it is necessary to introduce a stop criterion called BIC, Bayesian Information
Criterion.

The Bayesian Information Criterion: is a strategy that serves to


stop the bisection procedure when significant clusters are reached,
to avoid over-splitting, i.e. multiple splits The BIC can be adopted
as That would result useless.

a splitting criterion of a cluster to decide whether a cluster should


be split or not; it measures the improvement in cluster structure
between a cluster and its children. If a parent's BIC is less than
the child's BIC then we accept the bisection.

: The procedure is as follows: the BIC of the parent cluster is measured and the value is recorded; we do the
same with the child clusters, obtained by 2-means division of the parent. If the value of the children is greater
than that
the ofsplit.
the parent,
If instead
if itthe
is less,
valuethe
of procedure
children stops. ÿ The oversplitting phenomenon is thus stopped,
singletons are avoided.

The X-Means algorithm therefore starts with a K equal to the lower limit of the specified interval [r1,rmax], i.e.
the maximum number of clusters to test, and continues to add centroids where necessary, until the upper limit is
reached. àThe appropriate value of K is sought in the given range [r1,rmax]. During this process the set of
centroids that achieves the best score is recorded, and this is what will be shown in the output.

The algorithm consists of the following 2 steps repeated until completion:

1) Improve – Params: performs the K-Means with the current value of K. Let's imagine we
have a K=3 and we have a situation like this ÿ

2) Improve – Structure: Recursively split each cluster into two and use a local BIC to
decide the split. We will therefore take cluster A and do a split, creating children A1 and A2 and calculate their
BIC. The current configuration is recorded with a global BIC calculated on the entire configuration. We will then
look at the local BIC. We will stop if the BIC of the children is less than that of the father; that is, if it does not
respect the local bic or if ek > rmax it stops and returns the model with the best score, otherwise it re-executes the operations.

Let's refer to our example in red: let's start from a


situation in which we have three clusters A, B and C. We
start by bisecting cluster A, so as to create two child clusters
A1 and A2. Let's now check the Bic of the children (with Bic
for now we mean a quality measure of a set of clusters): if
the children have a lower Bic than the father's we stop, if
instead it is higher we continue with the split. Suppose we
have b in this second case and we split A2 into A21 and A22,
as shown in the figure. The same will be done for B and C.
However, imagine that the bisecting procedure stops only at
the first point for both cluster B and C due to a low BIC; what I will end up with will be 5 clusters: B,C,A1,A21,A22.
Machine Translated by Google

ÿ The idea of the X-MEANS is therefore to start making a K-means with the current value of K, I would redefine clusters with a bisecting
K-means which will stop either by evaluating the BIC or if the total number of clusters k is greater than the extremum of the chosen
interval (that is, if k > rmax it stops and returns the model with the best score, otherwise it re-executes the operations). Afterwards, after
having obtained the 5 clusters, the global BIC score will be calculated (in our case for B,C,A,A1,A21,A22) to decide which K to return
in output. It is called global because it considers all clusters, which cover the entire dataset. At the end the algorithm shows the largest
global BIC corresponding to a certain K, contained in the interval [r1,rmax].

EXAMPLE X-Means:

1) We perform the K-means with K = 3 and we obtain this configuration

2) We split each centroid into 2 children that are moved a distance proportional to the size of
the region in the opposite direction using a randomly chosen vector.

3) We perform the 2-means locally in each region for each pair of children. It is local in that the
children conflict/fight each other for the point in the parent's region: no one else.

4) We compare the BIC of the father with that of the sons,


only the centroids with the highest BIC survive. The algorithm
gives me 4 as a result
clusters

BIC Formula in X-Means


the BIC value of a data collection was defined by Kass and Wasserman as:

: The BIC approximates the probability that clustering in Mj describes the real clusters in the data.
Machine Translated by Google

: the BIC is a score; the higher the better. It is calculated considering the likelihood of dataset D:

is the log-likelihood of dataset D according to the j-th model; pj is the number of parameters in Mj.
It is also called Schwarz criterion. R is the number of points of cluster j, while M is the number of dimensions (but if it has the
subscript, i.e. it is in the form Mj, it indicates cluster j).
The likelihood is therefore the probability that the data are "explained" by the clusters according to the spherical Gaussian
assumption of the K-means. Focusing on the set Dn of points belonging to the centroid and linking the estimates of maximum
likelihood to the estimate of how close the points of the cluster are to the centroid, the likelihood is calculated as follows:

RECAP: to understand the X-means it is necessary to know the K-means and to understand how bisecting k means works,
remembering that the latter does not have a stop criterion unless we indicate a precise K. In X means we must then introduce the
BIC concept, the Bayesian Information Criterion which is calculated twice: - 1) locally, to understand if the structure can be
increased by adding splits or not;
- 2) globally, to evaluate the total clustering.

K-means origins: Expectation-Maximization

Model-based Clustering (probabilistic)


To DM1 in dealing with clustering algorithms, it was assumed that the points belonged completely to a cluster, making a HARD
assignemnt; the idea here is that clustering can also be probabilistic, ie admitting that an instance is part of a certain class with a
certain probability.
We will now see a generalization, the Expectation-Maximization procedure, which reveals the origin of k
means.

• To understand our data, we assume that there is an unknown generative process (a model) that creates/describes the data,
and we will try to find the model that best fits the data itself. So we would like to approximate how the data is generated. -
It is possible to define models of different complexity, but it is
assumed that the model is a points distribution.
from which they come championships the date
- data are the height Of all in Greece.
Example: i the people

• In most cases, a single distribution is not sufficient to describe all different data different distribution.
points: set off of the data they follow a

- Example: the data is the height of all people in Greece and China, this means that they belong to different clusters and
that everyone is described by their own model. In these cases, not only the distribution of the heights of the Greeks is
enough for us, but we need a mixture model,
Machine Translated by Google

since we are talking about two different distributions corresponding to different clusters in the data so that the two different
populations can be better represented.

Our goals are to find which clusters the data belongs to, but also to find the best metrics to describe the distribution. The
question is: how are Mixture models extracted ? The answer will be: through the expectation maximization algorithm.

EM (Expectation-Maximization) algorithm:

Initialize the parameter values in ÿ (theta, parameter vector) with some random values
Repeat until this procedure converges:
- E-Step: Given the parameters ÿ, estimate the probabilities of belonging P(Gj|xi);
- M-Step: given the probabilities, it calculates the values of the theta parameters that maximize the given likelihood

Example:
- E-step: estimate the probability with which a point belongs to a certain distribution. Recalling the example of the height of people in
Greece and China, what is the probability that some point is part of the Chinese distribution rather than the Greek distribution? After
defining this membership probability, I move on to the next step: - M-Step: adjust the probability calculated in the E-step and calculate
the parameters ÿ that maximize the date

likelihood.

The EM algorithm is a generalization of K-means. Indeed, the K-means algorithm for Euclidean data is a special case of the EM
algorithm for spherical Gaussian distributions with equal covariance matrices, but different means

The Expectation Step: Assign points to a cluster.


-Kmeans: corresponds to the step where each object is assigned to a cluster --> HARD assignment.
-Exp/Max: On the contrary, each object is assigned to each cluster (distribution) with a certain probability ÿ SOFT assignment, given
by the probability.
The Maximization Step: calculation of the parameters
-Kmeans: he comes Done The calculation of the centroids of the clusters.

-Exp/Max: On the contrary, all the parameters of the distributions are selected, as well as the weight parameters, to maximize the
probability (likelihood).
Machine Translated by Google

We will now present another implementation of the Expectation/Maximization Algorithm called the
Mixture Gaussian model which assumes a Gaussian distribution to model the data, performing not a hard
assignment like kmans, but through a probabilistic assignment. Will be called brother of
K-means.

Example: Mixture Gaussian Distribution


We can talk about mixture models in terms of Gaussian distributions. For example, height data for all people
in Greece follows a (normal) Gaussian distribution. The probability density function for a one-dimensional
normal Gaussian distribution at a point x is:

MODEL: A Gaussian distribution is defined by the pair of parameters ÿ = (ÿ, ÿ), where ÿ is the mean, while ÿ
is the standard deviation; generally, the model is defined by a vector of parameters ÿ.
Our goal is to find the normal distribution N(ÿ, ÿ) that best fits our data, so we want to find the best values
for ÿ and ÿ. But what does "best fit" mean? It means maximizing the Maximum Likelihood Estimation
(MLE).

Maximum Likelihood Estimation (MLE)


The MLE is the best fit, therefore the one we want to maximize.
As already mentioned, we want a procedure that estimates ÿ and ÿ if they are unknown. One approach is to
choose the parameter values for which the data is most probable. In other words:
Suppose we have a vector X = {x1, …, xn} of values. We want to fit a Gaussian model N(ÿ, ÿ) to the data. The
probability of observing a point xi is:

The probability of observing all points (assuming independence) is:


Machine Translated by Google

We want to find the parameters ÿ =(ÿ, ÿ) that maximize the probability P(X|ÿ) and to do this we will have to consider the
likelihood probability as a function of the parameter. The probability P(X|ÿ) as a function of ÿ is the Likelihood function
(it is the same as before as a function of ÿ, it was previously as a function of x):

However, it is usually easier to calculate the Log-Likelihood: (n are the data we have)

Thus, the Maximum Likelihood Estimation for the Gaussian model consists in finding the parameters ÿ, ÿ which
they maximize LL(ÿ): mean) (In the K-means Alone there
sample

It is important to note that parameters that maximize log likelihood also maximize likelihood, since log likelihood is a
monotonic increasing function.
If we have no a priori information about X or ÿ, maximizing P(ÿ|X) is the same as maximizing P(X|ÿ):

Example Mixture of Gaussians:


So once we find ÿ =(ÿ, ÿ) to model the distribution, we
will have a situation like the one pictured. represented
example we in For
can consider the heights of people in Greece and China.
In this case we obtain as a result the mixture of two

Gaussian: one for the Greek people and one for the
Chinese people. Identifying for each value which
Gaussian is most likely to have generated it will give us
clustering.

Mixture model
A value xi is generated according to the following process:

1. First you select your nationality; with probability ÿG we select Greece, with probability ÿC

we select China, so that we have ÿG + ÿC = 1.

2. Given the nationality, we generate the point from the corresponding Gaussian. Given the nationality and given the
probability, we will use:
Machine Translated by Google

Finally, the model will have the following parameters (in the k-means the green part was 1 for one class and 0 for the other)

For the value xi we have:

For all values X = {x1, …, xn}:

We want to estimate the parameters that maximize Likelihood Once


we have the parameters , we can estimate the membership probabilities and for each point .

This is the probability that a point belongs to the Greek or Chinese population (cluster)

Then we apply the Expectation Maximization Algorithm We


initialize the values of the parameters in with random values.
Repeat until convergence - E-
Step: Given the parameters estimate the probabilities of membership and
- M-Step: Calculate the parameter values that (in expectation) maximize the data likelihood as follows:

ÿ K means is not the only method of Exp/Max Algorithm, we can have more than one implementation.
Limitations: Assumes that all data is represented by a Gaussian distribution Advantages:
Probabilistic assignment and these parameters can be used to describe the data.

ÿ The difference between K-means and mixture gaussian models is that first of all in K-means the
Machine Translated by Google

probability because HARD assignments of the type "this point belongs to this cluster stop" are used and the only parameter that is
considered is the average (mean) you don't care about probability.
In the Gaussian model we have two parameters (mean and standard deviation).

DBSCAN evolution: OPTICS

you need to remember that the DBSCAN worked fine in

density terms to separate the data being resistant to noise.


Conversely, however, it could not work well with clusters of different
densities or large ones
data size. OPTICS fails to resolve

the latter problem but manages to work well with clusters with
different densities and different sizes, shapes.

OPTICS: Acronym of Ordering Points To Identify Clustering Structure: The basic


idea is therefore to order the points based on the density, in order to extract and identify the clustering structure and the clusters
themselves, without parameter estimation problems. • Produces a special order of the wrt dataset
on a density-based grouping structure. • This cluster-ordering contains information equivalent to density-based clustering, corresponding
to a broad range of parameters. • Good for automatic and interactive analysis of clusters, including finding an intrinsic clustering/
grouping structure, without using
parameters. In fact, both the K-means and the Dbscan are very sensitive models, but the more we went ahead with the study of these
models, the more we understood that it is not always so easy to find the right configuration of all the required parameters. This is the
motivation behind these advance clusterings. • Can be represented graphically or using visualization techniques.

OPTICS requires only two parameters:

• ÿ, distance maximum (radius) from consider,


• MINPTS, The number Of points required For to form a cluster

• We must recall the notion of Core point: a point p is a core point if at least a number of points equal to Minpts are located within its ÿ-
neighborhood, ie in the radius. In the example opposite, knowing that with MinPts=5, point P is a core point because in its neighborhood
ÿ=3 it has 5 points. • Core distance: It is the minimum radius value necessary to classify a certain
point as a core point. In other words, it is the distance from the furthest point among the minpts in the neighborhood. If the given point
is not Core, then Core Distance is undefined.

In the example the core distance is 3 mm; since the point Z is the furthest from P, always inside its point
Machine Translated by Google

neighborhood. There core distance And to the maximum even to the value Of ÿ.

• Reachability distance: The reachability distance between two


points p and q is the maximum distance of the Core distance of p and
the distance between P and q.
In the example the reachability distance is equal to 7mm, and was
obtained by looking at the greater value assumed between the core
distance (3mm) and the distance between p and q (7mm).

The reachability distance between p and r is 3mm, because the


distance between for is 2mm, while the distance from the core is point
The • 3mm.

reachability distance is not defined if p is not Core point. • In other


words,
if q is within the core distance of p, the core distance can be used, otherwise the real distance.

The output of OPTICS is the reachability plot. a special


kind of OPTICS dendrogram
gives the points in a particular order,
annotated with their minimum distance of

reachability. • The
cluster structure can be obtained fac: this procedure always
extracts the point q which is closest to ap

x axis: the order of the points elaborated by OPTICS • y axis:


the reachability distance • The output
will always have a shape like this, with peaks and valleys;
distance helps to understand when a cluster ends and a new
one begins; the points belonging to a cluster have a low
reachability distance from the nearest neighbor, the clusters
will be precisely the valleys in the frame reachability. The deeper the valley, the denser the bunch.

• Clusters are extracted by: 1.


selecting a range on the x-axis after a visual inspection, 2. selecting a
threshold on the

y-axis 3. by various algorithms that try to detect


valleys by steepness, knee detection, or local maxima . Clusterings
achieved this way are usually hierarchical, and cannot be reached by a
single DBSCAN run.

• OPTICS Pseudo Code:


Machine Translated by Google

- For each point p in the dataset, initialize the reachability distance of p as undefined

- For each untreated point p in the dataset a list is created, a random one is taken e

the N neighbors of the considered point p are obtained (neighbours= points within the radius, the ÿ
neighbors) mark p as processed and it is added as output in an ordered list

- If point p is a core point, a


priority queue Q will be initialized to obtain the closest point ap in terms of reachability
function update is called (N, p, Q) *
- For every point q in Q
we get the N' neighbors of
Q mark q as processed and it is added as output in a sorted list
If q is a core point the update of the function (N', q, Q) etc etc is called

: this procedure always extracts the point q that is closest to p. When the procedure has reached the end, it
starts again from a new point.

The radius parameter,


insights: - Core-distance and reachability-distance are undefined if there are no sufficiently
dense clusters available and the radius ÿ is too small.
- Given a sufficiently large ÿ, each ÿ-neighborhood gives the entire database, but this usually never happens.

- The parameter ÿ is required to cut out cluster densities that are not considered interesting, in order to speed
up the algorithm.
- The ÿ parameter is not really necessary, because it can be set to the maximum possible value.
- However, when a spatial index is available, it plays a practical role in complexity.
- OPTICS abstracts the DBSCAN by removing this parameter, at least to the point of giving only the maximum
value.

Transactional clustering
Machine Translated by Google

Clustering is grouping objects into different sets such that distances between objects in the same cluster are minimized and
distances between objects belonging to different clusters are maximized, so objects in the same cluster should be similar
to each other. usually the Euclidean distance, the manhattan distance or, as we have seen for the time series, the DTW are
used. All these types of distances are fine for data that is numerical only, but the problems arise when we want to compare objects
that have categorical features; this problem can be extended to boolean attributes.

There is a specific type of data that is mainly composed of categorical attributes, i.e. market basket data, i.e. the same data
types we have seen for pattern mining and sequential pattern mining, i.e. each piece of information is stored in a transaction
which represents a set of objects "bought" together. It may be interesting to apply clustering to this type of data as well,
for example to detect which customers buy in a similar way. clustering and pattern mining are completely different tasks because
on the one hand the goal is to extract subsequences of items often bought together while on the other hand the goal is to
group the transactions so that those in the same cluster are similar, and therefore maybe those in the same cluster can show
the same patterns.

A typical way to represent transactional data is the one already seen made by sets but another way is to use vectors with
boolean attributes saying if a single item has been bought or not in a transaction.

There is a strong connection between booleans and categoricals, in fact it is always easy to convert from booleans to
categoricals and vice versa.

So given this type of representation we can use Euclidean distance to measure the closeness between pairs of
transactions. For example, the smallest distance is the one between P1 and P2 which is equal to 1.

The problem comes when we introduce the notion of centroid and also the notion of distance itself.
Let's assume we are running hierarchical clustering with average distance in which we consider the centroids of
the clusters to calculate the distance. So if we take the cluster that contains the points P1 and P2, the centroid will
be (1,1,0.5,1) and when I have to create the next cluster I calculate the distance between the centroid of P12 and
the points P3 and P4.

I obtain that the smallest distance is that between P3 and P4, and this is true from a mathematical point of view
but meaningless from a semantic point of view because P3 and P4 do not have any item in common therefore
their similarity is due to the fact that they they didn't buy item 1 or 2, but they didn't buy the same items. For this a
better solution might be to use the Jaccard distance. So in conclusion using distance metrics which are used for
numeric attributes is often not appropriate for categoricals.

For this, algorithms have been created that work for categorical and transactional data.
Machine Translated by Google

K-Modes

It is a "brother" of the K-means designed for categorical data. The goal is always to minimize the intercluster
distance

In k means the Euclidean distance was used, while in k-modes the number of mismatches between the
attributes of the two objects is used as distance and the sum is done. The mismatch is 0 if the objects
for that attribute have the same value, and 1 otherwise.

Example:

The distance between objects x and y is 2 because there are two mismatches: one on the OH attribute and one on the
m.

The second difference is that while for the k means we use the mean as the centroid, for these attributes we cannot
calculate it and therefore the mode of that column is used. For example if we have attributes A1,A2,A3, if
most of the objects within the cluster have A1=1, then most have A2=1 and A3=0, then the centroid will be (1,1, 0).
It is independent for each column.

Mode calculation example:

The algorithm starts by randomly selecting the initial objects as the mode, then scans all the data and assigns each
object the closest cluster identified by the mode. Then it recalculates the mode of each cluster and repeats until
clusters are changed.

This distance here doesn't solve the problem of considering the zeros.
Machine Translated by Google

TX-MEANS
T stands for transactional, so it's the x-means for transactional data. It is a parameterless clustering
algorithm. Automatically estimates the number of clusters as the x-means. This algorithm provides the
representative transaction of each cluster, which summarizes the pattern captured by that cluster.

It starts from a single cluster that contains all baskets; then it binary splits clusters with multiple iterations until
the BIC tells us we don't have to split the cluster anymore.

X means starts by running the k-means with some k and then tries to split each cluster using bisecting k
means until the BIC stops. Finally it calculates the global BIC and returns the clustering that maximizes the
BIC. TX means apply bisecting directly until the BIC stops running. In the end it allows us to extract from
each cluster a representative basket that is useful for the tx means to run the bisecting procedure.

The difference with bisecting kmeans (with k=2), is the GETREPR function which is a function that returns a
specific type of centroid which is representative for the basket.
Machine Translated by Google

This procedure starts from a representative basket which is formed by items present in all transactions
and iteratively adds to this asket the most frequent items until the distance between the representative basket and
all the baskets in the analyzed cluster no longer decreases.

Suppose we have

T1=ABC

T2=ABCD

T3=AB

T4=AC

At the first iteration r=A because it is the only item present in all transactions. The distance between the 4
transactions is calculated using the jaccard coefficient. If the distance decreases, it is repeated by adding
the most frequent item which in our case are B and C, then the distance between r and the transactions is
recalculated. This representative basket r1_new is used as a centroid for the bisecting procedure and this
procedure returns two clusters with the relative representative baskets and if the BIC of the child is greater
than that of the father the procedure is stopped. (it's a top down strategy)

A peculiarity of the tx means is that it can be used on large datasets because it can first be applied to a subset of
the dataset and then assign the remaining baskets to the clusters obtained using a nearest neighbor approach.

Example:

suppose we have this cluster


Machine Translated by Google

the centroid extracted from the k-mode is 111000.

If instead we use tx means at the beginning r=A, in the second iteration B and C are added. Then the distance between r=
ABC and each transaction is calculated.

d(r,T0)= J(r,T0)= ¾ etc

The sum of these distances gives the distance between r and all the transactions, therefore the sum of the transactions divided
by the cardinality of the transactions is calculated. At each iteration we look for the closest r to all the distances
transactions.

ROCK (RObust Clustering using link)

It is one of the first algorithms proposed for transactional data, while TX-Means is one of the last.

It is a hierarchical clustering algorithm that uses links between clusters instead of the classic notion of distance.

There is also an idea of neighborhood used to identify the number of links needed to give an idea of similarity. Given a type of
distance (which in this case is the Jaccard distance) and a certain threshold ÿ, we can say that two transactions are
"close" if their similarity is greater than ÿ.

Example:

The objects in common are 3, while the objects in all making the union between A and B are 6, so it is 3/6.

If 0.5 is greater than ÿ then A and B are neighbors otherwise not.

A link defines the number of neighbors common to two objects.

The higher the value of the link between the two objects, the higher the probability that the two objects belong to the same
cluster. It does not directly mean that A and B are formed by similar items, but it does mean that the set of transactions similar
to A and similar to B have a high overlap. So we can say that similarity is a
Machine Translated by Google

local concept because it observes only if two transactions are similar in terms of items contained within
them, while link captures global (or pseudo-local) information because it compares the neighbors of A and B.

Obviously a point is considered a neighbor of itself.

Example:

The basic idea is this, even if in practice it is not really like this because the Criterion Function is used which
still uses the number of links. The idea of ROCK is to maximize this function to return the optimal partition
of transactions. This function is maximized when transactions with the same links are in the same cluster
while transactions with different links are kept in other clusters.

Even though ROCK is hierarchical clustering, it takes the number k of clusters as a parameter, so it stops the
hierarchy once it reaches k different clusters.

To identify the best pair of clusters to merge, a measure called goodness is calculated:

The denominator is the number of cross-links expected between two clusters.

How the algorithm works

It takes as input a set of transactions, the number k of clusters we want and the similarity treshold ÿ and the
output is a group of clustered data.
Machine Translated by Google

During the first step it takes a sample from the dataset and this happens because the whole procedure is very
expensive from a computational point of view, so it is done to ensure scalability for too large datasets. The
initial sample is used to form the clusters and the remaining unused data is assigned to these clusters
using the concept of distance.

In the second step he runs a hierarchical agglomerative clustering algorithm, i.e. he starts by assigning a cluster
to each point (a bottom-up procedure is done like all the hierarchical algorithms seen so far), then he calculates
the similarity between each pair of clusters using la la as similarity distance goodness measures (which uses the
notion of link) and the two clusters with the highest similarity are merged. Finally, the stop condition occurs,
which is simply to check if we have reached the required number k of clusters and if we have not yet reached k, we
go back to step two.

In the third step, label the data, i.e. assign the remaining data to the formed clusters, selecting a random
sample from each cluster and assigning each point to the cluster with which it has the highest link value.

ROCK algorithm exercise:

The second table, called the adjacency table, is obtained by putting the value 1 in the cells where the
Jaccard distance is greater than ÿ which in this case is 0.3, and 0 in the cells where it is less than or equal.
Machine Translated by Google

The third table is the link table, also called common neighbors, and contains the number of intersections between each pair of
transactions and can also be obtained by multiplying the adjacency table with itself.

P1={1,2,3}

P2={1,2,3}

P4={2,4}

So the intersection between P1 and P2 is {1,2,3} which has cardinality 3, while between P1 and P4 it is {2} which has cardinality 1.

At the first iteration the calculation is simple because n=1 and m=1 given that they are the cardinalities of the clusters (they are
the ones we called ni and nj before) and that we start from all clusters formed by a single point. For the first iteration, the calculation
of the denominator can be done only once because it will be the same for all pairs.

CLOPE (Clustering with sLOPE)

It is always a very efficient transactional data clustering algorithm for large datasets. Like ROCK, it uses a global
criterion function whose goal is to increase the overlapping of transaction items in the same cluster by increasing the height-
to-width ratio of the cluster histogram.

Example of cluster histogram:

For each cluster I write down the items within the cluster and put a square on that item for each time it appears within the
cluster.

Clustering 1 clusters the top 3 transactions in one cluster and the bottom 2 in another cluster, while clustering 2 clusters the top 2
transactions and the bottom 3 in another cluster.
Machine Translated by Google

How do we tell which of the two clustering is better? The idea is that the easier it is to generate a “square” with these histograms,
the better the clustering.

In clustering 1 we have to add 5 squares, while in clustering 2 we have to add 8, so clustering 1 is better. That's the idea but we have
to calculate it mathematically.

D(C) for the first cluster of clustering 1 will be equal to 4 while for the second cluster it will be equal to 3.

S(C) are the total squares in the cluster, thus the total items in the cluster.

The higher the ratio between H(C) and W(C), the more “complete” the histogram will be. This quantity is calculated by the criterion
function which evaluates the goodness of a clustering as a gradient of a cluster, defined as

Generalizing, a parameter r called repulsion is used which the larger it is, the more it prefers transactions that have a
greater portion of items in common and therefore a data partition that maximizes profit returns good clustering.

This algorithm takes as input the dataset, the repulsion and the maximum number of clusters.

CLOPE is a bottom up algorithm, we already know the number of clusters we want.

It consists of two stages:


Machine Translated by Google

in the first phase it extracts a transaction and adds it to a new or existing cluster maximizing the profit function.
After having assigned each transaction to a cluster, we move on to phase two in which we ask ourselves for each individual
transaction if it is better to keep it in the current cluster or if it is better to make it change clusters and these movements
also take place according to the profit function. This phase 2 is repeated until all transactions remain in the same cluster.

If we have transactions A,B,C,D,E,F and we want 3 clusters, first we take transaction A and insert it for example in cluster
1, then we take B and calculate the similarity with the first cluster and with the second and we insert it where it has the
greatest similarity, then we take C and do the same thing.

At the end of the first phase we get for example:

C1 = ACE

C2 = BD

C3= F

and we pass to phase 2 in which we see, we start from A and see if by moving A to C2 or C3 the similarity increases;
same thing for all other transactions. It iterates until we don't need to move any more
transaction.

12/05
Machine Translated by Google

explainability

Definitions:

- To interpret means to give or provide the meaning or to explain and present in terms
some concepts are understandable
- In AI, Data Mining and Machine Learning, interpretability is defined as the ability to explain or provide
meaning in terms that a human can understand

Blackbox model

A black box is a model, whose interiors are unknown to the observer or are known but
not interpretable by humans. It is important to talk about the reasons for the
explanation of black-box models.

It is also possible to access the box, where the decision process is so complex that as a
human it is not possible to understand how, given a certain input, a certain output is
obtained.

Examples: Deep Neural Network, SVM, Ensemble à it is difficult to understand all the
possibilities that can lead from every possible input to every possible output.

Interpretable models
However, we know that there are some interpretable models, one of these is the Decision Tree, in fact we can follow the logical
reasoning that leads from the root to each leaf. Another model that can be interpreted is the Linear Model as a linear or logistic
regressor, in the sense that by observing the coefficients and considering their sign and their magnitude.

In the following case, for example, there are models representing the Titanic Dataset: for the linear
model we have the final result of

survival Yes, and what is the


explanation? That a person
survives thanks to Gender (F),
which makes a great contribution

to the outcome, and the age and class in which she is sitting which however have a negative contribution to the final putcome is
still survivor!
We also examined other pure rule based classifiers that are interpretable via decision trees, which can be linearized into a set of
decision rules with the “if-then” format: if condition1
ÿ condition2 ÿ condition3, then outcome

Reasons for methods of explanation


Let's look at 3 examples:
Machine Translated by Google

1. COMPAS recidivism black bias


This result was obtained by Propublica (American newspaper) which conducted an experiment in several American
states using a score in the Court to decide if a prisoner deserves to have house arrest or to remain in prison using
this score calculated automatically in the features that describe the prisoners. A low risk of committing a new crime vs
a high risk decides the fate of the prisoners on the basis of this score.

Through this technique Propublica discovered that this score was biased
against black people, who were more dangerous than white people
regardless of whether they committed more crimes!

This is due to the presence of bias in the dataset used to learn


this COMPAS score.
2. The backgroud
bias We have a dataset of wolves and huskies and we learned an image classifier capable of recognizing a huski
from wolves. After developing the training, we give the classifier a test with external elements and notice that it is not
able to work very well. Why? We find that the reason for classifying an image as a wolf is because there are some
white areas in the background, not animal characteristics!
This is not always acceptable, because you try to make the machine think like a human would, but in this case you
can't trust the classifier!
3. Right to explain The EU
has given birth to the General Data Protection Regulation (GDPR), which entered into force on 25 May 2018, which
contains within the right to Explanation which says that if an automatic decision applies to an individual one has
a right to know why one has this prediction/classification about oneself.
For example on Facebook if you look at an advertisement that appears, on the right there is a small arrow thanks to
which you can ask why you have that particular advertisement

We can talk about "Explanation" in different fields of AI:


- Machine Learning: Feature importance, Partial dependence Plot; Auto-encoder
- Computer Vision (which makes something clearer): Uncertainty Map; Saliency Map
- Knowledge representation and reasoning: diagnosis inference
- Multi-agent systems
- NLP
- Planning and scheduling.
- Robotics (…)

What is the ultimate goal?


You don't need just a classifier or an explainer BUT a way to be able to "converse" with the
cars.

asks the machine: why is the first image classified as fish? Because there is a silence map where there are green
and red areas, the green ones suggesting that the image is a fish, the red ones a dog.
Since there's more green than red, I say it's a fish. BUT the green areas are mainly in the
Machine Translated by Google

fish background. You may suspect that you are in background bias.
Then you can ask the machine if you can have some training examples used
that affect the prediction:

maybe that's the case where the machine recognizes the

anemones in the background so the machine learns that


it's a fish thanks to anemones.

So you ask the machine


if you can try to remove the background and classify the image again:

Same behavior humans have when they try to convince a person of their thesis, so it's important to have access to this
information!

Role-based interpretability It
is not possible to have a universally interpretable explanation!
“Is the explanation interpretable?” to “to whom is the explanation
interpretable?”
In the literature, 3 types of users are recognized: End
users: "Can I contest the decision?" “What could I do differently to get a
positive result?”
Engineers, data scientists: “Does my system work as designed?”

Regulators: “is it compliant?”


An explanation tool should model the user's background. Not
only the background is important, but also the time you have to decide
on a choice to make (different decisions and different times
between doctor, data scientist)
We don't have universal explainers but user-based explainers.

XAI (Explainable AI) is interdisciplinary For


millennia, philosophers have asked the questions of what constitutes an
explanation, what is the function of explanations and their structure.
Many researchers who deal with different disciplines deal with these topics!

How to open a black box?


In recent years, a taxonomy of explanation methods and the problems they need to be established has been established
Machine Translated by Google

solved. At a very high level we can distinguish between the explainable by design methods and black
box explanation methods.

In the first part we have a black box system and we have to replace it with a transparent system, it is called
intrinsic explainability because the final model will be intrinsically explainable, eg: we have a neural network
as a black box system and somehow we replace it with a decision tree, which , as we have said, is transparent
being an interpretable model.

This model has some difficulties in classifying images or texts which are not easy to handle because we
can represent tabular data with a decision tree but we cannot do it easily with image data or textual data.

The second type of methods, the black box explanation methods and in this case we want to keep the black
box AI sytem and we want to get an explanation sub-system, so we will have the black box system
responsible for returning the classification and the explanation system responsible for returning the explanation.

Explainable by design methods

We have dataset X, we learn an interpretable and transparent model “c”, and this model c is able to return
both the outcome and the explanation to the user as output.

Example with Decision Tree:

Can I play tennis knowing the weather is Sunny and Humidity is Normal? According to the model the answer is
"Yes".

Black box explanation methods


Machine Translated by Google

In this case there are two versions: a global one and a local one.

From dataset X we can have a black box model that returns


only the prediction and on the other side the explanation model
that provides the explanation. Eventually the user will have
both!

We refer to a global explanation when e = f(b,x) gives the user the overall logic of black box b, so with this explanation
we are able to reproduce the behavior of the black box for each possibility (previous decision tree) .

If instead we have a local explanation (in the form of a rule for example) we know the reason why the classifier
will be yes or no, for example for the specific instances that make it so.
Global explainability provides a complete view of the black box, for all possible inputs, for all possible outcomes,
while a local explanation reveals why a specific instance analyzed at a specific moment was classified in a certain
way.

The search is mostly based on local explanations because globals are much more difficult to extract!

Model specific and model agnostic


Another distinction that can be made is between model specific and model agnostic explainers.

In the case of the model specific we have an explanation method strictly dependent on the black box, it means
that for example the black box model is able to explain only the random forest, because the explanation method is using
some components that are only in a particular black box or in a particular black box family, if i want to explain
the random forest the explanation method will require to handle the various
trees.
Machine Translated by Google

In the model agnostic explainer, on the other hand, the explanation method is independent of the black box so
it can be used in any black box regardless of the type of model. It can be very useful in scenarios where we
don't physically have the black box but it is for example an API, so we only have to worry about inputs and
output.

Data types

Currently, different types of data are analyzed for explanability, mainly three: tabular data, images and some
approaches for texts.

Types explanations (In terms of


of interpretable models used to capture and return the reason for the classifier to humans), categorized by data
type:

For tabular data we certainly have the rule-base expressed in a logical form and the decision trees and feature
importance, which reveals how important an instance is for a specific output, can only be local because it
refers to a specific instance.
Prototypes and counter-examples are instances which are on the one hand prototypically similar to the
analyzed instances getting the same output and clarifying the reason why a classifier returns a certain label
for that instance.
While the counter-exemplars are instances similar to those analyzed but with a different outcome.
Machine Translated by Google

For images and text we also have other options, the most famous is the saliency map concept (that of the
fish example), it is a heatmap placed on an image that has a certain color if a pixel/area is positively important for the
classification and another color if it is negatively important.
The saliency map is the feature importance counterpart for images, while a similar version for texts is the sentence
highlighting, which colors certain words differently if they have a positive or negative impact in the classifier.

Explanations and Explanation Methods


TREPAN
Proposed in 1996, it is a global explainer designed to "explain" Neural Networks but in practice used for any type
of black box (model agnostic). The idea is to approximate a NN with a decision tree using the best-m-of-n rules ,
i.e. out of n conditions, only m will be verified to achieve the path.

Pseudocode:

We have to focus on two points: black box auditing which explains the NN but is not built on the original label of the
dataset but on the label assigned to the data by the NN because we want to predict the class provided by the black
box (even if it is wrong), the the second aspect is line 5: random, which allows for the generation of a certain number
of random records at each split, respecting the condition of the tree defined up to that point.

For example , if at the first split we have 100 records at the second step I will have (eg) 70 and 30. When I
want to learn the second split I don't want to do it on 30 instances but on 100, therefore TREPAN generates 70
synthetic instances respecting the condition (UniformityCellSize < 2.5) and apply blackbox b on the set zb(z)=70
instances (in blue). So you will have 30 real elements and 70 "artificial".
Machine Translated by Google

Next, class y is used in line 9 to find the best split (with the label assigned by the black box, so it's not the
real label).

(Example Trepan Jupyter code)

LIME (Local Interpretable Model-Agnostic Explanations)


It is the most famous explanation technique, proposed in 2016 and follows a simple idea: even if the global
decision boundary is complex and it is difficult to explain the reasons for each classification, in the
neighborhood of a single decision (red cross highlighted), the boundary is simple.

Considering the red cross in the example, we can say that it is classified as a red cross because it is located
to the left of the line dividing the dots, regardless of the area in which it is located.
Thus a single decision can be verified by checking the black box in the neighborhood of a given instance
and learning a local decision.

LIME can be used to explain any type of black box by providing local explanations here you can see the
application to tabular data and images, as Lime is local agnostic so it doesn't strictly depend on a data
type.
Machine Translated by Google

We see that duration_in_month makes a large contribution to the classification with a negative contribution (0.11) against class
1, and account_check_status makes a positive contribution (0.99) for class 1. Lime's idea is to synthetically generate instances near
the instance to explain and then learn a local surrogate expressed as a logistic regressor (with Lasso or Ridge penalty) and then
use the coefficients of this regressor as feature importances

Example:
b(veric) indicates the probability on the possible classes In
the second step we synthetically generate instances
similar to the initial ones and we see how the
probabilities change, like this for a few times.
Then what you do is run a Train on a Logistic (or Linear,
depending on the case)
After learning the regressor the coefficients are reported as an
Explaination.

- Z is a set (initially empty) which constitutes the neighborhood - X


is the instance to
be explained - In this case we
are not using real data, but synthetic data. LIME assumes
that any data type can be represented in a human-interpretable
form.
If you are dealing with images or texts then a transformation is
applied, in particular for images there is a tessellation of the
images, where each part becomes a feature, while for texts
there is a fragmentation between words or sentences.

This notion of interpretability is used for sample_around(x') (line 5), as was done in the example
randomly selecting some features and changing their value randomly from the distributions that are

they know. The function on line 6 interprets bel2real(z') ÿ turn into the setting, then the black box to store the instances is queried:
zi (instance), b(zi) (the prediction of the black box for the class of the instance to be explained ), d(x,z) (the distance between x and
the generated instance). And finally it solves a Lasso problem by finally supplying w which expresses how important or not a
feature is in the classification.

Considering that an image is divided into areas of the same color (superpixels) and I represent the original img x' as interpretable
superpixels expressing the presence/absence, then in the neighborhood I set some superpixels to 0 and obtain the second image ÿ
same image where some parts are obscured We train on a linear regression model and assign a weight to each superpixel to
understand which are the most important features to be able to say that it is a fox.
Machine Translated by Google

17/05/2021 LAST LESSON EEEEEEEEE

LORE (Local Rule-based Explainer)


• LORE extends LIME by adopting decision tree instead of linear regression and generates synthetics
instances through a genetic procedure that takes into account both instances with the same labels and
those with different labels.
• The explanation is given in terms of rules and counterfactual rules • In
this example below the rule is highlighted in green. But we also have another one
information, i.e. the counterfactual rule: if the incame is >900 or the job is something other than
'clerck'--> the loan will be guaranteed

• LORE can be generalized to work with images and text, using the same data as
representation adopted by LIME
• LORE is similar to LIME, but there are two differences:
1. it uses a genetic approach instead of a random approach 2.
it uses a decision tree instead of a linear regressor

Algorithm
Machine Translated by Google

- Pto 2 and 3: "genetic approach".

This approach uses instances as evolving 'chromosomes' and optimizes a 'fitness function'. The idea is
to generate two parts of the neighbors:
- a call 'Z equal'= where there are instances similar to X but labeled black
box.
- a call 'Z not equal'= generates instances similar to X , but which have a different labeling
from the blackbox

- In the image below: the star is the point to explain. The yellow and orange dots are the
instances generated by the general approach.

- Subsequently at point 4 Z= and Z!= are combined - At point 5


a decision tree is built with the labels of the black box of set Z. At this point we are building a local surrogate
which for LIME was a linear regression, here it is a decision tree
Machine Translated by Google

-
at points 6 and 7, given the decision tree c, we can extract the factual rule and the counterfactual rule.

The counterfactual rules highlight the slightest change of instances to obtain a different result.

Rep:
-
in points 2 and 3 we have a random generation of the dataset in line 5 we
- have the 'black box auditing' the black box query

LORE ON MEDICAL IMAGES

LORE is made for tabular data, but using LIME data transformation it is also used for images and textual data.

SHAP

• Use the notion of shapely values • Imagine


expressing features as if each feature/column is a 'player' in a match.
We have to select the best players. SHAPE VALUES is a coalition game theory technique, to find the gamer who
contributes better to the final outcome.
• Ex: a black box predicts apartment prices. For a certain apartment € is expected
°
300,000 and the prediction needs to be explained. The apartment has an area of 50 m2, is located on the 2nd floor, has
a park nearby and cats are banned.
• Shapely values and Game Theory= If we consider the total dataset we have an average
prediction ($310,000). The question is how much each feature contributes to the prediction versus the average prediction.
• the game is for a
single instance, so this is a local approach
Machine Translated by Google

• the gain is given by the difference between the current forecast minus the average prediction for all the
instances

• the players are= the feature values of the instances that collaborate to receive a certain gain • more precisely we
have these contributions= The park-nearby contributed
€30,000; area-50 contributed €10,000; floor-2nd contributed €0; catbanned contributed -€50,000.
The contributions add up to -€10,000, the final prediction minus the average predicted apartment price.

• This last part represents a selection of feature importance

In summary:

Shape is an approach that uses shapely values to return a local moderate agnostic explanation in form of feature importance.

Shapely Values Example


-
The Shapley value is the average marginal contribution of a value over all possible coalitions (combination of fixed feature
values).
- We evaluate the contribution of cat-banned when it is added to a coalition of parknearby and area
50.

- Let's simulate that only a nearby no-cat park and area-50 are in a coalition, by randomly drawing another apartment from
the data and using its value for the floor element.
-
The second floor is replaced by the randomly drawn-1st floor.
- So we predict the price of the apartment with this combination (€310,000).

So we fix the park (the coalition) and the 50 m^2, randomly select the features that can change (the first floor in
this case) and then test all possible values for the features we observe in a given month.

-
In a second step, we will remove the cat from the coalition by replacing it with a random value.
In the example it was catalysed, but it could have been banned again.
Machine Translated by Google

- We forecast the apartment price for the park-adjacent and area-50 coalition (€320,000).
-
The contribution of cat-banned was €310,000 - €320,000 = - €10,000. This estimate depends on the values of the
randomly drawn apartment that served as a "donor" for the cat and the floor feature values.

- We get better estimates if we repeat the sampling step and average the contributions.

- Later we will repeat the calculation for different possible coalitions.

-
The Shapley value is the average of all marginal contributions in all possible coalitions.
-
Computation time increases exponentially with increasing number of features .
- For each of these coalitions we calculate the expected price per apartment with and without the
characteristic value 'catbanned' and take the difference to get the marginal contribution.
- We replace feature values with features that are not in a coalition, with feature values
random numbers of the apartment to get a prediction from the black box.
- If we estimate the Shapley values for all the features, we get the full distribution of the prediction (minus the mean)
among the feature values.

The prof doesn't want us to remember this whole process. Crucially, SHAP doesn't use a local surrogate to estimate feature
importance.
From an observational point of view it is similar to LIME: because it returns the explanation of the feature importance, but the shap
does not use a local approach, but the game theory, summarized by the formula:
Machine Translated by Google

Where:

- S= set of all the features in the coalition


- F = set of all the features
-
i= feature we are analyzing in a certain time - ÿi= is the shapely
feature for feature i
-
the f = the black box named in the original x feature modified or not) in the randomly selected features from the
collection.

The crucial aspect (he doesn't want us to remember the formula): the shapely value is the average of all the
marginal contributions, so it is important to understand what the shapely values are.

The 'table' below is the explanation given by SHAP, which says that: the variable called 'LSTAT' which has a value of 4.98 has the
largest positive contribution because it 'moves' the classification toward 24.41. While RM= 6.575 moves the classification in the
other direction We therefore have an importance value
for each feature.

SHAP on tabular data


Machine Translated by Google

The shape works with tabular data. In this case in the coalition we have Age, wight and color. The values on the right
are the x values under analysis.
While in z= Age is fixed, while weight and color are chosen randomly. This is an example of instances created
by SHAP that need to be tested here:

SHAPE ON IMAGES
The shape can also work with and images.

In this case the image is represented by these three super pixels. In the original image all three values are present,
while the SHAP is without sp3, i.e. the dog's face, replaced by a gray circle.
The color we choose affects the result and this is a weakness of the approach.

RECAP:
Machine Translated by Google

-
SHAP is local, moderate agnostic, data agnostic ÿ similar to LIME
- but it is theoretically different= does not use a surrogate, but tests different subgroups of features, keeping one fixed and
calculating the average marginal contribution of a particular value, in order to return the explanation in the form of
feature importance (like LIME)

Saliency Maps

A saliency map is an image superimposed on the original one, where the brightness of a pixel represents what is
"salient" (important). A positive (red) value indicates that the pixel contributed positively to the classification, while a negative
(blue) contributed negatively.

There are two methods to create an MS:

1) Assign each pixel to a saliency value.

- Advantages: you don't have to apply a segmentation algorithm.


- Disadvantages: It can be confusing to interpret.

2) Segment the image into groups of pixels (called superpixels or segments) and assign a saliency value to each group.

- Advantages: Different parameters can be chosen to perform a more accurate segmentation.


- Disadvantages: If unrepresentative segments are created, explainability will be inaccurate.

The examples indicated by the arrow represent the first method, in which each pixel has its own saliency value. The "lime" method
indicates the second method of creating an SM.

Integrated Gradient The


INTGRAD approach can only be used on Deep Neural Networks, therefore it is a model specific approach, through which
a saliency map is created where each pixel has its own saliency value.
Machine Translated by Google

This method uses an all black or all white baseline image, and tries to make this image equal to the image to
be classified by building a path from the base image x' to the input image x, adjusting the color
based on the neural network.

Finally the saliency map is obtained by overlaying the opacity of each point. The higher the gradient of
a point, the more important it will be for classification.

MASK
The idea is to change the input image x to image xR, such that a region R covering x is found (adding a
blur or noise), thus obtaining a large discrepancy between the two images in terms of probability for a certain
class.
As seen from the following example, the classifier recognizes the flute with a probability of 0.9973, while
in the modified image it recognizes it with a probability of 0.0007. So the learned mask indicates the most
important region (in blue) for recognizing the flute.

Example-based explanations

Example-based explainations are methods that work for any datatype, and are local explainations
(like saliency maps, feature importance, etc.).

These methods of explainations for humans only make sense if they are used to compare one
instance with another: they work well for images, tabular data with few features and short texts.

There are two large families of exact-based methods:

1) Prototypes: prototypes are instances similar to the one under analysis, which obtain the same class
as the one under analysis.
So if an instance X is classified with class y, and an instance X' similar to X is classified with class y, then
X' is said to be a prototype of X.
Machine Translated by Google

Among the prototypes we can distinguish Criticism and Influential Instances:

- A Criticism is always a prototype which is classified with the same label as X, but which is not
is represented well by a set of prototypes, so it's like an outlier.
-
Influential Instances are data that have a big impact on the trained model if they are removed. (graph below)

2) Counterfactuals: X'' (which must be similar to X) is counterfactual of X if the classification of X'' is different from y.

Counterfactual Explanation

• A counterfactual explanation describes a causal situation in the form: "If X had not occurred, Y would not have occurred."

• Thinking in counterfactual terms requires imagining a hypothetical reality that contradicts the facts
watch yourself.

• Although the relationship between the inputs and the outcome to be predicted may not be causal, we can view the inputs of
a model as the cause of the prediction; this also happens in Machine Learning Approches. In these terms we can say that CF
is a sort of causal explanation.

• what we want from a CF is the smallest amount of changes, features and perturbations I have to do on my analysis instances;
that is, a counterfactual explanation of a forecast describes the smallest change in feature values that change the forecast of a
predefined output.
Machine Translated by Google

Generate counterfactual explanations

1. A first approach, simple and naive, to generate counterfactual explanation is search by trial and error: here
we have a random change characteristic values (random component) of the instance of interest and it
stops when the desired output has been predicted à A random value is then taken, the output is looked at
and if no CF is found, two features are taken, etc.

2. As an alternative we have an optimized approach; here we can define a loss function, final loss function which
considers the instance of interest, counterfactual and desired result (counterfactual). Then, we
can find the counterfactual explanation that minimizes this loss using an optimization
algorithm.

à Many methods proceed this way, but differ in their definition of the loss function and in the method of optimization; we present one:

Optimize CF search

Algorithm proposed by Sandra and Watcher; with:

x = instance under analysis

x' = the changed instance

y' = the result we want to achieve

à We start from f(x)=y until we want to get to f(x')=(y') with y ÿ y'

: in the slide it is possible to see the Loss Function, obtained as (f(x') -y')^2 = 0, i.e. setting f(x')=f(y')and the
distance between the two instances must be minimized.
Machine Translated by Google

For the first term, both the equality between the two classes and the probability of obtaining a certain class can be used, which
can be a little more elastic than the optimization procedure.

How the optimized CF search works:

1. We start by sampling a CF x', a random instance

2. The loss function L is optimized

3. If is not true, i.e. the probability that a prediction of x' . y' in absolute value is
less than a certain threshold,

4. Increase Lambda and go back to point n2

5. CF x' which minimizes the loss is rendered

: the professor thinks that CF explanations are the most useful explanations on real case studies, much more than rules based
explanations etc.

Partial dependency plot

For literature we have another family of explanaires, called visual inspectors or inspectors for black boxes; The most famous of
them is the partial dependency plot:
The partial dependency plot or partial dependency plot (PDP) shows the marginal effect that one has on the expected result of
a model à it can be defined as a simplification Shap, it is essentially
Machine Translated by Google

modify a feature while keeping the others fixed

In particular, the above partial function tells us for a given value (or values) of a feature S, which
is the mean marginal effect on the forecast itself, where xc are real values of the dataset features for the features we are not
interested in, and with n= number of instances.

Looking at the picture we understand the points from which we must pass:

- We start from x, the instance under analysis and change a feature randomly, in the example Age will go from 50 to 53, keeping the
others fixed.

- On the right in the figure we can see the partial dependency plot relating to age. The bottom diagonal lines represent the probability
of high and low risk as age varies. We can see that age >55, other things being equal, creates a jump between low and high risk! The
probability of high risk grows a lot.

ÿ Then it is a question of introducing random perturbations on the input values to understand how much each characteristic
can affect the prediction using PDP ÿ The input is changed one variable
at a time.

Open the blackbox! The professor invites us to take responsibility against the unwanted effects of an automated decision-
making process, this:

• To reveal and protect new vulnerabilities

• To implement the "right to explain"

• Improve industry standards for AI-powered product development, increasing consumer trust
companies and consumers

To help people make better decisions

• To align algorithms with human values


Machine Translated by Google

• Preserve (and expand) human autonomy

to achieve these results, however, there are some open research questions:

• There is no agreement on what an explanation is and there is still no formalism for them

• How to evaluate the goodness of the explanations?

• There is no work that seriously addresses the problem of quantifying the degree of intelligibility of an
explanation for humans

• What if there is a cost to query a black box?

ÿ Python examples

You might also like