DM2 Notes Latest
DM2 Notes Latest
DATA MINING 2
Module 1: Imbalances learning and anomaly detection (crisp, evaluation, imbalanced learning, anomaly detection)
Module 2: Advanced classification methods (naïve bayes classifier, rule-based classifiers, logistic regression, support
vector machines, ensemble, neural networks).
Module 3: time series: similarity, approximation, motif, shapelets, classification, clustering).
Module 4: sequential patterns and advanced clustering (sequential pattern mining, x-means, OPTICS,
transactional clustering).
Module 5: ethical principles.
KNN extension
It's an instance-based classifier that isn't memory -based, so there's no training part.
Nearest-neighbor classification is part of a more general technique known as instance-based learning, which
uses specific training instances to make predictions without having to maintain an abstraction.
Instance-based learning algorithms require a proximity measure to determine the similarity or distance between
instances and a classification function that returns the predicted class of a test instance based on its proximity to
other instances.
Advantage: it is able to adapt models to data previously not taken into account, storing a new
instance or eliminating a previous one Disadvantage: it is a lazy learner
therefore it does not build a model explicitly, furthermore classifying unknown records is relatively expensive: in the
worst case, data in training items, the complexity of classifying a single instance is O(n) you have to scan all n
instances
Main elements: 1.
Training set of stored records 2.
Distance metric to complete the distance between the records
3. The value of k, the number of nearest neighbors to find
• If k is too small, it is sensitive to noise points and can lead to an overfitting case due to noise in the training
set. • If k is too large,
the neighborhood may include points of other classes.
In general we use k= sqrt(N) where N is the number of samples in the training dataset
Euclidean distance:
The problem with Euclidean distance is that high dimensional data can cause problems with dimensionality .
Solution normalize vector to unit length Attributes can also be scaled to prevent
distance measurements from being dominated by one attribute (e.g. z-score)
Machine Translated by Google
PEBLS (Parallel Exemplar-Based Learning System): It allows to work also with categorical attributes PEBLS is a nearest-
neighbor learning system (k=1) designed for applications where the instances have a symbolic value of the features, in fact it
works with both continuous and name them.
For nominal attributes, the distance between two nominal values is accomplished using the Modified Value Difference
Metric (MVDM):
Stages:
• Business understanding: understand the objective of a project, the requirements and the definition of
DM problem
Machine Translated by Google
• Data understanding: initial data collection and familiarization, data quality problems
identification
• Data preparation: prepare data: tables, records, attribute selection, date
transformation and cleaning (90% of the time)
• Modeling: model selection and application techniques, parameters calibration • Evaluation:
business objectives (kpis indicators) and issues achievement evaluation • Deployment
(distribution): result model deployment, repeatable DM process implementation
Business Understanding:
- Determine business objectives:
• Comprehensive understanding, from a business perspective, of what a customer really wants
realize
• Discovering important factors (at the outset) that can influence the outcome of the project •
Neglecting this step leads to spending a great deal of effort to produce the right answer to the
wrong question
- Access situations:
• more detailed fact finding on all resources, constraints (e.g. privacy), assumptions and other
factors that need to be considered • examine the details -
Determine DM goals:
• A business goal includes objectives in business terms: “increase catalog sales to existing
customers” • A DM
goal includes objectives in technical terms: “predict how many widgets a customer will buy, given
their purchases over the past three years, demographic information and the price of the item.”
Data understanding: explore the data, verify the quality, find outliers
- Collect initial date:
• Acquire within the project the data listed in the project resources •
Includes data loading if necessary for data understanding •
Possibly leads to initial data preparation steps
Machine Translated by Google
• If acquiring multiple data sources, integration is an additional issue, either here or in the
later data preparation phase
- Describe data
• Examine the “gross” or “surface” properties of the acquired data • Report on
the results
- Explore data: •
Tackles the data mining questions, which can be addressed using querying, visualization and reporting
including: distribution of key attributes, through aggregations; relations between pairs of attributes;
properties of significant subpopulations
• May address directly the data mining goals • May
contribute to data description and quality reports • May feed into the
transformation and other data preparation needed
- Verify data quality
• Examine the quality of the data, addressing questions such as: “Is the data complete?”, Are
there missing values in the data?”
Data preparation (90% of the time): collection, assessment, consolidation and cleaning, data selection, transformation:
- Select date
• Decide on the data to be used for analysis • Criteria
include relevance to the data mining goals, quality and technical constraints such as
limits on data volume or data types
• Covers selection of attributes as well as selection of records in a table
- Clean data
• Raise the data quality to the level required by the selected analysis techniques • May involve
selection of clean subsets of the data, the insertion of suitable defaults or
more ambitious techniques such as the estimation of missing data by modeling
- Construct data: constructive data preparation operations such as the production of derived attributes, entire
new records or transformed values for existing attributes
- Integrate data: methods whereby information is combined from multiple tables or records to
create new records or values
- Format data: formatting transformations refer to primarily syntactic modifications made to the data that do not
change its meaning, but might be required by the modeling tool
Modelling:
- Select the modeling technique (based upon DM objectives)
- Build model (parameter settings)
- Access model (rank the models)
- Select modeling technique • Select
the actual modeling technique that is to be used ex) decision tree, neural network • If multiple techniques
are applied, perform this task for each technique separately
- Generate test designs:
• Before actually building a model, generate a procedure or mechanism to test the model's
quality and validity ex) In classification, it is common to use error rates as quality measures for data mining
models. Therefore, typically separate the dataset into train and test set, build the model on the train
set and estimate its quality on the separate test set - Build model: • Run the modeling tool on
the prepared
dataset to create one or more models
- Assessment model:
Machine Translated by Google
• interprets the models according to his domain knowledge, the data mining success criteria and
the desired test design
• judges the success of the application of modeling and discovery techniques more
technically
• Contacts business analysts and domain experts later in order to discuss the data mining
results in the business context
• Only consider models whereas the evaluation phase also takes into account all other
results that were produced in the course of the project
Deployment: determining how the results are to be used, or who is to use them, i.e. who will be the end
user. How often should deploy DM results be used.
The knowledge gained will need to be organized and presented in a way that the customer can use it.
However, depending on the requirements, the deployment phase can be as simple as generating a report or
as complex as implementing a repeatable data mining process across the enterprise.
- Plan deployment:
• In order to deploy the data mining result(s) into the business, takes the evaluation results and
concludes a strategy for deployment •
Document the procedure for later deployment
- Plan monitoring and maintenance: over time, the attitudes of
shopping for example, and needs to be updated
• Important if the data mining results become part of the day-to-day business and it
environment
• Helps to avoid unnecessarily long periods of incorrect usage of data mining results •
Needs a detailed on monitoring process •
Takes into account the specific type of deployment
- Produces final reports
• The project leader and his team write up a final report •
May be only a summary of the project and its experiences • May
be a final and comprehensive presentation of the data mining result(s)
- Review project •
Assess what went right and what went wrong, what was done well and what needs to be
improved
So CRISP is useful in DM because it provides a uniform framework for: guidelines and experience
documentation. It is also flexible to handle different business/agency issues and different data
Performance evaluations
Machine Translated by Google
The confusion matrix is a measure of the performance evaluation of our model, which is based on the
predictive ability of the model.
• True positive (TP=a) or f++, which corresponds to the number of positive examples correctly predicted by
the classification model. •
False negative (FN= b) or f+ÿ, which corresponds to the number of positive examples wrongly predicted as
negative by the classification model. •
False positive (FP= c) or fÿ+, which corresponds to the number of negative examples wrongly predicted as
positive by the classification model. •
True negative (TN= d) or fÿÿ, which corresponds to the number of negative examples correctly predicted by
the classification model.
• Thanks to the confusion matrix it is possible to calculate the accuracy of our model
• How is the accuracy calculated? We add up the number of correct predictions we have in the table/
overall number of records
Limitation of Accuracy
• Since the accuracy measure treats every class as equally important, it may not be suitable for analyzing
imbalanced data sets, where the rare class is considered more interesting than the majority class.
cost matrix
The cost matrix allows me to evaluate a model using accuracy, without changing anything in the training or the
test. We're just considering the rating.
A cost matrix encodes the penalty of classifying records from one class as another.
Let C(i, j) denote the cost of predicting a record from class i as class j.
With this notation, C(+, ÿ) is the cost of committing a false negative error, while C(ÿ, +) is the cost of generating a false alarm.
A negative entry in the cost matrix represents the reward for making correct classification. Given a collection of N test records,
the overall cost of a model M is:
• Consider the cost matrix: The cost of committing a false negative error is a hundred times larger than the cost of committing a false
alarm. In other words, failure to detect any positive example is just as bad as committing a hundred false alarms. • Cost is
something we would like to reduce as
much as possible • Final result:
Despite improving both of its true positive and false positive counts, model M2 is still inferior since the improvement comes at
the expense of increasing the more costly false negative errors. A standard accuracy measure would have preferred the model M2
over M1.
Cost vs Accuracy
Machine Translated by Google
If we compare the two we see that the accuracy is proportional to the cost, if the cost given to false positive and false negative is
the same (=q) and if the cost given to true positive and true negated is the same (=p).
Finally we calculate the final cost with the final formula.
Using the cost matrix we can change the way we evaluate a performance and this helps us evaluate a classifier working with a
large dataset, thanks to the accuracy.
Cost-Sensitive Measures
Measures created from the cost matrix:
• Precision determines the fraction of records that actually turns out to be positive in the group the classifier has
declared as a positive class. The higher the precision is, the lower the number of false positive errors committed by the classifier.
• Recall measures the fraction of positive examples correctly predicted by the classifier.
The fraction of positive records that the model correctly recognized as positive. The measure says how much correct
prediction can cover the actual set of positive cases.
• Precision and recall can be summarized into another metric known as the F1 measure. In principle, F1 represents a harmonic
mean between recall and precision, ie:
Data partitioning
1. The model is usually divided into 2 parts (called the Holdout technique): test (30%) and test (70%)
The train set is used to train the mode and the test to test the model, to see if we get good results. If the
result is not good we go back and apply another model.
In this model we are always applying a holdout validation: i.e. we fix the train and the validation, i.e. the same
operation applied as for 1. it is repeated here, however, directly inside the train. The train, as in model 1, can
be biased due to the random data selection we make and as a consequence also the lal validation.
3.cross validation:
When we do the cross validation, we are applying a sort of 2. Because the validation set is considered between
the data partition. In this model you always try to find the best parameters like 2., and then you test the model.
In this case, unlike 1 and 2, we repeat the operation of 2. k times, with different dimensions depending on how
many times we want to repeat it. There is a validation part here too, but it moves every time we analyze a k part,
to find the biases.
Tip: first do a holdout separation: play with the data and feature engineering, and then test the model at the end.
it is not recommended to do a cross validation directly on the whole dataset, because otherwise we would
have parts as big as the train.
Cross Validation, considering TIME
Machine Translated by Google
• It illustrates the ability of a binary classifier as its discrimination threshold THR is varied. • The ROC
curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at
various THRs.
• The TPR = TP / (TP + FN) is also known as sensitivity, recall or probability of detection. • The
FPR = FP / (TN + FP) is also known as probability of false alarm and can be calculated as (1 ÿ specificity).
Machine Translated by Google
If we have:
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
So, given a certain threshold, we can classify all the instances we have in a test set and calculate the
FNR and TPR
(TP,FP):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (0,1): ideal
• Diagonal line:
• is Random guessing •
Below diagonal line:
• prediction is opposite of the true class
Lift chart
• The lift curve is a popular technique in direct marketing. • The
input is a dataset that has been “scored'' by appending to each case the estimated probability that it will belong to a given
class. therefore the inputs are the same as for the ROC • The cumulative lift
chart (also called gains chart) is constructed with the cumulative number of cases (descending order of probability) on
the x-axis and the cumulative number of true positives on the y-axis . • The dashed line is a reference line. For any given
number of cases (the x-axis value), it represents the expected number of positives we would predict if we did not have a
model but simply selected cases at random. It provides a benchmark against which we can see performance of the model.
Notice: “Lift chart” is a rather general term, often used to identify also other kinds of plots. Don't get confused!
Machine Translated by Google
• From Lift chart we can easily derive an “economical value” plot, eg in target marketing.
Given our predictive model lift , how many customers should we target to maximize income? we could use the
to understand how many consumers to call to maximize profit ÿ Profit =
UnitB*MaxR*Lift(X) - UnitCost*N*X/100 ÿ UnitB = unit
benefit, UnitCost = unit postal cost ÿ N = total customers
ROC example:
Machine Translated by Google
IMBALANCED LEARNING
Imbalanced classes
Most classification methods assume that classes are reasonably balanced, while in reality classes are often unbalanced (one class
is really popular and the other is rare). If a class is rare it doesn't mean it isn't interesting: in the medical field, it is
interesting finding out how to cure the population affected by HIV, which is the 0.4% of the total USA population. former. about
2% of credit card accounts are defrauded per year1
(most fraud detection domains are heavily imbalanced)
-
Is a classifier with 70% accuracy good? No, the trivial classifier (always positive) reaches 75%.
To say if our classifier accuracy is interesting we need to compare it with the accuracy of the trivial classifier that
always returns the majority class (case of perfect balancing).
Machine Translated by Google
Random Undersampling
Under-sample the majority class(es) by randomly picking samples with or without replacment
As we can see, in the right figure, we can distinguish the decision boundaries. So the random
undersampling is efficient because it doesn't require any strategy (to be sure of the results it should be
repeated several times), and it avoids biases in choosing the sample to keep. It still remains a way of
random selection, so this should be considered weakness, because it could lead to wrong choices.
Machine Translated by Google
This approach is more resistant than the Random Selection, but it is computationally expensive as it
performs a smart undersampling by removing majority points having as k-NN a minority point, and it is
sensitive to noise. So, if a minority point is in the nearest neighborhood of a majority point and this makes a
misclassification, then this majority point is removed.
Algorithm:
a) pass initialized at 1
b) We randomly select an instance x among all instances in the training set D, so we have two
different sets which are D(1) (D at the first iteration) and E.
c) We initialize D at the second iteration D(pass+1) as an empty set, and we initialize a counter to 0. d)
Another x' is chosen randomly from D (the previous x is not in D anymore), and now x is classified
through a Nearest Neighbor classification using E as test-set.
e) If the classifier we built at step d) can find the right class with a simple approach as KNN, the new x is
added to the set D(pass+1) (D at iteration pass+1). Otherwise, if x is misclassified, it is added to
the set E, and the counter is increased by one.
f) The x is removed from this iteration. The instances that are discarded are the closest (because are the
ones in the neighborhood) to every pre-selected instance. g) If
the set D(pass) is not empty, it goes back to step d) and repeats the process. h) If
the counter is 0 it means that no instances have been misclassified and the algorithm ends, because
I reached a situation where a simple classifier as KNN is a good classifier.
Machine Translated by Google
If the counter is not equal to 0, pass is increased and the algorithm goes back to step b) and repeats the
process.
So if a minority point is in the nearest-neighborood of a majority point, and this leads to a
misclassification, then this majority point is removed.
The minority class is over-sampled by taking each minority class sample and introducing synthetic examples
along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the
amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen (by
default k=5). Eg, if the amount of over-sampling needed is 200%, only two neighbors from the five are
chosen and one sample is generated in the direction of each.
[it must be considered that this method can influence our work, keep it in mind!]
The full line is the decision boundary of the classifier without weights, which cuts in half the red points. By giving
weights to classes, the decision boundary includes all the red points in one area, so it is less possible to have
misclassifications. In other terms: we have a high recall in respect to the minority class.
The objective of the classifier is to find a model which minimizes the total cost (we consider the cost matrix at
training time and not to evaluate the performance of the classifier):
With this approach we can consider the classification as an optimization problem (minimize the cost of
misclassification), as we can call it a Meta-Cost Sensitive Classifier.
The algorithm needs to compute the expected risk of classifying x with class i:
Several classification methods compute scores in terms of probability of belonging to a class, and then assign class.
Generally we have: -
Score p > 50% ÿ class = Y (p is the probability of being classified as Y)
- Otherwise ÿ class = N Eg:
decision trees have p = #positive/#negative cases over each leaf
For each THR (in [0-100]) we get a different set of predictions. If we increase the threshold, we are saying that the
confidence must be higher and we are going to increase the number of True Positives by reducing the False Positives.
Instead, by reducing the THR we increase the recall because we find more class Y.
Changing the THR changes also the confusion matrix, which changes all the indicators derived from it: Accuracy, True
Positive Rate (TPR), False Positive Rate (FPR) etc.
The smaller the standard deviation, the smaller the bell (pink bell chart).
Ex. we have 3 data points 9, 9.5, 11, we want to calculate the total probability of observing all the data, i.e. the joint probability
distribution of all the observed data points taking into account that each data point is generated independently of the others.
If all events are independent, the total probability of observing this distribution is the product of observing each data point
individually (ie the product of the marginal probabilities).
Machine Translated by Google
Outlier = observation that is completely different from all others in the dataset and is as if it were generated by a different
mechanism.
There is a statistical intuition that says that normal objects in the real world are generated following a "generating mechanism",
like statistical processes. Abnormal objects deviate from this generating mechanism = outliers.
Anomalies/outliers = set of data points that are considerably different than the remainder of the data.
The natural implication is that anomalies are relatively rare.
One in a thousand occurs often if you have lots of data, and context is important ex. freezing temps in July.
Applications of outlier detection can be: fraud detection (recognizing abuses in credit cards), medicines (eg some unusual
symptoms), public health.
Importance of anomaly detection: Ozone
depletion history: In 1985 three researchers were puzzled by data gathered by the British Antarctic Survey showing that ozone
levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for
recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite
were so low they were being treated as outliers by a computer program and discarded!
Machine Translated by Google
Noise is erroneous, perhaps random, values or contaminating objects. Take into consideration for example the weight recorded
incorrectly.
Noise is not that interesting compared to outliers, it doesn't necessarily produce unusual values or objects.
Anomalies can be interesting if they are not the result of noise. Noise and anomalies are related but still distinct concepts.
Number of attributes can create problems. Many outliers are defined in terms of single attributes: height, shape, color.
It can be difficult to find an anomaly using all attributes (because you have many dimensions) due to noisy or irrelevant
attributes; an object is anomalous with respect to other attributes.
Anomaly scoring, we look for a value that has a link with an object and provides a level of outlierization. Many anomaly
techniques provide only a binary categorization. An object is an anomaly ot it isn't and this is especially true of classification-based
approaches.
Other approaches assign a score to all points. This score measures the degree to which an object is an anomaly, this allows
objects to be ranked.
In the end, you often need a binary decision, ex. should this credit card transaction be flagged?
or:
Given a dataset D, find all data points x belonging to D that have top-n greater than the anomaly
scores.
or: Given
a dataset D, containing mostly normal (but unlabeled) data points, and a test point x, perform the anomaly score of x with reference
to D.
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that has a low probability compared to the
probability distribution of the model in the data.
It usually assumes a parametric model describing the data distribution (ex. normal distribution).
Apply a statistical test that depends on: data distribution, distribution parameters (mean, variance) and
the expected number of outliers.
Issues: identify the distribution of the data set (heavy tail distribution), the number of attributes and wonder if
the data is a mix of distributions.
Machine Translated by Google
a Gaussian distribution, the boundaries can be seen and all points outside are outliers.
For example on x: age, y: number of exams passed, in the center there will be those students who have passed a certain number of
exams with an average age compared to the present ages.
Grubbs' Test
Allows you to find outliers in univariate data. It can only be used with one dimension at a time and assuming that there is
a normal distribution.
• Detects one outlier at a time, removes the outlier, and repeats. There are two hypotheses: H0: no outlier
in the data (null-hypothesis) or HA: there is at least one outlier.
Grubbs' test statistic:
A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test
taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated
value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis.
A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but
not both. An example can be whether a machine produces more than one percent defective products. In this situation, if the
estimated value exists in one of the one-sided critical areas, depending on the direction of interest (greater than or less than), the
alternative hypothesis is accepted over the null hypothesis.
Machine Translated by Google
Likelihood Approach
Works with 2 sets: suppose dataset D contains samples from a mix of two probability distributions: M (majority distribution) and A
(anomalous distributions)
General approach: •
initially, assume that all data points belong to M • Then compute Lt(D): the log
likelihood of D at time t • For each point xt belonging to M, move it
to A: ÿ Let Lt+ 1 (D) be the new log likelihood. ÿ Compute the
difference, ÿ = Lt(D) – Lt+1 (D) ÿ If ÿ > c (some
threshold), then xt is declared as an anomaly and moved
permanently from
M to A
Data distribution: - ÿ
is a parameter that controls the importance of one distribution over another
- M is the probability distribution estimated from the data, it can be based on any method (naïve Bayes,
maximum entropy, etc.)
- A is initially assumed to be uniform distribution
Likelihood at time t:
Cons:
• in many cases, data distribution may not be known • For
high dimensional data, it may be difficult to estimate the true distribution • Anomalies
can distort the parameters of the distribution (Mean and standard deviation are very sensitive to outliers)
Depth-based Approaches
The general idea is to look for outliers at the data space boundaries but
independently of the statistical distributions. The complicated
thing is to create a convex hull layer to organize the data objects,
i.e. a grouping containing all the data.
Outliers are objects in outer layers
Assumption: outliers are located at the boundaries of the data space
• Similar idea like classical statistical approaches (k = 1 distributions) but independent from the chosen
kind of distribution • Convex
hull computation is usually only efficient in 2D / 3D spaces • Originally
outputs a label but can be extended for scoring easily (take depth as scoring value) • Uses a global
reference set for outlier detection • Sample algorithms:
ISODEPTH,FDC
Exercise:
Machine Translated by Google
The idea is to find which value of E that maximizes the smoothing factor over all possible subsets in the
database, then find those points that reduce the variance the most in the dataset.
We have to select some sort of max and min number of outliers that we want to find, and then E is set from that.
Discussion:
• Similar idea like classical statistical approaches (k = 1 distributions) but independent from the chosen
kind of distribution, we do not assume any particular kind of distribution for the data •
Naïve solution is in O(2n) for n data objects •
Heuristics like random sampling or best first search are applied to speed up the process •
Applicable to any data type (depends on the definition of Smoothing Function) •
Originally designed as a global method •
Outputs a labeling
Distance-based Approaches
• General Idea: the outliers of a point are judged based on the distance(s) to its neighbors. There are many
proposed variants • Basic
Assumption: Normal data objects have a dense neighborhood. Outliers are far apart from their neighbors, ie,
have a less dense neighborhood
• Approach 1: An object is an outlier if a specified fraction of the objects in the dataset is more than a specified distance away
(Knorr, Ng 1998). Some statistical definitions are special cases of this. We have a labeling definition if the point is an outlier or
not. • Approach 2: The outlier score of an object is the
distance to its k-th nearest neighbor. Here we have a score on whether or not the point is an outlier.
Analysis Approach 2:
Graphically it means that: the bluer the point, the more the point is IN-lier , if instead it approaches red it is a
OUTLIER. The distance is calculated with respect to the first nearest neighbor.
One of the possible weaknesses could be represented with the graph below.
The red dot appears to be more outlier than point D, because the distance between the two teal dots is
minor.
For this reason, the second approach is called LOCAL and strongly depends on the reference set (i.e. the only closest
neighbor).
We may want to identify two clusters and not a set of separate points recognized as outliers.
In this case we have two clusters and a point D. In this case the point D is recognized as an outlier, while
the blue point above cluster 2 is not because the distance from the closest cluster is much less than that of point
D. Having also different densities, the algorithm has problems, because it identifies the blue point as inline
and the point d as outlier.
• A point p is considered an outlier if at most ÿ percent of all other points have a distance to p less than ÿ, ie, it
is close to few points:
• The approach is local, it labels the point and has two parameters, ie epsilon and pi
Exercise (A):
Machine Translated by Google
We need to take points A and B and calculate the points that are inside the epsilon radius. First let's try with A.
Then with B. Let's see what are the points in the radius of A, in this case they are 5,6,B,4. We do "divided by 10"
because the points are 10 in total.
In the final unreadable line it says = since 0.4 is greater than 0.15 and not less, then A is not an outlier.
If we specify k=2, the closest neighbors according to the Manathhan distance will be:
- for A=2 -
for B=2 - for
1=4ÿ because the outlier score is the distance of the second nearest neighbor. This means that 1 is
more outliers than 4.
Machine Translated by Google
q • A vertex that has an indegree less than equal to T (user threshold) is an outlier
Former:
Machine Translated by Google
We can specify a limit T=2 if it has less than 2 incoming links. In our case E is an outlier. D= 4, C=3, A=4,
B=4.
Discussion
• The indegree of a vertex in the kNN graph equals to the number of reverse kNNs (RkNN= reverse KNN) of the
corresponding point
Former:
• The RkNNs of a point p are those data objects having p among their kNNs •
Intuition of the model: outliers are (are equal)= •
points that are among the kNNs of less than T other points •
have less than T RkNNs
• Outputs an outlier label •
Is a local approach (depending on parameter k)
Cons
• is generally expensive – O(n2) •
Sensitive to parameters •
Sensitive to variations in density •
Distance becomes less meaningful in high-dimensional space •
Difficult with categorical variables
Density-based Approaches
• General idea
• Compare the density around a point with the density around its local neighbors • The
relative density of a point compared to its neighbors is computed as an outlier score •
Approaches differ in how to estimate density
• Basic assumptions
• The density around a normal data object is similar to the density around its neighbors • The
density around an outlier is considerably different to the density around its neighbors
Density can be defined in various ways, the simplest way is to define it as an outlier score. • Density-
based Outlier : The outlier score of an object is the inverse of the density around the object.
• Can be defined in terms of the k nearest neighbors
Machine Translated by Google
Relative Density
• it is better to consider only the relative density, with respect to a given subset of points. • the
density of a point relative to that of its k nearest neighbors is considered:
ALGORITHM: We calculate the outlier score for each object for a specified number of neighbors (k) by first
computing the density of an object density(x, k) based on its nearest neighbors. The average density of the
neighbors of a point is then calculated and used to compute the average relative density of the point. This
quantity provides an indication of whether x is in a denser or sparser region of the neighborhood than its
neighbors and is taken as the outlier score of x.
•the reachability distance is the maximum distance between the distance from pao, and the k-th distance of the nearest
neighborhood of o.
It means that: if I measure the distance from O to A, I replace this distance with the k-th distance, i.e. the distance
in the circumference in our case. This is because point A is inside the circle, i.e. it is closer.
While for the point p2, which is outside the circle, I use the distance dist(p,o).
2.Local reachability distance (lrd) of point p is calculated as the inverse of the average reach-dists of the kNN s
of p. This formula implements the algorithm formula, but using the local reachability distance instead
Former:
The reachability distance between P,A = the greater of the distance (PA) and the distance
(AB). this means that P is a nearest neighbor of A, but not of C. Because P is further away than the knn of C, which is
c-third.
The Lof is calculated as the average ratio of lrds of neighbors of p and lrd of p/ cardinality of knn of p.
Ex B(k=2, calculate the LOF of A and B, the prof is using the manathan distance not the reachability distance) :
4. We calculate the LOF of A as the sum of the local reachability distances of the points in the neighborhood,
divided by the local reachability distance of the points calculated above
LOF(A)= 0.66/0.66 + 0.5/0.66 / 2 (CARDINALITY OF A) = 0.91 LOF(B)=
0.66/0.66 + 0.66/0.66 / 2 = 1
Properties
When is a point an outlier? • LOF ÿ
1: point is in a cluster (region with homogeneous density around the point and its neighbors) • LOF >> 1: point is an
outlier
Discussion
• I can choice of k (MinPts in the original paper) specifies the reference set • I can decide
to implements a local approach (resolution depends on the user's choice for k)= i.e. I can decide to consider the outlier
score or define a top number of outliers I want and label them as outliers
• It is however a local approach, initially defined for scoring (assigns an LOF value to each point)
As you can see, in the 2D and 3D representation, if the cluster is denser then it will be lower.
method
• Compress data points into micro clusters using the CFs of BIRCH [Zhang et al. 1996]. You focus on the microclasses
and compute the LOF only for points within the smallest microclasses. In this way the calculation is optimized, because we
refer only to a small subset. • Derive upper and lower bounds
of the reachability distances, lrd-values, and LOF-values for points within a micro clusters • Compute upper
and lower bounds of LOF values for
micro clusters and sort results wrt ascending lower bound
Machine Translated by Google
• You can also decide to prune micro clusters that cannot accommodate points among the top-n outliers (n highest
LOF values) •
Iteratively refine remaining micro clusters and prune points accordingly
• Motivation to use the COF= In regions of low density, it may be hard to detect outliers, for this choose a low
value for k is often not appropriate • Solution given
by the COF= Treat “low density” and “isolation” differently • Example
Idea
• Take symmetric neighborhood relationship into account, instead of using simple direct neighborhood.
Instead of using the knn, we consider the k influenced space, defined as the union between the k nearest neighbor
hood(kNN(p)) and the reverse nearest neighbor (RkNN(p)).
model
• Density is simply measured by the inverse of the kNN distance, ie,
den(p) = 1/k-distance(p) •
Influenced outlierness of a point p
Machine Translated by Google
is the sum of the quantity at the previous point (but iterated for all points of the k influenced set)/ the
cardinality of the k influenced set and normalized with respect to the quantity
den(p) • INFLO takes the ratio of the average density of objects in the neighborhood of a point p (ie, in kNN(p) ÿ
RkNN(p)) to p's density
Properties
Similar to LOF:
• INFLO ÿ 1: point is in a cluster
• INFLO >> 1: point is an outlier
Discussion
• Outputs an outlier score •
Originally proposed as a local approach (resolution of the reference set kIS can be adjusted by the user setting
parameter k)
The density based approaches, compared to the distance based approach, solve the problem of the
different densities, but always remain expensive and suffer in high-dimensional space.
• A set of many abnormal data objects that are similar to each other would be recognized as a cluster rather
than as noise/outliers These
problems arise especially with k- means, as we assume that clusters with less than a certain threshold of points
are outliers, but in reality these strange points, due to the large dataset that makes them end up in the same
cluster, we have very different points which however are grouped in the same cluster, therefore they are NOT
recognized as outliers.
In DBSCAN, however, we manage to label them as noise.
Clustering-Based Approaches
• Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster.
So for approaches like DBSCAN it's simple (n points are outliers)
ÿ • For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center.
We have to decide on a strategy, which we choose.
ÿ • For density-based clusters, an object is an outlier if its density is too low ÿ •
For graph-based clusters, an object is an outlier if it is not well connected. In a sense
we can imagine them as the result of a hierarchical clustering, considering the dendrogram; the object
is outlier if in the dendrogram it is merged in one of the latest steps and it is alone: therefore the point is
certainly very different from the others, otherwise it would have been merged
earlier. • Other issues include the impact of outliers on the clusters and the number of clusters: the presence
of outliers can affect the computation of the clustering algorithm. We do not have this problem if we know that
there are no outliers in our dataset: when we identify a point, after the algorithm, we know whether or not it is an
outlier, but an approach like DBSCAN, for example, with an external point struggles to label it such as
noise, border or center occurs WHILE the algorithm is being run. Possible solution: We can consider
frozen the labeling of the points in the dataset and then, when adding another point, we simply estimate if it
is core, border, or noise. In any case, we can't use the clustering algorithm as it is, but we need to make
sure that the result can be used.
For example, in the following case we see that it is easy (for example by choosing k= 9) to end up in the
situation highlighted: we have 4 well-defined clusters, while 5 are clusters which, in reality, should be points.
After that, we can compute the distance between a certain point and all clusters' centers (for example we
calculate the distance from the point indicated by the pointer, to the center, and all other cluster centers) with at
least a certain number of points and then we average and we decide on a threshold. This is distance-based way
reasoning.
Below, let's imagine we have the two clusters we see: for point C the distance is computed with respect to the
cluster on the left, while for point D with respect to the one on the right. The {distance from point A, with
respect to the left cluster} >{ distance D with respect to the right cluster}. Again, this is a density issue, of course.
We have a biased evaluation.
Machine Translated by Google
If instead we consider the relative distance instead of the absolute distance, we can solve the problem:
Since all of these techniques are unsupervised, we can apply more than one technique for outlier detection
and then decide on a policy for which points are outliers (for example, if ÿ of the methods we used tell us that the
point is an outlier ).
Solutions
• Use more robust distance functions and find full-dimensional outliers: we did this even before with clustering,
statistical, density, deviation-based approaches. Here we mean that all sizes of the
Machine Translated by Google
feature spaces simultaneously contribute to the identification of outliers, so as not to have the
dimensionality problem. •
Find outliers in projections (subspaces) of the original feature space: for example we reduce the
dimensionality and then run the algorithm we want with more confidence.
• Object o is an outlier if most of the other objects are located in similar directions •
Object o is no outlier if many other objects are located in varying directions In the
image below we consider directions based on angles. The notion of outlier is similar to that of the depth based
approach, as we assume that the outlier is located at the boundaries, instead of at the
center.
ASKS THE EXAMINATION: we cannot use depth-based approach in high-dimensional space, as it requires
the computation of a convex hull, which means to find the perimeter of the dataset which is simple with
some dimension, but becomes computationally expensive in high- dimensional datasets.
model
• Consider for a given point p the angle between any two instances x and y : I calculate the angle which is
formed as follows, thus comparing the vectors going from p to x and the one going from pay
• Consider the spectrum of all these angles: we calculate the angle between p and any other
possible combination of points in the dataset and then look at the spectrum: if it is small, we have a low
outlier score ÿ The broadness of this spectrum is a score for the outlierness of a point. The spectrum therefore
looks at the variance which is always greater for the inliers: an outlier is a point that has a small variance between
the angles between all points, while an inlier has a large variance between the angles of the other points. This
is a global method, as it considers all points in the dataset.
Machine Translated by Google
Spectrum:
Properties
• Small ABOD => outlier
• High ABOD => no outlier
ABOD Algorithms
• Naïve algorithm is in O(n^3): computationally expensive •
Approximate algorithm based on random sampling for mining top-n outliers: for example if we
select a sample of m points, where m is much smaller than n, for calculate the ABOD of the n points, the
complexity becomes n*m^2, while in the previous case it was n*n^2 = n^3
- Do not consider all pairs of other points x, y in the database to compute the angles
- Compute ABOD based on samples => lower bound of the real ABOD
Machine Translated by Google
- K-means with high k and then we use the centroids as sample points to calculate the various angles so that the solution is
slightly more stable, but again if the choice of k is not reliable, we can use a point as a centroid in the middle of the
space, which represents nothing ÿ bias!
-
Filter out points that have a high lower bound Refine
-
(compute the exact ABOD value) only for a small number of points: instead of using n = |10 k| we choose q= 10 points,
applying these q and run the sample of m data, the final complexity becomes q*m^2 which is much less than the original,
but the result is not as accurate.
Discussion
• Global approach to outlier detection • Outputs
an outlier score • ABOD: use
for high-dimensionality.
• Sparsity coefficient S(C) for a k-dimensional grid cell C : we count how many points we have in a certain cell and subtract the
amount seen in the slide. If count <0: the point belonging to that cell is outliers, as it means that there are fewer points in the
cell than expected. And viceversa.
ÿ it is a global and labeling method, as it labels all points inside a cell as outliers or inliers. We can do this as a
score method by assigning the same score to points that belong to the same cell.
Algorithm
• Find the m grid cells (projections) with the lowest sparsity coefficients; finding the right parameters is
crucial ÿ dimensionality is given, but phi must be defined: if it increases, the model is more accurate, but also
more complex (exponentially increasing number of cells, as we have more dimensions) • Brute-force
algorithm is in O(ÿd): we have to calculate the sparsity coefficient for each cell • Evolutionary
algorithm (input: m and the dimensionality of the cells) ÿ to look if we are interested,
but it is not used for exam
Discussion
• Results need not be the points from the optimal cells •
Very coarse model (all objects that are in cell with less points than to be expected) •
Quality depends on grid resolution and grid position: if we are in 2 dimensions and we have
something like l 'image below, if the grid is done this way we are splitting two clusters of points
• Outputs a labeling
• Implements a global approach (key criterion: globally expected number of points within a cell)
Model-based Approaches
This is the only approach where a model is “learned” from the data in order to assign an outlier
score.
isolation forest
Idea: Few and different instances can be isolated quicker. It is the only approach that considers instances both
different from each other and few, based on Ricky's
experience. • Given the dataset build a forest of trees.
ÿ For each tree (so as not to have bias): 1. Get
a sample of the data 2.
Randomly select a dimension (e.g. dimension y); it is not the same process of decision
tree in which we look for the best values to split after calculating the impurity etc… here we randomly
take a dimension and a
3. value on that dimension. In the image below, the points have already been selected
starting from a larger dataset.
Machine Translated by Google
4. Draw a straight line through the data at that value and split data: for example I say: if the value is >
2 I go to the right, otherwise to the left. I create a rule
5. Repeat until tree is complete : in each leaf I have ONE AND ONLY ONE POINT (or I select a
minimum value and I finish earlier). In each split identified in the splitting (see image ..) I have a point.
- Intuition: Anomalies will be isolated in only few steps : outliers are points that can be found in the initial
branch of the tree, as they require only a few splittings to be isolated.
Since the approach is random, it obviously has to be repeated several times.
- Nominal points (ie inliers) need more splits to be separated.
So how do I find outliers now? I calculate the number of steps I have to follow in each tree from root to leaf to
find a specific point and I average these values. On the left, here we have a tree and in blue we have a very long
path (the one for the inlier) and a short one (red: for the outlier). similarly in the right graph in blue we see many
steps for the inliers and vice versa. The formula shows how to compute the score outliers, where E, h, c are
functions capturing the aforementioned paths . The simpler it is to isolate a point from the others with this
random structure, the more the point can be considered an outlier, as it means that it is very different (and
distant) from the others and therefore is found alone in a region of space.
Machine Translated by Google
Isolation Forest
• Computationally Efficient
• Parallelizable : we can build all the isolation trees we want in parallel • Handle high
dimensional data: I don't have to calculate the distances considering the various features at each split.
It is not a simple approach like others, but for now it appears to be the most useful
and used. • Inconsistent scoring can be observed: below in red we have an outlier and in yellow an inlier.
Although the blue regions drawn are not too red, but lighter, so the dot is less outlier, the model therefore
assumes that the dots in the dark red areas are more likely to be outliers
Machine Translated by Google
Characteristics
isolation forest
• Computationally Efficient •
Parallelizable
• Handle high dimensional data
• Inconsistent scoring can be observed
Extended Isolation Forest
• Computationally Efficient
• Parallelizable
• Handle high dimensional data
• Consistent scoring: in the other case we had more of a sort of Gaussian distribution, i.e. points in the center
rather than in the edge of the plot, while at this turn the model understands better: in the image below we see
on the left the old result (Isolation Forest) and on the right the new one (Extended Isolation Forest)
Summary
• Different models are based on different assumptions
Machine Translated by Google
MODULE 2
Naïve Bayes Classifiers
Bayes Classifier
The Bayes classifier is a probabilistic framework (based on probabilities) for solving classification
problems, where P is the probability P(X=x) (between 0 and 1, where 0 is =never appears 1=always appears ) that
an events X=x are happening (Ex: X=”a student attended the DM2 course” and x can be “Yes” or “No” if it is a
binary variable).
With two events happening together, we compute the joint probability P(X=x, Y=y) where Y is another event
(Ex: X = “student attended DM2 course, Y="student has black eyes").
Instead, we call conditional probability P(Y=y|X=x) the probability of a student having black eyes GIVEN
THAT the student attended DM2 course.
The joint and conditional probabilities are needed to compute the Relationship between X and Y, which is: P(X,Y)
= P(Y|X) P(X) = P(X|Y) P(Y), which is the basis for the Bayes Theorem.
In fact the Bayes Theorem is described as: P(Y|X) = P(X|Y)P(Y) / P(X) ÿ The probability of obtaining a certain
outcome Y given the features X is equal to the probability of having certain features X given the outcome Y
multiplied by the probability of the outcome Y divided by the probability of the feature.
With this formulation we can notice that Y is the class variable and X is the set of features describing a
certain instance.
• P(Y = 0) = 0.65 ÿ the probability that Team 0 wins 65% of the time •
P(Y = 1) = 0.35 ÿ the probability that Team 1 wins 35% of the time •
P(X = 1|Y = 1) = 0.75 ÿ Among the games won by Team 1, 75% of them are won playing at home. • P(X =
1|Y = 0) = 0.30 ÿ Among the games won by Team 0, 30% of them are won at Team 1's field
• Objective P(Y = 1|X = 1): What is the probability that Team 1 wins the game given that it plays at home?
Solution: Apply the Bayes theorem formula P(Y|X) = P(X|Y)P(Y) / P(X): P(Y = 1|X = 1) =
P(X = 1|Y = 1)P(Y = 1) / P(X = 1) = I rewrite the
denominator by applying the formula P(X=x) = P(X=x, Y=0) + P(X=x, Y=1 ): = 0.75 x
0.35 / (P(X = 1, Y = 1) + P(X = 1, Y = 0)) =
Machine Translated by Google
I apply the Relationship formula P(X,Y) = P(Y|X) P(X) = P(X|Y) P(Y): = 0.75
x 0.35 / ( P (X = 1|Y = 1) P(Y=1) + P(X = 1|Y = 0)P(Y=0)) =
Now that all terms are known we can substitute the values: = 0.75
x 0.35 / (0.75 x 0.35 + 0.30 x 0.65 ) = =
0.5738 ÿ The probability of winning Team 1 if playing at home is more than 50%
Naïve Bayes Classifier estimates the class-conditional probability by assuming that all the attributes
(between columns) are conditionally independent given the class label y, this is why it's called Naive,
because we assume conditionally independence among the different features (which is different from
statistical independence between rows of the data like we discussed in anomaly detection). So, if we want
to use this classifier, we need to make sure that the attributes are independent of each other by checking for
correlation, leaving only one attribute between two highly correlated attributes.
Given three variables Y, X1,X2 we can say that Y is independent from X1 given X2 if the following condition
holds:
The probability of Y given X1 and X2 is equal to the probability of Y given X2, this means that Y is
independent from X1, so X1 doesn't affect the probability of Y.
With the conditional independence assumption, instead of computing the class-conditional probability for every
combination of X we only have to estimate the conditional probability of each Xi (every value of the columns)
given Y. Thus, to classify a record the naive Bayes classifier computes the posterior for each class Y and takes
the maximum class as result.
Examples:
• P(Evade = Yes) = 3/10 ÿ Consequently P(Evade=No) = 7/10 • P(Marital
Status = Single|Yes) = 2/3 ÿ P(Marital Status = Single|No) = 2/7
• Use data to estimate parameters of distribution (eg, mean and standard deviation) • Once probability
distribution is known, can use it to estimate the conditional probability P(X|y)
Machine Translated by Google
Exercises:
m is a parameter, p is a user-specified parameter, where we correct the original fraction with the
parameters m*p at numerator and m at denominator (eg p is the probability of observing xi among records
with class yj), with m and p specified by the user.
In the example with m = 3 and p = 1/m = 1/3 (ie, Laplacian estimation) we have:
P(Married |Yes) = (0+3*1/3)/ (3+3) = 1 / 6
Linear regression
Linear regression is a linear approach to modeling the relationship between a dependent variable Y and one or more independent
(explanatory) variables X.
When there is only one explanatory variable it is called simple linear regression. If instead there are more explanatory variables, the
process is called multiple linear regression. When there are multiple correlated dependent variables, the process is called
multivariate linear regression.
Formally, the regression function is given by E(Y|X=x) : expected value of Y given a specific X (like Bayes). This is the expected
value of Y at X=x.
The ideal or optimal predictor of Y based on X, called regressor, is therefore: f(X) = E(Y | X=x)
Then the yellow line finds the mean value of the points
two different notations but same formula, the second is more general.
Machine Translated by Google
Examples:
there are different amounts of TV, radio and newspapers on the X-axis, and the sales amount on the Y-axis.
The blue lines represent linear regression. For example if I have 20 radios at what price will I sell them?
according to the graph the answer is 12.5, but the real observations are the red points and must be considered.
• Linear regressions are often fitted using the least squares approach • However, they can be
fitted in other ways, such as minimizing a penalized version of the least squares cost function as in ridge regression (L2-
norm penalty) and lasso ( L1-norm penalty ). • Tikhonov regularization, also called ridge regression, is a
method of regularizing ill-posed problems particularly useful for mitigating multicollinearity, which commonly occurs in
models with a large number of parameters (and like having multiple linear regressions, so when you have
multiple X to predict a single Y)
[Multicollinearity: is a phenomenon in which one predictor variable in a multiple regression model can be linearly
predicted from the others with a substantial degree of accuracy. In this situation, the coefficient estimates of the
multiple regression may change erratically in response to small changes in the model or the data.
It concerns statistical models expressed through a linear equation, when some or all of the variables are
strongly correlated with each other, making it very difficult, and sometimes impossible, to identify the influence
of the variables separately and also to obtain a sufficiently reliable estimate of their individual effects .]
• Lasso (absolute minimum restriction and selection operator) performs both variable selection and regularization in order
to improve the accuracy and interpretability prediction of the statistical model it produces.
the closer the result is to 1, the better it will be but it can also be 0 and means that there is no relationship between
the prediction and the dependent variable.
In range from 0 to 1.
the same formulas but in the second the absolute values are used.
Sometimes you can also find the RMSE (Root-mean square error, i.e. the square root of the MSE)
range
from 0 to infinity
Example:
In the example intercept=ÿo and coefficient=ÿ1. The black dots are the observations.
R2 = is very good.
Logistic Regression
Machine Translated by Google
Logistic regression is used to curve fit data where the dependent variable is binary or dichotomous.
For example: predict the response to treatment, where we might code survivors as 1 and those who don't
survive as 0, or pass/fail, win/lose, healthy/sick, etc.
previously.
Logit transformations impose a cumulative normal
function on the data and are easy to
Given an event with probability p of being 1, the odds of that event are given by: odds = p / (1 - p)
[p = probability]
The probability of being sick given that one has a normal fever value is: Odds(Sick|Normal) = P(sick)/1-
P(sick) = (402/4016) / (1 - (402/4016 )) = 0.1001 / 0.8889 = 0.111 The probability of not being sick with a normal value
is the reciprocal: Odds(not Sick|Normal) = 0.8999/0.1001 = 8.99
So the ratio between Yes and No can be used as odds in this case given the conditioned status. The interesting
thing is that when we move from Normal to High, the odds of being Sick triple: Odds ratio: 0.293/0.111 = 2.64 to be read
as : it's 2.64 times more likely to be
Sick with high values
Logit transform
This is connected with the logit transformation because the logit of a probability is the natural logarithm of the odds, i.e. the natural
logarithm of prob/1-prob: logit( p) = ln(odds) = ln(p/(1-p) )
Machine Translated by Google
Logistic regression
In logistic regression we look for a model:
Therefore the log odds (logit) is considered linear related to the independent variable X.
In the logistic regression model we model the dependent variable Y, as the logit of the probability of
get a certain class.
That is, we assume that the log of the odds (i.e. the logit) is linearly related with the independent variable X.
In this way it is possible to solve an ordinary (linear) regression using the Least Square Method but solving a classification task.
Recovering probabilities
Starting from the fact that logit is represented by the natural logarithm of the probability/1-probability, in the first step it is equated to
the linear regression formula.
In the second step it transforms making it exponential, finally p is isolated, and rewritten a little obtaining
Interpretation of Beta1
Machine Translated by Google
the ratio between the two odds is equal to and to the power of Beta1
The slope exponent describes the proportionate rate at which the predicted odds ratio changes with each successive unit of X
Example:
One additional hour of study (thus e.g. passing from 0.50 to 1.50) is estimated to increase log-odds by 1.5046, so multiplying
odds by e1.5046 = 4.5 (e^ÿ).
For example, for a student who studies 2 hours we have an estimated probability of passing the exam of 0.26. Similarly, for a
student who studies 4 hours, the estimated probability of passing the exam is 0.87.
Logistic regression= we use the same approach as linear regression, but to solve a classification problem. We do this by
transforming the variable y, expressing it as the logit of the probability of observing a certain class.
We use odds to model the probability because the odds reveal the increase we have, increasing by one unit. We can express it
with the sigmoid function, between 0 and 1. In this way we can represent the relationship between y and ex, not as a regression,
but as a function, which has a probability between 0 and 1.
Rule-based classifiers
Classification of records using a collection of “if.. then..” rules Rule: (condition) ÿ
y where:
Example:
In the example there are 5 rules which classify the animal species e.g. with blood type and the final class is the
species, which can therefore be predicted based on characteristics
Question: The rule 2 covers (Give birth = no) ^ (LIve in water=yes) ÿ Fishes covers:
to. salmon, eel ÿ this is the right one b.
salmon, eel, dolphin c.
frog, eel d.
frog, turtle
Table for the questions that will be addressed from now on:
What is the coverage of R2: (Give Birth = no) and (Live in water= yes) ÿ Fishes?
to. 3/20 b. 2/10 c. 1/20 d. 2/20 ÿ this is the right one What
is the accuracy of R5: (Live in water: sometimes) ÿ Amphibians?
to. 4/4 b. 2/4 ÿ this is right (frog and salamander) c. 2/20 d. 2/2
• Exhaustive rules
Machine Translated by Google
The classifier has exhaustive coverage if it represents every possible combination of attribute values.
Each record is covered by at least one rule. There are no records covered by none of the rules you
have in your rule-based classifier.
You can think of the rule-based classifier as a list of rules, and if there is a record that doesn't
satisfy any rule in the list you have, then the rule-based classifier and the rules you have are not
exhaustive
In this case the turtle will belong to the rule class R4 as its order of precedence is higher than R5.
In this way, rules are put together regardless of the class to which they belong.
Building Classification
Rules Direct method: designed to extract rules directly from data, ex. RIPPER, CN2, Holte's 1R
Indirect method: extracting rules from other classification models (ex. decision trees, neural networks, etc.)
Ex. C4.5rules
17/03
Machine Translated by Google
This is the idea of Ripper, one of the direct methods that belongs to a family of methods that have the property of being
sequential covering methods. Let's see the steps:
1. General-to-specific, i.e. adding one condition at a time. Under the generalto-specific strategy, an initial rule r : {} ÿÿ y is
created, where the left-hand side is an empty set and the right-hand side contains the target class. The rule has poor quality
because it covers all the examples in the training set. New conjuncts are subsequently added to improve the rule's quality.
2. Specific-to-general, trying to generalize. In this case we start from a very complex rule, in which the high level can match the
records, removing the conditions that are not shared, ie that are different. Ex: the specific rules we have in the slide share Refund
and Status, we will remove Income and keep Refund and Status which will be useful for predicting the Yes class.
The Ripper method: how does it manage the "sequential covering" and the various steps of the rules?
ÿ For a 2-class problem, choose one of the classes as a positive class, and the other as a negative class:
-
Learn rules only for the positive class.
-
If the rules doesn't show us anything, it means that by default the class will be negative.
-
If we want to obtain the rules for the negative class, they will be obtained thanks to the combination of rules
found preceded by NOT: Not 2nd rule, Not 3rd rule: then the record is classified with the negative class. Generally
it is a good choice to select the minority class as a negative class, this because it is easier to find precise rules
and to remove less, computationally better records from the dataset ÿ For a multi-class problem: sort the classes
according to the increasing prevalence of the
class (fraction of instances that
-
Multi-class problem question: Deleting the subset of the dataset I just classified right?
Machine Translated by Google
Answer: Yes, it's the coverage idea. If you cover a piece of data with a rule, it is removed. Then another rule is generated, and we'll
remove data that matches it and so on, looking at the same class.
Once the process is finished, we will start it again by changing our positive class. We will then create rules etc
etc
-
Stop when the rule no longer covers negative examples, when it has maximum accuracy.
-
Immediately prune the rule using incremental reduced error pruning Pruning measure: v = (pn)/(p+n)
-
-
p: number of positive examples covered by the rule in the validation set n: number of
-
negative examples covered by the rule in the validation set
- Pruning Method: Eliminate any trailing sequence of conditions that maximize v
ES: Let's imagine we have 3 attributes A,B,C. We start with an empty rule where we will insert a condition that refers to A, for example:
“A=blue?” We will have to calculate the FOIL information Gain. Then we think “B=high?” and here too we calculate the FOIL IG
We will also calculate it for "B=low", so as to add attributes to my rule until it will cover only negative examples. After this, the rule
is pruned using a metric.
In this case we have a validation set in which the number of records covered p and n are calculated, respectively the
positive and negative examples covered. The pruning method is based on the removal of the conditions that maximize v = (pn)/(p+n) (it's
a bit like the way best-split was selected in decision trees: add a condition + add a new condition, etc. .. as long as no negative examples
are covered). Finally I select the best set of conditions, agreeing with the FOIL information Gain. However, we understand that it is
a bit expensive process because it is necessary to test many conditions. In addition, the order in which I add the conditions can
change the resulting rule (greedy approach).
Rules evaluation:
: the goal of this procedure is to understand if the rule is general enough, i.e. the more conditions I have on
a rule, the more specific it is. so the goal will be to test the accuracy or, in our case, the coverage of a rule
on a dataset that we didn't use to create the rule. : if the accuracy
increases by removing the conditions, it means that we are generalizing, i.e. the rule we created is too
specific. This is the idea of the to reduce the complexity.
NB: you must pay attention because in RIPPER there are two types of pruning and the one we have just
represented is only related to the rules.
: there are therefore two optimizations: one during the construction of the rules that looks only at the specific rule;
another occurs at the end, where the entire rule set, i.e. a set of rules, is optimized.
Comprehension questions: -
What type of rules from RIPPER? ÿ Mutually Exclusive and Exhaustive : the covering
staff guarantees the Exhaustive, i.e. that in the end all the records are covered, because for the last class I will have an
empty condition that says "everything that is not respected in the other rules is classified
Machine Translated by Google
using this rule”. Mutually Exclusive as rules are constructed, so a record is assigned and removed by a specific rule.
-The RIPPER builds rules...general-to-specific but then refines them aiming at generality
Example:
Machine Translated by Google
Machine Translated by Google
In a
situation like
the
one represented (fig1) and it is possible to have
an infinite number of possible solutions (fig2).
However, we see that the best hyperplane we
can find is B1 (fig3), because it is the one that
maximizes the margin (fig4). This margin is
the distance between the two closest instances.
It is possible to see that the distance between
b11 and b12 is > than that between b21 and b22.
So we have to find the vector "w" which
maximize quantity:
The
arrows
above the various letters indicate the vectors,
this means that each element corresponds to
more than one feature. The decision boundary will be found by setting w*x+b=0, while by setting =1 or -1 we find
the two boundary margins. concerning at the decision
: the goal of SVM is to identify web such that the constraints are satisfied.
: The distance expression for a point x from the decision boundary line
wx+b=0 is:
Everything is fine
record and the decision
boundary And
calculated as
1/||w||, The
denominator
represents the norm of w. Consequently, the distance from one side of the margin to the other is 2/||w||. To find the best hyperplane, the
SVM aims to minimize ||w||, which is the equivalent of maximizing the distance 2/||w||
Between the
margins
• Learning the SVM model is equivalent to determining web b. •How to find web? maximizing
the margin, which is equivalent to with num2 L(w), following margin
minimize reverse of the
- ÿ6= 1.4
- ÿ8= 0.6
Machine Translated by Google
Non-separable-case:
When the problem is not linearly separable as in this figure, we will have to
account for errors in our solution. We will therefore have to find the
hyperplane that most distinguishes between the two classes and
reduces the
error. • To do this, we need to introduce slack variables
The inequality constraints need to be "slacked" to accommodate
nonlinearly separable data. This is done by introducing the
slack variables ÿ (xi) in the constraints of the
optimization problem. • ÿ
provides an estimate of the decision boundary error on misclassified
training errors; if we want to measure how much Point P has been
misclassified, I will measure the distance between Point P itself and
the closest point of the margin.
So here we will have to minimize one more thing than in the linear
case, a penalty term *, a sort of adjustment that will depend on the
slack variable:
say it is not well classified. this second scenario explains the use of two new parameters
the constraints have also changed, we will have to add the slack variable.
Lesson 03/22/2021
Non-linear SVM
What if we want to solve a non separable problem, using a linear SVM?
The answer is that we can't.
In fact if the decision boundaries were not linear we would have a situation like this (initial data space):
In this case we could not draw a diagonal that separates the data. What can we do to use the SVM then?
We have to try to reframe the problem in a way that allows us to solve it.
In this image it is possible to notice that: if the problem were two-dimensional we will not be able to solve it,
while, considering it as a three-dimensional problem we could separate the points.
A problem that was not separable in two dimensions becomes linearly separable in three
dimensions.
Machine Translated by Google
Depending on the data, we need to find the decision boundaries via the classifier.
Knowing the transformation ÿ, the decision boundary for the SVM moves from w*x to z*ÿ(x), where ÿ(x)= transforms x
from n dimensions to m dimensions, where usually m>>>n (m is very bigger in) , because the idea is that you increase
the number of dimensions to find the decision boundaries.
Assuming we have this function ÿ, the optimization problem remains the same:
**
Problems:
• The function ÿ(x) is difficult to use, in fact it is as if we have a new problem. • Which type of mapping
function ÿ(x) should be used? • How to do if we are in high dimensional space?
• Most calculations involve the dot product ÿ( ) *ÿ( ) • We may have
another problem called “curse of dimensionality”
Machine Translated by Google
This is the mathematical formulation(**), but it is not exactly what we solve with the Lagrangian
method.
The problem that is solved is this:
Where X just appears as a product(xi * xj), but this part would be replaced as ÿ( i) *ÿ( j). What interests us,
therefore, is the result of this product.
Let's imagine we can modulate this quantity with the function k(xi, xj) to replace (xi * xj).
This procedure is defined as Kernel trick and consists in defining this function k(xi, xj) , the kernel function
which takes two vectors as input and approximates the product of the two vectors, after the transformation.
Indeed, K is defined as a kernel function which is expressed only in terms of coordinates with respect to the
original space, i.e. it does not use the different dimensions that are used by ÿ, but considers only the
dimensions of the original space. This is why it is called "trick": that is, we replace the definition of the function
ÿ with the function k .
To solve a classification problem with non-linear SVM, the hardest part is selecting the correct Kernel trick.
Also, depending on the chosen function there will be new and different hyper parameters that need to be
optimized.
T
Justification for the fact that we replace k( xi , xj) with xi *xj (which is the same as writing xi*xj ):
If each datapoint were mapped in a high dimensional space with a transform ÿ: xÿ ÿ(x),
Example: suppose we are in a two-dimensional space and therefore we have two vectors defined as
“x1” and “x2”, x=[x1 x2]. We adopt the polynomial function as kernel function: k( xi , xj) =(1 + xi , xj)=
ÿ( i)T *ÿ( j): T *xj ) 2 .
We have to show that k( xi
The first row (quantity) can be rearranged as the product of the two vectors in the second row
(same as the first row, only arrange by xi first and then by xy). To obtain:
• In this way we can prove that the property: • The kernel function
implicitly maps high dimensional spaces (without the need for
compute each ÿ( ) explicitly).
• So the kernel trick allows us to solve non linear separable problems , using a
kernel function. Not all functions can be used, but only those that respect the Mercer theorem.
• The problem remains the same as seen previously, the only thing that changes is that the product xi*xj is replaced
with the kernel function
At the extreme, the partial derivative of L with respect to ameb must be 0. Taking the derivatives, and setting them = 0,
replacing them and simplifying, we obtain:
Example:
- when we have a value to rank it is passed in place of z^2 and z. If the value >1
is classified as 1, if it is less than 1 it is classified as -1
The obtained function is represented as the blue line and perfectly separates the instances we have.
• Since the learning problem is formulated as a (convex) optimization problem, there are efficient algorithms
to find the global minimum of the object function (many other methods use greedy approaches and find
a local solution)
• Overfitting is addressed by maximizing the decision boundary margin, but the user needs to define the
type of kernel function and the cost of the function • it is difficult to manage
missing values= because the formulation is mathematical, so it is
better to remove them or replace them with other values.
• robust to noise= thanks to the slak variable • high
computational complexity to build the model if we have a large dataset
The result Y is calculated by the sign function which in this case is the activation function, where the
weights (0.3) of the inputs X1,X2,X3 and the bias or threshold (t=0.4) are entered .
The general definition of Linear Perceptron can be represented in the following two ways:
Artificial neurons receive real input values and can output unipolar values {0,1} or bipolar values {-1,+1}
(obviously these results are obtained only with a binary class) .
Weights are denoted w, sigma or theta (ÿij) and indicate the strength of the connection from unit j to unit i, and
in the Linear Perceptron they indicate the strength between input and output . The greater the weight,
the greater the importance of the contribution of the input to obtain the output.
The weights are adjusted through an optimization algorithm that aims to minimize the cost function:
for example, a child learns to do something through rewards and punishments, so the model will try to get
the output that minimizes the "punishments".
The bias b is a constant that can be written as the dot product between the input xj and the weight wij such
that:
Machine Translated by Google
The result is passed as a parameter in the activation function, where in the simplest case it is an identity
function, otherwise we would have the sign function or the logistic unit.
What contributes to the output? The bias (which has a fixed input =1) * the coefficient (tetha)* n-dimensions.
In the case of the linear activation function the output is y=hÿ(x)=ÿ(ÿT x) , given by the dot product
ÿ(a)=anet. coefficient
of the x.
* Sigma represents the activation function and is given by
Perceptron
This is a single layer network, which contains only inputs and outputs.
Its activation function is f=sign(w•x).
The application of the method is very simple: you just need to do the dot product between the inputs and the
weights. If it is >=0 the result will be 1, otherwise -1.
- we have this pre-trained linear perceptron (right) where the activation function is drawn
- Question: Provide the classification for this tested instances
- Remember that x0 is always equal to 1
Machine Translated by Google
- we calculate the classification, given the current weights wk (the k indicates the k-th iteration)
- we update the weights= the weight at iteration k+1 (wk+1) is equal to the weight at iteration k + [ the
learning rate*(the real classification (yi) - the classification of the linear perceptron during the current
iteration) * the current value of x]
- all this is repeated until a certain stopping condition is reached. Usually a certain number of iterations
is fixed. Another option is to have no errors.
If it is close to 0, the new weight is affected more by the value of the old weight.
If it is close to 1, the new weight is affected more by the current adjustment.
The learning rate can be adaptive: a large ÿ (close to 1) is preferable at the beginning of the training and gradually
decreases with the various iterations, but in the long run we prefer a small ÿ such that we can keep track of what the
model is doing .
In this exercise the goal is to train a linear perceptron considering three training instances (a, b, c) in two dimensions
(x1, x2), ÿ=0.3 and the activation function is f=sign .
1) We input random coefficients w0=-1 (bias), w1=0, w2=0, obtaining a dot product XW = -1. So the activation is equal
to -1 (value returned by the classifier), and since Y=-1, we get error=0 (correct classification).
The quantity to add to the new weight would be but in this (we call it ÿ),
case the classification is correct (error=0), therefore it is not necessary to update the weight because
the quantity inside the square brackets is the error.
The quantity ÿ will be 0 for each factor/dimension. Where ÿ =
2) Since it hasn't undergone any updates, the weights and bias are the same as the first one
iteration. Both dot product and activation equal -1, but Y=1 (instance b), so error=2.
ÿ=lamda*error*x0
Remember that x0 is always equal to 1
Machine Translated by Google
*2 *
ÿ0 = 0.3 1 = 0.6.
*2 *
ÿ1 = 0.3 0 (where 0 is the value of x1 of instance b) = 0 0 =
*2 *
ÿ2 = 0.3 0
3) In this iteration only w0 varies, going from -1 to -0.4 (we add ÿ0) and we use
the instance c to calculate each factor.
4) I start using instance a again, so every time I run out of instances I start again from
Before.
...
12) The last iterations (10,11,12) all have error=0 and always the same weights, therefore the
classifications have all occurred correctly (it could be overfitting). The correct weights to use will
be those calculated at the last iteration. In this case we have an overfitting, so it would have been better
to stop earlier.
In this example we see that the data cannot be separated properly using just one line, so a more complex model
will need to be used.
Requests:
Machine Translated by Google
- Which function is not used by a linear perceptron? LASSO (is a regularization term for the linear regression problem,
NOT A FUNCTION)
- What happens if the learning rate is close to 0? The new weight is mainly influenced by the
value of the ecchio weight.
What time?
It is therefore a concept that already exists but has become successful lately thanks to the elements listed above.
A possible vision
For example the KNN method is between machine learning and deep learning without Repres. Learning, while
Classical AI is AI without Machine Learning. In all of this, DM could represent her like this. it's about data analytics,
which isn't really learning, but exploring, as is Pattern Mining, BUT since data underlies all of these, we can say that
DM is an intersection.
Deep learning
We move from a Raw data representation to a Higher-level representation by expressing the concepts mathematically
Machine Translated by Google
In Representation learning the idea is to obtain different levels of abstraction at different times in
the structure of the model that we define and the goal is to define a deep neural network model.
Last time we saw this problem : from the linear perceptron we were not able to separate the
XOR problem and a possible solution is a decisionboundary as follows, but this is not possible
with a Linear perceptor only finding a linear hyperplane
Machine Translated by Google
• More general activation functions (sigmoid, linear, hyperbolic tangent, etc.). We can adopt
different ones for each Layer
• Multi-layer neural network can solve any type of classification task involving nonlinear decision
surfaces.
• Perceptron is single layer. Recall how a perceptron with only 2 inputs, node and output is made:
now we can think of each hidden node (n3, n4) as a SINGLE perceptron that tries to construct one
hyperplane (n1, n2, n4), while the output node combines the results to return the decision boundary.
See drawing in green for the first and blue for the second. The last part of the network is seen as a
further perceptor (therefore we have 3 linear perceptors where the first 2 are aimed at identifying the
red and blue hyperplane (XOR Data), while the last one combines the results to have the final decision
boundary .
Machine Translated by Google
So the general structure of an ANN (which is called ANN or NN when it has at least one hidden layer).
Another possible name is Multilevel NN or Multilevel Perceptron. What we have in each hidden layer
and in the output layer is the same structure as last time: inputs coming from previous layer nodes,
combined with a dot product wrt weights, activation function, output
The number of units in the internal node does NOT define a NN, but the number of hidden layers.
When we have a single hidden layer we can talk about NN, otherwise if we have more hidden layers
we talk about Deep NN (depth given by the number of hidden layers). The number of Units inside
each layer does NOT cause the depth of the Network. Researchers are also studying the wide NN
effect, where wide refers to the number of neurons in each layer.
Machine Translated by Google
Number of output layers and hidden layers is not related: we could also have something like this. Obviously
the more hidden layers there are, the easier it is to have more complex representations than those with
a binary 01 output.
There is no best way to find the parameters - it depends on so many different factors
Another element that differentiates linear perceptors from NN are the different types of activating functions
This NN is called Dense, as we have no missing connections and we have all the connections
between a node and the subsequent node. It is the simplest and one of the most used.
Machine Translated by Google
First of all, how do we calculate the output? The difficult thing is to train a structure like this: in the
perceptrons we use the errors made by each training record to adjust the weights.
Now, how do we repeat the same thing going from output to top of input? Considering that, moreover, now we
have to not only update the weights of the nodes connecting input and output, but all the weights of the nodes
connecting each couple of layers: we have many more coefficients. the direction drawn by the prof is like
calculating the class of a DNN
Exercise
We have 2 inputs, 1 hidden layer and 1 output layer. What we read are the trained weights, In each layer we
use activation function sign(S “-0.2” ). We have to label each test set. The strategy we follow is to do
the weighted sum for each neuron and we apply the activation function and the output of one neuron is the input
of the following one (in the following layer considering the arrows).
The output of the first node is -1 and we continue for each hidden layer (H1, H2, H3)
Machine Translated by Google
Here having paid attention to the fact that same inputs have same outputs, we could have saved time.
NOTE: the activating function is NOT applied on the inputs (which are recognized because they only have
outgoing arrows).
Representation learning
These additional layers are a way to extract different levels of representations for the data: these
subsequent levels have the goal of extracting and representing complex concepts in the data and the ultimate
goal is to discriminate between different principles: the concept learned at each level is not general, but necessary
to achieve the purpose for which the network was created: if I have a dataset of images and I have to
distinguish between cats and foxes, both have ears like triangles; this complex concept is not useful for
discriminating between the 2 animals. It is more useful to focus on colour, size, mouth, tail... the discriminant
aspects can be captured at different levels of abstraction. why in deep neural networks we increase the number
of layers? because in this way we can also address non-linear separable decision problems and because the
idea is to represent in the various hidden layers complex concepts that can help in the final classification.
Activation Functions
A new activation function which is one of the most used especially in the output layer for replace the sigmoid is
the hyperbolic tangent: it has better properties than the sigmoid keeping about the same shape (but the
domain changes: sigmoid goes from 0 to 1 and hyp .tangent goes from [-1,1]). In the codomain.
Machine Translated by Google
Takes the value 0 for input < 0 and the input for values > 0 ÿ max(0,z = parameter passed)
Blue: RELU
Green: sigmoid
Red: Soft version of the RELU. This allows sparsifying the output (we don't have just 0) and
vanishing the gradient
Machine Translated by Google
• Perceptron computes error e = yf(w,x) and updates weights accordingly : we look at the
difference between the true outcome and that of the network
• we can't really determine it, but we can approximate error in hidden nodes by error
in the output nodes ÿ backpropagate the error from the output node to the input node
• Problems:
1. Not clear how adjustment in the hidden nodes affect overall error (weights connecting the
various layers)
2. No guarantee of convergence to optimal solution. We can't do anything about it
to solve the first problem we can apply a strategy based on gradient descent for training the multilevel
NN, which works like this:
Machine Translated by Google
1. Error function to minimize ÿ we define the loss function we want to minimize. y = real
labels (ground truth) , f = applied on a set of instances given the weights at the current time is
the output that comes from the NN. we want to find a global min solution given this loss
function.
2. weight update: first element indicates the weight at the k+1 iteration and the other at the k-th iteration. there
quantity after lambda (in the linear perceptron it was easy to define. This time we express the delta with the
first derivative of the loss function wrt the weights we have to update). we can solve the problem using a
tochastic gradient descent algorithm only if the function f is differentiable. That's why we can use the
activation function we want at the beginning of the NN if the function is differentiable. otherwise we can't
Backpropagate the error.
3. For sigmoid function: last time as activation function we did NOT use a sigmoid, but this formula is
practically the same (the same as for the error function to minimize). The derivative of the loss with
respect to WY becomes the quantity for the output if we are in the last layer (see arrow from weight update
to sigmoid function)
This is the final part of the computation for backpropagating the error. It must also be repeated for the other internal
hidden layers.
The techniques used to find these weight updates are the Stochastic Gradient Descent methods. To give a
visual idea:
They are interested in the idea of backproppagating the error where we don't have the true errors in
the inner layers, so we have to, with the Stochastic gradient Descend method, backpropagate the error.
So we pass and update each weight through the derivative of the loss wrt that weight. How do we
calculate the derivative?
So in training a Multilayer NN, if we have an input x(i) we have these different computations
to update the loss we use the backward pass which updates the backwards weights, i.e. first the delta
(y) is calculated , then delta j - 1 ( z1), then j - 2 (x1) and does the same for the hidden layers.
Obviously the forward pass does the calculation to estimate the errors and with this we calculate the
deviation of the error at the output layer and then we can backpropagate it and do the calculation for
the former layers
Error Backpropagation
Taking the first derivative of the activating function we have only basic operations without logarithms
or exponentials.
It could happen that the delta quantity (i.e. the deltaE/deltaw(j) gradient) that is the part or backpropagate,
depending on the data and depending on the activation function, could be a quantity very close to 0.
Consequently, this quantity vanishes and in each iteration I have that the value at iteration (k+1) is the
same at level (k): the weights remain i
Machine Translated by Google
usual and the NN is not learning anything. This phenomenon usually occurs in a specific
time of the NN which is the recurrent NN.
Question: the weights are initially random as they were in perceivers. It's a kind of aprameter:
we can have better initial sets than random. He will show us later
What is a NN?
- Must be differentiable (because in the loss functin we have the activation fucntion
applied to the last layer)
- A possibility is the quadratic function
• The error/loss/cost function reduces all the various good and bad aspects of a possibly
complex system down to a single number, a scalar value, which allows candidate solutions to
be compared. the choice of the loss function is crucial, since it reduces everything to a single
number.
• It is important, therefore, that the function faithfully represent our design goals.
• If we choose a poor error function and obtain unsatisfactory results, the fault is ours for
badly specifying the goal of the search.
Machine Translated by Google
• AF: Output Layer: One node with a linear activation unit (result returned by the NN)
• One node with a sigmoid activation unit (if the number of classes is K=2)
• K output nodes in a softmax layer (each output layer with its own probability for the
classification of that class) (K>2)
Two examples of loss functions: in the forward pass we calculate the error with the quadratic
and in the backward pass we have the relative first derivative. Or we use cross entropy . The
quadratic is normally used for K = 2 and the Cross Entropy for K >2.
ÿ k or log2k nodes for each categorical attribute with k values, but this can reduce
the expressiveness of the Netword, so usually OHE is used (or particular encoding, eg.
Characteristics of ANN
• Multilayer ANN are universal approximators but could suffer from overfitting if the
network is too large:
• Model building can be very time consuming, but testing can be very fast. We can
download some already trained and redo the training to then apply them.
• Sensitive to noise in training data: carefully cleaned data, no bias, no noise. One of the
reasons why NNs have become famous in recent times is that in big data we have noise that
impacts training less (if it's not a really noised dataset)
• Training set: use to update the weights. here, weights are learned. Records in this set are
repeatedly in random order. The weight update equation are applied after a certain number of
records.
• Validation set: use to decide when to stop training only by monitoring the error and to
select the best model configuration. here we monitor the loss and, when it stops
decreasing, we stop the training.Validation set is CRITICAL.
what is a validation set? A dataset used during the training phase to check the NN
performance but not used to calculate the NN weights. Recall in the RIPPER, the validation set
is used to prune the rules: we were checking if the rule was too general or not, but it was already
extracted. Here we just check the loss but we DO NOT use data data in the valdiations et
to change the weight values.
• Test set: use to test the performance (accuracy, f1..) of the neural network. It should not be
used as part of the neural network development and model selection cycle.
- Assess robustness
- Have more opportunities to find optimal results
• Many weight updates, can quicker convergence but also make learning less stable and
more influenced by the initial pair of weights
- Batch mode (off-line or per-epoch). All the records are passed to the NN
• Weights updated after all records are presented. Uptade the records once for every batch
• Can be very slow and lead to trapping in early local minimal as the majority of training
records activate the network's neurons in a particular way and therefore we have a
certain leak assessment which is not good for DIFFERENT training records in the
data
• Weights updated after a few records (from tens to thousands) are presented ÿ subsets
of training records are selected and classified in the forward step. Then in backpropagation
step is applied, weights are updated, another minibatch is flowered until all the records are
analyzed. After this, one epoch is passed and the second one starts with another random
selection of minibatches.
Convergence Criteria
Machine Translated by Google
• Typically 1 training set presentation = 1 epoch ÿ all the training records have been
classified by the NN
• Absolute rate of change in the average squared error per epoch is sufficiently small
a peak ÿ typical. We have a max number of epochs (let's say 100) and we start the
training. After each epoch check the loss in the validation set.
We must note that in the training set the loss decreases, while in the validation (since
it is not the same set of data used to adjust the weights) it can happen that the
model goes into overfitting: it decreases up to a certain point and then increases.
The minimum point is the right one in which the training stops, otherwise we are in overtraining
Typically we consider a certain number of iterations and remember for example that every 5
epochs i store the model and the loss for validation and training. If after 10 epochs the
validation loss has increased, while that of the training decreases, I can set this
heuristic and say that I stop the calculation and take the model that I saved 10 epochs ago,
which was the one with the lowest loss in the validation set that I observe , so it has the
HIGHEST LEVEL OF GENERALIZATION.
Early Stopping
• Running too many epochs may overtrain the network and result in overfitting and perform
poorly in generalization
• Keep a hold-out validation set and test accuracy after every epoch. Maintain weights for
best performing network on the validation set and stop training when error increases
beyond this
• Always let the network run for some epochs before deciding to stop (patience parameter),
then backtrack to best result
Machine Translated by Google
Model Selection
• Too few hidden units prevent the network from learning adequately fitting the data and
learning the concept.
• Too many hidden units leads to overfitting, unless you regularize heavily (eg dropout,
weight decay, weight penalties)
• Constrain the learning model to avoid overfitting and help improve generalization.
- Add penalty terms to the loss function that punish the model for excessive use of
resources
• Limit the amount of weights that is used to learn a task : if it is too high, a particular
combination of features can have a too high contribution in the final output and lead to
overfitting
Dropout Regularization
It is the most common type of regularization: it is one of the best ways to prevent
overfitting. A network with dropped units is not used at the prediction time, but the full network
with the resulting weights is used. When we have dropouts, some weights are not used
and therefore not adjusted, but eventually all are used. If we use it at the prediction
time, the predictions are given with a confidence interval.
momentum
• If weight changes tend to have same signs, the momentum term increases and gradient
decrease speed up convergence on shallow gradient
• If weight changes tend have opposing signs, the momentum term decreases and gradient
descent slows to reduce oscillations (stabilizes)
• Can help escape being trapped in local minima. For example, if we have this loss
function and with our initial random weights that are up there, with the Stochastic Gradient
Descent Method we end up at a local minimum (LM). Without a technique like
Momentum or dropout etc.. it is easy for that local minimum to be uniquely selected,
while the model should note that there are better solutions that minimize the loss even more
and lead to higher accuracy.
Machine Translated by Google
- Dropouts
- Add penalty terms to loss
- Reducing the number of hidden units
- Increasing the number of hidden layers
- Using cross-validation
Question: regarding the n. of hidden layers, if we move a layer we have to recalculate the weights.
Let's start with an architecture and we train it from scratch: for example, we start with 5 layers, we
train with 5 layers, if we then change the layers, the weights also change.
lesson 29.03.2021
- Standard Stochastic Gradient Descent (SGD) is one of the best algorithms, where the only
requirement is that the activation function must be differentiable, and it is often used
with momentum. Disadvantages: Difficult to get the best learning rate and convergence
is unstable.
- RMSprop is another algorithm with an adaptive learning rate ÿ (it starts from a high value and
the algorithm decreases it via a weighted moving average of the squared gradient). This
approach speeds convergence through faster gradients when needed.
- Adagrad extends RMSprop with element-by-element gradient scaling.
- ADAM is like Adagrad but adds an average of the previous gradients like the
momentum, which decreases exponentially.
ensemble methods
Aggregate predictions from multiple classifiers to improve accuracy by building a set of base
classifiers from the training set. The predicted classes in the various test sets are combined to obtain a
joint result.
In the field of machine learning, the combination of classifiers allows for more accurate predictions.
For example: suppose we have 25 classifiers, each of which has an error rate ÿ=0.35 and the errors are
uncorrelated. The probability that an ensemble classifier will make an incorrect prediction is:
Machine Translated by Google
Having a wrong prediction means that at least 13 classifiers (the majority in this case) out of 25 must
make a mistake. The probability that the ensemble classifier makes an error is 0.06, which is less than
the error of the single classifier (equal to 0.35).
This graph shows that, as long as the probability of making an incorrect prediction of the single
classifier is less than 0.5, it is better to use an ensemble classifier, otherwise it would mean that
the single "individual" is making random predictions and the ensemble classifier would be worse of the
single.
- Error1 = 45%
- Error2 = 40%
- Error3 = 35%
We ask ourselves: is it better to use model 3 alone because it has the best performance, or is it
better to use them all together?
In the previous example, the formula for calculating the probability of an ensemble failing assumes
that the errors are all the same, but in this case, they are different for each model.
Now we must explicitly enumerate each case:
The cases we want to avoid are those where at least two models make a classification error
(highlighted lines). For each unfavorable case we calculate the probability of obtaining that particular
combination by multiplying the respective individual probabilities, and adding the probabilities of each
unfavorable case we obtain the total probability (probability that those four unfavorable cases occur).
Since the ensemble probability is 35.15%, it is convenient to use only model 3 since it has a lower error
probability.
Types: bagging and boosting (manipulate data distribution) and random forests (manipulate input
features).
Random Forests
Is a class of ensemble methods created for decision trees, which combine the predictions of multiple
decision trees via a fashion of the classes predicted by the individual trees. It is important to remember
that trees are not always the same.
Each decision tree is built on a bootstrap sample (a sample from the training set) based on values
from an independent set of random vectors and a random selection of the columns.
Let's assume we have mxn features. Let's create 3 decision trees for our random forest. The cardinality
of each dataset is n, but they don't have the same instances. The cardinality is
Machine Translated by Google
maintained, keeping instances, but not all bootstrap samples necessarily have all input instances. This is
obtained by randomly selecting instances of the original dataset, the probability is 1/n.
More precisely, random records are selected with probability 1/n (where n is the length of the
dataset), and the generated trees will have the same length n because some random lines will be
repeated several times to fill the spaces. It is as if the decision tree was built from a dataset that has the
same size as the original dataset and has random column selection.
Each tree is evaluated on m attributes randomly selected from the M available attributes. Usually the
Advantages of random
forest It is one of the most accurate and efficient classification algorithms that works even on very
large datasets and with a large number of features (because it estimates which features are more important).
Another important aspect is that it does not need data normalization because the variables are analyzed
independently of each other.
Curiosity: A Deep random forest can be made , i.e. an ensemble of the results of several random
forests.
Question: In Random forest classifier….. each base classifier is a Decision Tree and each bae
classifier works on a bootstrap sample
31/03
bagging
Given the definition of Bootstrap, which is essentially a sampling with replacement, the idea is that
given an original dataset like the one below, you can decide how many bagging rounds you want to
do, in this case 3.
So m = 3 and from the original data we bootstrap on 3 datasets, we have 10 records and in this case,
for example, in the first round the tenth is selected 3 times and the fifth 2, etc.
We create this dataset with m = 3 and then train a classifier on each of these bootstrap samples.
For example, a KNN for the first round, one for the second and one for the third, and at the end,
given a new instance, a prediction is made as in Random Forest.
So it also reduces dimensionality by randomly selecting a subset of columns.
Example:
Considering this 1-dimensional dataset:
The classifier is a decision stump: it can only make decisions in relation to a feature (in this example
we have only one feature so it takes x features). It must be considered that
with a single split it is not possible to divide the class:
- Decision rule: x<=k versus x>k
- Split point k is chosen based on the entropy
First step:
In the example 10 stumps are learned, the decision threshold moves according to the selected data.
Machine Translated by Google
It can also be represented more compressed like this, using the majority vote to determine the class of
the ensemble classifier:
In this representation we can add the 1 and -1 used as classes and take the sign of the sum and get
the decision boundary of the classifier.
Visual Example:
Question: Correct statements on bagging: Random Forest adopts bagging (bagging is much more general
than RF), in bagging each base classifier adopts a sample of the features.
Boosting
Boosting is similar to bagging, except that each random selection of data for one boostrap sample affects
the next, the idea being to increase the probability of picking instances that were misclassified by the
training classifier in the next boostrap sample. ÿ An iterative procedure to adaptively
change distribution of training data by focusing more on previously misclassified records ÿ Initially, all
records are assigned equal weights
ÿ Unlike bagging, weights can change at the end of each boosting round
ÿ Records that are classified incorrectly will have their weights increased. Major
are the weights à the greater the probability of being selected
ÿ Records that are classified correctly will have their weights decremented
For example, in this example it can be understood that since the fourth record appears 5 times à it is
difficult to classify therefore it is more likely to be selected in subsequent rounds Each
model is associated with an importance (alpha), when we classify a new record , we have to weight
the classification of each record in relation to its importance.
The final outcome could be 1 even if the majority is 0.
Question: Correct Statements on Boosting: each iteration and sampling affects the next one, correctly
classified instances get low weight.
AdaBoost
It is an algorithm that builds base classifiers: C1, C2, … CT The
error rate is calculated depending on the importance that each classifier has for the various instances:
This is the classification error on a single classifier, weighted (w) with respect to the importance that
a certain instance has at a current time on the dataset.
We then use the error to evaluate the importance, which is evaluated:
Machine Translated by Google
In the graph it can be seen that the domain is between 0 and 1 and the codomain between
5 and -5 so the importance will be between -5 and 5 while the error between 0 and 1.
close to zero (e.g. error of 0.2 ÿ importance 1.5, but if error of 0.8
ÿ importance -1.5)
AdaBoost algorithm
Each record has an associated weight that tells how important it is to the classification, and the
update depends on whether a classifier at a certain iteration classifies correctly or not. So the new
weight equals the old weight times and to the power of -ÿ (importance). The minus serves to
reduce the importance of the weight, while the exponent is positive if the base classifier at iteration
j misclassifies it.
If each intermediate round produces an error rate greater than 50%, the weights return to 1/n and
the resampling procedure is repeated.
Classification:
Algorithm:
in the first iteration all instances have the same weight and we get the first iteration:
in the first stump the best split is composed of all the
records to the left of the red line classified as -1
while those to the right as 1
the first
from 0.4 onwards and increases the weight by 0.1, 0.2, 0.3. instead, at the second iteration the points from 0.4 to 0.7
are misclassified ÿ the weight of all the others decreases while that of those elements increases.
For example, if I want to rank the point 0.5, it is smaller than 0.75 so it is ranked as -1 in the first round, 1 in the second
and -1 in the third à i weight these different results with respect to the importance.
If we want to see a visual example this time we have something a little more different:
Machine Translated by Google
Here too the circles are misclassified and therefore their importance increases
The third plot is the weighted strong classifier obtained as combination of the previous one
As the iterations increase, the misclassified instances get more and more approaching the decision
boundaries, this is reasonable because they are very difficult to separate. The algorithm stops given a
predefined number of base-classifiers to create
AdaBoost-Reloaded
Machine Translated by Google
With this dataset with 3 features, we initially have all records with the same weight: 1/ number of
samples.
What we have to do is look for the first stump in the forest.
All weights are the same so you can ignore them for now.
At this point the gain is calculated with the GINI in this case:
Machine Translated by Google
·
You want to minimize impurity so best is last.
·
So this is the first stump of our forest and first we want
to calculate the error of the stump: which is 1/8
·
Now we want to determine the importance of the
stump for the final classification and to update the weights (with
the importance formula seen before)
·
We determine importance based on how well the
samples rank in terms of total error
·
The total error is always between 0 and 1
since the error is very low we have a high importance, in our case it is
½ log(7) = 0.97 If we had another stump like
chest pain we would have had another error which would have
been 3/8 and an importance of 0.42 Now we use the importance
(0.97) to increase the
weight of samples that are misclassified and to reduce the weight of
correctly classified ones.
This graph shows how the scaling factor of the importance varies with respect to the variation of the importance.
In this case we have 1/8 e0.97= 1/8 * 2.64 = 0.33 > 0.125 (= 1/8)
Machine Translated by Google
To correctly classify the instances, the other part of the formula is used, obtaining: 1/8
e-0.97= 1/8 * 0.38 = 0.05 < 0.125
So now we have a new column of weights and a
column with the normalization of weights which
together add up to 1.
With the new weights, updated and normalized, we can
replace them with the old ones (so we replace the values
of Sample Weight with those of Norm. Weight)
Now you can use the modified sample weight to make the second stump in the forest.
We have two possibilities, in theory we could reuse the original dataset and adopt the weighted Gini
index to determine which variable should divide the next stump. Alternatively, you can create a new
training that contains duplicate copies of the samples with higher sample weights. Possible
example:
Finally, assuming we ran AdaBoost 6 times, given a new patient we want to predict whether he will
have heart disease.
You add up the importance and as you can see heart disease = Yes has a higher importance so it is
more likely that you have heart disease.
(there are solutions in the slides, the questions are questions that you could potentially ask in the exam, or how do you
obtain the weights normalized? This exercise could be asked in the exam).
Lesson 12/04/2020
TIME SERIES
What is a Time Series?
A TIME SERIES is a collection of observations made sequentially over time, usually at constant time intervals.
Relative to one dimension or multiple dimensions.
• In addition other data type can thought of as time series • Text data:
words count (appearance of every word can be considered an instant) • Images: edges displacement •
Videos: object positioning
• we can motif discovery= same as pattern discoveryÿ recognize some parts of the
Structural-based Similarities
• For long time series, similarity based on shape gives very poor results.
Here it seems that A and B are very similar, but we might have that time series B is evaluated as
more similar
to C • We need to measure similarly based on the high level structure. •
The basic idea is:
1. extract global features from the time series, 2. create a
feature vector and, 3. use it to
measure similarity and/or classify Example of features:
• mean, variance,
skewness, kurtosis, • 1st derivative
mean, 1st variance of the derivative, ... • regression
parameters, prediction, Markov model •The time series
has a max value of 11, the b of 12 etc…
The more similar the time series are, the more similar the ratio will be.
Ex: if we use hierarchical clustering using Euclidean distances in this dataset the result will be this= at the
end the yellow time series will be merged with the red one.
Running the same algo with the CDM distance we will get the clustering we want:
(when the last property doesn't hold, we have the figure on the right)
Machine Translated by Google
How is the Euclidean distance calculated ? Root of the sum of the squared differences, at each time interval.
In the first image (top left) = the shape is the same as the time series In the
second image (top right) = if we calculate the Euclidean distance we find that they are a bit dissimilar.
In the third and fourth image= we want to place the time series one above the other, i.e. performing
the normalization. An easy way to do normalize. How does it normalize?
A simple way is to subtract the mean value from the time series from the time series.
Ex: Q=Q-mean(Q) ÿ the time series Q is given by Q minus the mean. The same for for C.
We have two time series, the blue and the green. With amplitude scaling we not only subtract the
mean, but also divide by the standard deviation (corresponds to the standard scaling
normalization).
The idea is to see how similar the two time series are without considering the mean and the standard
deviation.
What is the difference between this normalization and other normalizations with respect
to standard scaling? (He will ask for it in the exam)
When we do the standard scaling=
- let's take the mean and the standard deviation of column A: mean (A) and STD (A) - x'i=
xi- mean(A) / STD(A)
Machine Translated by Google
we can do an inverse, because the normalization is applied for every time series.
We normalize the time series for data that comes from a time series.
On the other hand, we normalize from the columns, for data that comes from outside and therefore it is
possible to do a denormalization.
To remove noise from the time series, a smoothing function is defined that calculates the average
of each data point with its neighbors. Smoothing = make different points similar but what
Machine Translated by Google
appear similar in time. That is, points that appear similar over the same time interval should have the same
value.
Moving Average
One of the most common methods for removing noise is the Moving Average (MS), in which a time window of length w
and a TS t are defined .
Moving average calculation:
According to this formula i is the central point and the average is calculated considering the value of the previous data
point (i-1) and of the following one (i+1).
Since the mean is calculated using previous and next, the first item in the list has no previous, so the moving average
cannot be calculated (and the last will not have the next). There are two solutions to this problem: 1) The missing value of
ma can be replaced with the next value (in this case
22.0)
2) Or add a row at position 0 with the same value as row 1, and a row
following the last one (in this case in position 6) with the same value as the previous one, like this:
Sometimes two time series that are conceptually equivalent evolve with different speeds, at least
in some moments
In the first image (fixed time axis= sequence are aligned “one to one”) we see that there is a fixed
alignment up to a certain point. Then a misalignment develops.
In the second image (Warped Time Axis) we see how Dynamic Time Warping manages to correct
misalignments. This is why we need a Euclidean distance.
This path must have a guarantee of optimality, to be able to say that: we have discovered the best
alignment between q and c, it is as if we took the two time series, separated them and tried to
match each point with the point that represents the best to compare them and at the finally we
return the sum of this distance. We need dynamic time warping because Euclidean distance can't
handle time series that grow and decrease with different speeds, but always keeping the same
shape.
Then all the preprocessing required in the Euclidean distance is also required here (amplitude
scaling, noise removal).
One problem that DTW may have is running time, because it has to calculate the distance between
points for every couple of time series we have. If a time series has length m, m^2 is the distance
between two objects and then we have to multiply it by all the pairs of time series we have.
Final questions:
1. Which of the following information can be represented as a time series? covid-19
infections, temperatures in a region, number of purchase orders received by the
company, blood pressure.
2. How to calculate a structural-based distance between two time series?
- define the feature set we are interested in - compute
the same feature for the two time series - create two
vectors with the feature values - apply a
traditional distance function between the feature vectors 3. Which
of the following transformations of the time series can be used to remove the different
variability before shape-based distance calculation? Amplitude scaling, because
different variability means that time series go up and down.
4. Which of the following time series transformations can be used to remove different
range effect before shape-based distance calculation? Offset translation 5. Which
of the following time series transformations can be used to remove
noise? Moving averages
Machine Translated by Google
the gamma represents the cost of the best path to reach the cell
(i,j), the top right cell where we have the lowest cost. The
objective to maintain is to obtain the minimum cost by
reaching box (i,j).
: But how is this cost calculated? is the distance between the two time series at time i and j (i.e. how the value differs in
those two times) + the minimum of the three values preceding cell (i,j), i.e. (i-1, j), (i -1,
j-1) or (i, j-1).
: the idea is, from the moment we construct the path, from the bottom left point to the
top right point, the formula will be the one contained in the matrix (to be read from bottom
to top).
NB: Never down, never back and never right!
1 Step: calculate the matrix of all possible distances between the two time series; i.e. we consider
all possible times of all possible time series and compare the values. Typically the absolute value
of the distances is used, but the squared difference can also be used.
2 Step: calculate the cumulative cost matrix, i.e. the matrix of all the costs of the routes. It starts
,
from the bottom left ie (1,1), copying the first value and calculating the rest of the matrix with the
formula, repeating the procedure for all the columns. In the top right box we will have the result, which
corresponds to the final cost between time series Q and C.
3 Step: along the matrix it is possible to find the best path, with the lowest cost, also called "best
alignment" between each pair of points.
Machine Translated by Google
Let's assume that we have already calculated the distances from step 1 and start from step 2.
At the beginning we have to copy the value corresponding to the box (1,1) and we will see that in that case the first
box is exactly equal to the distance d(q,c), as written in the slide.
After that, to calculate the upper box, we will have to apply the formula, noting however that we do not yet have the
3 numbers to make the comparison; in fact we will be in the position ÿ(i,1), i.e. in correspondence with the first
column and here the value to insert will be = current value in the cost matrix d(q,c) + minimum value between (cell
of the previous column and row zero, which is missing) (the previous row cell of the same column) (another missing
in the
value). = the only value accessible to us is the one in the center, i.e. ÿ(-1,1).
We will then have to repeat the procedure for the entire matrix, considering all three values.
Machine Translated by Google
First of all we will have to build the point to point cost matrix (the one
represented here) using the absolute difference between the points.
To facilitate the calculation, the numbers are shown on the side of the
matrix, therefore in the first position we will have 2=|3-5 |, then above 4= |3-7|, 3= |3-6| and so on.
to calculate the values in the cell above, we'll need to do = 3 + min (6, missing value, missing value) = 3+6
=9 and so on. Let's move to the second column, so as to apply the formula well. In correspondence with
the first value of column n 2 of the cumulative cost matrix we have 4, which is given by = 2 (blue square) +
min (2, missing value, missing value) = 4 In the box above, in correspondence with 2, we will have= 0 +
min (2,6,4) found as written in blue in the slide below, therefore 0+2=2.
Machine Translated by Google
For the cell above, 3 will be given by = 1 + min(6,9,2) = 3. We will run the whole matrix to find the path with the highest cost.
STW- Exercise 1: Given two time series calculate the cumulative cost matrix and the distance between the two
t2 <3,6,7,0,1>
A) Calculate distance between t1 and t2 using DTW with distance between points calculated as d(x,y) = |x
y|
B) if we repeat the computation of point A, but this time with a full size Sakoe-Chiba band
equal for T2 (that is, T2=<3,6,7,0,1>), is it true that DTW(T1,T2)= DTW(t1,t2)? Discuss the problem
without Do calculations.
ÿ NB: sometimes it is possible to calculate the distance both with the Euclidean method (between absolute values) and with
the distance to manhattan. he can ask for it on the exam either way.
We start with the point to point cos, calculated with the Euclidean distance, then: |4-3|=1, |4-6|=2, |4- 7|=3 3 and so on
throughout the matrix.
Let's move on to the cumulative costs matrix: -
we copy the first value, i.e. 1 - the value
above is given by 2 + min (1, x,x) = 3. The value corresponding to the third row is given by = 3 + min (3 ,x,x), and
more.
Machine Translated by Google
- the first value of the second column (1) is given by = 0 + min (1, x,x)=1, the value in
position (2,2) is given by = 3 + min (3,1,1) = 4. The value still above = 4 + (6,3,4) = 7, then in
position (2,4) we will have 3 + min (10,6,7) = 9. Value of the last row = 2+ min (13,10,9)= 11 -
we will complete the table like this.
= The DTW value, i.e. the final result is the one corresponding to the box at the top right, in our case
4. The result at point A) is therefore 4.
There are many different and famous time series classification datasets that help us to recognize faces,
understand if a person is pointing a gun or not etc and as I told you, use a k NN classifier using
Euclidean distance or DTW from dei different results. Let's see them:
Machine Translated by Google
where "r" can be considered as a window that reduces the number of calculations. Term defining the
allowable range of warping for a given point in a sequence.
distances is constant for the whole matrix (dark gray band). you decide
We will simply have to calculate the point to point cost matrix and the cumulative costs matrix
only for the points inside the range ÿ it's like a window that decreases my calculations and that mi
Accuracy vs Width of the warping windowÿ in the slide below we have a resolution
of the previous dataset using Sakoe-Chiba band. along the x axis we have the warping width, while
Machine Translated by Google
on the y-axis the accuracy. We see that for w=1 we have the Euclidean distance, but when we start a
moving towards w = 2 the performance starts to grow more and more until w = 5 (in our case) for
then decrease and remain constant. This means that you practically do not have to calculate all the
possible values.
ÿ obviously, as the size of the band considered increases, the accuracy also increases
we are calculating the Euclidean distance (in fact I only consider the diagonal), as I enlarge it
the accuracy increases to then reach a constant value ÿ once this value is reached, all that is needed
B) the answer is that the result does not change, because r=1 means
moving by one with respect to the diagonal, creating a band limited
by the two blue segments. So we see that the shortest path still
remains within the Sakoe-Chiba band.
C) if we do the inverse of the two time series we get the same results,
despite having both the point to point matrix and the cumulative matrix
different.
Machine Translated by Google
Machine Translated by Google
The difference is that in the reduction, pieces are directly eliminated or variables such as the
PCA are used, while each time series is generally expressed by a particular attribute (such
as temperature, humidity, etc.) so we can see how the variable evolves at different times; it is like
a “timestamp reduction” •
Approximation vs Compression: the approximated space is always
understandable, while the compressed space is not always necessarily
understandable
As we can see, the points describing C were 128, those describing C' are only 8, therefore we
have removed 15/16 of the data, keeping 1/16 of the data. It is necessary to say that if we want to
reconstruct the time series starting from these courier coefficients, it is not certain that we obtain the
usual TS, this is given by the approximation made.--> There are no guarantees of having a
Machine Translated by Google
lossless reconstruction, that is, it is not said that it is identical to the initial one.
Instead of choosing the first coefficients (as we have done), however, it is possible to take the coefficients
we think are the best; this certainly helps us to increase quality but it cannot be done in all areas,
beta that will tell us how much final TS can be represented using these. • SVD differs
from other methods in that eigenwaves are data dependent. • We have already seen that
we can consider TSs as points in high-dimensional space. • We can rotate the axes so that axis 1 is
aligned with the direction of the maximum
variance, axis 2 is aligned with the direction of maximum variance orthogonal to axis 1 etc.
ÿ the idea is that: since the first eigenwaves contain most of the variance,
the rest of the time series can be truncated ensuring a good construction.
I see that there will be some segments (like the middle one)
that don't represent the curve very well. I will now have to
calculate the information by measuring the right length and
height.
- Pros: compress data; it is a good noise filter; this type of representation is able to support some
interesting non-Euclidean similarity measures
- Pros: Extremely fast to calculate; Supports non-Euclidean measures; Supports weighted Euclidean
distance
For a complete display only, a TS can be segmented using a predefined length w or a predefined
number of segments k, or using change point detection methods. The latter is the best way because it
does not imply a parameter which could cause the data to change.
7)
Symbolic Aggregate Approximation (SAX)
- The idea is to convert the data into a discrete format, with a small size alphabet.
Each part of the representation contributes the same amount of information about the shape
of the TS.
-1 step: A Time Series T of length n is divided into w segments of equal size; the values in each
segment are then approximated and replaced by a single coefficient, which is given by their mean. By
aggregating these w coefficients, the PAA representation of T is formed.
Machine Translated by Google
-2 step: Next, we determine the break points that divide the distribution space into ÿ equally probable
regions, where ÿ is the user-specified size of the alphabet, i.e. a letter (the MDL could be used al
instead of being chosen by the user).
- Breakpoints are determined in such a way that the probability of a segment falling into one of the
regions is approximately the same. If the symbols are not equiprobable, some substrings would be
more likely than others. As a result, we would have injected a bias
probabilistic in the process. Once the breakpoints are determined, each region is assigned a symbol.
The PAA coefficients can then be easily associated with the symbols corresponding to the regions in
which they reside. Symbols are assigned bottom-up, i.e. the PAA coefficient falling in the lowest
region is converted to "a", the one above to "b" and so on.
default.
equiprobable.
partition. Letters are assigned bottom-up, i.e. the PAA coefficient falls into the
lower region is converted to "a", the one above to "b" and so on.
: what we discussed so far was a sort of preprocessing phase for each task with the TS.
Clustering:
It is based on the similarity between time series.
• The most similar data is grouped into clusters, but the clusters themselves should be
different. • These groups to find are not predefined, it is an unsupervised learning
task. • The two general methods of grouping time series are: -
Partitional clustering -
Hierarchical clustering
Machine Translated by Google
Hierarchical clustering: • Calculates the distance in pairs and then combines similar
clusters bottom-up, without the need to provide the number of clusters • It is one of the
best tools for data evaluation, from creating a dendrogram to different time series to the
sector of interest. • Its application is limited to small datasets due to its quadratic
computational complexity.
Partial Clustering: • Typically uses K-Means (or some variant) to optimize the target function by minimizing squared. • K-
Means is perhaps the most commonly used clustering algorithm in the literature, one of its shortcomings is the fact that
the number of clusters, K, must besum
there
pre-specified. •of
Also errors function
the the distance intra-cluster
plays a fundamental role both for the
to the
Of • Full Clustering:
Time Series Types: Ofof discrete objects. Given a dataset of individual
Similar to conventional clustering
grouping
time series, the goal is to group similar time series into the same cluster.
• Feature Based Clustering: Extract features or time series motifs (see later lessons) as features and use them to cluster
time series. • Compression-based clustering: compress time series and perform clustering on the most separated
compressed versions, • Subsequence clustering: given a single time series, subsequence clustering is performed on
each individual time series extracted from a long sliding window time series.
Lesson 04/19/21
Time Series – Matrix Profiles, Motifs & Discords
- TS Pattern mining: Once you find the Motifs, you can apply Pattern Mining and
generate rules.
- TS Classification: We can make predictions using classifiers built on
subsequences
- TS Anomaly detection: You can find anomalies (which we will call Discords)
in the TS, through patterns that are very different from the Motifs.
Example:
We define m=16 (in the slides he calls it "n") in a TS of length 1000. Then we represent the TS using the
SAX method which will convert it into a sequence of letters, every 16 timestamps.
2) Then I randomly choose two columns (in this case columns 1 and 2), and count the occurrences of the letters: the first
pair "ac" is found both in row 1 and in row 58, the pair "bc" is found in row 2 and 985, and so on.
3) I choose two other random columns and repeat the count of occurrences.
Machine Translated by Google
4) After a certain number of permutations we obtain a matrix in which, in the x and y axes, we have the indices
of the subparts of the motifs. In the cells instead we have how many times we have found the same
subpart of a motif expressed in terms of SAX.
5) Let's take the maximum value which in this case is 27, which indicates in the axes that the motif of
length 16 is found (and begins) at timestamp 1 and timestamp 58.
matrix profile
The Matrix Profile is a data structure that annotates a TS, and is often used to find motifs.
Also here a portion of length m is specified (it cannot be smaller than 1), which moving to the right on the TS, will find all
the subsequences of length m. This will find |T|-m+1 subsequences, where T is the time series.
With this matrix the closest nearest neighbor is found for each subsequence (for each sequence the
comparison with the smallest distance is taken, discarding the others).
Later it is possible to find the Matrix Profile, which corresponds to the distance between a subsequence
and the closest subsequence, and the distance is stored in a vector. Graphically it is as if you were
generating a new TS where a point (black line) will indicate the distance between the yellow subsequence
and its nearest neighbor (probably the portion circled in red).
The Matrix Profile Index, on the other hand, shows where the closest similar subsequence is found:
in the following example, the subsequence (of length 16) at position 20 has the one at position 194 as its
nearest neighbour.
- I update the vector with the "inf" values, inserting the minimum of the values of each cell (at the first
iteration we will have only one value per cell, so the minimum will be the value just calculated).
We ignore the value "0".
- I select another subsequence Tj and repeat the previous steps for calculating the distances,
and finally I apply the minimum operator by updating the initial vector.
- Repeat all steps until each timestamp has been included at least once in a subsequence.
1) We write the TS twice to calculate the distances. The two highlighted time windows are equal,
therefore the minimum of the differences will be "inf" because all the distances will be "0".
2) I advance the time window by 1 and calculate the distances, after which I put the minimum in the
matrix. I perform this step until the end of the vector in red.
4) I do all distance calculations for each permutation and in each row I choose the value
minimum. I repeat the steps until the end of the matrix.
Machine Translated by Google
21/04
Reminder: given a time series T, the matrix profile of a ts indicates the distance of a point, of a
subpart, to the most similar part in the same ts. Local minima correspond to motifs, subparts of the
ts that appear multiple times. Then, given these distances, we can easily find the top-k motifs.
As?
Top-K Motifs
Top-1 motif
of motifs. The motif pairs are the two subsequences with the
points closest. - Except the distance between the two plus points
- I draw the radius around the pointsÿ if a point falls inside the circle then this point is
end. The three motifs found can be represented in blue, as in this example.
: we can therefore represent these 3 parts of the ts with the same shape, 3 times.
Top-2 motif:
To find the closest two points it will be enough to repeat the previous procedure for the points, excluding
those belonging to the top-1 motif.
Along the Matrix Profile we will find the smallest distance, disregarding the distances identified in the
previous iteration. We find the closest one, we find the neighbors of the closest points,
Machine Translated by Google
I will then consider the 2 closest points + their neighbors: --> the
Top-2 motif will have 4 members. Each stitch inside the circle is part
of the top 2 motif.
< Q3 … DK extension.
- Recap: at the first iteration I look for the smallest value in the MP; suppose it is in the Di,j box, so I will connect motif j to
motif i. After that I will trace the radius around them, which will be calculated as eps=D1*R. If I see that there are points that
have a smaller distance than eps (ie that fall within the radius, they will also be motifs). On the next iteration I repeat the steps.
The distance D1 can be used in the distances discovered in the dataset: i.e. if this distance is 5
and I have decided R=2, this means that I will look for distances smaller than or equal to 10. At the
next iteration instead let's suppose we have a distance equal to 11 , what I will have to calculate
will be 11*2 =22
The necessary steps to identify an anomaly in a time series are the following: - I identify a parameter E of
subsequences to be excluded near an anomaly, this time we will exclude the points that are too close. In our example: E=2 - I
find the sequences that have the distance (from the first nearest neighbour) higher in the MP.
the one identified by the arrow is the first major anomaly. Next, we'll
look for the other closest anomalies, those identified by the yellow
arrows. We then consider the two closest points (celestial features) and
eliminate all the anomalies found in the previous points.
Machine Translated by Google
I repeat the procedure: find the points with the greatest distance, look for the closest points E=2 and remove
all three. We will repeat the process until: - I have found a number K of anomalies.
- I use the minimum descriptive length MDL as a stop criterion. typically for
anomalies, it is interesting to see only the Top1s, for motifs instead K=3 or 5. It depends on the length and type
of the Ts that one has.
: To summarize:
ÿ The matrix profile is a model to efficiently describe a single time series, it can be
effectively extracted and the process is not particularly affected by the shapes, because everything is
ÿ Given a matrix profile you can extract many patterns: we know how to extract motifs and anomalies.
The only thing that is a bit difficult to adjust is the m parameter, the one that represents the length
of the window to consider, used to create precisely the MPà is what could afflict all
calculations. Generally it is better to take a small st but not too much, about 3 or 4. If you took
too large a w would fall into an already known error, precision, i.e. curse of
dimensionality! As long as small portions are being compared we can use a Euclidean distance or la
manhattan. If the portions were more substantial, the two distances mentioned would not be
particularly reliable. The DTW distance will be more suitable or apply an approximation (in
this second case, however, would have the influence of the choice of the type of ours
Given a set X of n time series, X = {x1, x2, …, xn} , where each time series has m (i.e. same length m) ordered values xi = <
xt1, xt2, …, xtm > and a value of Ci class.
Objective: to find a function that maps the set of possible time series to the set of
class values.
ES ÿ let's assume that each time series is associated with a class and let's take one for example
time series like the pulse measured by a smartwatch. The goal of classification would be
We generally assume that all time seekers have the same length m.
The methods for the TS Classification are: particular type of Deep Neural Networks (convolutional neural
networks) but also Ensemble Methods. Both techniques require raw data as input, ie
all timestamps. We could practically obtain the same performances with a transformation
SHAPELET A
shapelet is a pattern/subsequence that is most representative of a class with respect to a given Time Series dataset. Thus
Shapelets are discriminative subsequences of the Time Series that best predict the target variable. We will see that shapelets
are needed for the classification task. Shapelets can provide interpretable results of their data and can be more precise than
other Time Series classifiers because they address local rather than global characteristics.
Machine Translated by Google
Shapelet-based Classification
ÿ Shapelets are a representative part (subsequence) of the ts which are well representative
for the discovery of the classes of the ts themselves. We will be able to identify the red part on the ts which
makes us distinguish urtica from verbena; this is captured by the parts in light blue.
VS the motif instead was a pattern that described a single ts, it had no relationship with the classes of the ts.
In the end I can represent a dataset as the distance with respect to the time series; then, after this learning
phase in which I calculate the distance between the shapelet of a certain leaf and the one identified for the
two classes, I can build a decision tree based on the shalpelets which discriminates between the two classes
using a certain distance as a split criterion (derived during the training phase).
Basically what you do after finding a shapelet and recording its distance from the closest matching
subsequence (relative to all other objects in the database) is compiling a simple decision-tree classifier, like
the one in the photo. Of guy from the
ÿ YES, then the leaf is of the verbena urticifolia type, NO, then the leaf is of the urtica dioica type.
Machine Translated by Google
ES: if I have two ts two shapelets; I calculate the distance between shapelet 1 and the ts, and the same for shapelet 2. Now
I take the best alignment what the shapelets can do with the TS:
I can represent TS T1 as the distance between S1 and S2. if I had other time If
series, I would build a table like the one a. Knowing the classes of the ts, side.
I can identify them (Y=A,B) table. on the
: it is clear that given a ts, if the distance between the ts itself and the shapelet1 is small, then the ts is of type A, otherwise
it is of type B. :one advantage
of shapelets is that they are accurate and robust. : the number of shapelets
is a parameter to choose, it depends on the length, variety and the presence of noise in the TS. However K<<M. :
this technique is not widely used for image recognition (in
that case we prefer to use deep neural networks) because often transforming images into TS is not very easy
due to the presence of the background. If we have simple images and we want to use only the outline, with a completely
black or white background, as in our case with the leaves, it is possible to use it efficiently.
:Given these two TS, T and S, how do you calculate the distance between them?
I align the two TS and calculate all the distances (in green)
The distance between the two TS is dK= minimum of all distances found by moving the TS S on T. : we
therefore understand how important it is to have a normalized and scaled TS.
Dataset example:
NB: we could have used the motifs as well but they would not
have given us any guarantee on the division of the classes.
Machine Translated by Google
in the literature there are many ways to extract shapelets. We will present only one, which is not even very
efficient because it is based on a brute force approach. It's not particularly used.
- Let's imagine we have a set of TS and a certain time series dataset, I define a certain window of length m1,
making it move on TS1.
- I extract all the subsequences of length m, moving the window on TS1, so as to have a set of candidate
sequences. I update the candidate pool. I do the same thing for the TS2.
- I can decide to specify a new m, called m2 and I move it, looking at the distances and again updating the
candidate table.
- the more I change, the more difficult the calculation will be, but the more correct the shapelets will be. It
doesn't matter if the shapelets are of different lengths.
-I test the utility of candidate shapelets: I use the concept of distance from a subsequenceÿ la
distance between a certain series and a sub-sequence S, is the function that reports a certain value d which
corresponds to the minimum distance between S and sequences with the same length as S in the series
historic.
- The distance from the Time series to the subsequence SubsequenceDist(T, S) is a function of
distance that takes the time series T and the subsequence S as input and returns a value
ÿ SubsequenceDist(T,S) =min(Dist(S, S')), for S' ÿ ST |S|, where ST |S| is the set of all
possible subsequences of T.
Intuitively, it is the distance between S and its best match position in T, I find it by sliding S and calculating
the various distances.
The figure illustrates the best corresponding position in Time Series T for the subsequence S
Test the utility of candidate shapelets -> it is necessary to estimate how useful candidate sequences can be
to discriminate between the two classes; given a certain candidate subsequence ad
Machine Translated by Google
being shapelets, I order the TS dataset according to the distance of each TS with each candidate, i.e.
the sequence (use distance seen above).
I will compare the candidate with the different time series represented here in red above the leaves. From the
previous step (figure above) I will know the distances between the TS and the candidate, for example we will
have the distances between the shapelet and the time series of: 0,2,4,5,7,8.. I can sort all the distances by all
thets.
later, I can consider the classes for a certain time series; in fact for a certain candidate sequence I
get:
The squares and circles indicate the class it belongs to. What I then want to find is the optimal split point
between the two classes, which maximizes the information gain. This step is the usual one that the decision
tree classifier does. of the TS, its position on the straight line is the distance from the candidate
shapelet.
Finally, after split
found it point and
after calculating
the information
gain of each
candidate,
I as
choose the shapelet with the highest information gain. How do we calculate this information gain? One way
is entropy
ENTROPY: A Time Series dataset D consists of two classes, A and B. Since the proportion of objects in class
A is p(A) and the proportion of objects in class B is p(B), the entropy of D And:
ENTROPY d= I(D) = -p(A)log(p(A)) - p(B)log(p(B)).
Given a strategy that splits D into two subgroups D1 and D2, the information remaining in the dataset after the
split is defined by the weighted average entropy of each subset.
If the fraction of objects (frequency) in D1 is f(D1) and in D2 is f(D2), the total entropy of D after division
And
ÿ
Î(D) = f(D1) I(D1) + f(D2) I(D2)
After that we will calculate the information gain, in the same way as it is done in the decision tree, as a
difference, i.e.:
INFORMATION GAIN: Given a certain division strategy “sp” which divides D into two subsets D1 and D2, the
entropy before and after the division is I(D) and Î(D). The information gain for this splitting rule is: Gain(sp) =
I(D) - Î(D) = I(D) - f(D1) I(D1) + f(D2) I( D2)
Machine Translated by Google
:After that I can evaluate this measure to partition the dataset against all candidate shapelets, because as a splitting
rule (sp) we usually use the distance from T to a shapelet S. To find the best Shapelet (the one that splits best) we
may have to test many candidates.
We'll take the shapalet with the greatest info gain and use that to find the best splitting point, continue.
as Yes does In the decision tree For attributes
For example, in the brute force algorithm we sort objects by distance and find an optimal split point between two
nearby distances. Since the shapelet is simply a time series of some length less than or equal to the length of the
shortest time series in our dataset, there are an infinite amount of possible shapes it could have. However, they
could candidates. however : This process is reliable in finding the best split, the pitfall lies in the dimensionality of
shapelet I'm be there many the
looking for. With the information gain definition, we are guaranteed that the particular high information gain shapelet
is the best way to separate the TS (according to the IG definition, not absolute)
Mira example:
ÿ So we will select the best split point given by the maximum value of the information gain, I will take
then the top-k candidate shapelets. Suppose Top-1 is here, we will select SP3 because it has
26/04
Machine Translated by Google
The distance from the ts to the subsequence SubsequenceDist(T, S) is a distance function that takes
the ts T and the subsequence S as input and returns a non negative value d, which is the distance between
Problem: extracting shapelets is quite expensive because if we establish a minimum and a maximum
length for the shapelet size then the total number of candidates for a dataset D is
Speedups:
Since calculating distances from TSs to SC is expensive there are ways to reduce the time: • Distance
Early Abandon: reduces the distance calculation time between two TSs • Admissible
Entropy Pruning: reduces the number of distance calculations
For example, we have 100 instances and I only calculate the distances 5 times.
Now we need to identify the best possible scenario, which is in the example the division with the blue line, which
puts all the instances of class 1 on one side and all the others on the top and calculates the
Information Gain. Since the IG points to the maximum when there is the best separation, if with this simulation
the IG of the best candidate shapelet is not obtained, this means that the IG performed with this split
point cannot be the best.
Then stop the computation and test the next candidate and so on ÿ way to reduce the
complexity but this time completely avoiding the calculation of distances and you are guaranteed to
Shapelet Summary:
- We can find the optimal shapelet for the objective function via a NN method by
updating the shapelets in the minimum direction of the objective, thus in the
first gradient. Similarly, the weights can be updated together by minimizing the
objective function.
Machine Translated by Google
04/29/2021
It's about extracting patterns on sequential data considering time, so essentially we're going from
time series to transactional sequences.
An example of a customer's purchase sequence in an online store: <{Digital Camera, iPad} {memory
card} {headphone, iPad cover}>; items between brackets are bought together.
Frequent itemsets and basic transactions do not contain the notion of time, whereas in sequential pattern
mining we have to consider it. In the apriori algorithm we only considered the frequency of that pattern, for
example if a customer buys chicken and fish 5 times and our threshold is 5 we know that the pair <chicken,
fish> is a frequent itemset. Now let's consider sequences, ie the purchase of chicken and meat is followed
by the purchase of tomatoes which in turn is followed by the purchase of fish. This is not a rule, but a
sequence of events. We then check which sequences appear frequently.
The sequence is the collection of transactions or elements; an item or transaction is a set of events or
items. So I go to the supermarket several times and each time I make a transaction (a purchase) of a
set of items that I buy all at the same time.
Object identifies the customer sequence. Customer A at the first purchase buys items 2,3 and 5 at time 10.
In general, instead of the actual timestamp, to simplify, we will use the sequence of numbers starting from
1, to indicate whether it was the first, second or third purchase for that customer, etc., assuming that
each purchase takes place at a different time than the one before. so we are not considering time as a
continuous variable in our models but we are discretizing it by representing the events simply as one
after the other, I don't care if the next purchase is after two hours, two weeks or two months, it will
always have a timestamp 2.
Each transaction is assigned a specific time, so the first element is at time 1, the second at time 2 and
so on. Each transaction contains a set of items e(i)={i1, i2,…, ik}.
Machine Translated by Google
The length of a sequence |s| it is given by the number of elements (transactions) within it.
When we have only one item in the transaction we call it a singleton, when we have more than one we call
it a complex.
A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (with m>=n) if there are
indices i1 < i2 …
< < in such that transaction a1 is a subset of transaction b with index i1, la
transaction a2 is a subset of transaction b with index i2 etc.
So in this set we have that the first transaction of the first sequence is a subset of the second transaction
of the second sequence (ie {A} is a subset of {A,C}); we have that the second transaction of the first
sequence is a subset of the third transaction of the second sequence ({B,C} is a subset of {A,B,C}); finally
{D} is a subset of {D}, so the first sequence is a subsequence of the second and in this case i1=1, i2=2
and i3=5.
Given a sequence, a subsequence of it can have multiple occurrences; these occurrences are
represented by the indexes of the transactions of which the transactions of the subsequence are subsets.
Machine Translated by Google
Occurrences of the subsequence <{B}{E}> can be <1,2>, <1,6> or <3,6> because {B} can be a subset of
{B,F} or of {A,B }; we don't consider {B,E} because there are no transactions containing {E} after it.
The support of a subsequence w is the fraction of sequences that contains the subsequence w.
Recall that since even if a subsequence can be mapped multiple times in a sequence, this does not mean
that that subsequence is contained 3 times within the sequence, so when we calculate the support we just
need to see if the sequence contains any mapping of the
sequence.
Given a sequence database and a user-given minsup, the idea is to find all subsequences with support >=
minsup, so it's roughly the same definition as DM1's frequent pattern mining.
Machine Translated by Google
The trivial approach is to generate all possible k-subsequences and compute the support, but it is
impossible because we have a combinatorial explosion.
Even if we generate subsequences from a given input sequence the number of k-subsequences to extract is
too high.
GSP extension
To effectively run sequential pattern mining there is a first algorithm which is the basic one, namely the
Generalized Sequential Pattern (GSP), which however is not widely used because there are more efficient
implementations.
The idea of the GSP is the same as apriori, that is, it starts from the short patterns and then looks for the
longer ones at each iteration so that they respect the support threshold. Also this algorithm is based on
the anti-monotonicity principle of the support, so if the sequence S1 is contained in the sequence S2 then the
support of S2 cannot be greater than the support of S1.
An intuitive proof of this is that any input sequence that contains S2 will also contain S1.
At the first step the algorithm scans the dataset and takes the sequences with an element (a transaction, it
can be <{i1}> but also < {i1,i2}>);
step 2 is repeated until frequent new sequences are discovered and consists of:
-Candidate generation, i.e. generate the candidates by joining two pairs of frequent subsequences
found previously and generate the candidate sequences that contain k-item
-Candidate pruning, i.e. remove candidates that contain (k-1)-subsequences that are infrequent
When we consider for example the 2-subsequences, these can be formed either by a single transaction with
two items, or by two transactions with a single item.
To run this algorithm it is important to establish an order of the items within the transactions, and therefore
by establishing an order through timestamps we give an order to how the algorithm works.
Machine Translated by Google
Candidate generation
The base case is when I join two frequent 1-sequences to produce two new candidate 2-sequences, i.e. <{i1}{i2}
> and <{i1,i2}>. We also need to remember the special case, i.e. the case where we join the same item with itself
<{i1}{i1}> .
In the general case I know all (k-1)-frequent sequences and a (k-1)-frequent sequence w1 is combined with
another w2 to produce a k-sequence if the subsequence obtained by removing the first item in w1 is the same as
that obtained by removing the last item in w2. The result of this operation is obtained by taking w1 and
extending it with the last element of w2. There are 3 possible cases:
Here too we must remember the special case in which we have <{a} {a}> + <{a} {a}> = <{a} {a} {a}>
Candidate pruning
It is based on the a priori principle: if a k-sequence W contains a (k-1)-subsequence which is not frequent, then
W is not frequent and can be pruned.
Given a sequence we enumerate all the (k-1)-subsequences by deleting an item from the sequence, therefore
for a sequence of length k we have k (k-1)-subsequences; we already know that the subsequences formed by
the sequence without the first element and by the sequence without the last element will certainly be
frequent because they are the ones that led to the generation of the candidate, so the subsequences to be
checked are actually k-2 and we check if they were generated on the previous interaction. If they are not, the
candidate k-sequence in question can be pruned.
In the case where k=2 pruning has no use because they are certainly generated by frequent
subsequences.
Lesson 03/05
TIMING CONSTRAINTS
Machine Translated by Google
• Timing constraints are very useful because it might not be very interesting to find a
sequential pattern of two transactions that occurred six months apart. ex: Sequential
Pattern {milk} ÿ {cookies}
If we had the purchase of the milk six months after the purchase of the biscuits, it would certainly not be
useful information
• Too short a time is also uninteresting. If we have: {cheese A} ÿ
{cheese B}
We may have a purchase of cheese A and cheese B 20 minutes apart, but this may be because we forgot
to get cheese B along with that
TO
1. xg: Max-gapÿ each element of the instance pattern must be at most in xg time after the
previous
2. ng: min-gapÿ each element of the instance pattern must be at least in ng time after the previous
one. It allows us to understand that the time interval between two transactions is of little
interest/importance. 3. ms:
maximum spanÿ the overall duration of the pattern instance must be at most in ms.
Former:
In this case we say that the gap between A and C <= xg. This allows us to avoid considering pattern
instances that occurred at very distant points in time.
Exercise:
- <{2,4},{3,5,6},{4,7},{4,5},{8}>
- <{happens at time =0}, {happens at time =1}, {happens at time =2}, {happens at time =3},
{occurs at time =4}, > The gap
-
between transaction {3,5,6} and transaction {4,5}= 3-1=2 Now we need
to see if 2<=xg. That is if 2<=2 in this case.
- So: it is contained
Approach 1:
Approach 2:
-
Modifying the GSP algorithm to directly eliminate candidates that violate time constraints - Question: Is the Apriori
principle still valid? The solution is both yes and no, that's because you open it
principle is satisfied under certain constraints and is not satisfied under other constraints
Case1: max-span
Machine Translated by Google
- As you can see we have the two input sequences S1 and S2.
- The span for S2 is 4 (4-0=4) and for S1 (1-0=1) it is 1.
- When S1 has fewer elements, S1 span can only decrease: if S2 span is OK, then S1 span is OK.
- If we have IL supp(s1)>= SUP(s2) , then we know that S1 is contained in S2. Then the span of S1 <= the span of S2. So
if the span is respected for S1 it is also respected for S2 , therefore the a priori principle is respected
Case2: min-gap
-
In this case the gap of S1 is given by= position of D - position of A. So 4-0=4
- While for S2 we have to check for two gaps, between each transaction:
4-1=3 and 1-0=1
- Even with respect to the min_gap the constraint is respected: because the min_gap wants the gap between two
transactions in a subsequence is > than the min_gap.
- Therefore, if the constraint is respected by S2 then it will also be respected by S1. This is because S1 has fewer elements
and by construction the smaller subsequence has a gap > the gap of the larger one
sequence.
Case3: max-gap
-
The most unfortunate case is this.
Machine Translated by Google
-
In this case when S1 has fewer elements than S2, the gap can only increase. We might have that S1 has no internal
elements , contained in S2. In this case if the gap <=3 , then it will be correct for
gap=1, gap=3 and not for gap=4 .
- This is because if S1 has fewer elements, the gap for S1 can only increase compared to the gap for S2.
Even if the gap for s2 respects the time constraints, the gap for S1 can be > the timing constraints.
Example:
Apriori principle for Sequence Data
However, since we have the max-gap = 1, we cannot consider these events selected, because in the part of object A
the difference between the corresponding timestamps (3-1 = 2) is greater than 1.
Now we no longer consider those highlighted in blue, but we consider the purple ones.
The support is therefore 40% because it is given by ÿ cases. The occurrences under (D) don't even respect
the max_gap, so we eliminate them.
All those highlighted in green respect the constraint of the maximum gap, therefore the support is 60%
because we have ÿ cases.
Machine Translated by Google
In this case S1 does not respect minsup, while S2 respects it even if it is longer. But this can never happen with the apriori,
because if a sequence of length k respects the minsup then a sequence of length k-1 must respect the minsup, because the support
decreases monotonically (this does not happen since, in this case, as the length of the sequence increases, the support also
increases).
Contiguous Subsequences
Definition: s is a contiguous subsequence of a given sequence w=<e1><e2>...<ek> if each of the following conditions is
true:
1) s is obtained from w by deleting an element from e1 or from ek (we are only interested in this condition, for
avoid internal jump, the others we can ignore)
2) s is obtained from w by deleting an element from each element ei which contains more than 2 elements. 3) s is a
contiguous subsequence si s' and s' is a contiguous subsequence of w (recursive definition)
Examples: s = <{1}{2}>
-
s is contiguous subsequence of <{1}{2 3}>, <{1 2}{2}{3}>, <{3 4}{1 2}{2 3}{4}>, because we could remove the values
as follows:
- Switch by: Without maxgap constraint, a candidate k-sequence is pruned if at least one of its
(k-1)-subsequences is infrequent.
-
Transforms to: With the maxgap constraint, a candidate k-sequence is pruned if at least one of its contiguous (k-1)-
subsequences is infrequent. We are also counting for subsequences that are contiguous, but don't necessarily meet
the min support threshold.
So pruning power is reduced, because there are fewer subsequences to consider since only contiguous ones are taken.
The occurrence <1,2> is eliminated because it does not respect the constraint mingap > 1. That is, we are looking for a gap>1, ,
constraints. therefore
but 2-1=1it does not respect the
The same goes for <0,1,2> and <0,1,6>, in this case we always have 1-0=1.
We eliminate the occurrences <3,4> and <6,7> because they do not respect mingap > 1.
In the last one we eliminate all the occurrences always for the mingap constraint. So the sequence w3 is not contained in the
initial sequence if we apply the constraint mingap > 1.
(with the time constraint maxgap > 4, it did not show the passages, but it is enough to discard the occurrences that
have a gap greater than 4 as for the occurrence <0.5> in the first line). That is, just do ex: - <0.4> ÿ
4-0=4 (we keep it) - <0.5> ÿ
5-0=5 (we discard it)
-etc...
GSP - Exercise 1
If the professor asks us to simulate the GSP he will not ask us to consider the constraints because it would
become too complicated even with small datasets. We just need to remember that the maxgap does not satisfy
the a priori principle of the algorithm.
Machine Translated by Google
Let's start with single transactions and remove the ones that don't meet the support threshold:
A=¾
etc..
We build the ones for the next iteration and remove the ones that don't respect support. In this case it is not possible to do
the pruning step.
We need to check A ÿ D:
A ÿ D doesn't respect support, so we removed it. While the last 3 we have removed because their support is less than 60%.
Machine Translated by Google
GSP - Exercise 2
05/05/21
ADVANCED CLUSTERING
For the last part of this module there is no real reference to the book but at the end of the slides there will be some
papers. We will not go into too much detail but we present these advanced clusterings only to know and
understand that there are others, beyond those already studied in DM1.
Machine Translated by Google
Bisecting K-MEANS:
The bisecting K-means algorithm is a simple extension of the K-means algorithm which is based on the following
idea: to obtain a number of clusters equal to K, the set of all points is divided into two clusters select one that splits
further, and so on, until K clusters are produced. - It is a variant of K-Means which produces hierarchical clustering.
: in particular, we start from a single cluster and split it into two partitions (a 2-means); after doing the split you get
two clusters that will be added to the list of clusters. After that we will eliminate from the list the cluster considered
"least cluster", i.e. which has the lowest ESS.
This procedure is repeated until a number K expressed by the user is obtained.
: There are several ways to choose the cluster to split: we can choose the largest at each step, or the one with the
largest SSE, or a criterion based on both size and SSE. Different choices will generate
different clusters.
Practically, we are using the K-means algorithm locally, ie to do the bisection of single clusters.
Thus, the final set of clusters does not represent a clustering that is a local minimum with respect to the ESS
total.
Advantages: If I track the splitting of the various clusters, I can get a hierarchy. Work with areas of diversity
density and the best cluster, which will be more compact and have less SSE will not be deleted from the cluster
list, while the more sparse areas will become many small clusters.
Limits: the K must be specified by the user, but if it is not, i.e. it is not known or has not been specified, the
algorithm ends exhaustively in singleton clusters, but this takes a long time. Also singleton clusters are meaningless
(e.g. over-splitting).
ÿ A crucial difference between Bisecting K-means and hierarchical clustering is that the latter has a bottom up
strategy, because it starts from a cluster containing a single instance and joining time after time. Bisecting K-
means instead starts from a single large cluster and at each iteration the clusters are separated.
:However, the algorithm has no criterion for stopping the bisection before singleton clusters are reached
X-MEANS:
The X-means algorithm is also an extension of the K-means algorithm; here you are not asked to specify the
Machine Translated by Google
number of clusters K and for this it is necessary to introduce a stop criterion called BIC, Bayesian Information
Criterion.
: The procedure is as follows: the BIC of the parent cluster is measured and the value is recorded; we do the
same with the child clusters, obtained by 2-means division of the parent. If the value of the children is greater
than that
the ofsplit.
the parent,
If instead
if itthe
is less,
valuethe
of procedure
children stops. ÿ The oversplitting phenomenon is thus stopped,
singletons are avoided.
The X-Means algorithm therefore starts with a K equal to the lower limit of the specified interval [r1,rmax], i.e.
the maximum number of clusters to test, and continues to add centroids where necessary, until the upper limit is
reached. àThe appropriate value of K is sought in the given range [r1,rmax]. During this process the set of
centroids that achieves the best score is recorded, and this is what will be shown in the output.
1) Improve – Params: performs the K-Means with the current value of K. Let's imagine we
have a K=3 and we have a situation like this ÿ
2) Improve – Structure: Recursively split each cluster into two and use a local BIC to
decide the split. We will therefore take cluster A and do a split, creating children A1 and A2 and calculate their
BIC. The current configuration is recorded with a global BIC calculated on the entire configuration. We will then
look at the local BIC. We will stop if the BIC of the children is less than that of the father; that is, if it does not
respect the local bic or if ek > rmax it stops and returns the model with the best score, otherwise it re-executes the operations.
ÿ The idea of the X-MEANS is therefore to start making a K-means with the current value of K, I would redefine clusters with a bisecting
K-means which will stop either by evaluating the BIC or if the total number of clusters k is greater than the extremum of the chosen
interval (that is, if k > rmax it stops and returns the model with the best score, otherwise it re-executes the operations). Afterwards, after
having obtained the 5 clusters, the global BIC score will be calculated (in our case for B,C,A,A1,A21,A22) to decide which K to return
in output. It is called global because it considers all clusters, which cover the entire dataset. At the end the algorithm shows the largest
global BIC corresponding to a certain K, contained in the interval [r1,rmax].
EXAMPLE X-Means:
2) We split each centroid into 2 children that are moved a distance proportional to the size of
the region in the opposite direction using a randomly chosen vector.
3) We perform the 2-means locally in each region for each pair of children. It is local in that the
children conflict/fight each other for the point in the parent's region: no one else.
: The BIC approximates the probability that clustering in Mj describes the real clusters in the data.
Machine Translated by Google
: the BIC is a score; the higher the better. It is calculated considering the likelihood of dataset D:
is the log-likelihood of dataset D according to the j-th model; pj is the number of parameters in Mj.
It is also called Schwarz criterion. R is the number of points of cluster j, while M is the number of dimensions (but if it has the
subscript, i.e. it is in the form Mj, it indicates cluster j).
The likelihood is therefore the probability that the data are "explained" by the clusters according to the spherical Gaussian
assumption of the K-means. Focusing on the set Dn of points belonging to the centroid and linking the estimates of maximum
likelihood to the estimate of how close the points of the cluster are to the centroid, the likelihood is calculated as follows:
RECAP: to understand the X-means it is necessary to know the K-means and to understand how bisecting k means works,
remembering that the latter does not have a stop criterion unless we indicate a precise K. In X means we must then introduce the
BIC concept, the Bayesian Information Criterion which is calculated twice: - 1) locally, to understand if the structure can be
increased by adding splits or not;
- 2) globally, to evaluate the total clustering.
• To understand our data, we assume that there is an unknown generative process (a model) that creates/describes the data,
and we will try to find the model that best fits the data itself. So we would like to approximate how the data is generated. -
It is possible to define models of different complexity, but it is
assumed that the model is a points distribution.
from which they come championships the date
- data are the height Of all in Greece.
Example: i the people
• In most cases, a single distribution is not sufficient to describe all different data different distribution.
points: set off of the data they follow a
- Example: the data is the height of all people in Greece and China, this means that they belong to different clusters and
that everyone is described by their own model. In these cases, not only the distribution of the heights of the Greeks is
enough for us, but we need a mixture model,
Machine Translated by Google
since we are talking about two different distributions corresponding to different clusters in the data so that the two different
populations can be better represented.
Our goals are to find which clusters the data belongs to, but also to find the best metrics to describe the distribution. The
question is: how are Mixture models extracted ? The answer will be: through the expectation maximization algorithm.
EM (Expectation-Maximization) algorithm:
Initialize the parameter values in ÿ (theta, parameter vector) with some random values
Repeat until this procedure converges:
- E-Step: Given the parameters ÿ, estimate the probabilities of belonging P(Gj|xi);
- M-Step: given the probabilities, it calculates the values of the theta parameters that maximize the given likelihood
Example:
- E-step: estimate the probability with which a point belongs to a certain distribution. Recalling the example of the height of people in
Greece and China, what is the probability that some point is part of the Chinese distribution rather than the Greek distribution? After
defining this membership probability, I move on to the next step: - M-Step: adjust the probability calculated in the E-step and calculate
the parameters ÿ that maximize the date
likelihood.
The EM algorithm is a generalization of K-means. Indeed, the K-means algorithm for Euclidean data is a special case of the EM
algorithm for spherical Gaussian distributions with equal covariance matrices, but different means
-Exp/Max: On the contrary, all the parameters of the distributions are selected, as well as the weight parameters, to maximize the
probability (likelihood).
Machine Translated by Google
We will now present another implementation of the Expectation/Maximization Algorithm called the
Mixture Gaussian model which assumes a Gaussian distribution to model the data, performing not a hard
assignment like kmans, but through a probabilistic assignment. Will be called brother of
K-means.
MODEL: A Gaussian distribution is defined by the pair of parameters ÿ = (ÿ, ÿ), where ÿ is the mean, while ÿ
is the standard deviation; generally, the model is defined by a vector of parameters ÿ.
Our goal is to find the normal distribution N(ÿ, ÿ) that best fits our data, so we want to find the best values
for ÿ and ÿ. But what does "best fit" mean? It means maximizing the Maximum Likelihood Estimation
(MLE).
We want to find the parameters ÿ =(ÿ, ÿ) that maximize the probability P(X|ÿ) and to do this we will have to consider the
likelihood probability as a function of the parameter. The probability P(X|ÿ) as a function of ÿ is the Likelihood function
(it is the same as before as a function of ÿ, it was previously as a function of x):
However, it is usually easier to calculate the Log-Likelihood: (n are the data we have)
Thus, the Maximum Likelihood Estimation for the Gaussian model consists in finding the parameters ÿ, ÿ which
they maximize LL(ÿ): mean) (In the K-means Alone there
sample
It is important to note that parameters that maximize log likelihood also maximize likelihood, since log likelihood is a
monotonic increasing function.
If we have no a priori information about X or ÿ, maximizing P(ÿ|X) is the same as maximizing P(X|ÿ):
Gaussian: one for the Greek people and one for the
Chinese people. Identifying for each value which
Gaussian is most likely to have generated it will give us
clustering.
Mixture model
A value xi is generated according to the following process:
1. First you select your nationality; with probability ÿG we select Greece, with probability ÿC
2. Given the nationality, we generate the point from the corresponding Gaussian. Given the nationality and given the
probability, we will use:
Machine Translated by Google
Finally, the model will have the following parameters (in the k-means the green part was 1 for one class and 0 for the other)
This is the probability that a point belongs to the Greek or Chinese population (cluster)
ÿ K means is not the only method of Exp/Max Algorithm, we can have more than one implementation.
Limitations: Assumes that all data is represented by a Gaussian distribution Advantages:
Probabilistic assignment and these parameters can be used to describe the data.
ÿ The difference between K-means and mixture gaussian models is that first of all in K-means the
Machine Translated by Google
probability because HARD assignments of the type "this point belongs to this cluster stop" are used and the only parameter that is
considered is the average (mean) you don't care about probability.
In the Gaussian model we have two parameters (mean and standard deviation).
the latter problem but manages to work well with clusters with
different densities and different sizes, shapes.
• We must recall the notion of Core point: a point p is a core point if at least a number of points equal to Minpts are located within its ÿ-
neighborhood, ie in the radius. In the example opposite, knowing that with MinPts=5, point P is a core point because in its neighborhood
ÿ=3 it has 5 points. • Core distance: It is the minimum radius value necessary to classify a certain
point as a core point. In other words, it is the distance from the furthest point among the minpts in the neighborhood. If the given point
is not Core, then Core Distance is undefined.
In the example the core distance is 3 mm; since the point Z is the furthest from P, always inside its point
Machine Translated by Google
neighborhood. There core distance And to the maximum even to the value Of ÿ.
reachability. • The
cluster structure can be obtained fac: this procedure always
extracts the point q which is closest to ap
- For each point p in the dataset, initialize the reachability distance of p as undefined
- For each untreated point p in the dataset a list is created, a random one is taken e
the N neighbors of the considered point p are obtained (neighbours= points within the radius, the ÿ
neighbors) mark p as processed and it is added as output in an ordered list
: this procedure always extracts the point q that is closest to p. When the procedure has reached the end, it
starts again from a new point.
- The parameter ÿ is required to cut out cluster densities that are not considered interesting, in order to speed
up the algorithm.
- The ÿ parameter is not really necessary, because it can be set to the maximum possible value.
- However, when a spatial index is available, it plays a practical role in complexity.
- OPTICS abstracts the DBSCAN by removing this parameter, at least to the point of giving only the maximum
value.
Transactional clustering
Machine Translated by Google
Clustering is grouping objects into different sets such that distances between objects in the same cluster are minimized and
distances between objects belonging to different clusters are maximized, so objects in the same cluster should be similar
to each other. usually the Euclidean distance, the manhattan distance or, as we have seen for the time series, the DTW are
used. All these types of distances are fine for data that is numerical only, but the problems arise when we want to compare objects
that have categorical features; this problem can be extended to boolean attributes.
There is a specific type of data that is mainly composed of categorical attributes, i.e. market basket data, i.e. the same data
types we have seen for pattern mining and sequential pattern mining, i.e. each piece of information is stored in a transaction
which represents a set of objects "bought" together. It may be interesting to apply clustering to this type of data as well,
for example to detect which customers buy in a similar way. clustering and pattern mining are completely different tasks because
on the one hand the goal is to extract subsequences of items often bought together while on the other hand the goal is to
group the transactions so that those in the same cluster are similar, and therefore maybe those in the same cluster can show
the same patterns.
A typical way to represent transactional data is the one already seen made by sets but another way is to use vectors with
boolean attributes saying if a single item has been bought or not in a transaction.
There is a strong connection between booleans and categoricals, in fact it is always easy to convert from booleans to
categoricals and vice versa.
So given this type of representation we can use Euclidean distance to measure the closeness between pairs of
transactions. For example, the smallest distance is the one between P1 and P2 which is equal to 1.
The problem comes when we introduce the notion of centroid and also the notion of distance itself.
Let's assume we are running hierarchical clustering with average distance in which we consider the centroids of
the clusters to calculate the distance. So if we take the cluster that contains the points P1 and P2, the centroid will
be (1,1,0.5,1) and when I have to create the next cluster I calculate the distance between the centroid of P12 and
the points P3 and P4.
I obtain that the smallest distance is that between P3 and P4, and this is true from a mathematical point of view
but meaningless from a semantic point of view because P3 and P4 do not have any item in common therefore
their similarity is due to the fact that they they didn't buy item 1 or 2, but they didn't buy the same items. For this a
better solution might be to use the Jaccard distance. So in conclusion using distance metrics which are used for
numeric attributes is often not appropriate for categoricals.
For this, algorithms have been created that work for categorical and transactional data.
Machine Translated by Google
K-Modes
It is a "brother" of the K-means designed for categorical data. The goal is always to minimize the intercluster
distance
In k means the Euclidean distance was used, while in k-modes the number of mismatches between the
attributes of the two objects is used as distance and the sum is done. The mismatch is 0 if the objects
for that attribute have the same value, and 1 otherwise.
Example:
The distance between objects x and y is 2 because there are two mismatches: one on the OH attribute and one on the
m.
The second difference is that while for the k means we use the mean as the centroid, for these attributes we cannot
calculate it and therefore the mode of that column is used. For example if we have attributes A1,A2,A3, if
most of the objects within the cluster have A1=1, then most have A2=1 and A3=0, then the centroid will be (1,1, 0).
It is independent for each column.
The algorithm starts by randomly selecting the initial objects as the mode, then scans all the data and assigns each
object the closest cluster identified by the mode. Then it recalculates the mode of each cluster and repeats until
clusters are changed.
This distance here doesn't solve the problem of considering the zeros.
Machine Translated by Google
TX-MEANS
T stands for transactional, so it's the x-means for transactional data. It is a parameterless clustering
algorithm. Automatically estimates the number of clusters as the x-means. This algorithm provides the
representative transaction of each cluster, which summarizes the pattern captured by that cluster.
It starts from a single cluster that contains all baskets; then it binary splits clusters with multiple iterations until
the BIC tells us we don't have to split the cluster anymore.
X means starts by running the k-means with some k and then tries to split each cluster using bisecting k
means until the BIC stops. Finally it calculates the global BIC and returns the clustering that maximizes the
BIC. TX means apply bisecting directly until the BIC stops running. In the end it allows us to extract from
each cluster a representative basket that is useful for the tx means to run the bisecting procedure.
The difference with bisecting kmeans (with k=2), is the GETREPR function which is a function that returns a
specific type of centroid which is representative for the basket.
Machine Translated by Google
This procedure starts from a representative basket which is formed by items present in all transactions
and iteratively adds to this asket the most frequent items until the distance between the representative basket and
all the baskets in the analyzed cluster no longer decreases.
Suppose we have
T1=ABC
T2=ABCD
T3=AB
T4=AC
At the first iteration r=A because it is the only item present in all transactions. The distance between the 4
transactions is calculated using the jaccard coefficient. If the distance decreases, it is repeated by adding
the most frequent item which in our case are B and C, then the distance between r and the transactions is
recalculated. This representative basket r1_new is used as a centroid for the bisecting procedure and this
procedure returns two clusters with the relative representative baskets and if the BIC of the child is greater
than that of the father the procedure is stopped. (it's a top down strategy)
A peculiarity of the tx means is that it can be used on large datasets because it can first be applied to a subset of
the dataset and then assign the remaining baskets to the clusters obtained using a nearest neighbor approach.
Example:
If instead we use tx means at the beginning r=A, in the second iteration B and C are added. Then the distance between r=
ABC and each transaction is calculated.
The sum of these distances gives the distance between r and all the transactions, therefore the sum of the transactions divided
by the cardinality of the transactions is calculated. At each iteration we look for the closest r to all the distances
transactions.
It is one of the first algorithms proposed for transactional data, while TX-Means is one of the last.
It is a hierarchical clustering algorithm that uses links between clusters instead of the classic notion of distance.
There is also an idea of neighborhood used to identify the number of links needed to give an idea of similarity. Given a type of
distance (which in this case is the Jaccard distance) and a certain threshold ÿ, we can say that two transactions are
"close" if their similarity is greater than ÿ.
Example:
The objects in common are 3, while the objects in all making the union between A and B are 6, so it is 3/6.
The higher the value of the link between the two objects, the higher the probability that the two objects belong to the same
cluster. It does not directly mean that A and B are formed by similar items, but it does mean that the set of transactions similar
to A and similar to B have a high overlap. So we can say that similarity is a
Machine Translated by Google
local concept because it observes only if two transactions are similar in terms of items contained within
them, while link captures global (or pseudo-local) information because it compares the neighbors of A and B.
Example:
The basic idea is this, even if in practice it is not really like this because the Criterion Function is used which
still uses the number of links. The idea of ROCK is to maximize this function to return the optimal partition
of transactions. This function is maximized when transactions with the same links are in the same cluster
while transactions with different links are kept in other clusters.
Even though ROCK is hierarchical clustering, it takes the number k of clusters as a parameter, so it stops the
hierarchy once it reaches k different clusters.
To identify the best pair of clusters to merge, a measure called goodness is calculated:
It takes as input a set of transactions, the number k of clusters we want and the similarity treshold ÿ and the
output is a group of clustered data.
Machine Translated by Google
During the first step it takes a sample from the dataset and this happens because the whole procedure is very
expensive from a computational point of view, so it is done to ensure scalability for too large datasets. The
initial sample is used to form the clusters and the remaining unused data is assigned to these clusters
using the concept of distance.
In the second step he runs a hierarchical agglomerative clustering algorithm, i.e. he starts by assigning a cluster
to each point (a bottom-up procedure is done like all the hierarchical algorithms seen so far), then he calculates
the similarity between each pair of clusters using la la as similarity distance goodness measures (which uses the
notion of link) and the two clusters with the highest similarity are merged. Finally, the stop condition occurs,
which is simply to check if we have reached the required number k of clusters and if we have not yet reached k, we
go back to step two.
In the third step, label the data, i.e. assign the remaining data to the formed clusters, selecting a random
sample from each cluster and assigning each point to the cluster with which it has the highest link value.
The second table, called the adjacency table, is obtained by putting the value 1 in the cells where the
Jaccard distance is greater than ÿ which in this case is 0.3, and 0 in the cells where it is less than or equal.
Machine Translated by Google
The third table is the link table, also called common neighbors, and contains the number of intersections between each pair of
transactions and can also be obtained by multiplying the adjacency table with itself.
P1={1,2,3}
P2={1,2,3}
P4={2,4}
So the intersection between P1 and P2 is {1,2,3} which has cardinality 3, while between P1 and P4 it is {2} which has cardinality 1.
At the first iteration the calculation is simple because n=1 and m=1 given that they are the cardinalities of the clusters (they are
the ones we called ni and nj before) and that we start from all clusters formed by a single point. For the first iteration, the calculation
of the denominator can be done only once because it will be the same for all pairs.
It is always a very efficient transactional data clustering algorithm for large datasets. Like ROCK, it uses a global
criterion function whose goal is to increase the overlapping of transaction items in the same cluster by increasing the height-
to-width ratio of the cluster histogram.
For each cluster I write down the items within the cluster and put a square on that item for each time it appears within the
cluster.
Clustering 1 clusters the top 3 transactions in one cluster and the bottom 2 in another cluster, while clustering 2 clusters the top 2
transactions and the bottom 3 in another cluster.
Machine Translated by Google
How do we tell which of the two clustering is better? The idea is that the easier it is to generate a “square” with these histograms,
the better the clustering.
In clustering 1 we have to add 5 squares, while in clustering 2 we have to add 8, so clustering 1 is better. That's the idea but we have
to calculate it mathematically.
D(C) for the first cluster of clustering 1 will be equal to 4 while for the second cluster it will be equal to 3.
S(C) are the total squares in the cluster, thus the total items in the cluster.
The higher the ratio between H(C) and W(C), the more “complete” the histogram will be. This quantity is calculated by the criterion
function which evaluates the goodness of a clustering as a gradient of a cluster, defined as
Generalizing, a parameter r called repulsion is used which the larger it is, the more it prefers transactions that have a
greater portion of items in common and therefore a data partition that maximizes profit returns good clustering.
This algorithm takes as input the dataset, the repulsion and the maximum number of clusters.
in the first phase it extracts a transaction and adds it to a new or existing cluster maximizing the profit function.
After having assigned each transaction to a cluster, we move on to phase two in which we ask ourselves for each individual
transaction if it is better to keep it in the current cluster or if it is better to make it change clusters and these movements
also take place according to the profit function. This phase 2 is repeated until all transactions remain in the same cluster.
If we have transactions A,B,C,D,E,F and we want 3 clusters, first we take transaction A and insert it for example in cluster
1, then we take B and calculate the similarity with the first cluster and with the second and we insert it where it has the
greatest similarity, then we take C and do the same thing.
C1 = ACE
C2 = BD
C3= F
and we pass to phase 2 in which we see, we start from A and see if by moving A to C2 or C3 the similarity increases;
same thing for all other transactions. It iterates until we don't need to move any more
transaction.
12/05
Machine Translated by Google
explainability
Definitions:
- To interpret means to give or provide the meaning or to explain and present in terms
some concepts are understandable
- In AI, Data Mining and Machine Learning, interpretability is defined as the ability to explain or provide
meaning in terms that a human can understand
Blackbox model
A black box is a model, whose interiors are unknown to the observer or are known but
not interpretable by humans. It is important to talk about the reasons for the
explanation of black-box models.
It is also possible to access the box, where the decision process is so complex that as a
human it is not possible to understand how, given a certain input, a certain output is
obtained.
Examples: Deep Neural Network, SVM, Ensemble à it is difficult to understand all the
possibilities that can lead from every possible input to every possible output.
Interpretable models
However, we know that there are some interpretable models, one of these is the Decision Tree, in fact we can follow the logical
reasoning that leads from the root to each leaf. Another model that can be interpreted is the Linear Model as a linear or logistic
regressor, in the sense that by observing the coefficients and considering their sign and their magnitude.
In the following case, for example, there are models representing the Titanic Dataset: for the linear
model we have the final result of
to the outcome, and the age and class in which she is sitting which however have a negative contribution to the final putcome is
still survivor!
We also examined other pure rule based classifiers that are interpretable via decision trees, which can be linearized into a set of
decision rules with the “if-then” format: if condition1
ÿ condition2 ÿ condition3, then outcome
Through this technique Propublica discovered that this score was biased
against black people, who were more dangerous than white people
regardless of whether they committed more crimes!
asks the machine: why is the first image classified as fish? Because there is a silence map where there are green
and red areas, the green ones suggesting that the image is a fish, the red ones a dog.
Since there's more green than red, I say it's a fish. BUT the green areas are mainly in the
Machine Translated by Google
fish background. You may suspect that you are in background bias.
Then you can ask the machine if you can have some training examples used
that affect the prediction:
Same behavior humans have when they try to convince a person of their thesis, so it's important to have access to this
information!
Role-based interpretability It
is not possible to have a universally interpretable explanation!
“Is the explanation interpretable?” to “to whom is the explanation
interpretable?”
In the literature, 3 types of users are recognized: End
users: "Can I contest the decision?" “What could I do differently to get a
positive result?”
Engineers, data scientists: “Does my system work as designed?”
solved. At a very high level we can distinguish between the explainable by design methods and black
box explanation methods.
In the first part we have a black box system and we have to replace it with a transparent system, it is called
intrinsic explainability because the final model will be intrinsically explainable, eg: we have a neural network
as a black box system and somehow we replace it with a decision tree, which , as we have said, is transparent
being an interpretable model.
This model has some difficulties in classifying images or texts which are not easy to handle because we
can represent tabular data with a decision tree but we cannot do it easily with image data or textual data.
The second type of methods, the black box explanation methods and in this case we want to keep the black
box AI sytem and we want to get an explanation sub-system, so we will have the black box system
responsible for returning the classification and the explanation system responsible for returning the explanation.
We have dataset X, we learn an interpretable and transparent model “c”, and this model c is able to return
both the outcome and the explanation to the user as output.
Can I play tennis knowing the weather is Sunny and Humidity is Normal? According to the model the answer is
"Yes".
In this case there are two versions: a global one and a local one.
We refer to a global explanation when e = f(b,x) gives the user the overall logic of black box b, so with this explanation
we are able to reproduce the behavior of the black box for each possibility (previous decision tree) .
If instead we have a local explanation (in the form of a rule for example) we know the reason why the classifier
will be yes or no, for example for the specific instances that make it so.
Global explainability provides a complete view of the black box, for all possible inputs, for all possible outcomes,
while a local explanation reveals why a specific instance analyzed at a specific moment was classified in a certain
way.
The search is mostly based on local explanations because globals are much more difficult to extract!
In the case of the model specific we have an explanation method strictly dependent on the black box, it means
that for example the black box model is able to explain only the random forest, because the explanation method is using
some components that are only in a particular black box or in a particular black box family, if i want to explain
the random forest the explanation method will require to handle the various
trees.
Machine Translated by Google
In the model agnostic explainer, on the other hand, the explanation method is independent of the black box so
it can be used in any black box regardless of the type of model. It can be very useful in scenarios where we
don't physically have the black box but it is for example an API, so we only have to worry about inputs and
output.
Data types
Currently, different types of data are analyzed for explanability, mainly three: tabular data, images and some
approaches for texts.
For tabular data we certainly have the rule-base expressed in a logical form and the decision trees and feature
importance, which reveals how important an instance is for a specific output, can only be local because it
refers to a specific instance.
Prototypes and counter-examples are instances which are on the one hand prototypically similar to the
analyzed instances getting the same output and clarifying the reason why a classifier returns a certain label
for that instance.
While the counter-exemplars are instances similar to those analyzed but with a different outcome.
Machine Translated by Google
For images and text we also have other options, the most famous is the saliency map concept (that of the
fish example), it is a heatmap placed on an image that has a certain color if a pixel/area is positively important for the
classification and another color if it is negatively important.
The saliency map is the feature importance counterpart for images, while a similar version for texts is the sentence
highlighting, which colors certain words differently if they have a positive or negative impact in the classifier.
Pseudocode:
We have to focus on two points: black box auditing which explains the NN but is not built on the original label of the
dataset but on the label assigned to the data by the NN because we want to predict the class provided by the black
box (even if it is wrong), the the second aspect is line 5: random, which allows for the generation of a certain number
of random records at each split, respecting the condition of the tree defined up to that point.
For example , if at the first split we have 100 records at the second step I will have (eg) 70 and 30. When I
want to learn the second split I don't want to do it on 30 instances but on 100, therefore TREPAN generates 70
synthetic instances respecting the condition (UniformityCellSize < 2.5) and apply blackbox b on the set zb(z)=70
instances (in blue). So you will have 30 real elements and 70 "artificial".
Machine Translated by Google
Next, class y is used in line 9 to find the best split (with the label assigned by the black box, so it's not the
real label).
Considering the red cross in the example, we can say that it is classified as a red cross because it is located
to the left of the line dividing the dots, regardless of the area in which it is located.
Thus a single decision can be verified by checking the black box in the neighborhood of a given instance
and learning a local decision.
LIME can be used to explain any type of black box by providing local explanations here you can see the
application to tabular data and images, as Lime is local agnostic so it doesn't strictly depend on a data
type.
Machine Translated by Google
We see that duration_in_month makes a large contribution to the classification with a negative contribution (0.11) against class
1, and account_check_status makes a positive contribution (0.99) for class 1. Lime's idea is to synthetically generate instances near
the instance to explain and then learn a local surrogate expressed as a logistic regressor (with Lasso or Ridge penalty) and then
use the coefficients of this regressor as feature importances
Example:
b(veric) indicates the probability on the possible classes In
the second step we synthetically generate instances
similar to the initial ones and we see how the
probabilities change, like this for a few times.
Then what you do is run a Train on a Logistic (or Linear,
depending on the case)
After learning the regressor the coefficients are reported as an
Explaination.
This notion of interpretability is used for sample_around(x') (line 5), as was done in the example
randomly selecting some features and changing their value randomly from the distributions that are
they know. The function on line 6 interprets bel2real(z') ÿ turn into the setting, then the black box to store the instances is queried:
zi (instance), b(zi) (the prediction of the black box for the class of the instance to be explained ), d(x,z) (the distance between x and
the generated instance). And finally it solves a Lasso problem by finally supplying w which expresses how important or not a
feature is in the classification.
Considering that an image is divided into areas of the same color (superpixels) and I represent the original img x' as interpretable
superpixels expressing the presence/absence, then in the neighborhood I set some superpixels to 0 and obtain the second image ÿ
same image where some parts are obscured We train on a linear regression model and assign a weight to each superpixel to
understand which are the most important features to be able to say that it is a fox.
Machine Translated by Google
• LORE can be generalized to work with images and text, using the same data as
representation adopted by LIME
• LORE is similar to LIME, but there are two differences:
1. it uses a genetic approach instead of a random approach 2.
it uses a decision tree instead of a linear regressor
Algorithm
Machine Translated by Google
This approach uses instances as evolving 'chromosomes' and optimizes a 'fitness function'. The idea is
to generate two parts of the neighbors:
- a call 'Z equal'= where there are instances similar to X but labeled black
box.
- a call 'Z not equal'= generates instances similar to X , but which have a different labeling
from the blackbox
- In the image below: the star is the point to explain. The yellow and orange dots are the
instances generated by the general approach.
-
at points 6 and 7, given the decision tree c, we can extract the factual rule and the counterfactual rule.
The counterfactual rules highlight the slightest change of instances to obtain a different result.
Rep:
-
in points 2 and 3 we have a random generation of the dataset in line 5 we
- have the 'black box auditing' the black box query
LORE is made for tabular data, but using LIME data transformation it is also used for images and textual data.
SHAP
• the gain is given by the difference between the current forecast minus the average prediction for all the
instances
• the players are= the feature values of the instances that collaborate to receive a certain gain • more precisely we
have these contributions= The park-nearby contributed
€30,000; area-50 contributed €10,000; floor-2nd contributed €0; catbanned contributed -€50,000.
The contributions add up to -€10,000, the final prediction minus the average predicted apartment price.
In summary:
Shape is an approach that uses shapely values to return a local moderate agnostic explanation in form of feature importance.
- Let's simulate that only a nearby no-cat park and area-50 are in a coalition, by randomly drawing another apartment from
the data and using its value for the floor element.
-
The second floor is replaced by the randomly drawn-1st floor.
- So we predict the price of the apartment with this combination (€310,000).
So we fix the park (the coalition) and the 50 m^2, randomly select the features that can change (the first floor in
this case) and then test all possible values for the features we observe in a given month.
-
In a second step, we will remove the cat from the coalition by replacing it with a random value.
In the example it was catalysed, but it could have been banned again.
Machine Translated by Google
- We forecast the apartment price for the park-adjacent and area-50 coalition (€320,000).
-
The contribution of cat-banned was €310,000 - €320,000 = - €10,000. This estimate depends on the values of the
randomly drawn apartment that served as a "donor" for the cat and the floor feature values.
- We get better estimates if we repeat the sampling step and average the contributions.
-
The Shapley value is the average of all marginal contributions in all possible coalitions.
-
Computation time increases exponentially with increasing number of features .
- For each of these coalitions we calculate the expected price per apartment with and without the
characteristic value 'catbanned' and take the difference to get the marginal contribution.
- We replace feature values with features that are not in a coalition, with feature values
random numbers of the apartment to get a prediction from the black box.
- If we estimate the Shapley values for all the features, we get the full distribution of the prediction (minus the mean)
among the feature values.
The prof doesn't want us to remember this whole process. Crucially, SHAP doesn't use a local surrogate to estimate feature
importance.
From an observational point of view it is similar to LIME: because it returns the explanation of the feature importance, but the shap
does not use a local approach, but the game theory, summarized by the formula:
Machine Translated by Google
Where:
The crucial aspect (he doesn't want us to remember the formula): the shapely value is the average of all the
marginal contributions, so it is important to understand what the shapely values are.
The 'table' below is the explanation given by SHAP, which says that: the variable called 'LSTAT' which has a value of 4.98 has the
largest positive contribution because it 'moves' the classification toward 24.41. While RM= 6.575 moves the classification in the
other direction We therefore have an importance value
for each feature.
The shape works with tabular data. In this case in the coalition we have Age, wight and color. The values on the right
are the x values under analysis.
While in z= Age is fixed, while weight and color are chosen randomly. This is an example of instances created
by SHAP that need to be tested here:
SHAPE ON IMAGES
The shape can also work with and images.
In this case the image is represented by these three super pixels. In the original image all three values are present,
while the SHAP is without sp3, i.e. the dog's face, replaced by a gray circle.
The color we choose affects the result and this is a weakness of the approach.
RECAP:
Machine Translated by Google
-
SHAP is local, moderate agnostic, data agnostic ÿ similar to LIME
- but it is theoretically different= does not use a surrogate, but tests different subgroups of features, keeping one fixed and
calculating the average marginal contribution of a particular value, in order to return the explanation in the form of
feature importance (like LIME)
Saliency Maps
A saliency map is an image superimposed on the original one, where the brightness of a pixel represents what is
"salient" (important). A positive (red) value indicates that the pixel contributed positively to the classification, while a negative
(blue) contributed negatively.
2) Segment the image into groups of pixels (called superpixels or segments) and assign a saliency value to each group.
The examples indicated by the arrow represent the first method, in which each pixel has its own saliency value. The "lime" method
indicates the second method of creating an SM.
This method uses an all black or all white baseline image, and tries to make this image equal to the image to
be classified by building a path from the base image x' to the input image x, adjusting the color
based on the neural network.
Finally the saliency map is obtained by overlaying the opacity of each point. The higher the gradient of
a point, the more important it will be for classification.
MASK
The idea is to change the input image x to image xR, such that a region R covering x is found (adding a
blur or noise), thus obtaining a large discrepancy between the two images in terms of probability for a certain
class.
As seen from the following example, the classifier recognizes the flute with a probability of 0.9973, while
in the modified image it recognizes it with a probability of 0.0007. So the learned mask indicates the most
important region (in blue) for recognizing the flute.
Example-based explanations
Example-based explainations are methods that work for any datatype, and are local explainations
(like saliency maps, feature importance, etc.).
These methods of explainations for humans only make sense if they are used to compare one
instance with another: they work well for images, tabular data with few features and short texts.
1) Prototypes: prototypes are instances similar to the one under analysis, which obtain the same class
as the one under analysis.
So if an instance X is classified with class y, and an instance X' similar to X is classified with class y, then
X' is said to be a prototype of X.
Machine Translated by Google
- A Criticism is always a prototype which is classified with the same label as X, but which is not
is represented well by a set of prototypes, so it's like an outlier.
-
Influential Instances are data that have a big impact on the trained model if they are removed. (graph below)
2) Counterfactuals: X'' (which must be similar to X) is counterfactual of X if the classification of X'' is different from y.
Counterfactual Explanation
• A counterfactual explanation describes a causal situation in the form: "If X had not occurred, Y would not have occurred."
• Thinking in counterfactual terms requires imagining a hypothetical reality that contradicts the facts
watch yourself.
• Although the relationship between the inputs and the outcome to be predicted may not be causal, we can view the inputs of
a model as the cause of the prediction; this also happens in Machine Learning Approches. In these terms we can say that CF
is a sort of causal explanation.
• what we want from a CF is the smallest amount of changes, features and perturbations I have to do on my analysis instances;
that is, a counterfactual explanation of a forecast describes the smallest change in feature values that change the forecast of a
predefined output.
Machine Translated by Google
1. A first approach, simple and naive, to generate counterfactual explanation is search by trial and error: here
we have a random change characteristic values (random component) of the instance of interest and it
stops when the desired output has been predicted à A random value is then taken, the output is looked at
and if no CF is found, two features are taken, etc.
2. As an alternative we have an optimized approach; here we can define a loss function, final loss function which
considers the instance of interest, counterfactual and desired result (counterfactual). Then, we
can find the counterfactual explanation that minimizes this loss using an optimization
algorithm.
à Many methods proceed this way, but differ in their definition of the loss function and in the method of optimization; we present one:
Optimize CF search
: in the slide it is possible to see the Loss Function, obtained as (f(x') -y')^2 = 0, i.e. setting f(x')=f(y')and the
distance between the two instances must be minimized.
Machine Translated by Google
For the first term, both the equality between the two classes and the probability of obtaining a certain class can be used, which
can be a little more elastic than the optimization procedure.
3. If is not true, i.e. the probability that a prediction of x' . y' in absolute value is
less than a certain threshold,
: the professor thinks that CF explanations are the most useful explanations on real case studies, much more than rules based
explanations etc.
For literature we have another family of explanaires, called visual inspectors or inspectors for black boxes; The most famous of
them is the partial dependency plot:
The partial dependency plot or partial dependency plot (PDP) shows the marginal effect that one has on the expected result of
a model à it can be defined as a simplification Shap, it is essentially
Machine Translated by Google
In particular, the above partial function tells us for a given value (or values) of a feature S, which
is the mean marginal effect on the forecast itself, where xc are real values of the dataset features for the features we are not
interested in, and with n= number of instances.
Looking at the picture we understand the points from which we must pass:
- We start from x, the instance under analysis and change a feature randomly, in the example Age will go from 50 to 53, keeping the
others fixed.
- On the right in the figure we can see the partial dependency plot relating to age. The bottom diagonal lines represent the probability
of high and low risk as age varies. We can see that age >55, other things being equal, creates a jump between low and high risk! The
probability of high risk grows a lot.
ÿ Then it is a question of introducing random perturbations on the input values to understand how much each characteristic
can affect the prediction using PDP ÿ The input is changed one variable
at a time.
Open the blackbox! The professor invites us to take responsibility against the unwanted effects of an automated decision-
making process, this:
• Improve industry standards for AI-powered product development, increasing consumer trust
companies and consumers
to achieve these results, however, there are some open research questions:
• There is no agreement on what an explanation is and there is still no formalism for them
• There is no work that seriously addresses the problem of quantifying the degree of intelligibility of an
explanation for humans
ÿ Python examples