Lecture9_ML-Algorithms
Lecture9_ML-Algorithms
FACULTY OF ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
Machine learning is a sub-field of artificial intelligence (AI) that provides systems the ability
to automatically learn and improve from experience without being explicitly programmed.
Machine learning is not about learning in general (like humans), but about learning a
specific/specific task in a targeted manner (e.g., classify nationality, or predict stock price etc.).
This task is intended to improve the system over time.
A computer program is said to learn from experience E with respect to some class of task T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E. (Mitchell, 1997)
Hypothesis space:
A hypothesis that fits a set of training examples, is called the consistent hypothesis.
Examples:
Set of polynomials as hypothesis space for (a) and (b) and two consistent hypotheses.
How do we choose from multiple consistent hypotheses?
In the example above, consider case (c). It would be better to find a simple straight line that
isn't exactly consistent but that allows for meaningful predictions.
For non-deterministic functions there is an unavoidable trade-off between the complexity of the
hypothesis and its degree the agreement with the data.
Example (d) shows that the data from (c) are consistent with a simple function of the form a*x
+ b + c * sin x.
Supervised Learning:
For this family of models, the research needs to have at hand a dataset with some observations
and the labels/classes of the observations. For example, the observations could be images of
animals and the labels the name of the animal (e.g. cat, dog etc).
These models learn from the labeled dataset and then are used to predict future events. For the
training procedure, the input is a known training data set with its corresponding labels, and the
learning algorithm produces an inferred function to finally make predictions about some new
unseen observations that one can give to the model. The model is able to provide targets for any
new input after sufficient training. The learning algorithm can also compare its output with the
correct intended output (ground truth label) and find errors in order to modify itself accordingly
(e.g. via back-propagation).
Supervised models can be further grouped into regression and classification cases:
Some examples of models that belong to this family are the following: SVC, LDA, SVR,
regression, random forests etc.
Unsupervised Learning:
For this family of models, the research needs to have at hand a dataset with some observations
without the need of having also the labels/classes of the observations.
Unsupervised learning studies how systems can infer a function to describe a hidden structure
from unlabeled data. The system doesn’t predict the right output, but instead, it explores the data
and can draw inferences from datasets to describe hidden structures from unlabeled data.
Some examples of models that belong to this family are the following: PCA, K-means,
DBSCAN, mixture models etc.
Reinforcement Learning:
This family of models consists of algorithms that use the estimated errors as rewards or penalties.
If the error is big, then the penalty is high and the reward low. If the error is small, then the
penalty is low and the reward high.
Trial error search and delayed reward are the most relevant characteristics of reinforcement
learning. This family of models allows the automatic determination of the ideal behavior within
a specific context in order to maximize the desired performance.
Reward feedback is required for the model to learn which action is best and this is known as
“the reinforcement signal”.
Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The
goal of normalization is to change the values of numeric columns in the dataset to use a common
scale, without distorting differences in the ranges of values or losing information.
Normalization is also required for some algorithms to model the data correctly.
For example, assume your input dataset contains one column with values ranging from 0 to 1,
and another column with values ranging from 10,000 to 100,000. The great difference in
the scale of the numbers could cause problems when you attempt to combine the values as
features during modeling.
Normalization avoids these problems by creating new values that maintain the general
distribution and ratios in the source data, while keeping values within a scale applied across all
numeric columns used in the model.
• You can change all values to a 0-1 scale, or transform the values by representing
them as percentile ranks rather than absolute values.
• You can apply normalization to a single column, or to multiple columns in the
same dataset.
• If you need to repeat the experiment, or apply the same normalization steps to
other data, you can save the steps as a normalization transform, and apply it to
other datasets that have the same schema.
If the model has been underfitting, it has failed to grasp the underlying logic of the data. The
model does not know what to do with this data and gives inaccurate results. In the other case,
that is, if the model is overfitted, it overfits the dataset and misses the actual case.
Consider a regression problem that finds the price of houses according to their size.
In the first graph, we see a linear pattern. The model predicts that the price will always increase
as the size increases. But when we look at the dots on the chart, we see that the price can stay
stable as the size increases. In this case the model has been underfitting and the error is quite
high.
In the second graph, the model is the quadratic function. We can see the fit of this function with
the data. This model can be successful in training the data.
In the final chart, the model appears to fit very well with the data. But there are so many ups
and downs. We can think of this as if the model forced itself to fit the data. But this model is
failing and overfitting because what we want is that the model doesn't just work well on this
dataset.
The point that makes overfitting important is that the error is very close to zero and sometimes
even zero. This is the point that misleads us. We think the model is very good, but actually it
has been overfitted.
In the first case the model has been underfitting. The second case is desirable, although it
contains some error. However, the third case was overfitting. It just works fine with this data
layout.
The graphs below show what kind of a problem an overfitted model will cause for different
datasets.
a b x
0 0 0
0 1 0
1 0 0
1 1 1
Step function:
1 if a*wa + b*wb > t
0 otherwise
Linearly separable problems: AND, OR etc. Values can be distinguished from each other by a
linear line in a two-dimensional chart. More than one line can also distinguish these values from
each other.
XOR is not a linearly separable problem. In such cases, we need hidden layers.
Decision Trees
A decision tree is a tree-shaped plan of checks we perform on the features of an object before
making a prediction. For example, here’s a tree for predicting if the day is good for playing
outside:
Each internal node inspects a feature and directs us to one of its sub-trees depending on the
feature’s value and the leaves output decisions. Every leaf contains the subset of training objects
that pass the checks on the path from the root to the leaf. Upon visiting it, we output the majority
class or the average value of the leaf’s set.
However, trees are unstable. Slight changes to the training set, such as the omission of a handful
of instances, can result in totally different trees after fitting. Further, trees can be inaccurate and
perform worse than other machine-learning models on many datasets.
Random Forests
A random forest is a collection of trees, all of which are trained independently and on different
subsets of instances and features. The rationale is that although a single tree may be inaccurate,
the collective decisions of a bunch of trees are likely to be right most of the time.
For example, let's imagine that our training set 𝒟 consists of 200 instances with four features:
𝐴, 𝐵, 𝐶, and 𝐷. To train a tree, we randomly draw a sample 𝒮 of instances from 𝒟. Then, we
randomly draw a sample of features. For instance, 𝐴 and 𝐶. Once we do that, we fit a tree to 𝒮
using only those two features. After that, we repeat the process choosing different samples of
data and features each time until we train the desired number of trees.
Forests are more robust and typically more accurate than a single tree. But, they are harder to
interpret since each classification decision or regression output has not one but multiple
decision paths. Also, training a group of 𝑚 trees will take 𝑚 times longer than fitting only one.
However, we can train the trees in parallel since they are by construction mutually independent.
𝑥 𝑦 𝑓1 (𝑥) 𝑦 − 𝑓1 (𝑥)
𝑥1 10 9 1
𝑥2 11 13 −2
𝑥3 13 15 −2
𝑥4 20 25 −5
𝑥5 22 31 −9
If were satisfied, we stop here and use the sequence of trees 𝑓1 and 𝑓2 as our model However,
if the residuals 𝑦 − 𝑓1 (𝑥) − 𝑓2 (𝑥) indicate that the sequence 𝑓1 + 𝑓2 has a too high error, we fit
another tree 𝑓3 to predict 𝑦 − 𝑓1 (𝑥) − 𝑓2 (𝑥) and use the sequence 𝑓1 + 𝑓2 + 𝑓3 as our model.
We repeat the process until we reduce the errors to an acceptable level or fit the maximum
number of trees.
Gradient boosting trees can be more accurate than random forests. Because we train them to
correct each other’s errors, they’re capable of capturing complex patterns in the data.
However, if the data are noisy, the boosted trees may overfit and start modeling the noise.
The main differences between Gradient Boosting Trees and Random Forests
There are two main differences between the gradient boosting trees and the random forests.
We train the former sequentially, one tree at a time, each to correct the errors of the
previous ones. In contrast, we construct the trees in a random forest
independently. Because of this, we can train a forest in parallel but not the gradient-
boosting trees.
CEN 345 Algorithms 16 Assoc. Prof. Dr. Fatih ABUT
ÇUKUROVA UNIVERSITY
FACULTY OF ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
The other principal difference is in how they output decisions. Since the trees in a
random forest are independent, they can determine their outputs in any order. Then, we
aggregate the individual predictions into a collective one: the majority class in
classification problems or the average value in regression. On the other hand, the
gradient boosting trees run in a fixed order, and that sequence cannot change. For that
reason, they admit only sequential evaluation.
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
Accuracy value is calculated by the ratio of the classifications that we predict correctly in the
model to the total data set. Model accuracy alone is not sufficient, especially in unbiased
datasets that are not evenly distributed. For example, let's say we have a dataset of 100 patients
with and without cancer. Only 10 of all patients were diagnosed with cancer. In such a case, we
do not want to have patients with cancer that cannot be diagnosed (False Negative). Therefore,
we should evaluate the results of other metrics together.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Precision (or Specifity), on the other hand, shows how many of the values we estimated as
Positive are actually Positive.
The precision value is especially important when the cost of False Positive estimation is high.
For example, if your model marks the mails that should arrive in your mailbox as spam (FP),
then you will not be able to see the important mails you need to receive and you will be in a
loss-making situation for you. In this case, having a high Precision value is an important
criterion for model selection for us.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
Recall (or Sensitivity) is a metric that shows how much of the operations we need to estimate
as Positive, we estimate as Positive.
The recall value is also a metric that will help us in situations where the cost of estimating as
False Negative is high. It should be as high as possible. For example, if a model we created for
Fraud Detection marks a transaction that is Fraud as non-fraud, the consequences of such a
situation will pose a problem for a bank.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑥𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − Score = 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score value shows us the harmonic mean of Precision and Recall values. The reason why it
is a harmonic mean instead of a simple mean is that we should not ignore the extreme cases. If
it were a simple average calculation, a model with a Precision value of 1 and a Recall value of
0 would have an F1 Score of 0.5, which would mislead us.
The main reason for using F1 Score value instead of Accuracy is not to make an incorrect model
selection in unevenly distributed data sets. In addition, F1 Score is very important to us as we
need a measurement metric that will include not only False Negative or False Positive but also
all error costs.
The MAE, MSE, and MAPE metrics are mainly used to evaluate the prediction error rates and
model performance in regression analysis.
CEN 345 Algorithms 18 Assoc. Prof. Dr. Fatih ABUT
ÇUKUROVA UNIVERSITY
FACULTY OF ENGINEERING
COMPUTER ENGINEERING DEPARTMENT
The picture below is a graphical description of the MAE. The green line represents our
model’s predictions, and the blue points represent our data.
The MAE is also the most intuitive of the metrics since we’re just looking at the absolute
difference between the data and the model’s predictions. Because we use the absolute value of
the residual, the MAE does not indicate underperformance or overperformance of the model
(whether or not the model under or overshoots actual data). Each residual contributes
proportionally to the total amount of error, meaning that larger errors will contribute linearly to
the overall error. Like we’ve said above, a small MAE suggests the model is great at prediction,
while a large MAE suggests that your model may have trouble in certain areas. A MAE of 0
means that your model is a perfect predictor of the outputs (but this will almost never happen).
While the MAE is easily interpretable, using the absolute value of the residual often is not as
desirable as squaring this difference. Depending on how you want your model to treat outliers,
or extreme values, in your data, you may want to bring more attention to these outliers or
downplay them. The issue of outliers can play a major role in which error metric you use.
that differ greatly from the corresponding actual value. This is to say that large differences
between actual and predicted are punished more in MSE than in MAE. The following picture
graphically demonstrates what an individual residual in the MSE might look like.
Outliers will produce these exponentially larger differences, and it is our job to judge how we
should approach them.
The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The
equation looks just like that of MAE, but with adjustments to convert everything into
percentages.
Just as MAE is the average magnitude of error produced by your model, the MAPE is how far
the model’s predictions are off from their corresponding outputs on average. Like MAE, MAPE
also has a clear interpretation since percentages are easier for people to conceptualize. Both
MAPE and MAE are robust to the effects of outliers thanks to the use of absolute value.