Lecture 12 Distance Metrics Different Distance Metrics in Machine Learning

Distance Metrics in Machine Learning
 Distance metrics are a key part of machine learning algorithms.

 These distance metrics are used in both supervised and unsupervised learning, generally to
calculate the similarity between data points.
 An effective distance metric improves the performance of our machine learning model,
whether that’s for classification tasks or clustering.
 An applications distance matrices are K-Means Clustering or k-Nearest Neighbour

algorithm in classification or regression.
 How will you define the similarity between different observations here?
 How can we say that two points are similar to each other?
 Points will be similar if their features are similar, right?
 When we plot these points, they will be closer to each other in distance.
4 Types of Distance Metrics in Machine Learning
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
 Euclidean Distance: represents the shortest distance between two points. Most machine
learning algorithms including K-Means use this distance metric to measure the similarity
between observations and there are more applications in hierarchical clustering,
agglomerative clustering.
So, the Euclidean Distance between these two points A and B will be:
Here’s the formula for Euclidean Distance:

We use this formula when we are dealing with 2 dimensions. We can generalize this for
an n-dimensional space as:
Where,
n = number of dimensions
pi, qi = data points
Example
Python Code for Euclidean Distance : SciPy library that contains pre-written codes for most
of the distance functions used in Python:
from scipy.spatial import distance
# defining the points
point_1 = (1, 2, 3)
point_2 = (4, 5, 6)
point_1, point_2
These are the two sample points which we will be using to calculate the different distance
functions. Let’s now calculate the Euclidean Distance between these two points:
# computing the euclidean distance
euclidean_distance = distance.euclidean(point_1, point_2)
print('Euclidean Distance b/w', point_1, 'and', point_2, 'is: ', euclidean_distance)
2. Manhattan Distance: Manhattan Distance is the sum of absolute differences between

points across all the dimensions.
We can represent Manhattan Distance as:
Since the above representation is 2 dimensional, to calculate Manhattan Distance, we will take
the sum of absolute distances in both the x and y directions.
So, the Manhattan distance in a 2-dimensional space is given as:
And the generalized formula for an n-dimensional space is given as:
Where,
n = number of dimensions
pi, qi = data points
Instead of taking the straight line like in the Euclidean Distance, we ‘walk’ through
available, pre-defined paths. 2D path can be generalized with Manhattan Distance. Use
the Manhattan distance when the features are entire integers (1,2,3,4…) with no decimal
parts. The Manhattan Distance always returns a positive integer.
# computing the manhattan distance

manhattan_distance = distance.cityblock(point_1, point_2)
print('Manhattan Distance b/w', point_1, 'and', point_2, 'is: ', manhattan_distance)
Note: that Manhattan Distance is also known as city block distance. SciPy has a function
called cityblock that returns the Manhattan Distance between two points.
3. Minkowski Distance: Minkowski Distance is the generalized form of Euclidean and

Manhattan Distance.
The formula for Minkowski Distance is given as:
 Minkowski distance comes under consideration with machine learning algorithm, when
distance measures give control over the type of distance measure.
 If there is a confusion that which distance matrix should be use in algorithm, then
Minkowski distance measure is best, it is good for model optimization.
Let’s calculate the Minkowski Distance of the order 3:

# computing the minkowski distance
minkowski_distance = distance.minkowski(point_1, point_2, p=3)
print('Minkowski Distance b/w', point_1, 'and', point_2, 'is: ', minkowski_distance)
The p parameter of the Minkowski Distance metric of SciPy represents the order of the norm.
When the order(p) is 1, it will represent Manhattan Distance and when the order in the above
formula is 2, it will represent Euclidean Distance.
# minkowski and manhattan distance
minkowski_distance_order_1 = distance.minkowski(point_1, point_2, p=1)
print('Minkowski Distance of order 1:',minkowski_distance_order_1, '\nManhattan
Distance: ',manhattan_distance)
Here, you can see that when the order is 1, both Minkowski and Manhattan Distance are the
same.
# minkowski and euclidean distance
minkowski_distance_order_2 = distance.minkowski(point_1, point_2, p=2)
print('Minkowski Distance of order 2:',minkowski_distance_order_2, '\nEuclidean
Distance: ',euclidean_distance)
When the order is 2, we can see that Minkowski and Euclidean distances are the same.
4. Hamming Distance: Hamming Distance measures the similarity between two strings of
the same length. The Hamming Distance between two strings of the same length is the number
of positions at which the corresponding characters are different.
Let’s understand the concept using an example. Let’s say we have two strings:
“euclidean” and “manhattan”
 Since the length of these strings is equal, we can calculate the Hamming Distance.
 We will go character by character and match the strings.
 The first character of both the strings (e and m respectively) is different.
 Similarly, the second character of both the strings (u and a) is different. and so on.
Hence, the Hamming Distance here will be 7.
Another Example
Compute the Hamming Distance of two strings in Python

# defining two strings
string_1 = 'euclidean'
string_2 = 'manhattan'
# computing the hamming distance

hamming_distance = distance.hamming(list(string_1), list(string_2))*len(string_1)
print('Hamming Distance b/w', string_1, 'and', string_2, 'is: ', hamming_distance)
As we saw in the example above, the Hamming Distance between “euclidean” and
“manhattan” is 7.
Note: Hamming Distance only works when we have strings of the same length.
Let’s see what happens when we have strings of different lengths:
# strings of different shapes
new_string_1 = 'data'
new_string_2 = 'science'
len(new_string_1), len(new_string_2)
You can see that the lengths of both the strings are different.
# computing the hamming distance

hamming_distance = distance.hamming(list(new_string_1), list(new_string_2))
This throws an error saying that the lengths of the arrays must be the same. Hence, Hamming
distance only works when we have strings or arrays of the same length.
Example : Classify the following data types into two classes using Euclidian distance.
X1 = (2, 3, 4), X2 = (1, 2, 3) and X3 (0, -2, -5)
Example: Prove that points A (0, 4), B (6, 2), and C (9, 1) are collinear.
Solution: To prove it, the sum of the distances between two pairs of points must be equal to the distance
between the third pair.
AB = √[(6 – 0)2 + (2 – 4)2] = √[36 + 4] = √40 = 2√10
BC = √[(9 – 6)2 + (1 – 2)2] = √[9 + 1] = √10
CA = √[(0 – 9)2 + (4 – 1)2] = √[81 + 9] = √90 = 3√10
Here, we can see that
AB + BC = CA
(This is because 2√10 + √10 = 3√10).

Example: Check that points A(√3, 1), B(0, 0), and C(2, 0) are the vertices of an equilateral triangle.
Solution: Three vertices A, B, and C are vertices of an equilateral triangle if and only if AB = BC = CA.
Given:
A(√3, 1) = (x1,y1)(x1,y1)
B(0, 0) = (x2,y2)(x2,y2)
C(2, 0) = (x3,y3)(x3,y3)
Using Euclidean distance formula,
AB = √[(x22 – x11)2 + (y22 – y11)2]
= √[(0 – √3)2 + (0-1)2]
= √(3 + 1)
= √4
=2
BC = √[(x33 – x22)2 + (y33 – y22)2]
= √[(2-0)2 + (0-0)2]
= √(4 + 0)
= √4
=2
CA = √[(x33 – x11)2 + (y33 – y11)2]
= √[(2 - √3)2+ (0 – 1 )2]
= √(9 + 25)
= √34
Here AB = BC ≠ CA.

Lecture 12 Distance Metrics Different Distance Metrics in Machine Learning

Uploaded by

Lecture 12 Distance Metrics Different Distance Metrics in Machine Learning

Uploaded by

Distance Metrics in Machine Learning

 Distance metrics are a key part of machine learning algorithms.

 An applications distance matrices are K-Means Clustering or k-Nearest Neighbour

Here’s the formula for Euclidean Distance:

2. Manhattan Distance: Manhattan Distance is the sum of absolute differences between

And the generalized formula for an n-dimensional space is given as:

# computing the manhattan distance

3. Minkowski Distance: Minkowski Distance is the generalized form of Euclidean and

Let’s calculate the Minkowski Distance of the order 3:

Compute the Hamming Distance of two strings in Python

# computing the hamming distance

# computing the hamming distance

AB = √[(6 – 0)2 + (2 – 4)2] = √[36 + 4] = √40 = 2√10

BC = √[(9 – 6)2 + (1 – 2)2] = √[9 + 1] = √10

CA = √[(0 – 9)2 + (4 – 1)2] = √[81 + 9] = √90 = 3√10

Here, we can see that

(This is because 2√10 + √10 = 3√10).

Using Euclidean distance formula,

AB = √[(x22 – x11)2 + (y22 – y11)2]

= √[(0 – √3)2 + (0-1)2]

BC = √[(x33 – x22)2 + (y33 – y22)2]

CA = √[(x33 – x11)2 + (y33 – y11)2]

= √[(2 - √3)2+ (0 – 1 )2]

You might also like