Learning by Computing Distances (2):
Wrapping-up LwP, Nearest Neighbors
CS771: Introduction to Machine Learning
Piyush Rai
2
Learning with Prototypes (LwP)
1 𝜇
∑ 𝐱
1
𝜇− = 𝜇−
+¿=
∑
¿
𝑁 − 𝑦 =−1 𝑛
𝑛
𝐰 𝜇+¿¿ 𝑁+¿
𝑦 𝑛=+1
𝐱𝑛 ¿
Prediction rule for LwP
𝐰= 𝝁+¿ −𝝁 − ¿
(for binary classification
If Euclidean distance used
with Euclidean distance)
+ For LwP, the prototype vectors (or
Decision boundary their difference) define the “model”.
(> 0 then predict +1 otherwise -1) and (or just in the Euclidean distance
(perpendicular bisector of line
case) are the model parameters.
joining the class prototype vectors)
Exercise: Show that for the bin. classfn case
𝑁 Note: Even though can be expressed in Can throw away training data after computing the
𝑓 ( 𝐱 )= ∑ 𝛼𝑛 ⟨ 𝐱𝑛 , 𝐱 ⟩ +𝑏 this form, if N > D, this may be more
expensive to compute (O(N) time)as
prototypes and just need to keep the model parameters
𝑛=1 for the test time in such “parametric” models
compared to (O(D) time).
So the “score” of a test point is a weighted sum of its
similarities with each of the N training inputs. Many However the form is still very useful as we will see later
supervised learning models have in this form as we will see when we discuss kernel methods
later
CS771: Intro to ML
3
Improving LwP when classes are complex-shaped
Using weighted Euclidean or Mahalanobis distance can sometimes help
𝜇+¿¿ √
𝐷
𝑑 𝑤 ( 𝒂 , 𝒃 )= ∑ 𝑤 𝑖 ( 𝑎𝑖 − 𝑏𝑖 )2
𝜇− 𝑖=1
Use a smaller for the horizontal
axis feature in this example
Note: Mahalanobis distance also has the effect of rotating the axes which helps
A good W will help bring
W will be a 2x2 symmetric matrix in points from same class
this case (chosen by us or learned) closer and move different
classes apart
𝑑 𝑤 ( 𝒂 , 𝒃 )=√ ( 𝒂 − 𝒃 ) 𝐖 ( 𝒂 − 𝒃 )
⊤
𝜇− 𝜇+¿¿
𝜇+¿¿ 𝜇−
CS771: Intro to ML
4
Improving LwP when classes are complex-shaped
Even with weighted Euclidean or Mahalanobis dist, LwP still a linear classifier
Exercise: Prove the above fact. You may use the following hint
Mahalanobis dist can be written as
is a symmetric matrix and thus can be written as for any matrix
Showing for Mahalabonis is enough. Weighted Euclidean is a special case with diag
LwP can be extended to learn nonlinear decision boundaries if we use
nonlinear distances/similarities(more on this Note:
when we talk
Modeling each about kernels)
class by not
just a mean by a probability
distribution can also help in learning
nonlinear decision boundaries. More
on this when we discuss
probabilistic models for
classification
CS771: Intro to ML
5
LwP as a subroutine in other ML models
For data-clustering (unsupervised learning), K-means clustering is a popular algo
K-means also computes means/centres/prototypes of groups of unlabeled points
Harder than LwP since labels are unknown. But we can do the following
Guess the label of each point, compute means using guess labels Will see K-means
in detail later
Refine labels using these means (assign each point to the current closest mean)
Repeat until means don’t change anymore
Many other models also use LwP as a subroutine
CS771: Intro to ML
6
Supervised Learning
using
Nearest Neighbors
CS771: Intro to ML
7
Nearest Neighbors
Another supervised learning technique based on computing distances
Wait. Did you say distance from
ALL the training points? That’s
Very simple idea. Simply do the following at test time gonna be sooooo expensive!
Compute distance of of the test point from all the training points
Sort the distances to find the “nearest” input(s) in training data Yes, but let’s not worry
about that at the moment.
Predict the label using majority or avg label of these inputs There are ways to speed
up this step
Can use Euclidean or other dist (e.g., Mahalanobis). Choice imp just like LwP
Unlike LwP which does prototype based comparison, nearest neighbors
method looks at the labels of individual training inputs to make prediction
Applicable to both classifn as well as regression (LwP only works for classifn)
CS771: Intro to ML
8
Nearest Neighbors for Classification
CS771: Intro to ML
9
Nearest Neighbor (or “One” Nearest Neighbor)
Decision boundary Interesting. Even with
Euclidean distances, it can learn
nonlinear decision boundaries?
Indeed. And that’s
possible since it is a
“local” method (looks at
a local neighborhood of
the test point to make
prediction)
Nearest neighbour approach
Test point induces a Voronoi
Test point
tessellation/partition of the input
space (all test points falling in a cell
will get the label of the training
input in that cell)
CS771: Intro to ML
10
K Nearest Neighbors (KNN)
In many cases, it helps to look at not one but > 1 nearest neighbors
Test input = 31
How to pick the
“right” K
value?
K is this model’s
“hyperparameter”. One
way to choose it is using
“cross-validation” (will
see shortly)
Also, K should ideally be
an odd number to avoid
Essentially, taking more votes helps! ties
Also leads to smoother decision boundaries (less chances of overfitting on training data)
CS771: Intro to ML
11
-Ball Nearest Neighbors (-NN)
Rather than looking at a fixed number of neighbors, can look inside a ball of a
given radius , around the test input So changing may change
the prediction. How to
pick the “right” value?
Test input
Just like K, is also a
“hyperparameter”. One
way to choose it is using
“cross-validation” (will
see shortly)
CS771: Intro to ML
12
Distance-weighted KNN and -NN
The standard KNN and 𝜖-NN treat all nearest neighbors equally (all vote
equally)
=3
Test input
An improvement: When 1 voting,
1 give
1 more importance to closer
In weighted approach, training
a single red training inputs
Unweighted KNN prediction:
3
+ 3 +3 = input is being given 3 times more
importance than the other two green inputs
3 1 1 since it is sort of “three times” closer to the
Weighted KNN prediction:
5
+ 5
+5 = test input than the other two green inputs
-NN can also be made
CS771:
weighted Intro to ML
likewise
13
KNN/-NN for Other Supervised Learning Problems
Can apply KNN/𝜖-NN for other supervised learning problems as well, such as
Multi-class classification We can also try the weighted versions for
such problems, just like we did in the case
Regression of binary classification
Tagging/multi-label learning
For multi-class, simply used the same majority rule like in binary classfn case
Just a simple difference that now we have more than 2 classes
For regression, simply compute the average of the outputs of nearest neighbors
For multi-label learning, each output is a binary vector (presence/absence of
tag)
Just compute the average of the binary vectors
CS771: Intro to ML
14
KNN Prediction Rule: The Mathematical Form
Let’s denote the set of K nearest neighbors of an input by
The unweighted KNN prediction for a test input can be written as
Assuming discrete labels with 5 possible
1
∑
values, the one-hot representation will be a all
𝐲= 𝐲𝑖 zeros vector of size 5, except a single 1
𝐾 𝑖 ∈𝑁 (𝐱)
𝐾
denoting the value of the discrete label, e.g., if
label = 3 then one-hot vector = [0,0,1,0,0]
This form makes direct sense of regression and for cases where the each output
is a vector (e.g., multi-class classification where each output is a discrete value
which can be represented as a one-hot vector, or tagging/multi-label
classification where each output is a binary vector)
For binary classification, assuming labels as +1/-1, we predict )
CS771: Intro to ML
15
Nearest Neighbors: Some Comments
An old, classic but still very widely used algorithm
Can sometimes give deep neural networks a run for their money
Can work very well in practical with the right distance function
Comes with very nice theoretical guarantees
Also called a memory-based or instance-based or non-parametric method
No “model” is learned here (unlike LwP). Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Prediction step can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups
CS771: Intro to ML
16
Next Lecture
Hyperparameter/model selection via cross-validation
Learning with Decision Trees
CS771: Intro to ML