© Ben Galili IDC
¡ After grades are published, you have one
week to send a mail with your appeal (for HW
1 the one week starting now)
¡ You will get a response mail with the appeal’s
decision
¡ The grade will be updated only in the excel
file (uploaded to the first section in Moodle)
© Ben Galili IDC
¡ Family of learning algorithm that:
§ Doesn’t build a model to the data (like tree in Decision
Tree)
§ Instead – compares new instance with instances seen
in training
¡ Time complexity:
§ Fast learning (No learning…)
§ Potentially slow classification/prediction (O(n))
¡ Space complexity:
§ Store all in instances (O(n))
¡ Used in both Classification and Regression
© Ben Galili IDC
¡ How to find nearest? √
§ We know the possible methods & we use X-Fold
Cross Validation to chose best one?
¡ Slow query & Large space √
§ We now able to reduce space (irrelevant points) &
accelerate query time (K-D tree, reducing
calculation time)
¡ How to choose k? √
§ We use X-Fold Cross Validation to chose best one
© Ben Galili IDC
¡ This assignment has 3 phases:
§ First
▪ Implement a feature scaler.
§ Second
▪ Implement kNN algorithm
▪ use cross validation in order to find best hyper parameters (K,
p for the distance method, weighted / uniform majority)
§ Third
▪ Examine the influence of the number of folds on the running
time of each fold and the total running time.
▪ Implement an efficient distance kNN and see how it
effects the running time
© Ben Galili IDC
¡ Feature Scaling
§ 1 class : FeatureScaler
§ Should receive an instances object and return a
scaled instances object.
§ We'll use standardization on scaling in this
assignment:
© Ben Galili IDC
¡ Implement kNN
§ 2 classes: MainHW3 & kNN
§ The kNN class is the algorithm object
▪ You need to think which properties the class needs
(hint: think which parameters kNN algorithm needs)
§ The MainHW3 should find the best combination of
k, p (the distance method) and the voted method
– It should go over all combination and select the
one with the smallest error using cross validation
© Ben Galili IDC
¡ Efficient kNN: Efficient Distance Check
§ After you found the kNN parameters, implement the
efficient kNN
§ You need to implement an efficient distance check :
,
' .
# %
d " # ," % = ( ") − ")
)*+
§ Remember – the goal is to stop iterating once we are
above a desired threshold
© Ben Galili IDC
¡ Our previous model didn’t use probability
calculation (except maybe in the goodness of
split)
¡ The most intuitive algorithm is to return the
majority class, or in other word – return the
most probable class according to the training
set
¡ Today agenda – probability algorithm –
algorithm that uses probability techniques in
order to predict new instance
© Ben Galili IDC
¡ Sample space
¡ A sample space is a set of events which lists
all possible outcomes:
§ For a coin toss this is the sample space: S = {H,T}
§ For rolling a dice this is the sample space:
S = {1,2,3,4,5,6}
§ For rolling two dice this is the sample space:
S = {(1,1), (1,2), (1,3), (1,4), …, (6,5), (6, 6)}
© Ben Galili IDC
¡ Events
¡ Any subset of the sample space is called an
event
§ For rolling two die
E={(1,6) , (2,5), (3,4), (4,3), (5,2), (6,1)}
© Ben Galili IDC
¡ Events
¡ Some basic operation on events:
§ Union
§ Intersection
§ Complement
© Ben Galili IDC
¡ Random variable
¡ Some function of the outcome event
¡ For example, the sum of two die (not the two
numbers that come up):
§ Let X be a random variable denoting the sum of
two dice rolls:
▪ ! "=1 =
? 0
(
▪ ! "=2 =
? ! 1,1 =
)*
)
▪ ! "=4 =
? ! 1,3 , 3,1 , {2,2} = )*
© Ben Galili IDC
¡ Random variable
¡ We now can define the expected value of a random
variable:
§ For discrete variable:
! " = $ &'(&)
%
* Where p is the probability mass function (pmf)
§ For continuous variable:
,
! " = * &- & .&
+,
* Where f is the probability density function (pdf)
© Ben Galili IDC
¡ Random variable
¡ The variance:
! " = $%& ' = ([ * − , " ]
¡ The standard deviation (=square root of the
variance):
! = $%& ' = ([ * − , " ]
© Ben Galili IDC
¡ ! " ∪ $ =?
! " +! $ −! "∩$
¡ ! " ∪ $ ∪ * =?
! "∪$ ∪* =
! "∪$ +! * −! "∪$ ∩* =
! " +! $ −! "∩$ +! * −! "∩* ∪ $∩* =
! " +! $ −! "∩$ +! * −! "∩* −! $∩* +! "∩$∩* =
! " +! $ +! * −! "∩$ −! "∩* −! $∩* +! "∩$∩*
© Ben Galili IDC
¡ As we said before the simplest way is to ask
which class has higher probability in the training
set
¡ What is the probability that you’ll pass the
exam?
§ We have training data of the previous year result
§ There are 2 classes: ‘Pass’ or ‘Fail’
§ ‘Pass’ probability is 90%, and ‘Fail’ is 10%
§ So every one here has 90% to pass the exam
§ This uses the prior probability, and in our context is
the class distribution in the training set
© Ben Galili IDC
¡ Conditional probability:
& '∩)
§ ! "|$ =
&())
§ ! "|$, =?
& '∩). /.,
▪ = =1
&(). ) /.,
§ ! "|$2 =?
& '∩)3 /.,2
▪ = =0.75
&()3 ) /.,4
§ ! "|$5 =?
& '∩)6 /
▪ = =0
&()6 ) /.,
© Ben Galili IDC
¡ Example:
§ ! !"## = 90%
§ ! (")* = 10%
¡ We also know that:
§ ! ,-"./ 01. 2ℎ- 2-#2|!"## = 90%
§ ! 5)6/7 2 *-"./|!"## = 10%
§ ! ,-"./ 01. 2ℎ- 2-#2|(")* = 5%
§ ! 5)6/7 2 *-"./|(")* = 95%
¡ What is the probability that you pass the test if
you learn?
© Ben Galili IDC
§ ! !"## ∩ %&"'( )*' +ℎ& +&#+ = ! !"##
×! %&"'( )*' +ℎ& +&#+ !"## = 90%×90% = 81%
§ ! !"## ∩ 456(7 + 8&"'( = ! !"## ×!(456(7 + 8&"'(|!"##)
= 90%×10% = 9%
§ ! <"58 ∩ %&"'( )*' +ℎ& +&#+ = ! <"58
×! %&"'( )*' +ℎ& +&#+ <"58 = 10%×5% = 0.5%
§ ! <"58 ∩ 456(7 + 8&"'( = ! <"58 ×!(456(7 + 8&"'(|<"58)
= 10%×95% = 9.5%
§ ! %&"'( )*' +ℎ& +&#+ = ! !"## ∩ %&"'( )*' +ℎ& +&#+
+ ! <"58 ∩ %&"'( )*' +ℎ& +&#+ = 81% + 0.5% = 81.5%
§ ! 456(7 + 8&"'( = ! !"## ∩ 456(7 + 8&"'(
+ ! <"58 ∩ 456(7 + 8&"'( = 9% + 9.5% = 18.5%
© Ben Galili IDC
- -.//∩12.34 563 782 72/7
§ ! !"## $%"&' ()& *ℎ% *%#* =
-(12.34 563 782 72/7)
;<%
= = 99%
;<.?%
- D.EF∩12.34 563 782 72/7
§ ! A"BC $%"&' ()& *ℎ% *%#* =
-(12.34 563 782 72/7)
G.?%
= = 1%
;<.?%
- -.//∩LEM4N 7 F2.34 O%
§ ! !"## IBJ'K * C%"&' = = = 49%
-(LEM4N 7 F2.34) <;.?%
- D.EF∩LEM4N 7 F2.34 O.?%
§ ! A"BC IBJ'K * C%"&' = = = 51%
-(LEM4N 7 F2.34) <;.?%
© Ben Galili IDC
¡ Independent events
§ If ! " ∩ $ = ! " ! $ then A & B are independent
§ From conditional probability we get:
! "∩$
! "|$ =
!($)
! " ∩ $ = ! "|$ !($)
§ If A & B are independent:
! " ! $ = ! " ∩ $ = ! "|$ !($)
! " = ! "|$
* And also ! $ = ! $|"
© Ben Galili IDC
¡ The likelihood is the class conditional
information – the probability of an instance,
given the class
§ for an instance x, and 2 possible classes A, B:
P(x|B)
P(x|A)
If x=12, we’ll predict B,
because P(x|B)>P(x|A)
© Ben Galili IDC
¡ If we return to the previous example (fail \
pass) it is like asking:
§ What is the probability that someone learn to the
test if he pass the exam
¡ But, we wanted to know what is the
probability to pass \ fail the exam if you’ll
learn
¡ So we need a way to go from likelihood to
posterior probability
© Ben Galili IDC
¡ Bayes rule:
!(#|")!(")
! "# =
!(#)
¡ With this rule we can convert the likelihood to the
posterior probability, if we have also the prior
probability
¡ A classifier that classify A if P(A|x)>P(B|x), is a classifier
that maximize the posterior probability – MAP
¡ The classification with MAP depends on both the
likelihood and the prior probabilities
© Ben Galili IDC
¡ So if we want to classify according to MAP:
§ We will classify A if
!(#|")!(") !(#|))!())
! "# = > =! )#
!(#) !(#)
! # " ! " > !(#|))!())
§ Note that P(x) is removed from both side’s
denominator simply because it is the same
© Ben Galili IDC
¡ This classification rule is minimizing the error:
§ If we classify B, then the ! "##$# % = ! ' %
§ If we classify A, then the ! "##$# % = ! ( %
¡ But, we classify B only if ! ( % > ! ' % ,
and therefore the probability of the error is
minimal
! "##$# % = min[! ' % , ! ( % ]
© Ben Galili IDC
¡ We can define a loss measure for wrong
decision:
§ 0-1 loss (the simplest one):
1, ./ . ≠ 1
λ"# = λ %ℎ''() *" *# =+
0, ./ . = 1
© Ben Galili IDC
¡ After we defined the loss we can define the risk,
which is the expected loss (for k classes):
/
! "ℎ$$%& '( ) = + λ(, 1(', |)) = + 1(', |))
,-. ,5(
= 1 − 1('( |))
¡ Classifier that wants to minimize the risk will
choose '( such that:
1 '( ) > 1 ', ) ∀: ≠ <
© Ben Galili IDC
¡ We can use Bayes rule even for multi-class
problem:
%(#|&" )%(&" )
!" # = % &" # = .
∑+,- %(#|&+ ) %(&+ )
¡ The denominator is the same for all !" # , so
it can be dropped:
!" # = %(#|&" )%(&" )
© Ben Galili IDC
¡ In order to make the classification process
more efficient we can use ln():
!" # = ln ' # (" ' ("
= ln ' # (" + ln ' ("
¡ It helps reduce multiplication of small number
(0-1) and to deal better with normal
distribution * +(-)
¡ We can do it because ln() is monotonically
increasing
© Ben Galili IDC
¡ If we’re going back to the regression task, we can
define the hypothesis to be any function ℎ " : $ → &
that belongs to the hypothesis space ℎ ∈ (
¡ We want to find the most probable hypothesis
¡ This is a conditional probability problem – find the
hypothesis that maximize
) ℎ* =) *ℎ ) ℎ
* posterior probability
¡ We will assume all ℎ ∈ ( have the same prior
probability, and we’ll get that the most probable ℎ will
be found according to maximum likelihood:
ℎ,- = argmax 5 * ℎ
3∈4
© Ben Galili IDC
¡ Assuming the instances are independent:
! " ℎ = % ! '& ℎ
&
¡ If the error has normal distribution
(& ~*(,, .), then we can say that the
probability that ℎ 0& = '& is the same as the
probability (& = 0 according to the normal
distribution of (&
© Ben Galili IDC
¡ And we get:
ℎ"# = argmax - . ℎ = argmax / 1 20 ℎ
*∈, *∈,
0
1 : ;< 9= ?
9
= argmax / 8 7 >
*∈,
0
256 7
?
1 : * @< 9A<
9
= argmax / 8 7 >
*∈,
0
256 7
© Ben Galili IDC
;
1 6 * 78 598
5
ℎ"# = argmax - 4 3 :
*∈,
.
212 3
;
1 6 * 78 598
5
ℎ"# = argmax ln - 4 3 :
*∈,
.
212 3
3
1 1 ℎ @. − A.
ℎ"# = argmax > ln −
*∈, 212 3 2 2
.
3
1 ℎ @. − A.
= argmax > −
*∈, 2 2
.
= argmax > − ℎ @. − A. 3
*∈,
.
= argmin > ℎ @. − A. 3
*∈,
.
© Ben Galili IDC
¡ Prior classifier: ! " > !(%)
¡ ML classifier: ! '|" > !('|%) – assuming
! " = !(%)
¡ MAP classifier:
! "|' = ! '|" ! " > ! '|% ! % = ! %|'
* Drooping ! ' from the denominator
© Ben Galili IDC
¡ Parametric models
§ If we know \ can guess the distribution type we
can estimate the parameters of the distribution
¡ Non parametric models
§ Histogram (=count…)
§ Naïve Bayes
© Ben Galili IDC
¡ For each class we will estimate the distribution
parameter according to the train dataset
¡ If we’re talking about normal distribution
parameters, we need to estimate the mean and
the variance:
)
1
! = % *&
$
&'(
)
1
+, = % (*& − !),
$
&'(
© Ben Galili IDC
¡ Now, we can estimate the parameter for each likelihood
probability, for each class:
1
!" = ',
|&" |
(∈*+
.
1
-" = ' (, − !" ).
|&" |
(∈*+
¡ And then classify according to the largest probability given
by the normal distribution formula:
:
1 7 (68 +
6. 9
2 ,|&" = 5 +
24-".
© Ben Galili IDC
¡ But, this was good only for 1 attribute
¡ What if we have more than 1?
¡ In this case each likelihood probability will be
estimated according to multivariate normal
distribution
¡ For this we will need mean vector (each
dimension will be the mean for some
attribute) and the covariance matrix
© Ben Galili IDC
Variance
és 11 s 12 ! s 1d ù é s 12 s 12 ! s 1d ù
ês ú ê ú
ê 21 s 22 ! " ú ês 21 s 22 ! " ú
S= =
ê " " # " ú ê " " # " ú
ê ú ê 2 ú
ës d 1 s d 2 ! s dd û êës d 1 s d 2 ! s d úû
S - is the determinant of the covariance matrix
-1
S - is the inverse matrix of the covariance matrix
© Ben Galili IDC
¡ For each attribute we will find the mean and
the variance as before and we will create the
mean vector and the covariance matrix
¡ We will classify according to the multivariate
normal distribution:
1 2
̅ 5 6 7 89 314
1 314 ̅ 5
! #|%
̅ & = 0 /
(2+)- |.|/
© Ben Galili IDC
¡ If we don’t know the type of distribution?
¡ We need another way to estimate the
probabilities ! "|$% and ! $%
¡ The prior probability ! $% can be estimated
from the classes frequency in the training set
¡ But what with the likelihood?
© Ben Galili IDC
¡ In order to estimate the likelihood for a given
instance we need a huge dataset
¡ If we have d attributes the number of possible
terms in the likelihood ! "# , "% , … , "' |)* is
+ , |-# | , |-% | ,,, |-' |
¡ We need a way \ assumption to overcome this
problem
© Ben Galili IDC
¡ If we assume that all attributes are independent given the class,
we will get:
'
! "# , "% , … , "' |)* = , ! "- |)*
-.#
¡ And now we can find the MAP:
'
/01 = argmax !()* ) , ! "- |)*
*
-.#
¡ In this assumption we lower the necessary size of the dataset to
'
9 : /-
-.#
© Ben Galili IDC