Compre
Compre
Naganand Y (04-04-00-10-12-16-1-13965)
Contents
1 Syllabus in brief 3
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Computational Methods of Optimization . . . . . . . . . . . . . . 4
1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Linear Algebra 5
2.1 Vector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Vector space: definition . . . . . . . . . . . . . . . . . . . 6
2.1.3 Basis of vector space . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Dimension theorem for vector spaces [7] . . . . . . . . . . 8
2.1.5 The four fundamental subspaces . . . . . . . . . . . . . . 9
2.1.6 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.7 Gram-Schmidt orthogonalisation process . . . . . . . . . . 12
1
4.1.4 KKT conditions . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.5 Support vectors . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.6 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.7 Soft margin SVM . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.8 Kernel method . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Expectation Maximisation . . . . . . . . . . . . . . . . . . 43
2
Chapter 1
Syllabus in brief
3
1.3 Computational Methods of Optimization
• Unconstrained problems: First-order and second-order necessary condi-
tions, Convex and concave functions, descent algorithms – global conver-
gence, line search, steepest descent method, Newton’s method, coordi-
nate descent methods, conjugate gradient method, Quasi-Newton meth-
ods, DFP update, BFGS update.
4
Chapter 2
Linear Algebra
2.1.1 Field
A field F is a set F together with two binary operations + : F × F → F
(addition) and · : F × F → F (multiplication), that is denoted F = (F, +, .),
and that satisfies the following field axioms ∀a, b, c ∈ F :
1. associativity of addition (a + b) + c = a + (b + c)
2. additive identity ∃ 0∈F 3a+0=a
3. additive inverse ∃ − a ∈ F 3 a + (−a) = 0
4. commutativity of addition a+b=b+a
5. distributivity of multiplication over addition a · (b + c) = (a · b) + (a · c)
6. associativity of multiplication (a · b) · c = a · (b · c)
7. multiplicative identity ∃ 1∈F 3a·1=a
8. multiplicative inverse ∃ a−1 ∈ F 3 a · a−1 = 1 provided a 6= 0
9. commutativity of multiplication a·b=b·a
e.g. rational numbers, real numbers, complex numbers, and constructible num-
bers (through compass and straightedge/scale).
Points
• A semigroup S = (S, +) is required to satisy item 1
e.g. the set of strictly positive integers
5
• A monoid M = (M, +) is required to satisfy items 1, and 2
e.g. the set of positive integers (with 0 included)
• A group G = (G, +) is required to satisfy items 1, 2, and 3
e.g. permutations of a set of three elements
e.g. below are some examples of vector spaces over a field F = (F, +, ·) (for
example, the field of real numbers R)
• field: the field
F itself under field addition and scalar multiplication,
notationally F, F, +, ·
• finite coordinate space: for a positive integer n, the space of all n-tuples
of elements of F called the coordinate space
6
• infinite coordinate space: the set
F ∞ = (x1 , x2 , x3 , · · · ) : {xi : xi 6= 0, i ∈ Z+ } < ∞
why vector space? It is a useful and commonly known abstraction. All ex-
amples of vector spaces share a common structure. So, once we prove something
is true for a finite dimensional vector space then we do not have to prove it for
polynomials, matrices, symmetric matrices, antisymmetric matrices, n-th order
homogeneous ODE solution sets, sets of finitely many operators, multivariate
polynomials, matrices of all of the above examples, tensor products of all of the
above examples, linear transformations to and from all of the above examples
etc.
c1 · v1 + · · · + cn · vn = 0 =⇒ c1 = · · · = cn = 0 with v1 6= 0, · · · , vn 6= 0.
• v ∈ W, c ∈ F =⇒ cv ∈ W
Linear span Given a vector space (V, F, +, ·), the span of a set S = {v1 , · · · , vn }
of vectors is defined to be the intersection of all subspaces of (V, F, +, ·) that
contain S i.e.
nX n o
span(S) = ci vi : ci ∈ F, ∀i ∈ [n] .
i=1
7
Basis
A set of vectors S = {v1 , · · · , vn } is called a basis of a vector space (V, F, +, ·)
if S is a linearly independent set of vectors that spans the set V entirely. More
formally,
n
X n
nX o
ci vi = 0 =⇒ ci = 0, ∀i ∈ [n] and ci vi : ci ∈ F, ∀i ∈ [n] = V .
i=1 i=1
| {z } | {z }
linear independence linear span
e.g.
1 0
• The vectors e1 = and e2 = form a basis of the vector space R2
0 1
and in general, e1 , · · · , en form a basis of Rn
• The polynomials 1, x, x2 , · · · form a basis of the vector space of real poly-
nomials
Case 1: |I| = ∞
P
• ∀bj ∈ B, j ∈ J , ∃ci,j ∈ F, i ∈ S 3 bj = ci,j ai , where Sj is a finite
i∈Sj
subset of I.
S
• Let S = Sj . Since |I| > |J |, and |S| < ∞,
j∈J
• So, ∃ ι ∈ I 3 ι ∈
/ S.
P
• Since B is a basis, ∃dj ∈ F, j ∈ J with aι = dj bj .
j∈J
P P P P
• Now, aι = dj ci,j ai = ci,j dj ai
j∈J i∈Sj j∈J i∈Sj
8
Case 2: |I| < ∞
• Let I = [m], J = [n], and B0 = A = {a1 , · · · , am }
• Since A is a basis, span(B0 ) = V and hence span({b1 } ∪ B0 ) = V
m
P
• Now, ∃ci ∈ F, i ∈ [m] 3 b1 = ci ai with cp1 6= 0 for some p1 ∈ [m]
i=1
n
1
P ci
• Rearranging the terms, ap1 = c p1 b1 + − cp1 ai
i=1,i6=p1
The above proof uses only the facts that span(A) = V and B is a linearly
independent set. Using only the facts that span(B) = V and A is a linearly
independent set, we can mimic the above proof to show that m ≤ n and thus
m = n. The dimension of a vector space is defined to be this unique number of
vectors in a basis of the vector space.
Column space
The column space of the matrix A over a field F = (F, +, ·) is the linear span of
its column vectors.
9
e.g.
1 0
col(Aeg ) = c1 0 + c2 1 : c1 , c2 ∈ R .
0 1
We note that c1 = 1, c2 = 1 can give the third column and c1 = 6, c2 = 0 the
fourth column. Hence the first two columns form a basis for col(Aeg ).
Row space
The row space of the matrix A over F = (F, +, ·) is the linear span of its row
vectors. It is easy to see that the row space of the matrix A is the column space
of its transpose AT .
row(A) = {AT x : x = (x1 , · · · , xm ) ∈ F m }.
e.g.
1 0
0 1
row(Aeg ) = c1 + c2 : c1 , c2 ∈ R .
1 1
6 0
We note that c1 = 0, c2 = 1 can give the third row. Hence the first two rows
form a basis for row(Aeg ).
Null space
The null space of the matrix A over F = (F, +, ·) is the set of all vectors x such
that Ax = 0.
null(A) = {x ∈ F n : Ax = 0}.
e.g.
1 0
null(Aeg ) = c1 0 + c2 1 : c1 , c2 ∈ R .
0 1
We note that c1 = 1, c2 = 1 can give the third column and c1 = 6, c2 = 0 the
fourth column. Hence the first two columns form a basis for col(Aeg ).
2.1.6 Orthogonality
To discuss orthogonality, we briefly review inner product space which provides
a means of defining orthogonality.
10
Why inner products? They allow the rigorous introduction of intuitive ge-
ometrical notions such as length of a vector or the angle between two vectors.
Definition An inner product space is a quintuple (V, F, +, ·, h·, ·i) which con-
sisits of the vector space V = (V, F, +, ·) and an inner product,
h·, ·i : V × V → F
• hv, vi ≥ 0 positive-definiteness
• hv, vi = 0 ⇐⇒ v = 0. positive-definiteness
Orthogonal vectors
Two vectors, u and v, in an inner product space, are orthogonal if their inner
product, hu, vi, is zero.
1 1
0 2
e.g. The vectors u = are orthogonal because hu, vi =
2 and v =
3
−1 7
1 · 1 + 0 · 2 + 2 · 3 + (−1) · 7 = 0
Orthonormal vectors
Two vectors, u and v, in an inner product space V = (V, F, +, ·, h·, ·i), are
orthonormal if hu, vi = 0 if u 6= v and 1 if u = v.
11
1 1
1
0 1
2
e.g. The vectors u = √6 2 and v = 63 3 are orthogonal because
√
−1 7
1
• hu, vi = √6√ 63
1 · 1 + 0 · 2 + 2 · 3 + (−1) · 7 =0
• hu, ui = √ 1√ 1 · 1 + 0 · 0 + 2 · 2 + (−1) · (−1) = 16 6 = 1
6 6
• hv, vi = √ 1√ 12 + 2 2 + 3 2 + 7 2 = 1
63 = 1
63 63 63
Orthogonal matrix
A square matrix Qn×n is an orthogonal matrix if QT Q = I. It is easy to see
that Q−1 = QT . The columns of Q, q1 , · · · , qn , are pairwise orthonormal. So
∀i, j ∈ [n], (
T 0 if i 6= j
qi qj =
1 if i = j.
Equivalently,
q1T
QT Q = ..
q1 ... q n = In
.
qnT
e.g.
0 0 1
• permuation matrix, for example Q = 1 0 0
0 1 0
cos θ sin θ 1 1
• , for example Q = √12
sin θ − cos θ 1 −1
1 1 1 1
1 −1 1 −1
• Hadamard matrix, for example Q = 21
1 1 −1 −1
1 −1 −1 1
1 −2 2
• Q = 31 2 −1 −2
2 2 1
12
Figure 2.1: The two vectors u and v are linearly independent. p is the
projection of v on u. Clearly, p = mu, where m is a scalar. e = v−p
is the
T
error. The key fact here is that u and e are orthogonal, i.e. u v − mu = 0
T
T
giving us m = uuT uv . Thus, the projection is p = uuT uv u. We can write
T
p = u uuT uv = P v where P = uT1 u uuT is the projection matrix.
T T
e.g. given, u = 1 0 2 and v = −2 −1 3 , the projection of v
4
T
onto u is p = 5 1 0 2 .
1 0 2
The projection matrix is P = 15 0 0 0.
2 0 4
col(P ) = cu : c ∈ R .
13
Projection
Consider two vectors u and v in an inner product space V = (V, F, +, ·, h·, ·i).
T
We write the product hu, vi as u v. The projection of v onto u is P v
inner
where P = uT1 u uuT is the projection matrix as shown in 2.1.
Why project? Because Ax = b may have no solution (esp. with more equa-
tions than unknowns common in real-world applications). What do we do in
such a case? We solve the “closest” problem that can be solved. The problem
here is Ax ∈ col(A) and b ∈/ col(A). So choose the “closest” vector in the col-
umn space. Solve Ax̂ = p instead where p is the projection onto the column
space. The key here is that b − Ax̂ is orthogonal to col(A) i.e AT (b − Ax̂) = 0.
Rewriting,
AT Ax̂ = AT b.
We observe that
• x̂ = (AT A)−1 AT b solution
We interestingly note that the solution x̂ is the least squares fit to the equations
given by Ax = b.
Property check
• PT = P
(QQT )T = (QT )T QT = QQT
• P2 = P
(QQT )2 = (QQT )(QQT ) = Q(QT Q)QT = QQT
14
Two-vector case We now describe the Gram-Schmidt process for two linearly
independent vectors u and v. The key step is to get orthogonal vectors U and
V from u and v. We set U = u and the key idea now, is to project v onto u as
shown in 2.1
15
Chapter 3
Sample space
The sample space, Ω, of an experiment is the set of all possible outcomes of the
experiment.
e.g.
• For the experiment of tossing a single six-faced die, the sample space is
typically Ω = {1, 2, 3, 4, 5, 6}
• For tossing two coins, the sample space is typically Ω = {HH, HT, T H, T T }
Event
An event is a set of outcomes of an experiment. So, an event is a subset of the
sample space. A single outcome may be an element of many different events.
e.g.
• For
tossing a die, anevent could be the occurence of an even number,
{2, 4, 6} is the event
• For tossing
two coins, an event could be
at least one of the coins giving a
head (H) the event is {HH, HT, T H}
We observe that
16
• since the sample space Ω always occurs, would like to have Ω as an event
• if E is an event that occurs (say occurence of an even number on throwing
a die) then it is reasonable to expect E c as an event as well (occurence of
an odd number on throwing a die)
Algebra
A collection, F, of subsets of the sample space, Ω, i.e. a collection of events is
said to be an algebra if
• ∅∈F
• E ∈ F =⇒ E c ∈ F
• E1 , E2 ∈ F =⇒ E1 ∪ E2 ∈ F
e.g. toss a coin repeatedly until the occurence of the first head. Here the
sample space is Ω = {H , T H, T T H, · · · }. We might be interested in the event
of even number of tosses until the occurence of the first head i.e
E = {T H , T T T H, T T T T T H, · · · }. Clearly, the event E is not included in any
algebra because an algebra contains only finite unions of subsets, but E entails
a countably infinite union. This motivates a σ-algebra.
σ-algebra
A collection, F, of subsets of the sample space, Ω, i.e. a collection of events is
said to be a σ-algebra if
• ∅∈F
• E ∈ F =⇒ E c ∈ F
∞
S
• E1 , E2 , · · · ∈ F =⇒ Ei ∈ F
i=1
17
Uncountable sample space [5]
Suppose we want to pick a real number at random from Ω = [0, 1], such that
each number in [0, 1] is “equally likely” to be picked. A simple strategy of
assigning probabilities to singleton subsets of [0, 1] such as {0}, {0.354}, {0.75},
etc. gets into difficulties quite quickly. Note that,
• assigning a non-zero positive probability to each singleton set (outcome)
in [0, 1] would make P(Ω) unbounded
• assigning zero probability to each singleton set would, alone, not be suffi-
cient to make P(Ω) = 1
The main reason for the above two points is that probability measures are
not additive over uncountable disjoint unions (of singletons in this case). We
need a different approach to assign probabilities when the sample space, such as
Ω = [0, 1], is uncountable. Specifically, we need to assign probabilities to specific
subsets of Ω. This motivates a particular σ-algebra consisting of a collection of
certain subsets of an uncountable sample space.
3.1.3 Measure
We now give basic necessary definitions pertaining to measures
Measurable space
A measurable space is a pair (S, F) consisting of a nonempty set S and a
σ-algebra F of subsets of S.
Measure
Let (S, F) be a measurable space. A measure on (S, F) is a function
µ : F → [0, ∞] such that
18
• µ(∅) = 0
n o∞
• If Ei : Ei ∈ F is a sequence of (pairwise) disjoint sets, then
i=1
∞
[ X∞
µ Ei = µ(Ei )
i=1 i=1
Probability measure
Given a measurable space (Ω, F), a probability measure is a measure, P, satisying
the following Kolmogorov probability axioms.
• P(E) ≥ 0 ∀E ∈ F
• P(Ω) = 1
n o∞
• If Ei : Ei ∈ F is a sequence of (pairwise) disjoint sets (synonymous
i=1
with mutually exclusive events), then
∞
[ X∞
P Ei = P(Ei )
i=1 i=1
To summarise the axioms, given a measure space, (Ω, F, P), the measure P is a
probability measure if P(Ω) = 1.
19
e.g. for one flip of a fair coin, the outcome is either heads or tails.
So, Ω = {H, T }. The σ-algebra F = 2Ω contains 4 events viz. {H}, {T }, ∅,
and Ω. In other words, F = {∅, {H}, {T }, Ω}. The probability measure is
P(∅) = 0, P({H}) = 0.5, P({T }) = 0.5, and P({H, T }) = 1.
X −1 (B) ∈ F, ∀B ∈ B
o
Alternative definition: It can be shown that the set (−∞, x] : x ∈ R
can generate the entire Borel σ-algebra, B, of real numbers, R. It can also be
shown that it suffices to check measurability on any generating set. A real-valued
function, X : Ω → R, on a probability space, (Ω, F, P), is hence a real-valued
random variable , X : (Ω, F) → (R, B), if
n o
ω ∈ Ω : X(ω) ≤ x ∈ F ∀ x ∈ R
n o
We have used the fact that ω ∈ Ω : X(ω) ≤ x = X −1 (−∞, x] .
20
• We talk about probabilities of subsets of Ω, not elements
When the sample space Ω is countable, we can assign probabilities to
individual outcomes in Ω. However, when it is uncountable, this approach
is not viable and we expect that for the “uniform” probability in [0, 1],
the probability of a singleton set {x}, x ∈ R is 0 for every x ∈ R.
It is true that if the interval I is the disjoint sum of two other intervals J
and K, then the length of I will be the sum of the lenghts of J and K.
But [0, 1] is the disjoint union of the singleton sets of the form {x}, whose
length is 0. Nevertheless, the length of [0, 1] is not 0. For this reason, we
do not talk about the size or the probability of points in Ω. We talk about
the probabilities or size of subsets of Ω, given by the σ-algebra F or B.
• We know the size of certain sets (intervals)
Usually, we know the measure of certain subsets of B. For example, in the
case of the unit interval [0, 1], we usually take the size of an interval [a, b]
to be the value b − a.
• The sets in B are the “measurable” sets
Based on the size of this simple sets, we can manage to extend our mea-
sure to other sets. It so happens that, given the constraints we want the
measure to satisfy, not always it is possible to extend the measure to the
whole family of subsets of R but only to some class B of subsets of R. So,
the probability measure, P, is a function P : B → [0, 1].
• Given a probability space, (Ω, F, P), a random variable,
X : (Ω, F) → (R, B) transports the probability measure, P, from
one measurable space, (Ω, F), to another, (R, B)
For example, suppose that Ω = {1, · · · , 6} is a dice, and we are gambling.
Say, if the value of the dice is odd, we lose 10 INR and if it is even, we win
10 INR. This function from Ω to {−10, 10} induces the random variable
X : (Ω, F) → (R, B). Now, instead of talking about a probability measure,
P, in the space (Ω, F, P), we can talk about the probability of, in one bet,
winning or losing 10 INR. So there is another probability measure, PX , in
the measurable space (R, B) We transported the probability in (Ω, F, P)
to probability in (R, B, PX ).
• We want X −1 (I) to be measurable
Since, we talk about a function from Ω to R, it might happen that we want
the probabilities to be defined at least for the intervals. That is given an
interval, I ⊆ R, we want X −1 (I) to have a probability associated with it.
Also, X −1 (I) will
be measurable for every interval I ⊆ R exactly when
X −1 (−∞, x] is measurable for every x ∈ R.
21
Types and distribution of random variables: Overview
A real-valued random variable (r.v) can be either discrete or continuous. A
discrete r.v is a real-valued r.v that assumes at most a countable number of
values. A discrete r.v is characterised by the probability mass function (p.m.f).
The cumulative distribution function, c.d.f., is another method to describe the
distribution of an r.v. The advantage of c.d.f. is that it can be defined for any
kind of an r.v (discrete, continuous, and mixed). A continuous r.v is a real-
valued r.v in which the c.d.f is continuous. The probability density function
(p.d.f) can also be used to characterise a continuous r.v. We now discuss these
in detail.
Definition: p.m.f
Let X : (Ω, F) → (R, B) be a discrete random
variable on the probability space
(Ω, F, P) with P {ω ∈ Ω : X(w) ∈ R} = 1 for R = {x1 , x2 , x3 , · · · } ⊂ R. The
function PX : R → [0, 1] with,
PX (x) = P {w ∈ Ω : X(w) = x} = P(X = x), ∀x ∈ R
p.m.f (
p x=1
PX (x) =
q =1−p x=0
This can also be written as PX (x) = px (1 − p)x for x ∈ {0, 1} and PX (x) = 0
otherwise. The plot is as shown in figure 3.1.
22
Figure 3.1: The p.m.f of a Bernoulli r.v with p < 0.5. PC
e.g.
• a classical example of Bernoulli r.v is a coin toss where 1 and 0 would
represent “head” and “tail” (or vice versa), respectively and in particular,
unfair coins would have p 6= 0.5
e.g.
• it is known that a certain fraction (p) of products on a production line
are defective and say, products are inspected until the first defective is
encountered
• a certain percent of bits (p) transmitted through a digital transmission
are received in error and now say, bits are transmitted until the first error
23
Figure 3.2: The p.m.f of a geometric r.v with p = 0.3. PC
The plots for n = 10, p = 0.3 and n = 20, p = 0.6 are as shown in figure
3.3
e.g.
• number of heads in n coin flips
• number of disk drives that crashed in a cluster of, say, 1000 computers
• number of advertisments that are clicked when, say, 40, 000 are served
24
Figure 3.3: The p.m.f of binomial r.vs with n = 10, p = 0.3 and n = 20, p = 0.6.
PC: here, here
(i + 1)(1 − p)
= .
(n − i)p
The p.m.f is increasing when
PX (i) < PX (i + 1)
⇒ (i + 1)(1 − p) < (n − i)p
⇒ i + 1 − ip − p < np − ip
⇒ i + 1 < (n + 1)p.
The p.m.f is similarly decreasing when i + 1 > (n + 1)p. Now, for a given
number, n, of trials and the given parameter p, the integer closest to and less
than (n + 1)p is the one that corresponds to the peak value of the p.m.f. Hence
the maximum value of X ∼ Binomial(n, p) is PX (ι) = n Cι pι (1 − p)n−ι where
ι = b(n + 1)pc.
25
We make the following assumptions
• cars pass by with a constant rate λ
• number of cars that pass by in a unit time interval (any hour) is indepen-
dent of that in any other unit time interval (any other hour).
The first assumption may not hold if, for example, the hour is a rush hour. The
second assumption may not hold, if, for example, there was a traffic jam before
the current hour of investigation.
However, the assumptions are reasonable, and motivate us to think X as a
binomial r.v with a Bernoulli trial done, say, every minute. The value X, then,
represents the number of successes in 60 trials (minutes). Each Bernoulli trial
is done every minute and involves asking the following yes-no question:
has at least one car passed by the point in the current minute?
λ
Then, X ∼ Binomial(60, p) and PX (i) = 60 Ci · pi · (1 − p)60−i where p = 60 .
The obvious downside of this approach is that it counts the number of
Bernoulli successes in 60 trials (minutes) rather than the actual number of cars
passing by in the given hour. What happens if more than one car passes by, in
a minute?
Intuitively, we need to get more granular: instead of dividing an hour into
60 minutes, we could divide it into 3600 seconds. Now, X ∼ Binomial(3600, p)
λ
and PX (i) = 3600 Ci · pi · (1 − p)3600−i where p = 3600 .
Catch Now, what if a couple of cars pass by, in half a second? Intuitively,
we need to divide an hour into infinitely many possible timesteps to avoid more
than one car passing by, in a timestep.
Key With this division, we can safely assume that at most one car passes by,
in a timestep (Bernoulli trial) and the number of Bernoulli sucesses is the same
as the number of cars passing by.
λx
lim n
Cx · pxn · (1 − pn )n−x = e−λ .
n→∞ x!
26
Figure 3.4: p.m.f of Poisson r.vs with λ = 1, 5, 10. PC: here, here, and here
n −x
Proof We use the facts that lim 1− λ
n = e−λ , and lim 1− λ
n =1
n→∞ n→∞
x n−x
n n(n − 1) · · · (n − x + 1) λ λ
lim Cx · pxn · (1 − pn ) n−x
= lim · · 1−
n→∞ n→∞ x! n n
λx −λ
1 x−1
= e lim 1 1 − ··· 1 −
x! n→∞ n n
x
λ
= e−λ .
x!
27
• two events cannot occur at exactly the same instant; instead, at each very
small sub-interval exactly one event either occurs or does not occur
• the probability of an event in a small sub-interval is proportional to the
length of the sub-interval
e.g.
• the number of car accidents in a site or in an area
i+1
= .
λ
The p.m.f is increasing when
PX (i) < PX (i + 1)
⇒ i+1<λ
1
p.m.f PX (x) = n, x ∈ {a, a + 1, · · · , a + n − 1 = b}
28
Figure 3.5: The p.m.f of a discrete uniform r.v with n = 5 and b = a + n − 1.
PC
e.g.
• for a simple example of throwing a fair dice, the probability of each of the
6 outcomes is 16
If two dice are thrown and their values added, the resulting distribution is no
longer uniform since not all sums have equal probability.
29
3.2.3 Cumulative distribution function
We recollect the definition of an r.v. A real-valued function X : Ω → R is a
real-valued r.v, X : (Ω, F) → (R, B), if
n o
X −1 (−∞, x] = ω ∈ Ω : X(ω) ≤ x ∈ F ∀ x ∈ R.
We might be interested
n to know, and itois quite natural to ask, what the proba-
bility of the event ω ∈ Ω : X(ω) ≤ x , x ∈ R is. The cumulative distribution
function (c.d.f) gives the probability of the event of interest.
Definition: c.d.f
Let X : (Ω, F) → (R, B) be any real-valued random variable on the probability
space (Ω, F, P). The function, FX : R → [0, 1], with
FX (x) = P {ω ∈ Ω : X(ω) ≤ x} = P(X ≤ x), ∀x ∈ R
Properties of c.d.f
Every c.d.f F
30
Figure 3.6: The general form of the c.d.f of any discrete r.v with x1 > −∞.
PC
Every function with these four properties is a c.d.f, i.e., for every such function,
an r.v can be defined such that the function is the c.d.f of that r.v.
The function fX is called the probability density function (p.d.f) of the r.v X.
The p.d.f is a function whose value at any given point in the sample space can
be interpreted as providing a relative likelihood that the value of the r.v would
equal that point. More precisely, the p.d.f is used to specify the probability of
the r.v falling within a particular range of values, as opposed to taking on any
31
one value. We see that fX can be written as
P ω ∈ Ω : X(ω) ∈ (x, x + δ]
fX (x) = lim+
δ→0 δ
P x<X ≤x+δ
= lim
δ→0+ δ
FX (x + δ) − FX (x)
= lim
δ→0+ δ
dFX
= (x).
dx
whenever the limit exists, or equivalently FX is differentiable at x.
p.d.f (
1
b−a x ∈ [a, b]
fX (x) =
0 otherwise
c.d.f
0
x<a
x−a
FX (x) = x ∈ [a, b]
b−a
1 otherwise
The p.d.f and c.d.f of a continuous uniform r.v are pictorially shown in figure
3.7.
e.g. the arrival time of a bus at a bus stop given that a bus comes by once
per hour and the current time is 3 p.m. is a continuous uniform r.v in [3, 4]
measured in hours.
32
Figure 3.7: The p.d.f and c.d.f of a continuous uniform r.v. PC here, here
c.d.f (
1 − e−λx x≥0
FX (x) =
0 otherwise
33
Chapter 4
4.1.1 Introduction
Let the training set be
n
(xi , yi ) i=1 with xi ∈ Rm , yi ∈ {+1, −1}.
yi (wT xi + b) ≥ 1, ∀ i ∈ [n].
Catch When the training set is separable, any separating hyperplane (w, b),
can be scaled to satisfy yi (wT xi + b) ≥ 1, ∀ i ∈ [n]. Therefore, there are neither
+ patterns nor - patterns between the two parallel hyperplanes wT x + b = 1
and wT x + b = −1 (dotted lines in fig. 4.1).
34
Figure 4.1: The nearest + and - patterns are much further apart from the blue
hyperplane (left) than the black one (right). The blue one on the left is as far
away from both + and - patterns as possible. The blue separating hyperplane
(left) intuitively looks a “better” hyperplane than the black one (right).
4.1.2 Intuition
w.l.o.g. assume a - pattern, x- , is on the hyperplane wT x + b = −1. Then the
T -
distance of that pattern from the separating hyperplane is |w ||w||
x +b| 1
= ||w|| . It is
T
now easy to see that the distance between the parallel hyperplanes w x + b = 1
2
and wT x + b = −1 is ||w|| .
The distance between the two parallel hyperplanes is called the margin of
the separating hyperplane wT x + b = 0. Intuitively, the more is the margin,
the better is the chance of correct classification of new patterns. The optimal
hyperplane, intuitvely, is the separating hyperplane with the maximum margin.
The main intuition behind the SVM approach is that if a classifier is good
at the most challenging comparisons (patterns from different classes that are
close to each other), then the classifier will be even better at the easy compar-
isons (patterns from different classes that are far away from each other). SVMs
focus only on the points that are the most difficult to tell apart, whereas other
classifiers pay attention to all of the points.
1 T
min w w
w∈Rm ,b∈R 2
subject to yi (wT xi + b) ≥ 1, ∀ i ∈ [n].
35
Keys The optimisation problem is a convex optimisation problem with linear
inequality constraints and hence
• the Karush–Kuhn–Tucker (KKT) conditions are necessary and sufficient
• every local minimum is a global minumum
n
∂L
µ∗i yi = 0
P
• ∂b = 0 =⇒ stationarity
i=1
• 1 − yi (w∗ )T xi + b∗ ≤ 0, ∀ i ∈ [n] primal feasibility
∗ ∗ T ∗
• µi 1 − yi (w ) xi + b = 0, ∀i ∈ [n] complementary slackness
36
4.1.6 Dual problem
The dual function is
n1 n
X o
q(µ) = inf L(w, b; µ) = inf wT w + µi [1 − yi (wT xi + b)] .
w,b w,b 2 i=1
n
P n
P
We note the presence of the term − µi yi b = −b µi yi . A simple observation
i=1 i=1
n
P
here is, if µi yi 6= 0, then b can be chosen appropriately so that q(µ) = −∞.
i=1
n1 n
X o
T
q(µ) = inf w w+ µi [1 − yi (wT xi )] .
w 2 i=1
The infimum w.r.t w is obtained similarly to the first stationarity condition and
n
P
is attained at w = µi yi xi . Thus,
i=1
n
X n
T X n n n
X T
1 X X
q(µ) = µi yi xi µj yj xj + µi − µi yi µj yj xj xi
2 i=1 j=1 i=1 i=1 j=1
n n n n n
1 XX X XX
= µi yi µj yj xTi xj + µi − µi yi µj yj xTi xj
2 i=1 j=1 i=1 i=1 j=1
n n n
X 1 XX
= µi − µi yi µj yj xTi xj .
i=1
2 i=1 j=1
n n n
X 1 XX
max µi − µi µj yi yj xTi xj
(µ1 ,··· ,µn )∈Rn
i=1
2 i=1 j=1
subject to µi ≥ 0, ∀ i ∈ [n]
Xn
yi µi = 0.
i=1
∗ ∗ ∗
The SVM solution We solvethe above n dual problem to get µ = (µ1 · · · , µn )
for the given training data set, (xi , yi ) i=1 . The maximum-margin separating
37
hyperplane, (w∗ , b∗ ), can then be found using
X
w∗ = µ∗i yi xi and b∗ = yi − (w∗ )T xi for any i ∈ I
i∈I
Primal problem
n
1 T X
min w w+c ξi
w∈R , b∈R, ξ∈Rn
m 2 i=1
subject to 1 − ξi − yi (wT xi + b) ≤ 0, ∀ i ∈ [n]
− ξi ≤ 0, ∀ i ∈ [n].
38
Figure 4.2: Datapoint 1 (labeled +) is within the margin and on the same
side (+ side) of the separating hyperplane. Notice that it still contributes to
a penalty term cξ1 to the objective. So does datapoint 3 (labeled -). The
datapoint 2 is on the separating hyperplane and contributes to a term of c to
the objective. Datapoints 4 and 5 are wrongly classified and contribute to terms
cξ4 and cξ5 to the objective respectively.
KKT conditions
The Lagrangian, with the Lagrange multipliers µi for the constraints 1 − ξi −
yi (wT xi + b) ≤ 0, and λi for the constraints −ξi ≤ 0, ∀i ∈ [n], is given by
n n n
1 T X X
T
X
L(w, b, ξ; µ, λ) = w w + c ξi + µi [1 − ξi − yi (w xi + b)] − λ i ξi .
2 i=1 i=1 i=1
n
∂L
µ∗i yi = 0
P
2. ∂b = 0 =⇒ stationarity
i=1
3. ∂L
= 0 =⇒ µ∗i + λ∗i = c
∂ξi stationarity
4. 1 − ξi − yi (w∗ )T xi + b∗ ≤ 0 primal feasibility
39
8. µ∗i 1 − ξi − yi (w∗ )T xi + b∗ = 0 complementary slackness
9. λi ξi = 0 complementary slackness
From conditions 3, 6, and 7, we can write 0 ≤ µ∗i + λ∗i = c, ∀i ∈ [n]. Define the
set of indices I = {i : 0 < µi < c}. It follows that λi > 0, ∀ i ∈ I, and
conditon 9 that ξi = 0, ∀ i ∈ I. Using these in condition 8, we get
also from
1 − yi (w∗ )T xi + b∗ = 0, ∀ i ∈ I and thus
b∗ = 1 − yi (w∗ )T xi for any i ∈ I.
Dual problem
The dual function is
n
P
Observation In the above Lagrangian, we have the term (c − µi − λi )ξi .
i=1
For a given µ and λ, if c − µi − λi 6= 0 for some i ∈ [n], then ξi can be made
as large negative/positive as possible and hence q(µ, λ) = −∞. So, we need
to impose µi + λi = c, ∀ i ∈ [n]. Further, we can write the two constraints
µi + λi = c and λi ≥ 0 concisely as 0 ≤ µi ≤ c, ∀ i ∈ [n]. This makes us drop λ
out as a variable and the dual function now becomes
n1 Xn h i
q(µ, λ) = q(µ) = inf wT w + µi 1 − yi (wT xi + b) .
w,b 2
i=1
40
Figure 4.3: Influence of c in SVM. The right figure shows the separating
hyperplane and the margins with a much higher value of c than the left. In an
SVM optimisation problem, we need to have the right balance between choosing
a hyperplane with as large minimum margin as possible (left) and choosing one
that correctly classifies as many data points as possible (right). The value of c
influences this balance and is a user-specified hyperparameter that is commonly
determined through cross validation.
The above dual function is the same as the one in the hard-margin SVM case
and hence
n n n
X 1 XX
q(µ) = µi − µi yi µj yj xTi xj .
i=1
2 i=1 j=1
n n n
X 1 XX
max µi − µi µj yi yj xTi xj
(µ1 ,··· ,µn )∈Rn
i=1
2 i=1 j=1
subject to 0 ≤ µi ≤ c, ∀ i ∈ [n]
X n
yi µi = 0.
i=1
∗ ∗ ∗
The SVM solution We solvethe above n dual problem to get µ = (µ1 · · · , µn )
for the given training data set, (xi , yi ) i=1 . The maximum-margin separating
hyperplane, (w∗ , b∗ ), can then be found using
X
w∗ = µ∗i yi xi and b∗ = yi − (w∗ )T xi for any i ∈ I
i∈I
41
Observations We observe that in the dual optimisation problem,
• the training vectors, xi , i ∈ [n], appear as only pairwise inner products
• the objective is over Rn and not Rm i.e. the dimensionality of the opti-
misation problem is the number of examples, n, and is independent of the
number of features, m
• the cost function is quadratic and the constraints are linear.
These observations motivate kernel methods.
We can now solve the SVM optimisation problem by solving the dual replacing
xTi xj with ziT zj , ∀i, j ∈ [n]. The key here is the dimensionality of the optimi-
sation problem is n and is independent of m (or p). However, for a test data
point xtest , computing the prediction, ytest = (w∗ )T ztest + b∗ , ztest = φ(xtest ),
is expensive for large values of p. To resolve the computational bottleneck, we
can use kernel trick.
Kernel trick
Suppose we have a function k : Rp × Rp → R such that
The kernel trick exploits the fact that certain problems in machine learning have
additional structure than an arbitrary weighting function k. The computation
is made much simpler if the kernel can be written in the above form. We replace
the dot product xTi xj in the dual with k(xi , xj ).
42
Figure 4.4: Maximum likelihood estimation. For each set of ten tosses, note the
head counts and tail counts for coins A and B separately. Then, use the counts
to estimate the biases θA and θB . This is parameter estimation for complete
data because we know the random variables xi which is # heads observed during
the ith set of tosses and zi which is the identity of the coin used durng the ith
set of tosses
All we need is to store the (non-zero) Lagrange multipliers µ∗i , and support
vectors xi , ∀i ∈ I = {i : 0 < µi < c}. We do not need to enter the Rp space!
The range space φ can be infinite dimensional.
43
model depends on unobserved latent variables. We will motivate EM with the
help of a simple coin-flipping experiment [3].
Suppose we are given a pair of coins A and B of unknown biases, θA and
θB , respectively. On any given flip, coin A will land on heads with probability
θA ∈ (0, 1) and tails with probability 1 − θA and similarly for coin B.
Our goal is to estimate θ = (θA , θB ) by, say, repeating the following proce-
dure five times: randomly choose one of the two coins (with equal probability),
and perform ten independent coin tosses with the selected coin. Thus the entire
procedure involves a total of 50 coin tosses.
During the experiment, suppose that we carefully keep track of two vectors
x = (x1 , · · · , x5 ) and z = (z1 , · · · , z5 ), where xi ∈ {0, · · · , 10} is the number of
heads observed during the ith set of tosses, and zi ∈ {A, B} is the identity of
the coin used durng the ith set of tosses. Now, a simple way to estimate θA and
θB is to compute the observed proportions of heads for each coin as shown for
an example in figure 4.4.
In the figure, θ̂A and θ̂B represent the estimates obtained from MLE. In
other words, if ln P (x, z ; θ) is the logarithm of the joint probability, the
log-likelihood, of obtaining any particular vector of observed head counts x and
coin types z, then the parameters θ̂ = (θ̂A , θ̂B ) are the ones that maximise the
log-likelihood, ln P (x, z ; θ).
44
Bibliography
[5] Ravi Kolla, Aseem Sharma, Vishakh Hegde, and Krishna Jagan-
nathan. Lecture 7: Borel sets and lebesgue measure, 2015. [online;
http://nptel.ac.in/courses/108106083/lecture7_Borel%20Sets%
20and%20Lebesgue%20Measure.pdf; accessed 2-June-2018]. 18.
45