Logarithmic Power Property Explained
Logarithmic Power Property Explained
1
4.2 Neural Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Neural Conversation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 NMT By Jointly Learning to Align & Translate . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Detailed Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Effective Approaches to Attention-Based NMT . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Using Large Vocabularies for NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Candidate Sampling – TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Attention Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.9.2 Sentence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Simple Baseline for Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 Survey of Text Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.11.1 Distance-based Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.2 Probabilistic Document Clustering and Topic Models . . . . . . . . . . . . . . . . . 98
4.11.3 Online Clustering with Text Streams . . . . . . . . . . . . . . . . . . . . . . . . 100
4.12 Deep Sentence Embedding Using LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.13 Clustering Massive Text Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.14 Supervised Universal Sentence Representations (InferSent) . . . . . . . . . . . . . . . . . . . 107
4.15 Dist. Rep. of Sentences from Unlabeled Data (FastSent) . . . . . . . . . . . . . . . . . . . . 108
4.16 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.17 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.18 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.19 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.20 Joint Event Extraction via RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.21 Event Extraction via Bidi-LSTM Tensor NNs . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.22 Reasoning with Neural Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.23 Language to Logical Form with Neural Attention . . . . . . . . . . . . . . . . . . . . . . . 128
4.24 Seq2SQL: Generating Structured Queries from NL using RL . . . . . . . . . . . . . . . . . . 130
4.25 SLING: A Framework for Frame Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . 133
4.26 Poincaré Embeddings for Learning Hierarchical Representations . . . . . . . . . . . . . . . . . 135
4.27 Enriching Word Vectors with Subword Information (FastText) . . . . . . . . . . . . . . . . . 137
4.28 DeepWalk: Online Learning of Social Representations . . . . . . . . . . . . . . . . . . . . . 139
4.29 Review of Relational Machine Learning for Knowledge Graphs . . . . . . . . . . . . . . . . . 141
4.30 Fast Top-K Search in Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.31 Dynamic Recurrent Acyclic Graphical Neural Networks (DRAGNN) . . . . . . . . . . . . . . . 146
4.31.1 More Detail: Arc-Standard Transition System . . . . . . . . . . . . . . . . . . . . 149
4.32 Neural Architecture Search with Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 150
4.33 Joint Extraction of Events and Entities within a Document Context . . . . . . . . . . . . . . . 152
4.34 Globally Normalized Transition-Based Neural Networks . . . . . . . . . . . . . . . . . . . . 155
4.35 An Introduction to Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 158
4.35.1 Inference (Sec. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.35.2 Parameter Estimation (Sec. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.35.3 Related Work and Future Directions (Sec. 6) . . . . . . . . . . . . . . . . . . . . . 168
2
4.36 Co-sampling: Training Robust Networks for Extremely Noisy Supervision . . . . . . . . . . . . 169
4.37 Hidden-Unit Conditional Random Fields . . . .. . . . . . .
. . . . . . . . . . . . . . . . 170
4.37.1 Detailed Derivations . . . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 172
4.38 Pre-training of Hidden-Unit CRFs . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 177
4.39 Structured Attention Networks . . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 179
4.40 Neural Conditional Random Fields . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 181
4.41 Bidirectional LSTM-CRF Models for Sequence Tagging . . . .
. . . . . . . . . . . . . . . . 183
4.42 Relation Extraction: A Survey . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 184
4.43 Neural Relation Extraction with Selective Attention over Instances . . . . . . . . . . . . . . . 187
4.44 On Herding and the Perceptron Cycling Theorem . . . . . . . . . . . . . . . . . . . . . . . 189
4.45 Non-Convex Optimization for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 191
4.45.1 Non-Convex Projected Gradient Descent (3) . . . . . . . . . . . . . . . . . . . . . 194
4.46 Improving Language Understanding by Generative Pre-Training . . . . . . . . . . . . . . . . . 195
4.47 Deep Contextualized Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.48 Exploring the Limits of Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.49 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.51 Wasserstein is all you need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
4.52 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.52.1 Self-Normalized NCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.53 Neural Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.54 On the Dimensionality of Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.55 Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.56 A Framework for Intelligence and Cortical Function . . . . . . . . . . . . . . . . . . . . . . 216
4.57 Large-Scale Study of Curiosity Driven Learning . . . . . . . . . . . . . . . . . . . . . . . . 217
4.58 Universal Language Model Fine-Tuning for Text Classification . . . . . . . . . . . . . . . . . 218
4.59 The Marginal Value of Adaptive Gradient Methods in Machine Learning . . . . . . . . . . . . . 220
4.60 A Theoretically Grounded Application of Dropout in Recurrent Neural Networks . . . . . . . . . 221
4.61 Improving Neural Language Models with a Continuous Cache . . . . . . . . . . . . . . . . . . 222
4.62 Protection Against Reconstruction and Its Applications in Private Federated Learning . . . . . . . 223
4.63 Context Dependent RNN Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.64 Strategies for Training Large Vocabulary Neural Language Models . . . . . . . . . . . . . . . . 226
4.65 Product quantization for nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . 228
4.66 Large Memory Layers with Product Keys . . . . . . . . . . . . . . . . . . . . . . . . . . 229
4.67 Show, Ask, Attend, and Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.68 Did the Model Understand the Question? . . . . . . . . . . . . . . . . . . . . . . . . . . 233
4.69 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.70 Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
4.71 Efficient Softmax Approximation for GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.72 Adaptive Input Representations for Neural Language Modeling . . . . . . . . . . . . . . . . . 238
4.73 Neural Module Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.74 Learning to Compose Neural Networks for QA . . . . . . . . . . . . . . . . . . . . . . . . 241
4.75 End-to-End Module Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.76 Fast Multi-language LSTM-based Online Handwriting Recognition . . . . . . . . . . . . . . . 245
3
4.77 Multi-Language Online Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . 246
4.78 Modular Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.79 Transfer Learning from Speaker Verification to TTS . . . . . . . . . . . . . . . . . . . . . . 250
4
8.3.1 Data Compression and Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.3.2 Further Analysis and Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.4 Monte Carlo Methods (Ch. 29) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.5 Variational Methods (Ch. 33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
13 Blogs 387
13.1 Conv Nets: A Modular Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
13.2 Understanding Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
13.4 Deep Learning for Chatbots (WildML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
13.5 Attentional Interfaces – Neural Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 395
14 Appendix 396
14.1 Common Distributions and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
14.2 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
14.3 Matrix Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
5
14.4 Main Tasks in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5 Misc. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.1 BLEU Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.2 Connectionist Temporal Classification (CTC) . . . . . . . . . . . . . . . . . . . . . 413
14.5.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
14.5.4 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.5 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.6 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.7 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.8 Traditional Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
6
Math and
Machine
Learning Basics
Contents
7
Math and Machine Learning Basics January 23, 2017
• For A−1 to exist, Ax = b must have exactly one solution for every value of b. Determining Unless stated otherwise,
assume A ∈ Rm×n
whether a solution exists ∀b ∈ Rm means requiring that the column space (range) of
A be all of Rm . It is helpful to see Ax expanded out explicitly in this way:
A1,1 A1,n
X . .
Ax = xi A:,i = x1 . .
. + · · · + xm . (2.27)
i
Am,1 Am,n
f (x) = 0 ⇒ x = 0 (1)
f (x + y) ≤ f (x) + f (y) (2)
∀α ∈ R, f (αx) = |α|f (x) (3)
• An orthogonal matrix is a square matrix whose rows are mutually orthonormal and
whose columns are mutually orthonormal: Note that orthonorm cols
implies orthonorm rows
(if square). To prove,
AT A = AAT = I (2.37) consider the relationship
−1 T between AT A and AAT
A =A (2.38)
• Suppose square matrix A ∈ Rn×n has n linearly independent eigenvectors {v (1) , . . . , v (n) }.
The eigendecomposition of A is then given by1
8
1) Change of basis: The vector (QT x) can be thought of as how x would appear in
the basis of eigenvectors of A.
2) Scale: Next, we scale each component (QT x)i by an amount λi , yielding the new
vector (Λ(QT x)). A common convention to
sort the entries of Λ in
3) Change of basis: Finally, we rotate this new vector back from the eigen-basis into descending order.
its original basis, yielding the transformed result of QΛQT x.
• Positive definite: all λ are positive; positive semidefinite: all λ are positive or zero.
→ PSD: ∀x, xT Ax ≥ 0
→ PD: xT Ax = 0 ⇒ x = 0.2
• Any real matrix A ∈ Rm×n has a singular value decomposition of the form,
U ∈ Rm×m (7)
A = UDV T (10) m×n
D∈R (8)
where both U and V are orthogonal matrices, and D is diagonal. V ∈ Rn×n (9)
2
I proved this and it made me happy inside. Check it out. Let A be positive definite. Then
xT Ax = xT QΛQT x (4)
X T T
= (Q x)i λi (Q x)i (5)
i
X
= λi (QT x)2i (6)
i
Since all terms in the summation are non-negative and all λi > 0, we have that xT Ax = 0 if and only if
(QT x)i = 0 = q (i) · x for all i. Since the set of eigenvectors {q (i) } form an orthonormal basis, we have that x
must be the zero vector.
9
1.1.1 Example: Principal Component Analysis
Task. Say we want to apply lossy compression (less memory, but may lose precision) to a
collection of m points {x(1) , . . . , x(m) }. We will do this by converting each x(i) ∈ Rn to some
c(i) ∈ Rl (l < n), i.e. finding functions f and g such that:
Decoding function (g). As is, we still have a rather general task to solve. PCA is defined
by choosing g(c) = Dc, with D ∈ Rn×l , where all columns of D are both (1) orthogonal and
(2) unit norm.
Encoding function (f ). Now we need a way of mapping x to c such that g(c) will give us
back a vector optimally close to x. We’ve already defined g, so this amount to finding the
optimal c∗ such that:
Optimal D. It is important to notice that we’ve been able to determine e.g. the optimal c∗
for some x because each x has a (allowably) different c∗. However, we use the same matrix D
for all our samples x(i) , and thus must optimize it over all points in our collection. With that
out of the way, we just do what we always do: minimize over the L2 distance between points
and their reconstruction. Formally, we minimize the Frobenius norm of the matrix of errors:
v
uX 2
(i)
D∗ = arg min t xj − r(x(i) )j s.t. DT D = I (17)
u
D i,j
10
Consider the case of l = 1 which means D = d ∈ Rn . In this case, after [insert math here], we
obtain
d∗ = arg max T r dT X T Xd s.t. dT d = 1 (18)
d
where, as usual, X ∈ Rm,n . It should be clear that the optimal d is just the largest eigenvector
of X T X.
11
Math and Machine Learning Basics January 24
Expectation. For some function f (x), Ex∼P [f (x)] is the mean value that f takes on when x
is drawn from P . The formula for discrete and continuous variables, respectively is as follows:
X
Ex∼P [f (x)] = P (x)f (x) (3.9)
x
Z
Ex∼P [f (x)] = p(x)f (x)dx (3.10)
Variance. A measure of how much the values of a function of a random variable x vary as we
sample different values of x from its distribution.
h i
Var [f (x)] = E (f (x) − E [f (x)])2 (3.11)
Covariance. Gives some sense of how much two values are linearly related to each other, as
well as the scale of these variables.
→ Large |Cov [f, g] | means the function values change a lot and both functions are far from
their means at the same time.
→ Correlation normalizes the contribution of each variable in order to measure only how
much the variables are related.
and if we want the “sample” covariance matrix taken over m data point samples, then
m
1 X
Σ := (xk − x̄)(xk − x̄)T (19)
m k=1
12
Measure Theory.
• A set of points that is negligibly small is said to have measure zero. In practical terms,
think of such a set as occupying no volume in the space we are measuring (interested in).
In R2 , a line has measure
zero.
• A property that holds almost everywhere holds throughout all space except for on a
set of measure zero.
Functions of RVs.
• Common mistake: Suppose y = g(x), and g is invertible/continuous/differentiable.
It is NOT true that py (y) = px (g −1 (y)). This fails to account for the distortion of
[probability] space introduced by g. Rather,
∂g(x)
px (x) = py (g(x)) (3.47)
∂x
where log is always assumed to be the natural logarithm. We can quantify the amount of
uncertainty in an entire probability distribution using the Shannon entropy,
which gives the expected amount of information in an event drawn from that distribution. Tak-
ing it a step further, say we have two separate probability distributions P (x) and Q(x). We can
measure how different these distributions are with the Kullback-Leibler (KL) divergence:
P (x)
DKL (P ||Q) , Ex∼P log = Ex∼P [log P (x) − log Q(x)] (22)
Q(x)
Note that the expectation is taken over P , thus making DKL not symmetric (and thus not a
true distance measure), since DKL (P ||Q) 6= DKL (Q||P ). Finally, a closely related quantity is
the cross-entropy, H(P, Q), defined as:
13
Math and Machine Learning Basics January 24, 2017
Some terminology. Underflow is when numbers near zero are rounded to zero. Similarly,
overflow is when large [magnitude] numbers are approximated as ±∞. Conditioning refers
to how rapidly a function changes w.r.t. small changes in its inputs. Consider the function
f (x) = A−1 x. When A has an eigenvalue decomposition, its condition number is
λi
max (4.2)
i,j λj
which is the ratio of the magnitude of the largest and smallest eigenvalue. When this is large,
matrix inversion is sensitive to error in the input [of f(x)].
Gradient-based optimization. Recall from basic calculus that the directional derivative
of f (x) in direction û (a unit vector) is defined as the slope of the function f in direction û.
By definition of the derivative, this is given by (with v := x + αû)
f (x + αû) − f (x) ∂f (x + αû)
lim = (25)
α→0 α ∂α α=0
X ∂f (v) ∂vi
= (26)
i
∂vi ∂α α=0
X
= (∇v f (v))i ui (27)
i α=0
where it’s important to recognize the distinction between limα→0 and setting α to zero, which
is denoted by α=0 . If we want to find the direction û such that this directional derivative is
a minimum, i.e.
= cos(θ) (32)
14
Jacobian and Hessian Matrices. For when we want partial derivatives of some function
f whose input and output are both vectors. The Jacobian matrix contains all such partial f : Rm → Rn
J ∈ Rn×m where
derivatives. Sometimes we want to know about second derivatives too, since this tells us ∂
Ji,j = ∂x f (x)i
whether a gradient step will cause as much of an improvement as we would expect based on j
the gradient alone. The Hessian matrix H(f )(x) is defined such that The Hessian is the
Jacobian of the gradient.
∂2
H(f )(x)i,j = f (x) (4.6)
∂xi ∂xj
T
The second derivative in a specific direction dˆ is given by d̂ H d̂ 3 . It tells us how well we can
expect a gradient descent step to perform. How so? Well, it shows up in the second-order
approximation to the function f (x) about our current spot, which we can denote x(0) . The
standard gradient descent step will move us from x(0) → x(0) − g, where g is the gradient
evaluated at x(0) . Plugging this in to the 2nd order approximation shows us how H can give
information related to how “good” of a step that really was. Mathematically,
1
f (x) ≈ f (x(0) ) + (x − x(0) )T g + (x − x(0) )T H(x − x(0) ) (4.8)
2
(0) (0) T 1 2 T
f (x − g) ≈ f (x ) − g g + g Hg (4.9)
2
If g T Hg is positive, then we can easily solve for the optimal = ∗ that decreases the Taylor
series approximation as
gT g
∗ = (4.10)
g T Hg
which can be as low as 1/λmax (the worst case), and as high as 1/λmin with the λ being the
eigenvalues of the Hessian. The best (and perhaps only) way to take what we learned about
the “second derivative test” in single-variable calculus and apply it to the multidimensional
case with H is by using the eigendecomposition of H. Why? Because we can examine the
eigenvalues of the Hessian to determine whether the critical point x(0) is a local maximum, local
minimum, or saddle point4 . If all eigenvalues are positive (remember that this is equivalent to
saying that the Hessian is positive definite!), the point is a local minimum. The condition number of
the Hessian at a given
point can give us an idea
3 about how much the
In the same manner that I derived equation 29, we can derive the second derivative in a specified direction
ˆ second derivatives (along
d: different directions)
∂2 ˆ ∂ ˆT differ from each other
f (x + αd) = d ∇v f (v) α=0 (33)
∂α2 α=0 ∂α
X ∂ ∂f (v)
= di (34)
∂α ∂vi α=0
i
X ∂ ∂f (v)
= di (35)
∂vi ∂α α=0
i
XX ∂ 2 f (v)
= di dj (36)
∂vi vj α=0
i j
= dˆT H dˆ (37)
4
Emphasis on “values” in “eigenvalues” because it’s important not to get tripped up here about what the
15
Constrained optimization: minimizing/maximizing a function f (x) constrained to only
values of x in some set S. One way of approaching such a problem is to re-design the uncon-
strained optimization problem such that the re-designed problem’s solution satisfies the con-
straints. For example, to minimize f (x) for x ∈ R2 with constraint ||x||2 = 1, we can minimize
g(θ) = f ([cos θ, sin θ]T ) wrt θ, then return [cos θ, sin θ]T as the solution to the original problem.
The equations involving g (i) are called the equality constraints and the inequalities
involving h(j) are called the inequality constraints.
2. Introduce new variables λi (for the equality constraints) and αj (for the inequality con-
straints). These are called the KKT multipliers. The generalized Lagrangian is then
defined as
X X
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x) (39)
i j
which has the same optimal objective function value and set of optimal points x as the
original constrained problem, minx∈S f (x). Any time the constraints are satisfied, the
expression maxλ maxα,α≥0 L(x, λ, α) evaluates to f (x), and any time a constraint is
violated, the same expression evaluates to ∞.
eigenvectors of the Hessian mean. The reason for the decomposition is that it gives us an orthonormal basis
(out of which we can get any direction) and therefore the magnitude of the second derivative along each of these
directions as the eigenvalues.
16
Math and Machine Learning Basics January 25, 2017
Point Estimation: attempt to provide “best” prediction of some quantity, such as some
parameter or even a whole function. Formally, a point estimator or statistic is any function of
the data:
θ̂m = g x(1) , . . . , x(m) (5.19)
where, since the data is drawn from a random process, θ̂ is a random variable. Function
estimation is identical in form, where we want to estimate some f (x) with fˆ, a point estimator
in function space.
Bias. Defined below, where the expectation is taken over the data-generating distribution5 .
Bias measures the expected deviation from the true value of the func/param.
h i h i
bias θ̂m = E θ̂m − θ (5.20)
h i
2 for Gaussian distribution [helpful link].
TODO: Figure out how to derive E θ̂m
5
May want to double-check this, but I’m fairly certain this is what the book meant when it said “data,”
based on later examples.
17
Bias-Variance Tradeoff .
→ Conceptual Info. Two sources of error for an estimator are (1) bias and (2) variance,
which are both defined as deviations from a certain value. Bias gives deviation from the
true value, while variance gives the [expected] deviation from this expected value.
→ Summary of main formulas.
h i h i
bias θ̂m = E θ̂m − θ (41)
h i h i2
Var θ̂m = E θ̂m − E θ̂m (42)
Consistency. As the number of training data points increases, we want the estimators to con-
verge to the true values. Specifically, below are the definitions for weak and strong consistency,
respectively.
plimm→∞ θ̂m = 0
(5.55)
p lim θ̂m = θ = 1
m→∞
6
Derivation:
18
1.4.2 Maximum Likelihood Estimation (5.5)
Consider set of m examples X = {x(1) , . . . , x(m) } drawn independently from the true (but
unknown) pdata (x). Let pmodel (x; θ) be parametric family of probability distributions over the
same space indexed by θ. The maximum likelihood estimator for θ can be expressed as
where we’ve chosen to express with log for underflow/gradient reasons. One interpretation of
ML is to view it as minimizing the dissimilarity, as measured by the KL divergence7 , between
p̂data and pmodel .
Thoughts: Let’s look at DKL in some more detail. First, I’ll rewrite it with the explicit
definition of Ex∼p̂data [log (p̂data (x))]:
DKL (p̂data ||pmodel ) = Ex∼p̂data [log (p̂data (x)) − log (pmodel (x))] (48)
N
! !
1 X
= log (Counts(xi )) − log N − Ex∼p̂data [log (pmodel (x))] (49)
N i=1
Note also that our goal is to find parameters θ such that DKL is minimized. It is for this
reason, that we wish to optimize over θ, that minimizing DKL amounts to maximizing the
quantity, Ex∼p̂data [log (pmodel (x))]. Sure, I can agree this is true, but why is our goal to
minimize DKL , as opposed to minimizing | DKL |? I’m assuming it is because optimizing
w.r.t. an absolute value is challenging numerically.
where x(i) are fed as inputs to the model; this is why we can formulate MLE as a conditional
probability.
7
The KL divergence is given by
DKL (p̂data ||pmodel ) = Ex∼p̂data [log p̂data (x) − log pmodel (x)] (5.60)
19
1.4.3 Bayesian Statistics (5.6)
The prior. Before observing the data, we represent our knowledge of θ using the prior prob-
ability distribution p(θ). Unlike maximum likelihood, which makes predictions using a point It is common to choose a
high-entropy prior, e.g.
estimate of θ (a single value), the Bayesian approach uses Bayes’ rule to make predictions using uniform.
the full distribution over θ. In other words, rather than focusing on the most accurate value
estimate of θ, we instead focus on pinning down a range of possible θ values and how likely
we believe each of these values to be.
So what happens to θ after we observe the data? We update it using Bayes’ rule8 :
Note that we still haven’t mentioned how to actually make predictions. Since we no longer
have just one value for θ, but rather we have a posterior distribution p(θ | x(1) , . . . , x(m) ), we
must integrate over this to get the predicted likelihood of the next sample x(m+1) :
Z
(m+1) (1) (m)
p(x |x ,...,x )= p(x(m+1) | θ)p(θ | x(1) , . . . , x(m) )dθ (51)
h i
= Eθ∼p(θ|x(1) ,...,x(m) ) p(x(m+1) | θ) (52)
Linear Regression: MLE vs. Bayesian. Both want to model the conditional distribution
p(y | x) (the conditional likelihood). To derive the standard linear regression algorithm, we
define
Assume σ 2 is some fixed
p(y | x) = N y; ŷ(x; w), σ 2 (53) constant chosen by the
user.
ŷ(x; w) = wT x (54)
8
In practice, we typically compute the denominator by simply normalizing the probability distribution, i.e.
it is effectively the partition function.
20
• Maximum Likelihood Approach: We can use the definition above (and the i.i.d.
assumption) to evaluate the conditional log-likelihood as
m m
X m X ||ŷ (i) − y (i) ||2
log p(y (i) | x(i) ; θ) = −m log σ − log(2π) − (5.65)
i=1
2 i=1
2σ 2
where only the last term has any dependence on w. Therefore, to obtain wM L we take
the derivative of the last term w.r.t. w, set that to zero, and solve for w. We see that Recall that the training
finding the w that maximizes the conditional log-likelihood is equivalent to finding the MSE
1
Pm is
||ŷ (i) − y (i) ||2
w that minimizes the training MSE. m i=1
• Bayesian Approach: Our conditional likelihood is already given in equation 53. Next,
we must define a prior distribution over w. As is common, we choose a Gaussian prior to
express our high degree of uncertainty about θ (implying we’ll choose a relatively large
variance):
Typically assume
Λ0 = diag (λ0 )
p(w) := N (w; µ0 , Λ0 ) (55)
We can then compute [the unnormalized] p(w | X, y) ∝ p(y | X, w)p(w) [and then
normalize it].
Maximum A Posteriori (MAP) Estimation. Often we either prefer a point estimate for
θ, or we find out that computing the posterior distribution is intractable and a point estimate
offers a tractable estimation. The obvious way of obtaining this while still taking the Bayesian
route is to just argmax the posterior and use that as your point estimate:
where the second form shows how this is basically maximum likelihood with incorporation of
the prior. We don’t want just any θ that maximizes the likelihood of our data if there is
virtually no chance of that value of θ in the first place.
21
1.4.4 Supervised Learning Algorithms (5.7)
Logistic Regression. We’ve already seen that linear regression corresponds to the family
p(y | x) = N y; θ T x, I (5.80)
which we can generalize to the binary classification scenario by interpreting as the probability
of class 1. One way of doing this while ensuring the output is between 0 and 1 is to use the
logistic sigmoid function:
Equation 5.81 is the
p(y = 1 | x; θ) = σ(θ T x) (5.81) definition of logistic
regression
Unfortunately, there is no closed-form solution for θ, so we must search via maximizing the
log-likelihood.
where the kernel [function] takes the general form k(x, x(i) ) = φ(x)·φ(x(i) ). A major drawback
to kernel machines (methods) in general is that the cost of evaluating the decision function
f (x) is linear in the number of training examples. SVMs, however, are able to mitigate this
by learning an α with mostly zeros. The training examples with nonzero αi are known as
support vectors.
22
Deep Networks:
Modern
Practices
Contents
23
Modern Practices January 26
Generalizing to aid gradients when z < 0. Three such generalizations are based on using
a nonzero slope αi when zi < 0:
Logistic sigmoid and hyperbolic tangent. Sigmoid activations on hidden units is a bad
idea, since they’re only sensitive to their inputs near zero, with small gradients everywhere else.
If sigmoid activations must be used, tanh is probably a better substitute, since it resembles
the identity (i.e. a linear function) near zero.
24
2.1.1 Back-Propagation (6.5)
The chain rule. Suppose z = f (y) where y = g(x) (see margin for dimensions). Then9 ,
x ∈ Rm
n n n y ∈ Rn
∂z X ∂z ∂yj X ∂yj X
= (∇x z)i = = (∇y z)j = (∇y z)j (∇x yj )i (6.45) z : Rn → R
∂xi j=1
∂yj ∂xi j=1 ∂xi j=1
T g : Rm → Rn
∂y
→ ∇x z = T (6.46)
∇y z = Jy=g(x) ∇y z
∂x
9
Note that we can view z = f (y) as a multi-variable function of the dimensions of y,
z = f (y1 , y2 , . . . , yn )
25
Modern Practices January 12, 2017
Recall the definition of regularization: “any modification we make to a learning algorithm that
is intented to reduce its generalization error but not its training error.”
Limiting Model Capacity. Recall that Capacity [of a model] is the ability to fit a wide
variety of functions. Low cap models may struggle to fit training set, while high cap models
may overfit by simply memorizing the training set. We can limit model capacity by adding a
parameter norm penalty Ω(θ) to the objective function J:
J(θ;
e X, y) = J(θ; X, y) + αΩ(θ) where α ∈ [0, ∞) (7.1)
where we typically choose Ω to only penalize the weights and leave biases unregularized.
L2-Regularization. Defined as setting Ω(θ) = 12 ||w||22 . Assume that J(w) is quadratic, with
minimum at w∗ . Since quadratic, we can approximate J with a second-order expansion about
w∗ .
1
ˆ
J(w) = J(w∗ ) + (w − w∗ )T H(w − w∗ ) (7.6)
2
∇w J(w) = H(w − w∗ )
ˆ (7.7)
2
∂ J
where Hij = ∂w i wj w∗
. If we add in the [derivative of] the weight decay and set to zero, we
obtain the solution
e = (H + αI)−1 Hw∗
w (7.10)
= Q(Λ + αI)−1 ΛQT w∗ (7.13)
which shows that the effect of regularization is to rescale the i eigenvectors of H by λiλ+α
i
. This
means that eigenvectors with λi >> α are relatively unchanged, but the eigenvectors with
λi << α are shrunk to nearly zero. In other words, only directions along which the parameters
contribute significantly to reducing the objective function are preserved relatively intact.
26
Sparse Representations. Weight decay acts by placing a penalty directly on the model
parameters. Another strategy is to place a penalty on the activations of the units, encouraging
their activations to be sparse. It’s important to distinguish the difference between sparse
parameters and sparse representations. In the former, if we take the example of some y = Bh,
there are many zero entries in some parameter matrix B while, in the latter, there are many
zero entries in the representation vector h. The modification to the loss function, analogous
to 7.1, takes the form
J(θ;
e X, y) = J(θ; X, y) + αΩ(h) where α ∈ [0, ∞) (7.48)
Adversarial Training Even for networks that perform at human level accuracy have a nearly
100 percent error rate on examples that are intentionally constructed to search for an input x0
near a data point x such that the model output for x0 is very different than the output for x . In many cases, x0 can be
so similar to x that a
human cannot tell the
x0 ←− x + · sign (∇x J(θ; x, y)) (58) difference!
In the context of regularization, one can reduce the error rate on the original i.i.d. test set via
adversarial training – training on adversarially perturbed training examples.
27
Modern Practices Feb 17, 2017
Empirical Risk Minimization. The ultimate goal of any machine learning algorithm is to
reduce the expected generalization error, also called the risk:
risk definitions
J ∗ (θ) = E(x,y)∼pdata [L(f (x; θ), y)] (59)
with emphasis that the risk is over the true underlying data distribution pdata . If we knew
pdata , this would be an optimization problem. Since we don’t, and only have a set of training
samples, it is a machine learning problem. However, we can still just minimize the empirical
risk, replacing pdata in the equation above with p̂data 10 .
So, how is minimizing the empirical risk any different than familiar gradient descent ap-
proaches? Aren’t they designed to do just that? Well, sort of, but it’s technically not the ERM 6= GD
same. When we say “minimize the empirical risk” in the context of optimization, we mean this
very literally. Gradient descent methods emphatically do not just go and set the weights to val-
ues such that the empirical risk reaches its lowest possible value – that’s not machine learning.
Furthermore, many useful loss function such as 0-1 loss11 do not have useful derivatives.
Surrogate Loss Functions and Early Stopping. In cases such as 0-1 loss, where mini-
mization is intractable, one typically optimizes a surrogate loss function instead, such as
the negative log-likelihood of the correct class. Also, an important difference between pure
optimization and our training algorithms is that the latter usually don’t halt at a local min-
imum. Instead, we usually must define some early stopping condition to terminate training
before overfitting begins to occur.
10
This amount to a simple average over the loss function at each training point.
11
The 0-1 loss function is defined as
28
Batch and Minibatch Algorithms. Computing ∇θ J(θ) as an expectation over the entire
training set is expensive, so we typically compute the expectation over a small subset of the
examples. Recall that the standard deviation, or standard error SE(µm ), of the mean taken
1 P (i) √
over some subset of m ≤ n samples, µm = m i∼Rand(0, n, size=m) x , is given by σ/ m,
where σ is the true [sample] standard deviation of the full n data samples. In other words,
to improve such a gradient by a factor of 10 requires 100 times more samples-per-batch (and
thus 100 times more computation). For this reason, most optimization algorithms actually
converge much faster if they can rapidly compute approximate estimates of the gradient (re:
smaller batches) rather than slowly computing the exact gradient.
12
Or even less, i.e. not using all of the training data.
29
Training algorithms. Below, I list some popular training algorithms and their update equa-
tions.
• SGD.
m
1 X
g← ∇θ L f (x(i) ; θ), y (i) (62)
m i
θ ← θ − g (63)
• Momentum.
m
1 X
g← ∇θ L f (x(i) ; θ), y (i) (64)
m i
v ← αv − g (65)
θ ←θ+v (66)
• AdaGrad. Different learning rate for each model parameter. Individually adapts the
learning rates of all model parameters by scaling them inversely proportional to the
square root of the sum of all historical squared values of the gradient. Empirically, can
result in premature and excessive decrease in the effective learning rate.
m
1 X
g← ∇θ L f (x(i) ; θ), y (i) (70)
m i
r ←r+g g (71)
θ←θ− √ g (72)
δ+ r
where the gradient accumulation variable r is initialized to the zero vector, and the
fraction and square root in the last equation is applied element-wise.
• RMSProp. Modifies AdaGrad by changing the gradient accumulation into an exponen-
tially weighted moving average.
m
1 X
g← ∇θ L f (x(i) ; θ), y (i) (73)
m i
r ← ρr + (1 − ρ)g g (74)
θ←θ− √ g (75)
δ+ r
It is also common to modify RMSProp to use Nesterov momentum.
30
• Adam. So-named to mean “adaptive moments.” We now call r the 2nd moment (vari-
ance) variable, and introduce s as the 1st moment (mean) variable, where the moments
are for the [true] gradient; the new variables act as estimates of the moments [since we
estimate the gradient with a simple average over a minibatch]. Note that these moments
are uncentered.
m
1 X
g← ∇θ L f (x(i) ; θ), y (i) (76)
m i
1
s← [ρ1 s + (1 − ρ1 )g] (77)
1 − ρt−1
1
1
r← [ρ2 r + (1 − ρ2 )g g] (78)
1 − ρt−1
2
θ←θ− √ s (79)
δ+ r
where the factors proportional to the ρ values serve to correct for bias in the moment
estimators.
31
Modern Practices January 24, 2017
We use a 2-D image I as our input (and therefore require a 2-D kernel K). Note that most
neural networks do not technically implement convolution13 , but instead implement a related
function called the cross-correlation, defined as
XX
S(i, j) = (I ∗ K)(i, j) = I(i + m, j + n)K(m, n) (9.6)
m n
• Parameter sharing.
• Equivariance (to translation). Changes in inputs [to a function] cause output to change
in the same way. Specifically, f is equivariant to g if f (g(x)) = g(f (x)). For convolution,
g would be some function that translates the input.
13
Technically the convolution output is defined as
XX
S(i, j) = (I ∗ K)(i, j) = I(m, n)K(i − m, j − n) (9.4)
m n
XX
= (K ∗ I)(i, j) = I(i − m, j − n)K(m, n) (9.5)
m n
32
Pooling. Helps make the representation approximately invariant to small translations of the
input. The use of pooling can be viewed as adding an infinitely strong prior14 that the function
the layer learns must be invariant to small translations.
aix,y
bix,y = (80)
P j 2 β
k + α j ax,y
where j runs from [i − n/2]+ to min(N −1, i+n/2), and N is the total number of kernels
in the given layer17 . Authors used k = 2, n = 5, α = 10−4 , β = 0.75.
• Batch Normalization18 . BN Allows us to use much higher learning rates and be less
careful about initialization. Algorithm defined in image below, where each element of
(k)
the batch, xi ≡ xi (where we drop the k for notational simplicity), represents the kth
activation output from the previous layer [for the ith sample in the batch] and about to
be fed as input to the current layer.
Note that one can model each layer’s activations as arising from some distribution. When
we feed data to a network, we model the data as coming from some data-generating
14
Where the distribution of this prior is over all possible functions learnable by the model.
15
Collected on my own. In other words, not from the deep learning book, but rather a bunch of disconnected
resources over time.
16
From section 3.3 of Krizhevsky et al. (2012). AlexNet paper.
17
In other words, the summation is over the adjacent kernel maps, with [total] window size n (manually
chosen). The min/max just says n/2 to the left (right) unless that would be past the leftmost (rightmost)
kernel map.
18
From “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”
by Ioffe et al.
33
distribution. Similarly, we can model the activations that occur when feeding the data
through as coming from some activation-layer-generating distribution. The problem with
this model is that the process of updating our weights during training changes the dis-
tribution of activations for each layer, which can make the learning task more difficult.
Batch normalization’s goal is to reduce this internal covariate shift, and is motivated by
the practice of normalizing our data to have zero mean and unit variance.
34
Modern Practices January 15
Notation/Architecture Used.
• U: input → hidden.
• W: hidden → hidden.
• V: hidden → output.
• Activations: tanh [hidden] and softmax [after output].
• Misc. Details: x(t) is a vector of inputs fed at time t. Recall that RNNs can be unfolded for any
desired number of steps τ . For example, if τ = 3, the general functional representation output of an RNN is
s(3) = f (s(2) ; θ) = f (f (s(1) ; θ); θ). Typical RNNs read information out of the state h to make predictions.
Shape of x(t) fixed, e.g.
vocab length.
Black square on
recurrent connection ≡
interaction w/delay of a
single time step.
Forward Propagation & Loss. Specify initial state h(0) . Then, for each time step from
t = 1 to t = τ , feed input sequence x(t) and compute the output sequence o(t) . To determine
the loss at each time-step, L(t) , we compare softmax(o(t) ) with (one-hot) y (t) .
h(t) = tanh(a(t) ) where a(t) = b + W h(t−1) + U x(t) (10.9/8)
(t) (t) (t) (t)
ŷ = softmax(o ) where o =c+Vh (10.11/10)
Note that this is an example of an RNN that maps input seqs to output seqs of the same
length19 . We can then compute, e.g., the log-likelihood loss L = t L(t) over all time steps as:
P
h i
Convince yourself this is
X
L=− log pmodel y (t) | {x(1) , . . . , x(t) } (10.12/13/14) identical to
t cross-entropy.
19
Where “same length” is related to the number of timesteps (i.e. τ input steps means τ output steps), not
anything about the actual shapes/sizes of each individual input/output.
35
where y (t) is the ground-truth (one-hot vector) at time t, whose probability of occurring is
given by the corresponding element of ŷ (t)
(t)
eoj
P
1 ∂ j
= − 1i,y(t) − P (t) (t)
eoj ∂oi
j
(t)
eoi
= − 1i,y(t) − P (t) (10.18)
oj
je
n o
(t)
= − 1i,y(t) − ŷi
(t)
= ŷi − 1i,y(t)
which leaves all entries of o(t) unchanged except for the entry corresponding to the
true label, which will become negative in the gradient. All this means is, since we
36
want to increase the probability of this entry, driving this value up will decrease
the loss (hence negative) and driving any other entries up will increase the loss
proportional to its current estimated probability (driving up an [incorrect] entry
that is already high is “worse” than driving up a small [incorrect entry]).
• Hidden nodes. First, consider the simplest hidden node to take the gradient of,
the last one, h(τ ) (simplest because only one descendant [path] reaching any loss
node(s)).
(τ )
∂L nX out
∂L(τ ) ∂ok
∇h(τ ) L =
i ∂L(τ ) k=1 ∂o(τ ) ∂h(τ )
k i
n
Xout ∂o(τ )
k
= ∇o(τ ) L (τ )
k ∂hi
k=1
nout nhid
X ∂ X (τ )
= ∇o(τ ) L c + Vkj hj
k ∂hi
(τ ) k (10.19)
k=1 j=1
n
Xout
= ∇o(τ ) L Vki
k
k=1
n
Xout
= (V T )ik ∇o(τ ) L
k
k=1
= V T ∇o(t) L
i
Before proceeding, notice the following useful pattern: If two nodes a and b,
each containing na and nb neurons, are fully connected by parameter matrix Wnb ×na
and directed like a → b → L, then20 ∇a L = W T ∇b L. Using this result, we can
then iterate and take gradients back in time from t = τ − 1 to t = 1 as follows:
!T !T
∂h(t+1) ∂o(t)
∇h(t) L = ∇h(t+1) L + ∇o(t) L (10.20) d
∂h(t) ∂h(t) tanh(x) = 1−tanh2 (x)
dx
= W T ∇h(t+1) L diag(1 − tanh2 (a(t+1) )) + V T ∇o(t) L
(diag(a))ii , ai
(10.21)
T (t+1) 2 T
=W ∇h(t+1) L diag 1 − (h ) +V ∇o(t) L
2. Parameter Gradients. Now we can compute the gradients for the parameter matri-
ces/vectors, where it is crucial to remember that a given parameter matrix (e.g. U )
is shared across all time steps t. We can treat tensor derivatives in the same form as
20
More generally,
∂b T
∇a L = (
) ∇b L
∂a
which is a good example of how vector derivatives map into a matrix. For example, let a ∈ Rna and b ∈ Rnb .
Then
∂b
∈ Rnb ×na
∂a
37
previously done with vectors after a quick abstraction: For any tensor X of arbitrary
rank (e.g. if rank-4 then index like Xijk` ), use single variable (e.g. i) to represent the
complete tuple of indices21 .
• Bias parameters [vectors]. These are nothing new, since just vectors.
!T
X ∂o(t)
(∇c L) = ∇o(t) L
∂c(t)
t
X (10.22)
= ∇o(t) L
t
!T
X ∂h(t)
(∇c L) = ∇h(t) L
∂b(t)
t
X (10.23)
(t) 2
= diag 1 − (h ) ∇h(t) L
t
• V (nout × nhid ).
τ
X
∇V L = ∇V L(t) (82a)
t
τ
X (t)
= ∇V L(t) (o1 , . . . , o(t)
nout ) (82b)
t
τ nout
X X (t)
= ∇o(t) L ∇V oi (82c)
i
t i
τ nout nhid
X X X (t)
= ∇o(t) L ∇V c + Vij hj (82d)
i i
t i j=1
0 0 ... 0
.. .. ..
τ nout
. . ... .
∇o(t) L h(t) (t) (t)
X X
= h2 . . . hnhid (82e)
i 1
.. .. ..
t i
.
. ... .
0 0 ... 0
τ
X
= ∇o(t) L (h(t) )T (82f)
t
21
More details on tensor derivatives: Consider the chain defined by Y = g(X), and z = f (Y), where z is
some vector. Then X ∂z
∇X z = (∇X Yj )
∂Yj
j
38
ith row, which is (h ) .
(See footnote for more)
22
The general lesson learned here is that, for some matrix W ∈ Ra×b and vector x ∈ Rb ,
xT
X xT
∇W [(Wx)i ] = . (83)
i
..
xT
39
2.5.2 RNNs as Directed Graphical Models
The advantage of RNNs is their efficient parameterization of the joint distribution over y (i)
via parameter sharing. This introduces a built-in assumption that we can model the effect of
y (i) in the distant past on the current y (t) via its effect on h. We are also assuming that the
conditional probability distribution over the variables at t + 1 given the variables at time t is
stationary. Next, we want to know how to draw samples from such a model. Specifically,
how to sample from the conditional distribution (y (t) given y (t−1) ) at each time step.
Say we want to model a sequence of scalar random variables Y , {y (1) , . . . , y (τ ) } for some
sequence length τ . Without making independence assumptions just yet, we can parameterize
the joint distribution P (Y) with basic definitions of probability:
τ
Y
P (Y) , P (y (1) , . . . , y (τ ) ) = P (y (t) | y (t−1) , . . . , y (1) ) (86)
t=1
If each value y could take on the same fixed set of k values, we would need to learn k 4 pa-
rameters to represent the joint distribution P (Y). This clearly inefficient, since the number
of parameters needed scales like O(k τ ). If we relax the restriction that each y (i) must depend
directly on all past y (j) , we can considerably reduce the number of parameters needed to com-
pute the probability of some particular sequence.
We could include latent variables h at each timestep that capture the dependencies, reminiscent
of a classic RNN:
40
Since in the RNN case all factors P (h(t) | h(t−1) ) are deterministic, we don’t need any addi-
tional parameters to compute this probability24 , other than the single m2 parameters needed
to convert any h(t) to the next h(t+1) (which is shared across all transitions). Now, the number
of parameters needed as a function of sequence length is constant, and as a function of k is
just O(k).
Finally, to view the RNN as a graphical model, we must describe how to sample from it, namely
how to sample a sequence y from P (Y), if parameterized by our graphical model above. In the
general case where we don’t know the value of τ for our sequence y, one approach is to have a
EOS symbol that, if found during sampling, means we should stop there. Also, in the typical
case where we actually want to model P (y | x) for input sequence x, we can reinterpret the
parameters θ of our graphical model as a function of x the input sequence. In other words, the
graphical model interpretation becomes a function of x, where x determines the exact values
of the probabilities the graphical model takes on – an “instance” of the graphical model.
24
Don’t forget that, in a neural net, a variable y (t) is represented by a layer, which itself is composed of k
nodes, each associated with one of the k unique values that y (t) could be.
41
Encoder-Decoder Seq2Seq Architectures (10.4) Here we discuss how an RNN can be
trained to map an input sequence to output sequence which is not necessarily the same length.
(Not really much of a discussion...figure below says everything.)
Gradients propagated over many stages either vanish (usually) or explode. We saw how this
could occur when we took parameter gradients earlier, and for weight matrices W further
along from the loss node, the expression for ∇W L contained multiplicative Jacobian factors.
Consider the (linear activation) repeated function composition of an RNN’s hidden state in
10.36. We can rewrite it as a power method (10.37), and if W admits an eigendecomposition
(remember W is necessarily square here), we can further simplify as seen in 10.38.
Q: Explain interp. of
h(t) = W T h(t−1) (10.36) mult. h by Q as
opposed to the usual QT
= (W t )T h(0) (10.37) explained in the linear
T t (0) algebra review.
= Q Λ Qh (10.38)
Any component of h(0) that isn’t aligned with the largest eigenvector
will eventually be discarded.25
If, however, we have a non-recurrent network such that the state elements are repeatedly
multiplied by different w(t) at each time step, the situation is different. Suppose the different
w(t) are i.i.d. with mean 0 and variance v. The variance of the product is easily seen to
25
Make sure to think about this from the right perspective. The largest value of t = τ in the RNNs we’ve seen
would correspond with either (1) the largest output sequence or (2) the largest input sequence (if fixed-vector
output). After we extract the output from a given forward pass, we reset the clock and either back-propagate
errors (if training) or get ready to feed another sequence.
42
be O(v n )26 . To
√ obtain some desired variance v ∗ we may choose the individual weights with
n ∗
variance v = v .
While leaky units have connection weights that are either manually chosen constants or are
trainable parameters, gated RNNs generalize this to connection weights that may change at
each time step. Furthermore, gated RNNs can learn to both accumulate and forget, while leaky
units are designed for just accumulation27
LSTM (10.10.1). The idea is we want self-loops to produce paths where the gradient can
flow for long durations. The self-loop weights are gated, meaning they are controlled by
another hidden unit, interpreted as being conditioned on context. Listed below are the main
components of the LSTM architecture.
(t) (t) (t−1)
• Forget gate fi = σ bfi + f f The subscript, i,
P P
j Ui,j xj + j Wi,j hj .
identifies the cell. The
(t) (t) (t−1) (t) (t) (t−1) superscript, t, denotes
• Internal state si = fi
P P
si + gi σ bi + j Ui,j xj + Wi,j hj .
j the time.
(t) P g (t) P g (t−1)
• External input gate gi = σ bgi + j Ui,j xj + j Wi,j h .
j
(t) (t) (t−1)
• Output gate qi = σ boi + j o x o
P P
Ui,j j + j Wi,j hj .
26
Quick sketch of (my) proof:
2
Var w(i) = v = E (w(i) )2 − (i)
Ew (87)
" n
# " n
!2 # n
Y Y Y
w(t) = E w(t) E (w(t) )2 = v n
Var = (88)
t t t
27
Q: Isn’t choosing to update with higher relative weight on the present the same as forgetting? A: Sort of.
It’s like “soft forgetting” and will inevitably erase more/less than desired (smeary). In this context, “forget”
means to set the weight of a specific past cell to zero.
43
Modern Practices February 14
To define distributions over longer sequences, we can just use Bayes rule over the shorter
distributions, as usual. For example, say we want to find the [joint] distribution for some
τ -gram (τ > n), and we have access to an n-gram model and a [perhaps different] model for
the initial sequence P (x1 , . . . , xn−1 ). We compute the τ distribution simply as follows:
τ
Y
P (x1 , . . . , xτ ) = P (x1 , . . . , xn−1 ) P (xt | xt−1 , . . . , xt−(n−2) , xt−(n−1) ) (12.5)
t=n
where it’s important to see that each factor in the product is a distribution over a length-n
sequence. Since we need that initial factor, it is common to train both an n-gram model and
an n − 1-gram model simultaneously.
44
• Derivation: We can use the built-in model assumption to derive the following formula.
P (THE DOG RAN AWAY) = P3 (AWAY | THE DOG RAN) P3 (THE DOG RAN)
= P3 (AWAY | DOG RAN) P3 (THE DOG RAN)
P3 (DOG RAN AWAY)
= P3 (THE DOG RAN)
P2 (DOG RAN)
= P3 (THE DOG RAN)P3 (DOG RAN AWAY)/P2 (DOG RAN)
(12.7)
Limitations of n-gram. The last example illustrates some potential problems one may Recall that, in MLE, the
encounter that arise [if using MLE] when the full joint we seek is nonzero, but (a) some Pn Pn and Pn−1 are usually
approximated via
factor is zero, or (b) Pn−1 is zero. Some methods of dealing with this are as follows. counting occurrences in
the training set
• Smoothing: shifting probability mass from the observed tuples to unobserved ones that
are similar.
• Back-off methods: look up the lower-order (lower values of n) n-grams if the frequency
of the context xt−1 , . . . , xt−(n−1) is too small to use the higher-order model.
In addition, n-gram models are vulnerable to the curse of dimensionality, since most n-grams
won’t occur in the training set28 , even for modest n.
28
For a given vocabulary, which usually has much more than n possible words, consider how many possible
sequences of length n.
29
Ok I tried re-wording that from the book’s confusing wording but that was also a bit confusing. Let
me break it down. Say you train on a thousand sentences each of length 5. For a given vocabulary of size
VOCAB_SIZE, the number of possible sequences of length 5 is (VOCAB_SIZE)5 , which can be quite a lot
more than a thousand (not to mention the possibility of duplicate training examples). To the naive model, all
points in this high-dimensional space are basically the same. A neural language model, however, tries to arrange
the space of possibilities in a meaningful way, so that an unforeseen sample at test time can be said "similar"
as some previously seen training example. It does this by embedding words/sentences in a lower-dimensional
feature space.
45
Deep Learning
Research
Contents
46
Deep Learning Research January 12
Overview. Much research is in building a probabilistic model 30 of the input, pmodel (x). Why?
Because then we can perform inference to predict stuff about our environment given any of
the other variables. We call the other variables latent variables, h, with
X
pmodel (x) = Pr(h) Pr(x | h) = Eh [pmodel (x | h)] (90)
h
So what? Well, the latent variables provide another means of data representation, which can
be useful. Linear factor models (LFM) are some of the simplest probabilistic models with
latent variables.
A linear factor model is defined by the use of a stochastic linear decoder function
that generates x by adding noise to a linear transformation of h.
Note that h is a vector of arbitrary size, where we assume p(h) is a factorial distribution:
p(h) = i p(hi ). This roughly means we assume the elements of h are mutually independent31 .
Q
h ∼ N (h; 0, I) (92)
2
noise ∼ N (0, ψ ≡ diag(σ )) (93)
x ∼ N (x; b, WW T + ψ) (94)
where the last relation can be shown by recalling that a linear combination of Gaussian
variables is itself Gaussian, and showing that Eh [x] = b, and Cov(x) = WW T + ψ.
30
Whereas, before, we’ve been building functions of the input (deterministic).
31
Note that, technically, this assumption isn’t strictly the definition of mutual independence, which requires
that every subset (i.e. not just the full set) of {hi ∈ h} follow this factorial property.
47
It is worth emphasizing the interpretation of ψ as the matrix of conditional vari-
ances σi2 . Huh? Let’s take a step back. The fact that we were able to separate the
distributions in the above relations for h and noise is from a built-in assumption that
Pr(xi |h, xj6=i ) = Pr(xi |h)32 .
The latent variable h is a big deal because it captures the dependencies between
the elements of x. How do I know? Because of our assumption that the xi are
conditionally independent given h. If, once we specify h, all the elements of x
become independent, then any information about their interrelationship is hiding
somewhere in h.
Detailed walkthrough of Factor Analysis (a.k.a me slowly reviewing, months after taking
this note):
– Goal. Analyze and understand the motivations behind how Factor Analysis defines the
data-generation process under the framework of LFMs (defined in steps 1 and 2 earlier).
Assume h has dimension n.
– Prior. Defines p(h) := N (h; 0, I), the unit-variance Gaussian. Explicitly,
1 − 21
Pn 2
hi
p(h) := e i
(2π)n/2
– Noise. Assumed to be drawn from a Gaussian with diagonal covariance matrix ψ :=
diag σ 2 . Explicitly,
1 − 21
Pn 2 2
ai /σi
p(noise = a) := Qn e i
(2π)n/2 i σi
– Deriving distribution of x. We use the fact that any linear combination of Gaussians
is itself Gaussian. Thus, deriving p(x) is reduced to computing it’s mean and covariance
matrix.
µx = Eh [Wh + b] (95)
Z
= p(h)(Wh + b)dh (96)
Z
1 1
Pn 2
=b+ n/2
e− 2 i hi Whdh (97)
(2π)
=b (98)
Cov(x) = E (x − E [x])(x − E [x])T
(99)
= E (Wh + noise)(hT W T + noiseT )
(100)
= E (WhhT W T ) + ψ
(101)
T
= WW +ψ (102)
48
– Thoughts. Not really seeing why this is useful/noteworthy. Feels very contrived (many
assumptions) and restrictive – it only applies if the dependencies between each xi can be
modeled with a random variable h sampled from a unit variance Gaussian.
• Probabilistic PCA: Just factor analysis with ψ = σ 2 I. So zero-mean spherical Gaus-
sian noise. It becomes regular PCA as σ → 0. Here we can use an iterative EM algorithm
for estimating the parameters W.
49
Deep Learning Research May 07
Introduction. An autoencoder learns to copy its input to its output, via an encoder function
h = f (x) and a decoder function r = g(h). Modern autoencoders generalize this to allow for r for “reconstruction”
50
Deep Learning Research May 07
In English: just apply the regular learning/training process for each layer/stage sequen-
tially and individually33 .
When this is complete, we can run fine-tuning: train all layers together (including any later
layers that could not be pretrained) with a supervised learning algorithm. Note that we do
indeed allow the pretrained encoding stages to be optimized here (i.e. not fixed).
33
In other words, you proceed one layer at a time in order. You don’t touch layer i until the weights in layer
i − 1 have been learned.
51
Deep Learning Research October 01, 2017
Directed Models. Also called belief networks or Bayesian networks. Formally, a di-
rected graphical model defined on a set of variables {x} is defined by a DAG, G, whose vertices
are the random variables in the model, and a set of local conditional probability distribu-
tions, p(xi | P aG (xi )), where P aG (xi ) gives the parents of xi in G. The probability distribution
over x is given by
Y
p(x) = p(xi | P aG (xi )) (108)
i
Undirected Graphical Models. Also called Markov Random Fields (MRFs) or Markov
Networks. Appropriate for situations where interactions do not have a well-defined direction.
Each clique C (any set of nodes that are all [maximally] connected) in G is associated with a
factor φ(C). The factor φ(C), also called a clique potential, is just a function (not necessarily
a probability) that outputs a number when given a possible set of values over the nodes in C. Clique potentials are
The output number measures the affinity of the variables in that clique for being in the states constrained to be
nonnegative.
specified by the inputs. The set of all factors in G defines an unnormalized probability
distribution:
Y
pe(x) = φ(C) (109)
C∈G
34
Consider the common NLP case where our vector x contains n word tokens, each of which can take on any
symbol in our vocabulary of size v. If we assign n = 100 and v = 100, 000, which are relatively common values
for this case, this amounts to (1e5)1e2 = 10500 parameters.
52
The Partition Function. To obtain a valid probability distribution, we must normalize the
probability distribution:
1
p(x) = pe(x) (110)
ZZ
Z= pe(x)dx (111)
where the normalizing function Z = Z({φ}) is known as the partition function (physicz
sh0ut0uT). It is typically intractable to compute, so we resort to approximations. Note that
Z isn’t even guaranteed to exist – it’s only for those definitions of the clique potentials that
cause the integral over pe(x) to converge/be defined.
Energy-Based Models (EBMs). A convenient way to enforce ∀x, pe(x) > 0 is to use EBMs,
where
and E(x) is known as the energy function35 . Many algorithms need to compute not pmodel (x)
but only log pemodel (x) (unnormalized log probabilities - logits!). For EBMs with latent variables
h, such algorithms are phrased in terms of the free energy:
X
F(x = x) = − log exp (−E(x = x, h = h)) (113)
h
Separation and D-Separation. We want to know which subsets of variables are condi-
tionally independent from each other, given the values of other subsets of variables. A set
of variables A is separated (if undirected model)/d-separated (if directed model) from an-
other set of variables B given a third set of variables S if the graph structure implies that A is
independent from B given S.
• Separation. For undirected models. If variables a and b are connected by a path
involving only unobserved variables (an active path), then a and b are not separated.
Otherwise, they are separated. Any paths containing at least one observed variable are
called inactive.
• D-Separation36 . For directed models. Although there are rules that help determine
whether a path between a and b is d-separated, it is simplest to just determine whether
a is independent from b given any observed variables along the path.
35
Physics throwback: this mirrors the Boltzmann factor, exp(−ε/τ ), which is proportional to the probability
of the system being in quantum energy state ε.
36
The D stands for dependence.
53
3.4.1 Sampling from Graphical Models
For directed graphical models, we can do ancestral sampling to produce a sample x from
the joint distribution represented by the model. Just sort the variables xi into a topological
ordering such that ∀i, j : j > i ⇐⇒ xi ∈ P aG (xj ). To produce the sample, just sequentially
sample from the beginning, x1 ∼ P (x1 ), x2 ∼ P (x2 | P aG (x1 )), etc.
For undirected graphical models, one simple approach is Gibbs sampling. Essentially, this
involves drawing a conditioned sample from xi ∼ p(xi | neighbors(xi )) for each xi . This process
is repeated many times, where each subsequent pass uses the previously sampled values in
neighbors(xi ) to obtain an asymptotically converging [to the correct distribution] estimate for
a sample from p(x).
One of the main tasks with graphical models is predicting the values of some subset of variables
given another subset: inference. Although the graph structures we’ve discussed allow us to
represent complicated, high-dimensional distributions with a reasonable number of parameters,
the graphs used for deep learning are usually not restrictive enough to allow efficient inference.
Approximate inference for deep learning usually refers to variational inference, in which we
approximate the distribution p(h | v) by seeking an approximate distribution q(h | v) that is
as close to the true one as possible.
where b, c, and W are unconstrained, real-valued, learnable parameters. One could interpret
the values of the bias parameters as the affinities for the associated variable being its given
value, and the value Wi,j as the affinity of vi being its value and hj being its value at the same
time37 .
The restrictions on the RBM structure, namely the fact that there are no intra-layer connec-
tions, yields nice properties. Since pe(h, v) can be factored into clique potentials, we can say
37
More concretely, remember that v is a one-hot vector representing some state that can assume len(v)
unique values, and similarly for h. Then Wi,j gives the affinity for the state associated with v being its ith
value and the state associated with h being its jth value.
54
that:
Y
p(h | v) = p(hi | v) (115)
i
Y
p(v | h) = p(vi | h) (116)
i
Also, due to the restriction of binary variables, each of the conditionals is easy to compute,
and can be quickly derived as
p(hi = 1 | v) = σ ci + v T W:,i (117)
55
Deep Learning Research May 09
Monte Carlo Sampling (Basics). We can approximate the value of a (usually prohibitively
large) sum/integral by viewing it as an expectation under some distribution. We can then
approximate its value by taking samples from the corresponding probability distribution and
taking an empirical average. Mathematically, the basic idea is show below:
n
1
Z X
s= p(x)f (x)dx = Ep [f (x)] → ŝn = f (x(i) ) (118)
n
i=1, x(i) ∼p
As we’ve seen before, the empirical average is an unbiased38 estimator. Furthermore, the
central limit theorem tells us that the distribution of ŝn converges to a normal distribution
with mean s and variance Var [f (x)] /n.
Importance Sampling. What if it’s not feasible for us to sample from p? We can approach
this a couple ways, both of which will exploit the following identity:
p(x)f (x)
p(x)f (x) = q(x) (122)
q(x)
38
Recall that expectations on such an average are still taken over the underlying (assumed) probability
distribution:
n
1X
Ep f (x(i) )
Ep [ŝn ] = (119)
n
i=1
n
1X
= s (120)
n
i=1
=s (121)
You should think of the expectation Ep f (x(i) ) as the expected value of the random sample from the underlying
distribution, which of course is s, because we defined it that way.
56
At first glance, it feels a little wonky, but recognize that we are sampling from q instead
of p (i.e. if this were an integral, it would be over q(x)dx). The catch is that, now, the
variance can be greatly sensitive to the choice of q:
p(x)f (x)
Var [ŝq ] = Var /n (124)
q(x)
where p̃ and q̃ are the unnormalized forms of p and q, and the x(i) samples are still drawn
from [the original/unknown] q. E [ŝBIS ] 6= s except asymptotically when n → ∞.
57
Deep Learning Research August 30, 2018
and explicitly learn an approximation, c, for − log Z(θ). Obviously MLE would just try jacking
up c to maximize this, so we adopt a surrogate supervised training problem: binary classifi-
cation that a given sample x belongs to the (true) data distribution pdata or to the noise
distribution pnoise . We introduce binary variable y to indicate whether the sample is in the
true data distribution (y=1) or the noise distribution (y=0). Our surrogate model is thus
defined by
1
pjoint (y=1) = (128)
2
pjoint (x | y=1) = pmodel (x) (129)
pjoint (x | y=0) = pnoise (x) (130)
58
Deep Learning Research Nov 15, 2017
Overview. Most graphical models with multiple layers of hidden variables have intractable
posterior distributions. This is typically because the partition function scales exponentially
with the number of units and/or due to marginalizing out latent variables. Many approximate
inference approaches make use of the observation that exact inference can be described as an
optimization problem.
Assume we have a probabilistic model consisting of observed variables v and latent variables
h. We want to compute log p(v; θ), but it’s too costly to marginalize out h. Instead, we
compute a lower bound L(v, θ, q) – often called the evidence lower bound (ELBO) or
negative variational free energy – on log p(v; θ)39 :
q is an arbitrary
L(v, θ, q) = log p(v; θ) − DKL (q(h | v) || p(h | v; θ)) (135) probability distribution
over h. Note that the
= −Eh∼q(h|v) [log p(h, v)] + H(q(h | v)) (136) book will write q when
they really mean
q(h | v).
where the second form is the more canonical definition40 . Note that L(v, θ, q) is a lower-bound
on log p(v; θ) by definition, since
With equality (to zero) iff q is the same distribution as p(h | v). In other words, L can
be viewed as a function parameterized by q that’s maximized when q is p(h | v), and with
maximal value log p(v). Therefore, we can cast the inference problem of computing the (log)
probability of the observed data log p(v) into an optimization problem of maximizing L. Exact
inference can be done by searching over a family of functions that contains p(h | v).
h i
39 P (x)
Recall that DKL (P ||Q) = Ex∼P (x) log Q(x)
40
This can be derived easily from the first form. Hint:
q(h | v) q(h | v)
log = log
p(h | v) p(h, v; θ)/p(v; θ)
59
Expectation Maximization (19.2). Technically not an approach to approximate inference,
but rather an approach to learning with an approximate posterior. Popular for training models
with latent variables. The EM algorithm consists of alternating between the followi[p]ng 2 steps
until convergence:
1. E-step. For each training example v (i) (in current batch or full set), set
where θ (0) denotes the current parameter values of the model at the beginning of the
E-step. This can also be interpreted as maximizing L w.r.t. q.
2. M-step. Update the parameters θ by completely or partially finding
X
arg max L v (i) , θ, q(h | v (i) ; θ (0) ) (138)
θ i
60
Deep Learning Research July 28, 2018
To prove this, it’s easier to use the conventional definition where U is symmetric with zero diagonal, and we write E(x) as
d d d
X X X
E(x) = − xi Ui,j xj − bi x i (139)
where the difference is that we explicitly only sum over the upper triangle of U.
Intuitively, since p({x}j6=i ) = pi=on (x) + pi=of f (x), our final formula for pi=on should only contain terms involving the
parameters that interact with xi , and only for those cases where xi = 1. This motivates exploring the formula for ∆Ei (x) ,
Ei=of f − Ei=on where I’ve dropped the explicit notation on x for simplicity/space. Before jumping in to deriving this, step back
and realize that ∆Ei will only contain summation terms where either the row or column index of U is i, and only for terms with
bias element bi . Since our summation is over the upper triangle of U, this means terms along the slices Ui,i+1:d and U1:i−1,i .
Now there is no derivation needed and we can simply write
d i−1
X X
∆Ei = Ui,k xk + xk Uk,i + bi (140)
k=i+1 k=1
The goal is to use this to get a logistic-regression-like formula for pi=on , so we should now think about the relationship between
any given p(x) and the associated E(x). The critical observation is that E(x) = − ln p(x) − ln Z, which therefore means
1 − pi=on (x)
∆Ei = ln pi=on (x) − ln pi=of f (x) = − ln (141)
pi=on (x)
1 − pi=on (x) 1
exp(−∆Ei ) = = −1 (142)
pi=on (x) pi=on (x)
1
∴ pi=on (x) = (143)
1 + exp(−∆Ei )
Since ∆Ei is a linear function of all other units, we have proven that pi=on (x) for some state x reduces to logistic regression
over the other units.
41
Authors are being lazy because it’s assumed the reader is familiar (which is fair, I guess). i.e. they aren’t
mentioning that this formula implies that U is either lower or upper triangular, and the diagonal is zero.
61
Restricted Boltzmann Machines (20.2). A BM with variables partitioned into two sets:
hidden and observed. The graphical model is bipartite over the hidden and observed nodes,
as I’ve drawn in the example below.
X1 X2 X3 X4
H1 H2 H3
Although the joint distribution p(x, h) has a potentially intractable partition function, the
conditional distributions can be computed efficiently by exploiting independencies:
nh
Y
p(h | x) = σ [2h − 1] [c + W T x] (144)
j
j=1
Ynx
p(x | h) = σ ([2x − 1] [b + Wh])i (145)
i=1
where b and c are the observed and hidden bias parameters, respectively.
Deep Belief Networks (20.3). Several layers of (usually binary) latent variables and a single
observed layer. The "deepest" (away from the observed) layer connections are undirected, and
all other layers are directed and pointing toward the data. I’ve drawn an example below.
X1 X2 X3
We can sample from a DBN via first Gibbs sampling on the undirected layer, then ancestral
sampling through the rest of the (directed) model to eventually obtain a sample from the
visible units.
62
Deep Boltzmann Machines (20.4). Same as DBNs, but now all layers are undirected. Note
that this is very close to the standard RBM, since we have a set of hidden and observed vari-
ables, except now we interpret certain subgroups of hidden units as being in a “layer”, thus
allowing for connections between hidden units in adjacent layers. What’s interesting is that
this still defines a bipartite graph, with odd-numbered layers on one side and even on the
other42 .
x ← g(z) = µ + Lz
pz (g −1 (x))
px (x) = (146)
∂g
det ∂z
but it’s usually far easier to use indirect means of learning g, rather than trying to maximize/e-
valuation px (x) directly.
For case (b), the common approach is to train the generator net to emit conditional probabilities Case b: interpret g(z)
as emitting p(x | z).
which can also support generating discrete data (case a cannot). The challenge in training
generator networks is that we often have a set of examples x, but the value of z for each x
is not fixed and known ahead of time. We’ll now look at some ways of training generator
nets given only training samples for x. Note that such a setting is very unlike unsupervised
learning, where we typically interpret x as inputs that we don’t have labels for, while here we
interpret x as outputs that we don’t know the associated inputs for.
42
Recall that this immediately implies that units in all odd layers are conditionally independent given the
even layers (and vice-versa for even to odd).
43
The [unique] Cholesky decomposition of a (real-symmetric) p.d. matrix A is a decomposition of the form
A = LLT , where L is lower triangular.
63
Variational Autoencoders (20.10.3). VAEs are directed models that use learned approx-
imate inference and can be trained purely with gradient-based methods. To generate a
sample x, the VAE first samples z from the code distribution pmodel (z). This sample is
then fed through the a differentiable generator network g(z). Finally, x is sampled from
pmodel (x; g(z)) = pmodel (x | z).
64
65
Papers and
Tutorials
Contents
4.1 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Neural Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Neural Conversation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 NMT By Jointly Learning to Align & Translate . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Detailed Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Effective Approaches to Attention-Based NMT . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Using Large Vocabularies for NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Candidate Sampling – TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Attention Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.9.2 Sentence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Simple Baseline for Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 Survey of Text Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.11.1 Distance-based Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.2 Probabilistic Document Clustering and Topic Models . . . . . . . . . . . . . . . . . 98
4.11.3 Online Clustering with Text Streams . . . . . . . . . . . . . . . . . . . . . . . . 100
4.12 Deep Sentence Embedding Using LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.13 Clustering Massive Text Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.14 Supervised Universal Sentence Representations (InferSent) . . . . . . . . . . . . . . . . . . . 107
4.15 Dist. Rep. of Sentences from Unlabeled Data (FastSent) . . . . . . . . . . . . . . . . . . . . 108
4.16 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.17 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.18 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.19 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.20 Joint Event Extraction via RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.21 Event Extraction via Bidi-LSTM Tensor NNs . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.22 Reasoning with Neural Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.23 Language to Logical Form with Neural Attention . . . . . . . . . . . . . . . . . . . . . . . 128
4.24 Seq2SQL: Generating Structured Queries from NL using RL . . . . . . . . . . . . . . . . . . 130
4.25 SLING: A Framework for Frame Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . 133
4.26 Poincaré Embeddings for Learning Hierarchical Representations . . . . . . . . . . . . . . . . . 135
66
4.27 Enriching Word Vectors with Subword Information (FastText) . . . . .
. . .
. . . . . . . . . 137
4.28 DeepWalk: Online Learning of Social Representations . . . . . . . . .
. . .
. . . . . . . . . 139
4.29 Review of Relational Machine Learning for Knowledge Graphs . . . . .
. . .
. . . . . . . . . 141
4.30 Fast Top-K Search in Knowledge Graphs . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 144
4.31 Dynamic Recurrent Acyclic Graphical Neural Networks (DRAGNN) . . .
. . .
. . . . . . . . . 146
4.31.1 More Detail: Arc-Standard Transition System . . . . . . . .
. . .
. . . . . . . . . 149
4.32 Neural Architecture Search with Reinforcement Learning . . . . . . . .
. . .
. . . . . . . . . 150
4.33 Joint Extraction of Events and Entities within a Document Context . . .
. . .
. . . . . . . . . 152
4.34 Globally Normalized Transition-Based Neural Networks . . . . . . . .
. . .
. . . . . . . . . 155
4.35 An Introduction to Conditional Random Fields . . . . . . . . . . . .
. . .
. . . . . . . . . 158
4.35.1 Inference (Sec. 4) . . . . . . . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 162
4.35.2 Parameter Estimation (Sec. 5) . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 165
4.35.3 Related Work and Future Directions (Sec. 6) . . . . . . . . .
. . .
. . . . . . . . . 168
4.36 Co-sampling: Training Robust Networks for Extremely Noisy Supervision . . .
. . . . . . . . . 169
4.37 Hidden-Unit Conditional Random Fields . . . . . . . . . . . . . . . . . .
. . . . . . . . . 170
4.37.1 Detailed Derivations . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 172
4.38 Pre-training of Hidden-Unit CRFs . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 177
4.39 Structured Attention Networks . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 179
4.40 Neural Conditional Random Fields . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 181
4.41 Bidirectional LSTM-CRF Models for Sequence Tagging . . . . . . . . . . .
. . . . . . . . . 183
4.42 Relation Extraction: A Survey . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 184
4.43 Neural Relation Extraction with Selective Attention over Instances . . . . . .
. . . . . . . . . 187
4.44 On Herding and the Perceptron Cycling Theorem . . . . . . . . . . . . . .
. . . . . . . . . 189
4.45 Non-Convex Optimization for Machine Learning . . . . . . . . . . . . . . .
. . . . . . . . . 191
4.45.1 Non-Convex Projected Gradient Descent (3) . . . . . . . . . . . .
. . . . . . . . . 194
4.46 Improving Language Understanding by Generative Pre-Training . . . . . . . .
. . . . . . . . . 195
4.47 Deep Contextualized Word Representations . . . . . . . . . . . . . . . . .
. . . . . . . . . 196
4.48 Exploring the Limits of Language Modeling . . . . . . . . . . . . . . . . .
. . . . . . . . . 198
4.49 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 200
4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 202
4.51 Wasserstein is all you need . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 204
4.52 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 206
4.52.1 Self-Normalized NCE . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 208
4.53 Neural Ordinary Differential Equations . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 210
4.54 On the Dimensionality of Word Embedding . . . . . . . . . . . . . . . . .
. . . . . . . . . 212
4.55 Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 213
4.56 A Framework for Intelligence and Cortical Function . . . . . . . . . . . . .
. . . . . . . . . 216
4.57 Large-Scale Study of Curiosity Driven Learning . . . . . . . . . . . . . . .
. . . . . . . . . 217
4.58 Universal Language Model Fine-Tuning for Text Classification . . . . . . . .
. . . . . . . . . 218
4.59 The Marginal Value of Adaptive Gradient Methods in Machine Learning . . . .
. . . . . . . . . 220
4.60 A Theoretically Grounded Application of Dropout in Recurrent Neural Networks . . . . . . . . . 221
4.61 Improving Neural Language Models with a Continuous Cache . . . . . . . . . . . . . . . . . . 222
4.62 Protection Against Reconstruction and Its Applications in Private Federated Learning . . . . . . . 223
4.63 Context Dependent RNN Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 225
67
4.64 Strategies for Training Large Vocabulary Neural Language Models . . . . . . . . . . . . . . . . 226
4.65 Product quantization for nearest neighbor search . . .. . . . . . . . . . . . . . . . . . . . 228
4.66 Large Memory Layers with Product Keys . . . . . .. . . . . . . . . . . . . . . . . . . . 229
4.67 Show, Ask, Attend, and Answer . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
4.68 Did the Model Understand the Question? . . . . . .. . . . . . . . . . . . . . . . . . . . 233
4.69 XLNet . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 234
4.70 Transformer-XL . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 236
4.71 Efficient Softmax Approximation for GPUs . . . . . .. . . . . . . . . . . . . . . . . . . . 237
4.72 Adaptive Input Representations for Neural Language Modeling . . . . . . . . . . . . . . . . . 238
4.73 Neural Module Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.74 Learning to Compose Neural Networks for QA . . . . . . . . . . . . . . . . . . . . . . . . 241
4.75 End-to-End Module Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.76 Fast Multi-language LSTM-based Online Handwriting Recognition . . . . . . . . . . . . . . . 245
4.77 Multi-Language Online Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . 246
4.78 Modular Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.79 Transfer Learning from Speaker Verification to TTS . . . . . . . . . . . . . . . . . . . . . . 250
68
Papers and Tutorials January 15, 2017
WaveNet
Table of Contents Local Written by Brandon McKinzie
Introduction.
• Inspired by recent advances in neural autoregressive generative models, and based
on the PixelCNN architecture.
• Long-range dependencies dealt with via “dilated causal convolutions, which exhibit very
large receptive fields.”
The model outputs a categorical distribution over the next value xt with a softmax
layer and it is optimized to maximize the log-likelihood of the data w.r.t. the pa-
rameters.
Main ingredient of WaveNet is dilated causal convolutions, illustrated below. Note the absence
of recurrent connections, which makes them faster to train than RNNs, but at the cost of
requiring many layers, or large filters to increase the receptive field44 .
44
Loose interpretation of receptive fields here is that large fields can take into account more info (back in
time) as opposed to smaller fields, which can be said to be “short-sighted”
69
Excellent concise definition from paper:
ln(1 + µ|xt |)
f (xt ) = sign(xt ) (149)
ln(1 + µ)
Gated activation and res/skip connections. Use the same gated activation unit as Pix-
elCNN:
where ∗ denotes conv operator, denotes elem-wise mult., k is layer index, f, g denote filter/-
gate, and W is learnable conv filter. This is illustrated below, along with the residual/skip
connections used to speed up convergence/enable training deeper models.
70
Conditional Wavenets. Can also model conditional distribution of x given some additional
h (e.g. speaker identity).
T
Y
p(x | h) = p(xt | x1 , . . . , xt−1 , h) (151)
t=1
→ Global conditioning. Single h that influences output dist. accross all times. Activation
becomes:
T T
z = tanh Wf,k ∗ x + Vf,k h σ Wg,k ∗ x + Vg,k h (152)
→ Local conditioning (confusing). Have a second time-series ht . They first transform this
ht using a “transposed conv net (learned unsampling) that maps it to a new time-series
y = f (h) w/same resolution as x.
Experiments.
• Text-to-Speech. Single-speaker datasets of 24.6 hours (English) and 34.8 hours (Chi-
nese) speech. Locally conditioned on linguistic features. Receptive field of 240 millisec-
onds. Outperformed both LSTM-RNN and HMM.
• Music. Trained the WaveNets to model two music datasets: (1) 200 hours of annotated
music audio, and (2) 60 hours of solo piano music from youtube. Larger receptive fields
sounded more musical.
• Speech Recognition. “With WaveNets we have shown that layers of dilated convolu-
tions allow the receptive field to grow longer in a much cheaper way than using LSTM
units.”
Conclusion (verbatim): “This paper has presented WaveNet, a deep generative model of audio
data that operates directly at the waveform level. WaveNets are autoregressive and combine
causal filters with dilated convolutions to allow their receptive fields to grow exponentially with
depth, which is important to model the long-range temporal dependencies in audio signals. We
have shown how WaveNets can be conditioned on other inputs in a global (e.g. speaker identity)
or local way (e.g. linguistic features). When applied to TTS, WaveNets produced samples that
outperform the current best TTS systems in subjective naturalness. Finally, WaveNets showed
very promising results when applied to music audio modeling and speech recognition.”
45
Speakers encoded as ID in form of a one-hot vector
71
Papers and Tutorials January 22
Neural Style
Table of Contents Local Written by Brandon McKinzie
Notation.
• Content image: p
• Filter responses: the matrix P l ∈ RNl ×Ml contains the activations of the filters in
layer l, where Pijl would give the activation of the ith filter at position j in layer l. Nl is
number of feature maps, each of size Ml (height × width of the feature map)46 .
• Reconstructed image: x (initially random noise). Denote its corresponding filter
response matrix at layer l as P l .
Content Reconstruction.
1. Feed in content image p into pre-trained network, saving any desired filter responses
during the forward pass. These are interpreted as the various “encodings” of the image
done by the network. Think of them analogously to “ground-truth” labels.
2. Define x as the generated image, which we first initialize to random noise. We will be
changing the pixels of x via gradient descent updates.
3. Define the loss function. After each forward pass, evaluate with squared-error loss
between the two representations at the layer of interest:
1X l
Lcontent (p, x, l) = (F − Pijl )2 (1)
2 i,j ij
(
∂Lcontent (F l − P l )ij Fijl > 0
= (2)
∂Fijl 0 Fijl < 0
4. Compute iterative updates to x via gradient descent until it generates the same re-
sponse in a certain layer of the CNN as the original image p.
46
If not clear, Ml is a scalar, for any given value of l.
72
Style Representation. On top of the CNN responses in each layer, the authors built a style
representation that computes the correlations between the different [aforementioned] filter
responses. The correlation matrix for layer l is denoted in the standard way with a Gram
matrix Gl ∈ RNl ×Nl , with entries
X
Glij = hFil , Fjl i = l
Fik l
Fjk (3)
k
To generate a texture that matches the style of a given image, do the following.
1. Let a denote the original [style] image, with corresponding Al . Let x, initialized to
random noise, denote the generated [style] image, with corresponding Gl .
2. The contribution to the loss of layer l, denoted El , to the total loss, denoted Lstyle , is
given by
1 X
El = (Glij − Alij )2 (4)
4Nl2 Ml2 ij
L
X
Lstyle (a, x) = wl E l (5)
l=0
1
∂El
Nl2 Ml2
(F l )T (Gl − Al ) Fijl > 0
= ji (6)
∂Fijl 0 Fijl < 0
where wl are [as of yet unspecified] weighting factors of the contribution of layer l to the
total loss.
Mixing content with style. Essentially a joint minimization that combines the previous
two main ideas.
1. Begin with the following images: white noise x, content image p, and style image a.
Note that we can choose which layers Lstyle uses by tweaking the layer weights wl . For
example, the authors chose to set wl = 1/5 for ’conv[1, 2, 4, 5]_1’ and 0 otherwise. For
the ratio α/β, they explored 1 × 10−3 and 1 × 10−4 .
73
Papers and Tutorials February 8
[Reminder: red text means I need to come back and explain what is meant, once I understand it.]
Abstract. This paper presents a simple approach for conversational modeling which uses the
sequence to sequence framework. It can be trained end-to-end, meaning fewer hand-crafted
rules. The lack of consistency is a common failure of our model.
Introduction. Major advantage of the seq2seq model is it requires little feature engineering
and domain specificity. Here, the model is tested on chat sessions from an IT helpdesk dataset
of conversations, as well as movie subtitles.
Related Work. The authors’ approach is based on the following (linked and saved) papers
on seq2seq:
• Kalchbrenner & Blunsom, 2013.
• Sutskever et al., 2014. (Describes Seq2Seq model)
• Bahdanau et al., 2014.
74
The thought vector is the hidden state of the model when it receives [as input] the end
of sequence symbol heosi, because it stores the info of the sentence, or thought, “ABC”. The
authors acknowledge that this model will not be able to “solve” the problem of modeling
dialogue due to the objective function not capturing the actual objective achieved through Ponder: what would be a
human communication, which is typically longer term and based on exchange of information reasonable objective
function & model for
[rather than next step prediction]47 . conversation?
47
I’d imagine that, in order to model human conversation, one obvious element needed would be a memory.
Reminds me of DeepMind’s DNC. There would need to be some online filtering & output process to capture the
crucial aspects/info to store in memory for later, and also some method of retrieving them when needed later.
The method for retrieval would likely be some inference process where, given a sequence of inputs, the probability
of them being related to some portion of memory could be trained. This would allow for conversations that
stretch arbitrarily back in the past. Also, when storing the memories, I’d imagine a reasonable architecture
would be some encoder-decoder for a sparse distributed representation of memory.
75
Papers and Tutorials February 27
[Bahdanau et. al, 2014]. The primary motivation for me writing this is to better understand
the attention mechanism in my sequence to sequence chatbot implementation.
Abstract. The authors claim that using a fixed-length vector [in the vanilla encoder-decoder
for NMT] is a bottleneck. They propose allowing a model to (soft-)search for parts of a source
sentence that are relevant to predicting a target word, without having to form these parts as
a hard segment explicitly.
where the function eij is given by an alignment model which scores how well the
inputs around position j and the output at position i match.
• Encoder. It’s just a bidirectional RNN. What they call “annotation hj ” is literally just
a concatenated vector of hfj orward and hbackward
j
48
By “align” the authors are referring to aligning the source-search to the relevant parts for prediction.
76
4.4.1 Detailed Model Architecture
Decoder Internals. It’s just a GRU. However, it will be helpful to detail how we format the
inputs (given we now have attention). Wherever we’d usually pass the previous decoder state
si−1 , we now pass a concatenated state, [si−1 , ci ], that also contains the ith context vector.
Below I go over the flow of information from GRU input to output:
1. Notation: yt is the loop-embedded output of the decoder (prediction) at time t, st is
the internal hidden state of the decoder at time t, and ct is the context vector at time t.
s̃t is the proposed/proposal state at time t.
2. Gates:
3. Proposal state:
4. Hidden state:
Alignment Model. All equations enumerated below are for some timestep t during the
decoding process.
1. Score: For all j ∈ [0, Lenc −1] where Lenc is the number of words in the encoder sequence,
compute:
77
Decoder Outputs. All below is for some timestep t during the decoding process. To find the
probability of some (one-hot) word y [at timestep t]:
TW
Pr (y | s, c) ∝ ey ou
(166)
u = [max{ũ2j−1 , ũ2j }]Tj=1,...,` (167)
ũ = Uo [st−1 , c] + Vo yt−1 (168)
N.B.: From reading other (and more recent) papers, these last few equations do not appear to
be the way it is usually done (thank the lord). See Luong’s work for a much better approach.
78
Papers and Tutorials May 11
Global Attention. Now I’ll describe in detail the processes involved in ht → at → ct → h̃t .
1. ht : Compute the hidden state ht in the normal way (not obvious if you’ve read e.g.
Bahdanau’s work...)
2. at :
(a) Compute the scores between ht and each source h̄s , where our options are:
T
ht h̄s
dot
score(ht , h̄s ) = hTt Wa h̄s general (171)
v T tanh W [h ;
h̄s ]
concat
a a t
(b) Compute the alignment vector at of length Lenc (number of words in the encoder
sequence):
79
3. ct : The weighted average over all source (encoder) hidden states49 :
L
Xenc
ct = at (i)h̄i (174)
i=1
4. h̃t : For convenience, I’ll copy the equation for h̃t again here:
Input-Feeding Approach. A copy of each output h̃t is sent forward and concatenated with
the inputs for the next timestep, i.e. the inputs go from xt+1 to [h̃t ; xt+1 ].
49
NOTE: Right after mentioning the context vector, the authors have the following cryptic footnote that
may be useful to ponder: For short sentences, we only use the top part of at and for long sentences, we ignore
words near the end.
80
Papers and Tutorials March 11
Paper information:
- Full title: On Using Very Large Target Vocabulary for Neural Machine Translation.
- Authors: Jean, Cho, Memisevic, Bengio.
- Date: 18 Mar 2015.
- [arXiv link]
where f is the function defined by the cell state (e.g. GRU/LSTM/etc.). Then the decoder
generates the output sequence y, and with probability given below:
The functions q, g, and r
y = (y1 , . . . , yT0 ) [yi ∈ Z] (179) are just placeholders –
“some function of
Pr[yt | y<t , x] ∝ eq(yt−1 , z t , ct )
(180) [inputs].”
zt = g(yt−1 , zt−1 , ct ) [decoder hidden?] (181)
ct = r(zt−1 , h1 , . . . , hT ) [decoder inp?] (182)
As usual, model is jointly trained to maximize the conditional log-likelihood of correct transla-
tion. For N training sample pairs (xn , y n ), and denoting the length of the n-th target sentence
as Tn , this can be written as,
Tn
N X
θ∗ = arg max
X
log (Pr[ytn | y<t
n
, xn ]) (183)
θ n=1 t=1
81
Model Details. Above is the general structure. Here I’ll summarize the specific model chosen
by the authors.
h i
• Encoder. Bi-directional, which just means ht = hbackward
t ; hft orward . The chosen cell
state (the function f ) is GRU.
• Decoder. At each timestep, computes the following:
→ Context vector ct .
T
X
ct = αi hi (184)
i=1
ea(ht ,zt−1 )
αt = P a(h ,z (185) a is a standard
ke
k t−1 )
single-hidden-layer NN.
→ Decoder hidden state zt . Also a GRU cell. Computed based on the previous hidden
state zt−1 , the previously generated symbol yt−1 , and also the computed context
vector ct .
• Next-word probability. They model equation 180 as50 , Reminder: yi is an
integer token, while wi
is the target vector of
1 wTt φ(yt−1 ,zt ,ct )+bt length vocab size
Pr[yt | y<t , x] = e (186)
Z
T
ewk φ(yt−1 ,zt ,ct )+bk
X
Z= (187)
k: yk ∈V
50
Note: The formula for Z is correct. Notice that the only part of the RHS of Pr(yt ) with a t is as the
subscript of w. To be clear, wk is a full word vector and the sum is over all words in the output vocabulary, the
index k has absolutely nothing to do with timestep. They use the word target but make sure not to misinterpret
that as somehow meaning target words in the sentence or something.
51
NOTE TO SELF: After long and careful consideration, I’m concluding that the authors made a typo
when defining E(yj ), which they choose to subscript all parts of the RHS with j, but that is in direct contradiction
with a step-by-step derivation, which is why I have written it the way it is. I’m pretty sure my version is right,
but I know you’ll have to re-derive it yourself next time you see this. And you’ll somehow prove me wrong.
Actually, after reading on further, I doubt you’ll prove me wrong. Challenge accepted, me. Have fun!
82
The crux of the approach is interpreting the second term as EP [∇E(y)], where P denotes P r(y |
y<t , x). They approximate this expectation by taking it over a subset V 0 of the predefined
proposal distribution Q. So Q is a p.d.f. over the possible yi , and we sample from Q to
generate the elements of the subset V 0 .
X ωk
EP [∇E(y)] ≈ P ∇E(yk ) (190)
k: yk ∈V 0 k0 :yk0 ∈V 0 ωk0
Here is some math I did that was illuminating to me; I’m not sure why the authors didn’t
point out these relationships.
eE(yk ) Q(yk )
ωk = thus p(yk | y<t , x) = ωk (192)
Q(yk ) Z
→ eE(yk ) = Z · p(yk | y<t , x) = Q(yk ) · ωk (193)
Below are the exact and approximate formulas for EP [∇E(y)] written in a seductive
suggestive manner. Pay careful attention to subscripts and primes.
X ωk · Q(yk )
EP [∇E(y)] = ∇E(yk ) (194)
k0 :yk0 ∈V ωk0 · Q(yk0 )
P
k: yk ∈V
X ωk
EP [∇E(y)] = P ∇E(yk ) (195)
k: yk ∈V 0 k0 :yk0 ∈V 0 ωk0
They’re almost the same! It’s much easier to see why when written this way. I interpret
the difference as follows: in the exact case, we explicitly attach the probabilities Q(yk )
and sum over all values in V . In the second case, by sampling a subset V 0 from Q, we
have encoded these probabilities implicitly as the relative frequency of elements yk in V 0
“In practice, we partition the training corpus and define a subset V 0 of the target
vocabulary for each partition prior to training. Before training begins, we sequen-
tially examine each target sentence in the training corpus and accumulate unique
target words until the number of unique target words reaches the predefined thresh-
old τ . The accumulated vocabulary will be used for this partition of the corpus
during training. We repeat this until the end of the training set is reached. Let us
refer to the subset of target words used for the i-th partition by Vi0 .
83
Papers and Tutorials March 19
[Link to article]
What is Candidate Sampling The goal is to learn a compatibility function F (x, y) which
says something about the compatibility of a class y with a context x. Candidate sampling:
for each training example (xi , yi ), only need to evaluate F (x, y) for a small set of classes
{Ci } ⊂ {L}, where {L} is the set of all possible classes (vocab size number of elements). We
represent F (x, y) as a layer that is trained by back-prop from/within the loss function.
C.S. for Sampled Softmax. I’ll further narrow this down to my use case of having exactly
1 target class (word) at a given time. Any other classes are referred to as negative classes
(for that example).
Training task. We are given this set Ci and want to find out which element of Ci is the
target class yi . In other words, we want the posterior probability that any of the y in Ci are
the target class, given what we know about Ci and xi . We can evaluate and rearrange as usual
with Bayes’ rule to get:
Pr yitrue = y | xi · Pr Ci | yitrue = y, xi
Pr yitrue = y | Ci , x i = (196)
Pr (Ci | xi )
Pr (y | xi ) 1
= · (197)
Q (y | xi ) K(xi , Ci )
Pr (Ci | xi )
K(xi , Ci ) , Q (198)
Q(y 0 | xi ) y0 ∈(L−Ci ) (1 − Q(y 0 | xi ))
Q
y 0 ∈Ci
84
Clarifications.
• The learning function F (x, y) is the input to our softmax. It is our neural network,
excluding the softmax function.
• After training our network, it should have learned the general form
elog(Pr(y|x))+K(x)
Softmax(log(Pr(y | x)) + K(x)) = P log(Pr(y0 |x)+K(x) (200)
y0 e
= Pr(y | x) (201)
Note that I’ve been a little sloppy here, since Pr(y | x) up until the last line actually
represented the (possibly) unnormalized/relative probabilities.
• [MAIN TAKEAWAY]. Time to bring it all together. Notice that we’ve only trained
F (x, y) to include part of what’s needed to compute the probability of any y being the
target given xi and Ci . . . equation 199 doesn’t take into account Ci at all! Luckily we
know the form of the full equation because it just the log of equation 197. We can
easily satisfy that by subtracting log(Q(y | x)) from F (x, y) right before feeding into the
softmax.
TL;DR. Train network to learn F (x, y) before softmax, but instead of feeding F (x, y)
to softmax directly, feed
85
Papers and Tutorials April 04
Attention Terminology
Table of Contents Local Written by Brandon McKinzie
Generally useful info. Seems like there are a few notations floating around, and here I’ll at-
tempt to set the record straight. The order of notes here will loosely correspond with the order
that they’re encountered going from encoder output to decoder output.
Jargon. The people in the attention business love obscure names for things that don’t need
names at all. Terminology:
• Attentions keys/values: Encoder output sequence.
• Query: Decoder [cell] state. Typically the most recent one.
• Scores: Values of eij . For the Bahdanau version, in code this would be computed via
When someone lazily calls some layer output the “attention”, they are usually referring to the
layer just after the linear combination/map of encoder hidden states. You’ll often see this as
some vague function of the previous decoder state, context vector, and possibly even decoder
output (after project), like f (si−1 , yi−1 , ci ). In 99.9% of cases, this function is just a fully
connected layer (if even needed) to map back to the state size for decoder input. That is it.
From encoder to decoder. The path of information flow from encoder outputs to decoder
inputs, a non-trivial process that isn’t given the attention (heh) it deserves52
1. Encoder outputs. Tensor of shape [batch size, sequence length, state size]. The
state is typically some RNNCell state.
• Note: TensorFlow’s AttenntionMechanism classes will actually convert this to [batch
size, Lenc , attention size], and refer to it as the “memory”. It is also what is re-
turned when calling myAttentionMech.values.
52
For some reason, the literature favors explaining the path “backwards”, starting with the highly abstracted
“decoder inputs as a weighted sum of encoder states” and then breaking down what the weights are. Unfortu-
nately, the weights are computed via a multi-stage process so that becomes very confusing very quick.
86
2. Compute the scores. The attention scores are the computation described by Lu-
ong/Bahdanau techniques. They both take an inner product of sorts on copies of the
encoder outputs and decoder previous state (query). The main choices are:
T
ht h̄s dot Synonyms:
- scores
score(ht , h̄s ) = hTt Wa h̄s general (204) - unnormalized
v T tanh W [h ;
h̄s ] concat alignments
a a t
where the shapes are as follows (for single timestep during decoding process):
• h̄s : [batch size, 1, state size]
• ht : [batch size, 1, state size]
• Wa : [batch size, state size, state size]
• score(ht , h̄s ): [batch size]
Synonyms:
- softmax outputs
3. Softmax the scores. In the vast majority of cases, the attention scores are next fed - attention dist.
through a softmax to convert them into a valid probability distribution. Most papers will - alignments
call this some vague probability function, when in reality they are using softmaxonly.
4. Compute the context vector. The inner product of the softmax outputs and the Synonyms:
- context vector
raw encoder outputs. This will have shape [batch size, attention size] in TensorFlow, - attention
where attention size is from the constructor for your AttentionMechanism.
5. Combine context vector and decoder output: Typically with a concat. The result is
what people mean when they say “attention”. Luong et al. denotes this as h̃t , the decoder
output at timestep t. This is what TensorFlow means by “Luong-style mechanisms output
the attention.” And yes, these are used (at least for Luong) to compute the prediction:
87
Papers and Tutorials May 03
TextRank
Table of Contents Local Written by Brandon McKinzie
Introduction. A graph-based ranking algorithm is a way of deciding on the importance of a Semantic graph: one
vertex within a graph, by taking into account global information recursively computed from whose structure encodes
meaning between the
the entire graph, rather than relying only on local vertex-specific information. TextRank is nodes (semantic
a graph-based ranking model for graphs extracted from natural language texts. The authors elements).
investigate/evaluate TextRank on unsupervised keyword and sentence extraction.
The TextRank PageRank Model. In general [graph-based ranking], a vertex can be ranked
based on certain properties such as the number of vertices pointing to it (in-degree), how highly-
ranked those vertices are, etc. Formally, the authors [of PageRank] define the score of a vertex
Vi as follows:
The factor d is usually
X 1
S(Vi ) = (1 − d) + d ∗ S(Vj ) where d ∈ <[0, 1] (209) set to 0.85.
Vj ∈In(V )
| Out(Vi ) |
i
and the damping factor d is interpreted as the probability of jumping from a given vertex53 to
another random vertex in the graph. In practice, the algorithm is implemented through the
following steps:
(1) Initialize all vertices with arbitrary values.54
(2) Iterate over vertices, computing equation 4.9 until convergence [of the error rate] below
a predefined threshold. The error-rate, defined as the difference between the "true score"
and the score computed at iteration k, S k (Vi ), is approximated as:
53
Note that d is a single parameter for the graph, i.e. it is the same for all vertices.
54
The authors do not specify what they mean by arbitrary. What range? What sampling distribution?
Arbitrary as in uniformly random? EDIT: The authors claim that the vertex values upon completion are not
affected by the choice of initial value. Investigate!
88
Weighted Graphs. In contrast with the PageRank model, here we are concerned with natural
language texts, which may include multiple or partial links between the units (vertices). The
authors hypothesize that modifying equation 4.9 to incorporate weighted connections may be
useful for NLP applications.
X wji wij denotes the
W S(Vi ) = (1 − d) + d ∗ W S(Vj ) (211) connection between
Vk ∈ Out(Vj )wjk
P
vertices Vi and Vj .
j∈In(Vi )
where I’ve shown the modified part in green. The authors mention they set all weights to
random values in the interval 0-10 (no explanation).
89
4.9.1 Keyword Extraction
Graph. The authors apply TextRank to extract words/phrases that are representative for a
given document. The individual graph components are defined as follows:
- Vertex: sequence of one or more lexical units from the text.
– In addition, we can restrict which vertices are added to the graph with syntactic filters.
– Best filter [for the authors]: nouns and adjectives only.
- Edge: two vertices are connected if their corresponding lexical units co-occur within a Typically N ∈ Z[2, 10]
window of N words55 .
Procedure:
(1) Pre-Processing: Tokenize and annotate [with POS] the text.
(2) Build the Graph: Add all [single] words to the graph that pass the syntactic filter, and
connect [undirected/unweighted] edges as defined earlier (co-occurrence).
(3) Run algorithm: Initialize all scores to 1. For a convergence threshold of 0.0001, usually
takes about 20-30 iterations.
(4) Post-Processing:
(i) Keep the top T vertices (by score), where the authors chose T = |V |/3.56 Remember
that vertices are still individual words.
(ii) From the new subset of T keywords, collapse any that were adjacent in the original
text in a single lexical unit.
Evaluation. The data set used is a collection of 500 abstracts, each with a set of keywords.
Results are evaluated using precision, recall, and F-measure57 . The best results were
obtained with a co-occurrence window of 2 [on an undirected graph], which yielded:
The authors found that larger window size corresponded with lower precision, and that directed
graphs performed worse than undirected graphs.
55
That is . . . simpler than expected. Can we do better?
56
Another approach is to have T be a fixed value, where typically 5 < T < 20.
57
Brief terminology review: A PR Curve plots
• Precision: fraction of keywords extracted that are in the "true" set of keywords. precision as a function of
• Recall: fraction of "true" keywords that are in the extracted set of keywords. recall.
• F-score: combining precision and recall to get a single number for evaluation:
2pr
F =
p+r
90
4.9.2 Sentence Extraction
Evaluation. The data set used is 567 news articles. For each article, TextRank generates a
100-word summary (i.e. they set T = 100). They evaluate with the Rouge evaluation toolkit
(Ngram statistics).
91
Papers and Tutorials June 12, 2017
Overview. It turns out that simply taking a weighted average of word vectors and doing
some PCA/SVD is a competitive way of getting unsupervised word embeddings. Apparently Discussion based on
it beats supervised learning with LSTMs (?!). The authors claim the theoretical explanation paper by Arora et al.,
(2017).
for this method lies in a latent variable generative model for sentences (of course).
Algorithm.
1. Compute the weighted average of the word vectors in the sentence:
where wi is the word vector for the ith word in the sentence, a is a parameter, and p(wi )
is the (estimated) word frequency [over the entire corpus].
2. Remove the projections of the average vectors on their first principal component (“com-
mon component removal”) (y tho?).
92
Theory. Latent variable generative model. The model treats corpus generation as a dynamic
process, where the t-th word is produced at time step t, driven by the random walk of a
discourse vector ct ∈ <d (d is size of the embedding dimension). The discourse vector is not
pointing to a specific word; rather, it describes what is being talked about. We can tell how
related (correlation) the discourse is to any word w and corresponding vector vw by taking the
inner product ct · vw . Similarly, we model the probability of observing word w at time t, wt ,
as:
• The Random Walk. If we assume that ct doesn’t change much over the words in
a single sentence, we can assume it stays at some cs . The authors claim that in their
previous paper they showed that the MAP58 estimate of cs is – up to multiplication by
a scalar – the average of the embeddings of the words in the sentence.
• Improvements/Modifications to 214.
1. Additive term αp(w) where α is a scalar. Allows words to occur even if ct · vw is
very small.
2. Common discourse vector c0 ∈ <d . Correction term for the most frequent discourse
that is often related to syntax.
• Model. Given the discourse vector cs for a sentence s, the probability that w is in the
sentence (at all (?)):
ec̃s ·vw
Pr [w | cs ] = αp(w) + (1 − α) (215)
Zc̃s
c̃s = βc0 + (1 − β)cs (216)
with c0 ⊥ cs and Zc̃s is a normalization constant, taken over all w ∈ V .
58
Review of MAP: X
θM AP = arg max log (pX (x | θ)p(θ))
θ
i
93
Papers and Tutorials July 03, 2017
Introduction. The unique characteristics for clustering text, as opposed to more traditional
(numeric) clustering, are (1) large dimensionality but highly sparse data, (2) words are typ-
ically highly correlated, meaning the number of principal components is much smaller than
the feature space, and (3) the number of words per document can vary, so we must normalize
appropriately.
94
– Sij ∈ (0, 1) is the similarity between doc i and j.
– dij is the distance between i and j after the term t is removed
– d¯ is the average distance between the documents after the term t is removed.
LSI-based Methods. Latent Semantic Indexing is based on dimensionality reduction where See ref 28. of paper for
more on LSI.
the new (transformed) features are a linear combination of the originals. This helps magnify
the semantic effects in the underlying data. LSI is quite similar to PCA59 , except that we use
an approximation of the covariance matrix C which is appropriate for the sparse and high-
dimensional nature of text data.
Let A ∈ Rn×d be term-document matrix, where Ai,j is the (normalized) frequency for term
j in document i. Then AT A = n · Σ is the (scaled) approximation to covariance matrix60 ,
assuming the data is mean-centered. Quick check/reminder:
where the expectation is technically over the underlying data distribution, which gives e.g.
P (ai = x), the probability the ith word in our vocabulary having frequency x. Apparently,
since the data is sparse, we don’t have to worry much about it actually being mean-centered
(why?).
As usual, we using the eigenvectors of AT A with the largest variance in order to represent the
text61 . In addition:
One excellent characteristic of LSI is that the truncation of the dimensions removes the
noise effects of synonymy and polysemy, and the similarity computations are more closely
affected by the semantic concepts in the data.
59
The difference between LSI and PCA is that PCA subtracts out the means, which destroys the sparseness
of the design matrix.
60
Approximation because it is based on our training data, not on true expectations over the underlying
data-distribution.
61
In typical collections, only about 300 to 400 eigenvectors are required for the representation.
95
Non-negative Matrix Factorization. Another latent-space method (like LSI), but partic-
ularly suitable for clustering. The main characteristics of the NMF scheme:
• In LSI, the new basis system consists of a set of orthonormal vectors. This is not the
case for NMF.
• In NMF, the vectors in the basis system directly correspond to cluster topics. There-
fore, the cluster membership for a document may be determined by examining the largest
component of the document along any of the [basis] vectors.
Assume we want to create k clusters, using our n documents and vocabulary size d. The goal
of NMF is to find matrices U ∈ Rn×k and V ∈ Rd×k that minimize:
1
J = ||A − UV T ||2F (223) uij ≥ 0
2
vij ≥ 0
1
= tr(AAT ) − 2tr(AVU T ) + tr(UV T VU T ) (224)
2
Note that the columns of V provide the k basis vectors which correspond to the k different
clusters. We can interpret this as trying to factorize A ≈ UV T . For each row, a, of A
(document vector), this is
a ≈u·VT (225)
k
X
= ui ViT (226)
i=1
Langrange-multiplier stuff: Our optimization problem can be solved using the Lagrange method.
• Variables to optimize: All elements of both U = [uij ] and V = [vij ]
• Constraint: non-negativity, i.e. ∀i, j, uij ≥ 0 and vij ≥ 0.
• Multipliers: Denote as matrices α and β, with same dimensions as U and V, respectively.
• Lagrangian: I’ll just show it here first, and then explain in this footnote62 :
Any matrix
multiplication with a · is
L = J + tr(α · U T ) + tr(β · V T ) (227) just a reminder to think
n n X
n of the matrices as
column vectors.
X X
where tr(α · U T ) = αi · ui = αij uij (228)
i=1 i=1 j=1
You should think of α as a column vector of length n, and U T as a row vector of length n.
The reason we prefer L over just J is because now we have an unconstrained optimization
problem.
62
Recall that in Lagrangian minimization, L takes the form of [the-function-to-be-minimized] + λ
([constraint-function] - [expected-value-of-constraint-at-optimum]). So the second term is expected to tend
toward zero (i.e. critical point) at the optimal values. In our case, since our optimal value is sort-of (?) at 0 for
any value of uij and/or vij , we just have a sum over [langrange-mult] × [variable].
96
• Optimization: Set the partials of L w.r.t both U and V (separately) to zero63 :
∂L
= −A · V + U · V T · V + α = 0 (229)
∂U
∂L
= −AT · U + V · U T · U + β = 0 (230)
∂V
Since, ultimately, these just say [some matrix] = 0, we can multiply both sides (element-
wise) by a constant (x × 0 = 0). Using 64 the Kuhn-Tucker conditions αij · uij = 0
and βij · vij = 0, we get:
• Update rules:
(A · V)ij · uij
uij = (233)
(U · V T · V)ij
(AT · U)ij · vij
vij = (234)
(V · U T · U)ij
One challenge in clustering short segments of text (e.g., tweets) is that exact keyword matching
may not work well. One general strategy for solving this problem is to expand text represen- See ref. 66 in the paper
tation by exploiting related text documents, which is related to smoothing of a document for computing
similarities of short text
language model in information retrieval. segments.
63
Recall that the Lagrangian consists entirely of traces (re: scalars). Derivatives of traces with respect to
matrices output the same dimension as that matrix, and derivatives are taken element-wise as always.
64
i.e. the equations that follow are not the KT conditions, they just use/exploit them. . .
65
Here, transitivity of similarity means if A is similar to B, and B is similar to C, then A is similar to C.
This is not guaranteed by any means for textual similarity, and so we can end up with A and Z in the same
cluster, even though they aren’t similar at all.
97
• Group-Average Linkage Clustering. Similarity between two clusters is the average
similarity over all unique pairwise combinations of documents from one cluster to the
other. One way to speed up this computation with an approximation is to just compute
the similarity between the mean vector of either cluster.
• Complete Linkage Clustering. Similarity between two clusters is the worst-case
similarity between any pair of documents.
Overview. Primary assumptions in any topic modeling approach: From pg. 31/52 of paper.
• The n documents in the corpus are assumed to each have a probability of belonging to one
of k topics. Denote the probability of document Di belonging to topic Tj as Pr [Tj | Di ].
This allows for soft cluster membership in terms of probabilities.
• Each topic is associated with a probability vector, which quantifies the probability of
the different terms in the lexicon for that topic. For example, consider a document that
belongs completely to topic Tj . We denote the probability of term tl occurring in that
document as Pr [tl | Tj ].
The two main methods for topic modeling are Probabilistic Latent Semantic Indexing
(PLSA) and Latent Dirichlet Allocation (LDA).
98
PLSA. We note that the two aforementioned probabilities, Pr [Tj | Di ] and Pr [tl | Tj ] allow
us to calculate Pr [tl | Di ]: the probability that term tl occurs in some document Di :
k
X
Pr [tl | Di ] = Pr [tl | Tj ] · Pr [Tj | Di ] (235)
j=1
Recall that we also have our n × d term-document matrix X, where Xi,l gives the number
of times term l occurred in document Di . This allows us to do maximum likelihood! Our
negative log-likelihood, J can be derived as follows:
Interpret Pr [X] as the
joint probability of
J = − log (Pr [X]) (236) observing the words in
our data and with their
assoc. frequencies.
= − log Pr [tl | Di ]Xi,l
Y
(237)
i,l
X
=− Xi,l · log (Pr [tl | Di ]) (238)
i,l
and we can plug-in eqn 235 to for evaluating Pr [tl | Di ]. We want to optimize the value of J,
subject to the constraints:
X X
(∀Tj ) : Pr [tl | Tj ] = 1 (∀Di ) : Pr [Tj | Di ] = 1 (239)
l j
This can be solved with a Lagrangian method, similar to the process for NMF described earlier.
See page 33/52 of the paper for details.
Latent Dirichlet Allocation (LDA). The term-topic probabilities and topic-document prob-
abilities are modeled with a Dirichlet distribution as a prior67 . Typically preferred over PLSI
because PLSI more prone to overfitting.
66
This is actually pretty bad notation, and borderline incorrect. Pr [Tj | Di ] is NOT a conditional probability!
It is our prior! It is literally Pr [ClusterOf(Di ) = Tj ].
67
LDA is the Bayesian version of PLSI
99
4.11.3 Online Clustering with Text Streams
Reference List: [3]: A Framework for Clustering Massive Text and Categorical Data Streams; [112]: Efficient Streaming Text Clustering;
[48]: Bursty feature representation for clustering text streams; [61]: Clustering Text Data Streams (Liu et al.)
Overview. Maintaining text clusters in real time. One method is the Online Spherical See ref. 112 for more on
OSKM
K-Means Algorithm (OSKM)68 .
Condensed Droplets Algorithm. I’m calling it that because they don’t call it anything –
it is the algorithm in [3].
• Fading function: f (t) = 2−λ·t . A time-dependent weight for each data point (text
stream). Non-monotonic decreasing; decays uniformly with time.
• Decay rate: λ = 1/t0 . Inverse of the half-life of the data stream.
When a cluster is created by a new point, it is allowed to remain as a trend-setting outlier
for at least one half-life. During that period, if at least one more data point arrives, then the
cluster becomes an active and mature cluster. If not, the trend-setting outlier is recognized as
a true anomaly and is removed from the list of current clusters (cluster death). Specifically, this
happens when the (weighted) number of points in the [single-point] cluster is 0.5. The same
criterion is used to define the death of mature clusters. The statistics of the data points are
referred to as condensed droplets, which represent the word distributions within a cluster,
and can be used in order to compute the similarity of an incoming data point to the cluster.
Main idea of algorithm is as follows:
1. Initialize empty set of clusters C = {}. As new data points arrive, unit clusters containing
individual data points are created. Once a maximum number k of such clusters have been
created, we can begin the process of online cluster maintenance.
2. For a new data point X, compute its similarity to each cluster Cj , denoted as S(X, Cj ).
- If S(X, Cbest ) > threshoutlier , or if there are no inactive clusters left69 , insert X to the
cluster with maximum similarity.
- Otherwise, a new cluster is created70 containing the solitary data point X.
68
Authors only provide a very brief description, which I’ll just copy here:
This technique divides up the incoming stream into small segments, each of which can be processed effectively in main memory.
A set of k-means iterations are applied to each such data segment in order to cluster them. The advantage of using a segment-
wise approach for clustering is that since each segment can be held in main memory, we can process each data point multiple
times as long as it is held in main memory. In addition, the centroids from the previous segment are used in the next iteration
for clustering purposes. A decay factor is introduced in order to age- out the old documents, so that the new documents are
considered more important from a clustering perspective.
69
We specify some max allowed number of clusters k.
70
The new cluster replaces the least recently updated inactive cluster.
100
Misc.
• Mandatory read: reference [61]. Details phrase extraction/topic signatures. The use
of using phrases instead of individual words is referred to as semantic smoothing.
• For dynamic (and more recent) topic modeling, see reference [107] of the paper, titled “A
probabilistic model for online document clustering with application to novelty detection.”
Semi-Supervised Clustering. Useful when we have any prior knowledge about the kinds of
clusters available in the underlying data. Some approaches:
• Incorporate this knowledge when seeding the cluster centroids for k-means clustering.
• Iterative EM approach: unlabeled documents are assigned labels using a naive Bayes
approach on the currently labeled documents. These newly labeled documents are then
again used for re-training a Bayes classifier. Iterate to convergence.
• Graph-based approach: graph nodes are documents and the edges are similarities between
the connected documents (nodes). We can incorporate prior knowledge by adding certain
edges between nodes that we know are similar. A normalized cut algorithm is then applied
to this graph in order to create the final clustering.
We can also use partially supervised methods in conjunction with pre-existing categorical
hierarchies.
101
Papers and Tutorials July 10, 2017
Palangi et al., “Deep Sentence Embeddings Using Long Short-Term Memory Networks: Analysis and Application
to Information Retrieval,” (2016).
Abstract. Sentence embeddings using LSTM cells, which automatically attenuate unimpor-
tant words and detect salient keywords. Main emphasis on applications for document retrieval
(matching a query to a document71 ).
Introduction. Sentence embeddings are learned using a loss function defined on sentence
pairs. For example, the well-known Paragraph Vector72 is learned in an unsupervised manner
as a distributed representation of sentences and documents, which are then used for sentiment
analysis.
The authors appear to use a dataset of their own containing examples of (search-query, clicked-
title) for a search engine. Their training objective is to maximize the similarity between the
two vectors mapped by the LSTM-RNN from the query and the clicked document, respectively.
One very interesting claim to pay close attention to:
We further show that different cells in the learned model indeed correspond to different
topics, and the keywords associated with a similar topic activate the same cell unit in the
model.
71
Note that this similar to topic extraction.
72
Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents.”
73
Might want to look into this.
102
Basic RNN. The information flow (sequence of operations) is enumerated below.
1. Encode tth word [of the given sentence] in one-hot vector x(t).
2. Convert x(t) to a letter tri-gram vector l(t) using fixed hashing operator74 H:
3. Compute the hidden state h(t), which is the sentence embedding for t = T , the length
of the sentence.
where U and W are the usual parameter matrices for the input/recurrent paths, respec-
tively.
LSTM. With peephole connections that expose the internal cell state s to the sigmoid com-
putations. I’ll rewrite the standard LSTM equations from my textbook notes, but with the
modifications for peephole connections:
(t) (t) (t−1) (t−1)
= σ bfi + f f f
X X X
fi Ui,j xj + Wi,j hj + Pi,j sj (242)
j j j
(t) (t) (t−1) (t) X (t) X (t−1)
si = fi si + gi σ bi + Ui,j xj + Wi,j hj (243)
j j
(t) (t) (t−1) (t−1)
gi = σ bgi + g g g
X X X
Ui,j xj + Wi,j hj + Pi,j sj (244)
j j j
(t) X
o (t)
X
o (t−1)
X
o (t)
qi = σ boi + Ui,j xj + Wi,j hj + Pi,j sj (245)
j j j
74
Details aside, the hashing operator serves to lower the dimensionality of the inputs a bit. In particular we
use it to convert one-hot word vectors into their letter tri-grams. For example, the word “good” gets surrounded
by hashes, ‘#good#‘, and then hashed from the one-hot vector to vectorized tri-grams, “#go”, “goo”, “ood”,
“od#”.
103
Learning method. We want to maximize the likelihood of the clicked document given query,
which can be formulated as the following optimization problem:
N N
( )
Y h i X
L(Λ) = min − log Pr Dr+ | Qr = min lr (Λ) (247)
Λ Λ
r=1 r=1
n
e−γ·∆r,j
X
lr (Λ) = log 1 + (248)
j=1
where
• N is the number of (query, clicked-doc) pairs in the corpus, while n is the number of
negative samples used during training.
• Dr+ is the clicked document for rth query.
−
• ∆r,j = R(Qr , Dr+ ) − R(Qr , Dr,j ) (R is just cosine similarity)75 .
• Λ is all the parameter matrices (and biases) in the LSTM.
The authors then describe standard BPTT updates with momentum, which need not be de-
tailed here. See the “Algorithm 1” figure in the paper for extremely detailed pseudo-code of
the training procedure.
75
Note that ∆r,j ∈ [−2, 2]. We use γ as a scaling factor so as to expand this range.
104
Papers and Tutorials July 10, 2017
Aggarwal et al., “A Framework for Clustering Massive Text and Categorical Data Streams,” (2006).
Overview. Authors present an online approach for clustering massive text and categorical
data streams with the use of a statistical summarization methodology. First, we will go over the
process of storing and maintaining the data structures necessary for the clustering algorithm.
Then, we will discuss the differences which arise from using different kinds of data, and the
empirical results.
Maintaining Cluster Statistics. The data stream consists of d-dimensional records, where
each dimension corresponds to the numeric frequency of a given word in the vector space
representation. Each data point is weighted by the fading function f (t), a non-monotonic
decreasing function which decays uniformly with time t. The authors define the half-life of a
data point (e.g. a tweet) as:
1
t0 s.t. f (t0 ) = f (0) (249)
2
and, similarly, the decay-rate as its inverse, λ = 1/t0 . Thus we have f (t) = 2−λ·t .
To achieve greater accuracy in the clustering process, we require a high level of granularity in
the underlying data structures. To do this, we will use a process in which condensed clusters
of data points are maintained, referred to as cluster droplets. We define them differently for
the case of text and categorical data, beginning with categorical:
• Categorical. A cluster droplet D(t, C) for a set of categorical data points C at time t is
defined as the tuple:
¯ 2, DF
D(t, C) , (DF ¯ 1, n, w(t), l) (250)
where
– Entry k of the vector DF ¯ 2 is the (weighted) number of points in cluster C where
the ith dimension had value x and the j dimension had value y. In other words, all
pairwise combinations of values in the categorical vector76 . di=1 dj6=i vi vj entries
P P
total77 .
¯ 1 consists of the (weighted) counts that some dimension i took on the
– Similarly, DF
Pd
value x. i=1 vi entries total.
76
This is intentionally written hand-wavy because I’m really concerned with text streams and don’t want to
give this much space.
77
vi is the number of values the ith categorical dimension can take on.
105
– w(t) is the sum of the weights of the data points at time t.
– l is the time stamp of the last time a data point was added to the cluster.
• Text. Can be viewed as an example of a sparse numeric data set. A cluster droplet
D(t, C) for a set of text data points C at time t is defined as the tuple:
¯ 2, DF
D(t, C) , (DF ¯ 1, n, w(t), l) (251)
where
– DF¯ 2 contains 3 · wb · (wb − 1)/2 entries, where wb is the number of distinct words
in the cluster C.
– DF¯ 1 contains 2 · wb entries.
– n is the number of data points in the cluster C.
78
In other words, the statistics for a cluster do not decay, until a new point is added to it.
106
Papers and Tutorials July 12, 2017
Conneau et al., “Supervised Learning of Universal Sentence Representations from Natural Language Inference
Data,” Facebook AI Research (2017).
Overview. Authors claim universal sentence representations trained using the supervised
data of the Stanford Natural Language Inference (SNLI) dataset can consistently outperform
unsupervised methods like SkipThought on a wide range of transfer tasks. They emphasize
that training on NLI tasks in particular results in embeddings that perform well in transfer
tasks. Their best encoder is a Bi-LSTM architecture with max pooling, which they claim is
SOTA when trained on the SNLI data.
The Natural Language Inference Task. Also known as Recognizing Textual Entailment
(RTE). The SNLI data consists of sentence pairs labeled as one of entailment, contradiction,
or neutral. Below is a typical architecture for training on SNLI.
Note that the same sentence encoder is used for both u and v. To obtain a sentence vector
from a BiLSTM encoder, they experiment with (1) the average ht over all t (mean pooling),
and (2) selecting the max value over each dimension of the hidden units [over all timesteps]
(max pooling)79 .
79
Since the authors have already mentioned that BiLSTM did the best, I won’t go over the other architectures
they tried: self-attentive networks, hierarchical convnet, vanilla LSTM/GRU.
107
Papers and Tutorials July 13, 2017
Hill et al., “Learning Distributed Representations of Sentences from Unlabelled Data,” (2016).
s to predict wi+k+1 .
• Bottom-Up Methods. Train CBOW and Skip-Gram word embeddings on the Books
corpus.
108
Novel Text-Based Methods.
• Sequential (Denoising) Autoencoders. To avoid needing coherent inter-sentence
narrative, try this representation-learning objective based on DAEs. For a given sentence
S and noise function N (S | po , px ) (where p0 , px ∈ [0, 1]),the approach is as follows:
1. For each w ∈ S, N deletes w with probability po .
2. For each non-overlapping bigram wi wi+1 ∈ S, N swaps wi and wi+1 with probability
px . Authors recommend
po = px = 0.1
We then train the same LSTM-based encoder-decoder architecture as NMT, but with the
denoising objective to predict (as target) the original source sentence S given a corrupted
version N (S | po , px ) (as source).
109
Papers and Tutorials July 22, 2017
80
Recall that for positive integers n, Γ(n) = (n − 1)!.
81
The Dirichlet distribution is conjugate to the multinomial distribution. TODO: Review how to interpret
this.
82
Recall that LSI is basically PCA but without subtracting off the means
110
Model. LDA assumes the following generative process for each document (word sequence) w:
1. N ∼ Poisson(λ): Sample N , the number of words (length of w), from Poisson(λ) =
n
e−λ λn! . The parameter λ should represent the average number of words per document.
2. θ ∼ Dir(α): Sample k-dimensional vector θ from the Dirichlet distribution (eq. 252),
Dir(α). k is the number of topics (pre-defined by us). Recall that this means θ lies in the
(k-1) simplex. The Dirichlet distribution thus tells us the probability density of θ over
this simplex – it defines the probability of θ being at a given position on the simplex.
3. Do the following N times to generate the words for this document.
(a) zn ∼ Multinomial(θ). Sample a topic zn .
(b) wn ∼ Pr [wn | zn , β]: Sample a word wn from Pr [wn | zn , β], a “multinomial prob-
ability conditioned on topic zn .”83 The parameter β gives the distribution of words
given a topic:
83
TODO: interpret meaning of the multinomial distributions here. Seems a bit different than standard
interp...
84
In other words, the meaning of θm,i = x is “x percent of document m is about topic i.”
111
• Topic/Word Place. zmn is the topic for word n in doc m, and wmn is the word. It is
shaded gray to indicate it is the only observed variable, while all others are latent
variables.
Theory. I’ll quickly summarize and interpret the main theoretical points. Without having
read all the details, this won’t be of much use (i.e. it is for someone who has read the paper
already).
• LDA and Exchangeability. We assume that each document is a bag of words (order
doesn’t matter; frequency sitll does) and a bag of topics. In other words, a document of
N words is an unordered list of words and topics. De Finetti’s theorem tells us that we
can model the joint probability of the words and topics as if a random parameter θ were
drawn from some distribution and then the variables within w, z were conditionally
independent given θ. LDA posits that a good distribution to sample θ from is a
Dirichlet distribution.
• Geometric Interpretation: TODO
Inference and Parameter Estimation. As usual, we need to find a way to compute the
posterior distribution of the hidden variables given a document w:
Pr [θ, z, w | α, β]
Pr [θ, z | w, α, β] = (257)
Pr [w | α, β]
112
Papers and Tutorials July 30, 2017
Lafferty et al., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,”
(2001).
Introduction. CRFs offer improvements to HMMs, MEMMs, and other discriminative Markov
models. MEMMs and other non-generative models share a weakness called the label bias
problem: the transitions leaving a given state compete only against each other, rather than
against all other transitions in the model. The key difference between CRFs and MEMMs is
that a CRF has a single exponential model for the joint probability of the entire sequence of labels
given the observation sequence.
The Label Bias Problem. Recall that MEMMs are run left-to-right. One way of interpreting
such a model is to consider how the probabilities (of state sequences) are distributed as we
continue through the sequence of observations. The issue with MEMMs is that there’s nothing
we can do if, somewhere along the way, we observe something that makes one of these state
paths extremely likely/unlikely; we can’t redistribute the probability mass amongst the various
allowed paths. The CRF solution:
Account for whole state sequences at once by letting some transitions “vote” more strongly
than others depending on the corresponding observations. This implies that score mass will
not be conserved, but instead individual transitions can “amplify” or “dampen” the mass
they receive.
Conditional Random Fields. Here we formalize the model and notation. Let X be a
random variable over data sequences to be labeled (e.g. over all words/sentences), and let Y
the random variable over corresponding label sequences85 . Formal definition:
Let G = (V, E) be a graph such that Y = (Yv )v∈V , so that Y is indexed by the vertices of
G. Then (X, Y ) is a CRF if, when conditioned on X, the random variables Yv obey the
Markov property with respect to the graph:
All this means is a CRF is a random field (discrete set of random-valued points in a space)
where all points (i.e. globally) are conditioned on X. If the graph G = (V, E) of Y is a tree,
its cliques86 are the edges and vertices. Take note that X is not a member of the vertices
85
We assume all components Yi can only take on values in some finite label set Y.
86
A clique is a subset of vertices in an undirected graph such that every two distinct vertices in the clique
are adjacent
113
in G. G only contains vertices corresponding to elements of Y. Accordingly, when the au-
thors refer to cases where G is a “chain”, remember that they just mean the Y vertex sequence.
where y|S is the set of components of y associated with the vertices in subgraph S. We
assume the K feature [functions] fk and gk are given and fixed. Note that fk are the feature
functions over transitions yt−1 to yt , and gk are the feature functions over states yt and xt .
Our estimation problem is thus to determine parameters θ = (λ1 , λ2 , . . . ; µ1 , µ2 , . . .) from the
labeled training data.
Linear-Chain CRF. Let |Y| denote the number of possible labels. At each position t in the
observation sequence x, we define the |Y| × |Y| matrix random variable Mt (x)
where yt−1 := y 0 and yt := y. We can see that the individual elements correspond to specific
values of e and v in the double-summations of pθ (y | x) above. Then the normalization
(partition function) Zθ (x) is the (y0 , yT +1 ) entry (the fixed boundary states) of the product:
"T +1 #
Y
Zθ (x) = Mt (x) (262)
t=1 y0 ,yT +1
which includes all possible sequences y that start with the fixed y0 and end with the fixed
yT +1 . Now we can write the conditional probability as a function of just these matrices:
QT +1
t=1 Mt (yt−1i, yt | x)
pθ (y | x) = hQ (263)
T +1
t=1 Mt (x) y ,y
0 T +1
114
Parameter Estimation (for linear-chain CRFs). For each t in [0, T + 1], define the forward
vectors αt (x) with base case α0 (y | x) = 1 if y = y0 , else 0. Similarly, define the backward
vectors βt (x) with base case βT +1 (y | x) = 1 if y = yT +1 else 087 . Their recurrence relations
are
87
Remember that y0 and yT +1 are their own fixed symbolic constants representing a fixed start/stop state.
115
Papers and Tutorials September 04, 2017
Overview. Authors refer to sequence transduction models a lot – just a fancy way of refer-
ring to models that transform input sequences into output sequences. Authors propose new
architecture, the Transformer, based solely on attention mechanisms (no recurrence!).
Model Architecture.
• Encoder. N =6 identical layers, each with 2 sublayers: (1) a multi-head self-attention
mechanism and (2) a position-wise FC feed-forward network. They apply a residual
connection and layer norm such that each sublayer, instead of outputting Sublayer(x),
instead outputs LayerNorm(x + Sublayer(x)).
• Decoder. N =6 with 3 sublayers each. In addition to the two sublayers described for the
encoder, the decoder has a third sublayer, which performs multi-head attention over
the output of the encoder stack. Same residual connections and layer norm.
Figure shows
encoder-decoder
template layers. The
actual model instantiates
chain of 6 encoder layers
and 6 decoders layers.
The decoder’s
self-attention masks
embeddings at future
timesteps to zero.
116
Attention. An attention function can be described as a mapping:
X
Attn(query, {(k1 , v1 ), . . . , }) ⇒ f n(query, ki )vi (266)
i
where the query, keys, values, and output are all vectors.
• Scaled Dot-Product Attention.
1. Inputs: queries q, keys k of dimension √ dk , values v of dimension dv Appears that dq ≡ dk .
First, let’s explicitly show which indices are being normalized over, since it can get
confusing when presented with the highly vectorized version above. For a given input
sequence of length T , and for the self-attention version where K=Q=V
√ ∈ RT ×dk , the
output attention vector for timestep t is explicitly (ignoring the dk for simplicity)
h i
Attention(Q, K, V )t = softmax QK T V (271)
t
T
X eQt ·Kt0
= PT V0
Qt ·Kt00 t
(272)
t0 t00 e
88
Assume that q and k are vectors in Rd whose components are independent RVs with E [qi ] = E [kj ] = 0
(∀i, j), and Var [qi ] = Var [kj ] = 1 (∀i, j). Then
" d
# d d
X X X
E [q • k] = E q i ki = E [qi ki ] = E [qi ] E [ki ] = 0 (268)
i i i
" d # d d
X X X
E qi2 ki2 −
Var [q • k] = Var qi ki = Var [qi ki ] = i k
E [q i] (269)
i i i
d d
X 2 2 X
= E qi E ki = Var [qi ] Var [ki ] = d (270)
i i
See this S.O answer and/or these useful formulas for more details.
117
Next, the gradient of the dth softmax output w.r.t its inputs is
∂Softmaxd (x)
= Softmaxd (x) (δdj − Softmaxd (x)) (273)
∂xj
• Multi-Head Attention. Basically just doing some number h of parallel attention
computations. Before each of these, the queries, keys, and values are linearly projected
with different, learned linear projections to dk , dk and dv dimensions respectively (and
then fed to their respective attention function). The h outputs are then concatenated
and once again projected, resulting in the final values.
WiQ ∈ Rdmodel ×dk
O
MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W (274)
WiK ∈ Rdmodel ×dk
where headi = Attention(QWiQ , KWiK , V WiV ) (275)
WiV ∈ Rdmodel ×dv
The authors employ h = 8, dk = dv = dmodel /h = 64. W O ∈ Rhdv ×dmodel
Other Components.
• Position-wise Feed-Forward Networks (FFN): each layer of the encoder and de-
coder contains a FC FFN, applied to each position separately and identically:
FFN(x) = max (0, xW1 + b1 ) W2 + b2 (276) The FFN is linear →
ReLU → linear.
• Embeddings and Softmax: use learned embeddings to convert input/output tokens to
vectors of dimension dmodel , and for the pre-softmax layer at the output of the decoder89 . For inputs to
encoder/decoder, the
• Positional Encoding: how the authors deal with the lack of recurrence (to make use embedding weights are
√
of the sequence order). They add a sinusoid function of the position (timestep) pos and multiplied by dmodel
vector index i to the input embeddings for the encoder and decoder90 :
pos
PE(pos, 2i) = sin (277)
100002i/dmodel
pos
PE(pos, 2i + 1) = cos (278)
100002i/dmodel
The authors justify this choice:
89
In other words, they use the same weight matrix for all three of (1) encoder input embedding, (2) decoder
input embedding, and (3) (opposite direction) from decoder output to pre-softmax.
90
Note that the positional encodings must necessarily be of dimension dmodel to be summed with the input
embeddings.
118
We chose this function because we hypothesized it would allow the model to easily learn
to attend by relative positions, since for any fixed offset k, PEpos+k can be represented
as a linear function of PEpos .
Summary of Add-ons. Below is a list of all the little bells and whistles they add to the
main components of the model that are easy to miss since they mention them throughout the
paper in a rather unorganized fashion.
• Shared weights for encoder inputs, decoder inputs, and final softmax projection
√ outputs.
• Multiply the encoder and decoder input embedding [shared] weights by dmodel . TODO:
why? Also this must be highly correlated with their decision regarding weight initializa-
tion (mean/stddev/technique). Add whatever they use here if they mention it.
• Adam optimizer with β1 =0.9, β2 =0.98, =10−9 .
• Learning rate schedule LR(s) = d−0.5 −0.5 , s · w −1.5 for global step s and warmup
model ·min s
steps w=4000.
• Dropout on sublayer outputs pre-layernorm-and-residual. Specifically, they actually re-
turn LayerNorm(x + Dropout(Sublayer(x))). Use Pdrop = 0.1.
• Dropout the summed embeddings+positional-encodings for both encoder and decoder
stacks.
• Dropout on softmax outputs. So do Dropout(Softmax(QK))V.
• Label smoothing with ls = 0.1.
119
Papers and Tutorials September 06, 2017
Overview. Authors introduce the Hierarchical Attention Network (HAN) that is designed
to capture insights regarding (1) the hierarchical structure of documents (words -> sentences
-> documents), and (2) the context dependence between words and sentences. The latter is
implemented by including two levels of attention mechanisms, one at the word level and one
at the sentence level.
Hierarchical Attention Networks. Below is an illustration of the network. The first stage
is familiar to sequence to sequence models - a bidirectional encoder for outputting sentence-
level representations of a sequence of words. The HAN goes a step further by feeding this
another bidirectional encoder for outputting document-level representations for sequences of
sentences.
The authors choose the GRU as their underlying RNN. For ease of reference, the defining
equations of the GRU are shown below:
ht = (1 − zt ) ht−1 + zt h̃t (279)
zt = σ (Wz xt + Uz ht−1 + bz ) (280)
h̃t = tanh (Wh xt + rt (Uh ht−1 ) + bh ) (281)
rt = σ (Wr xt + Ur ht−1 + br ) (282)
120
Hierarchical Attention. Here I’ll overview the main stages of information flow.
1. Word Encoder. Let the tth word in the ith sentence be denoted wit . They embed the
vectors with a word embedding matrix We , xit = We wit , and then feed xit through a
→
− ← −
bidirectional GRU to ultimately obtain hit := [ h it ; hit ].
2. Word Attention. Extracts words that are important to the meaning of the sentence
and aggregates the representation of these informative words to form a sentence vector.
uit = tanh(Ww hit + bw ) (283)
exp(uTit uw )
αit = P T
(284)
t exp(uit uw )
X
si = αit hit (285)
t
Note the context vector uw , which is shared for all words91 and randomly initialized and
jointly learned during the training process.
3. Sentence Encoder. Similar to the word encoder, but uses the sentence vectors si as
the input for the ith sentence in the document. Note that the output of this encoder, hi
contains information from the neighboring sentences too (bidirectional) but focuses on
sentence i.
4. Sentence Attention. For rewarding sentences that are clues to correctly classify a
document. Similar to before, we now use a sentence level context vector us to measure
the importance of the sentences.
ui = tanh(Ws hi + bs ) (286)
exp(uTi us )
αi = P T
(287)
i exp(ui us )
X
v= α i hi (288)
t
where v is the document vector that summarizes all the information of sentences in a
document.
As usual, we convert v to a normalized probability vector by feeding through a softmax:
p = softmax(Wc v + bc ) (289)
91
To emphasize, there is only a single context vector uw in the network, period. The subscript just tells us
that it is the word-level context vector, to distinguish it from the sentence-level context vector in the later stage.
121
Configuration and Training. Quick overview of some parameters chosen by the authors:
• Tokenization: Stanford CoreNLP. Vocabulary consists of words occurring more than 5
times, all others are replaced with UNK token.
• Word Embeddings: train word2vec on the training and validation splits. Dimension
of 200.
• GRU. Dimension of 50 (so 100 because bidirectional).
• Context vectors. Both uw and us have dimension of 100.
• Training: batch size of 64, grouping documents of similar length into a batch. SGD
with momentum of 0.9.
122
Papers and Tutorials Oct 31, 2017
Nguyen, Cho, and Grishman, “Joint Event Extraction via Recurrent Neural Networks,” (2016).
Each event subtype has its own set of roles to be filled by the event arguments. For example,
the roles for the Die event subtype include Place, Victim, and Time.
Model.
- Sentence Encoding. Let wi denote the ith token in a sentence. It is transformed into a
real-valued vector xi , defined as
where “Embed” is an embedding we learn, and “DepVec” is the binary vector whose dimen-
sions correspond to the possible relations between words in the dependency trees.
- RNN. Bidirectional LSTM on the inputs xi .
- Prediction. Binary memory vector Gtrg i for triggers; binary memory matrices Garg i and
arg/trg
Gi for arguments (at each timestep i). At each time step i, do the following in order:
1. Predict trigger ti for wi . First compute the feature representation vector Ritrig , defined
as:
h i
Ritrig := hi ; Ltrg trg
i ; Gi−1 (291)
123
some predefined window size d. This is then fed to a fully-connected layer with softmax
activation, F trg , to compute the probability over possible trigger subtypes:
trg
Pi;t := Fttrg (Ritrg ) (292)
trg
As usual, the predicted trigger type for wi is computed as ti = arg maxt Pi;t . If wi
is not a trigger, ti should predict “Other.”
2. Argument role predictions, ai1 , . . . , aik , for all of the [already known] entity mentions
in the sentence, e1 , . . . , ek with respect to wi . aij denotes the argument role of ej with
respect to [the predicted trigger of] wi . If NOT(wi is trigger AND ej is one of its
arguments), then aij is set to Other. For example, if wi was the word “died” from our
example sentence, we’d hope that its predicted trigger would be ti = Die, and that the
entity associated with “cameraman” would get a predicted argument role of
Victim.
def getArgumentRoles(triggerType=t, entities=e):
k = len(e)
if isOther(t):
return [Other] * k
else:
for e_j in e:
h i
arg arg/trg
Rij := hi ; hij ; Larg arg
ij ; Bij ; Gi−1 [j]; Gi−1 [j] (293)
124
Papers and Tutorials Oct 31, 2017
Y. Chen, S. Liu, S. He, K. Liu, and J. Zhao, “Event Extraction via Bidirectional Long Short-Term Memory
Tensor Neural Networks.”
Overview. The task/goal is the event extraction task as defined in Automatic Content Ex-
traction (ACE). Specifically, given a text document, our goal is to do the following in order
for each sentence:
1. Identify any event triggers in the sentence.
2. If triggers found, predict their subtype. For example, given the trigger “fired,” we may
classify it as having the Attack subtype.
3. If triggers found, identify their candidate argument(s). ACE defines an event argument
as “an entity mention, temporal expression, or value that is involved in an event.”
4. For each candidate argument, predict its role: “the relationship between an argument to
the event in which it participates.”
Context-aware Word Representation. Use pre-trained word embeddings for the input
word tokens, the predicted trigger, and the candidate argument. Note: we assume we already
have predictions for the event trigger t and are doing a pass for one of (possibly many) candidate
arguments a.
1. Embed each word in the sentence with pre-trained embeddings. Denote the embedding
for ith word as e(wi ).
2. Feed each e(wi ) through a bidirectional LSTM. Denote the ith output of the forward
LSTM as cl (wi+1 ) and the output of the backward LSTM at the same time step as
cr (wi−1 ). As usual, they take the general functional form:
−−−−→
cl (wi ) = LST M (cl (wi−1 ), e(wi−1 )) (294)
←−−−−
cr (wi ) = LST M (cr (wi+1 ), e(wi+1 )) (295)
(296)
3. Concatenate e(wi ), cl (wi ), cr (wi ) together along with the embedding of the candidate
argument e(a) and predicted trigger e(t). Also include the relative distance of wi to t or
(??) a, denoted as pi for position information, and the embedding of the predicted event
type pe of the trigger. Denote this massive concatenation result as xi :
125
Dynamic Multi-Pooling. This is easiest shown by example. Continue with our example
sentence:
In California, Peterson was arrested for the murder of his wife and unborn son.
where the colors are given for this specific case where murder is our predicted trigger and we
are considering the candidate argument Peterson 92 . Given our n outputs from the previous
stage, y (1) ∈ Rn×m , where n is the length of the sentence and m is the size of that huge
concatenation given in equation 297. We split our sentence by trigger and candidate argument,
then (confusingly) redefine our notation as
h i
(1) (1) (1)
y1j ← y1j y2j (298) Peterson is the 3rd word,
h i and murder is the 8th
(1) (1) (1)
y2j ← y3j ··· y7j (299) word.
h i
(1) (1) (1)
y3j ← y8j · · · ynj (300)
(1)
where it’s important to see that, for some 1 ≤ j ≤ m, each new yij is a vector of length equal
to the number of words in segment i. Finally, the dynamic multi-pooling layer, y (2) , can be
expressed as
(2) (1)
yi,j := max yi,j 1 ≤ i ≤ 3, 1 ≤ j ≤ m (301)
where the max is taken over each of the aforementioned vectors, leaving us with 3m values
total. These are concatenated to form y (2) ∈ R3m .
Output. To predict of each argument role [for the given argument candidate], y (2) is fed
through a dense softmax layer,
O = W2 y (2) + b2 (302)
where W2 ∈ Rn1 ×3m and n1 is the number of possible argument roles (including "None"). The
authors also use dropout on y (2) .
92
Yes, arrested could be another predicted trigger, but the network considers each possibility at separate
times/locations in the architecture.
126
Papers and Tutorials Nov 2, 2017
Socher et al., “Reasoning with Neural Tensor Networks for Knowledge Base Completion”
Overview. Reasoning over relationships between two entities. Goal: predict the likely truth
of additional facts based on existing facts in the KB. This paper contributes (1) the new NTN
and (2) a new way to represent entities in KBs. Each relation is associated with a distinct
model. Inputs to a given relation’s model are pairs of database entities, and the outputs score
how likely the pair has the relationship.
Intuitively, we can see each slice of the tensor as being responsible for one type of
entity pair or instantiation of a relation. . . Another way to interpret each tensor
slice is that it mediates the relationship between the two entity vectors differently.
Training Objective and Derivatives. All models are trained with contrastive max-
margin objective functions and minimize the following objective:
N X
X C
J(Ω) = max 0, 1 − g T (i) + g Tc(i) + λ||Ω||22 (304)
i=1 c=1
(i) (i) (i)
where c is for “corrupted” samples, Tc := e1 , R(i) , ec . Notice that this function is mini-
(i)
mized when the difference, g T (i) − g Tc , is maximized. The authors used minibatched
L-BFGS for optimization.
127
Papers and Tutorials Nov 6, 2017
Dong and Lapata, “Language to Logical Form with Neural Attention,” (2016)
where the author’s have employed “parent-feeding”: for a given subtree (logical form), at each
timestep, the hidden vector of the parent nonterminal is concatenated with the inputs and fed
into the LSTM (best understood via above illustration).
After encoding input q, the hierarchical tree decoder generates tokens at depth 1
of the subtree corresponding to parts of logical form a. If the predicted token is
< n >, decode the sequence by conditioning on the nonterminal’s hidden vector.
This process terminates when no more nonterminals are emitted.
128
Also note that the output posterior probability over the encoded input q is the product of
subtree posteriors. For example, consider the decoding example in the figure below:
The model is trained by minimizing log-likelihood over the training data, using RMSProp for
optimization. At inference time, greedy search or beam search is used to predict the most
probable output sequence.
129
Papers and Tutorials Nov 6, 2017
Zhong, Xiong, and Socher, “Seq2SQL: Generating Structured Queries From Natural Language Using Reinforce-
ment Learning”
Overview. Deep neural network for translating natural language questions to corresponding
SQL queries. Outperforms state-of-the-art semantic parser.
Seq2Tree and Pointer Baseline. Baseline model is the Seq2Tree model from the previous
note on Dong & Lapata’s (2016) paper. Authors here argue their output space is unnecessarily
large, and employ the idea of pointer networks with augmented inputs. The input sequence is
the concatenation of (1) the column names, (2) the limited vocabulary of the SQL language
such as SELECT, COUNT, etc., and (3) the question.
x := [< col >; xc1 ; xc2 ; . . . ; xcN ; < sql >; xs ; < question >; xq ] (306)
xcj ∈ RTj
where we also insert special (“sentinel”) tokens to demarcate the boundaries of each section.
The pointer network can then produce the SQL query by selecting exclusively from the input.
Let gs denote the sth decoder [hidden] state, and ys denote the output (index/pointer to input
query token).
ptr
[ptr net] ys = arg max αsptr where αs,t = wptr · tanh U ptr gs + V ptr ht
t
(307)
Seq2SQL.
1. Aggregation Classifier. Our goal here is to predict which aggregation operation to use
out of COUNT, MIN, MAX, NULL, etc. This is done by projecting the attention-weighted aver-
age of encoder states, κagg , to RC where C denotes the number of unique aforementioned
aggregation operations. The sequence of computations is summarized as follows:
130
where βiagg gives the probability for the ith aggregation operation. We use cross entropy
loss Lagg for determining the aggregation operation. Note that this part isn’t really
a sequence-to-sequence architecture. It’s nothing more than an MLP applied to an
attention-weighted average of the encoder states.
2. Get Pointer to Column. A pointer network is used for identifying which column in
the input representation should be used in the query. Recall that xcj,t denotes the tth
word in column j. We use the last encoder state for a given column’s LSTM93 as its
representation; Tj denotes the number of words in the jth column.
ecj = hcj,Tj where hcj,t = LSTM emb(xcj,t ), hcj,t−1 (313)
Similar to the aggregation, we train the SELECT network using cross entropy loss Lsel .
3. WHERE Clause Pointer Decoder. Recall from equation 307 that this is a model with
recurrent connections from its outputs leading back into its inputs, and thus a common
approach is to train it with teacher forcing94 . However, since the boolean expressions
within a WHERE clause can be swapped around while still yielding the same SQL query,
reinforcement learning (instead of cross entropy) is used to learn a policy to directly
optimize the expected correctness of the execution result. Note that this also implies that
we will be sampling from the output distribution at decoding step s to obtain the next
input for s + 1 [instead of teacher forcing].
93
Yes, we encode each column with an LSTM separately.
94
Teacher forcing is just a name for how we train the decoder portion of a sequence-to-sequence model,
wherein we feed the ground-truth output y (t) as input at time t + 1 during training.
131
−2 if q(y) is not a valid SQL query
R (q(y), qg ) = −1 if q(y) is a valid SQL query and executes to an incorrect result
+1 if q(y) is a valid SQL query and executes to the correct result
(316)
Lwhe = −Ey [R (q(y), qg )] (317)
∇Lwhe
Θ = −∇Θ Ey∼py [R (q(y), qg )] (318)
" #
X
= −Ey∼py R (q(y), qg ) ∇Θ log py (yt ; Θ) (319)
t
X
≈ −R (q(y), qg ) ∇Θ log py (yt ; Θ) (320)
t
where
→ y = [y 1 , y 2 , . . . , y T ] denotes the sequences of generated tokens in the WHERE clause.
→ q(y) denotes the query generated by the model.
→ qg denotes the ground truth query corresponding to the question.
and the gradient has been approximated in the last line using a single Monte-Carlo sample
y.
Finally, the model is trained using gradient descent to minimize L = Lagg + Lsel + Lwhe .
Speculations for Event Extraction. I want to explore using this paper’s model for the
task of event extraction. Below, I’ve replaced some words (shown in green) from a sentence in
the paper in order to formalize this as event extraction.
Seq2Event takes as input a sentence and the possible event types of an ontol-
ogy. It generates the corresponding event annotation, which, during training, is
compared against an event template. The result of the comparison is utilized
to train the reinforcement learning algorithm95 .
95
Original: Seq2SQL takes as input a question and the columns of a table. It generates the corresponding
SQL query, which, during training, is executed against a database. The result of the execution is utilized as the
reward to train the reinforcement learning algorithm.
132
Papers and Tutorials Nov 13, 2017
M. Ringgaard, R. Gupta, F. Pereira, “SLING: A framework for frame semantic parsing” (2017)
Model.
• Inputs. [words; affixes; shapes]
• Encoder.
1. Embed.
2. Bidirectional LSTM.
• Inputs to TBRU.
– BLSTM [forward and backward] hidden state for the current token in the parser state.
– Focus. Hidden layer activations corresponding to the transition steps that evoked/brought
into focus the top-k frames in the attention buffer.
– Attention. Recall that we maintain an attention buffer: an ordered list of frames,
where the order represents closeness to center of attention. The attention portion
of inputs for the TBRU looks at the top-k frames in the attention buffer, finds the
phrases in the text (if any) that evoked them. The activations from the BLSTM for
the last token of each of those phrases are included as TBRU inputs96
– History. Hidden layer activations from the previous k steps.
– Roles. Embeddings of (si , ri , ti ), where the frame at position si in the attention
buffer has a role (key) ri with frame at position ti as its value. Back-off features are
added for the source roles (si , ri ), target role (ri , ti ), and unlabeled roles (si , ti ).
• Decoder (TBRU). Outputs a softmax over possible transitions (actions).
96
Okay, how is this attention at all? Seems misleading to call it attention.
133
Transition System. Below is the list of possible actions. Note that, since the system is
trained to predict the correct frame graph result, it isn’t directly told what order it should
take a given set of actions97 .
• SHIFT. Move to next input token.
• STOP. Signal that we’ve reached end of parse.
• EVOKE(type, num). New frame of type from next num tokens in the input; placed
at front of attention buffer.
• REFER(frame, num). New mention from next num tokens, evoking existing frame
from attention buffer. Places at front.
• CONNECT(source-frame, role, target-frame). Inserts (role, target-frame) slot
into source-frame, move source-frame to front.
• ASSIGN(source-frame, role, value). Same as CONNECT, but with primitive/con-
stant value.
• EMBED(target-frame, role, type). New frame of type, and inserts (role, target-
frame) slot. New frame placed to front.
• ELABORATE(source-frame, role, type). New frame of type. Inserts (role, new-
frame) slot to source-frame. New frame placed at front.
Evaluation. Need some way of comparing an annotated document with its gold-standard
annotation. This is done by constructing a virtual graph where the document is the start node.
It is then connected to the spans (which are presumably nodes themselves), and the spans are
connected to the frames they evoke. Frames that refer to other frames are given corresponding
edges between them. Quality is computed by aligning the golden and predicted graphs and
computing precision, recall, and F1. Specifically, these scores are computed separately for
spans, frames, frame types, roles linking to other frames (referred to here as just “roles”), and
roles that link to global constants (referred to here as just “labels”). Results are shown below.
97
This is important to keep in mind, since more than one sequence of actions can result in a given predicted
frame graph.
134
Papers and Tutorials Nov 16, 2017
M. Nickel and D. Kiela, “Poincaré Embeddings for Learning Hierarchical Representations” (2017)
Prerequisite Math. Recall that a hyperbola is a set of points, such that for any point P of
the set, the absolute difference of the distances |P F1 |, |P F2 | to two fixed points F1 , F2 (the
foci), is constant, usually denoted by 2a, a > 0. We can define a hyperbola by this set of points
or by its canonical form, which are both given, respectively, as:
where ∗ denotes the non-euclidean inner product (subtracts last term; same as Minkowski
sapce-time). Notice that this is the defining equation for a hyperboloid of two sheets, and
Cannon et al. says “usually we deal only with one of the two sheets.” Hyperbolic spaces are
well-suited to model hierarchical data, since both circle length and disc area grow exponentially
with r.
98
This begs the question: how useful would a Poincaré embedding be for situations where this assumption
isn’t valid?
135
Poincaré Embeddings. Let B d = {x ∈ Rd | ||x|| < 1} be the open d-dimensional unit ball.
The Poincaré ball model of hyperbolic space corresponds then to the Riemannian manifold99
(B d , gx ), where
2
2
gx = gE (324)
1 − ||x||2
is the Riemannian metric tensor, and g E denotes the Euclidean metric tensor. The distance
between two points u, b ∈ B d is given as
!
||u − v||2
d(u, v) = arccosh 1 + 2 (325)
(1 − ||u||2 )(1 − ||v||2 )
The boundary of the ball corresponds to the sphere S d−1 and is denoted by ∂B. Geodesics
in B d are then circles that are orthogonal to ∂B. To compute Poincaré embeddings for a set
of symbols S = {xi }ni=1 , we want to find embeddings Θ = {θi }ni=1 , where θi ∈ B d . Given
some loss function L that encourages semantically similar objects to be close as defined by the
Poincaré distance, our goal is to solve the optimization problem
where Rθt denotes the retraction onto B at θ and ηt denotes the learning rate at time t.
99
All five analytic models of hyperbolic geometry in Cannon et al. are differentiable manifolds with a Rieman-
nian metric. A Riemannian metric ds2 on Euclidean space Rn is a function that assigns at each point p ∈ Rn
a positive definite symmetric inner product on the tangent space at p, this inner product varying differentiably
with the point p. If x1 , . . . , xn are the standard coordinates in Rn , then ds2 has the form
P
i,j
gij dxi dxj , and
the matrix (gij ) depends differentiably on x and is positive definite and symmetric.
136
Papers and Tutorials Nov 17, 2017
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information”
(2017)
Overview. Based on the skipgram model, but where each word is represented as a bag of
character n-grams. A vector representation is associated each character n-gram; words being
represented as the sum 100 of these representations.
Skipgram with Negative Sampling. Since this is based on skipgram, recall the objective
of skipgram which is to maximize:
T X
X
log Pr [wc | wt ] (328)
t=1 c∈Ct
es(wt ,wc )
Pr [wc | wt ] = PW (329)
s(wt ,j)
j=1 e
However, this implies that, given wt , we only predict one context word wc . Instead, we can
frame the problem as a set of independent binary classification tasks, and independently predict
the presence/absence of context words. Let ` : x 7→ log(1 + e−x ) denote the standard logistic
negative log-likelihood. Our objective is:
T
X X X
`(s(wt , wc )) + `(−s(wt , n)) (330)
t=1 c∈Ct n∈Nt,c
where Nt,c is a set of negative examples sampled from the vocabulary. A common scoring
function involves associating a distinct input vector uw and output vector vw for each word w.
Then the score is computed as s(wt , wc ) = uTwt vwc .
100
It would be interesting to explore other aggregation operations than just summation.
137
FastText. Main contribution is a different scoring function s that utilizes subword information.
Each word w is represented as a bag of character n-grams. Special symbols < and > delimit
word boundaries, and the authors also insert the special sequence containing the full word (with
the delimiters) in its bag of n-grams. The word where is thus represented by first building its
bag of n-grams, for the choice of n = 3:
where −→ {< wh, whe, her, ere, re >, < where >} (331)
Such a set of n-grams for a word w is denoted Gw . Each n-gram g for a word w has its own
vector zg , and the final vector representation of w is the sum of these. The scoring function
becomes
X
s(w, c) = zgT vc (332)
g∈Gw
138
Papers and Tutorials Nov 17, 2017
B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online Learning of Social Representations,” (2014).
Problem Definition. Classifying members of a social network into one or more categories.
Let G = (V, E), where V are the members of the network, and E be its edges,
E ⊆ (V × V ). Given a partially labeled social network GL = (V, E, X, Y ), with
attributes X ∈ R|V |×S where S is the size of the feature space for each attribute
vector, and Y ∈ R|V |×|Y| , Y is the set of labels.
In other words, the elements of our training dataset, (X, Y ), are the members of the social
network, and we want to label each member, represented by a vector in RS , with one or more
of the |Y| labels. We aim to learn features that capture the graph structure independent of
the labels’ distribution, and to do so in an unsupervised fashion.
139
Remember that, here, vj is the jth vertex visited in some given random walk.
where SkipGram(Φ, Wvi , w) performs SGD updates on Φ to minimize − log Pr [uk | Φ(vj )] for
each visited vj , for each uk in the “context” of vj . Notice that a binary tree T is build from the
set of vertices V (line 2) – this is done as preparation for computing each Pr [uk | Φ(vj )] via a
hierarchical softmax, to reduce computational burden of its partition function. Finally, a visual
overview of the DeepWalk algorithm is shown below. The authors use this algorithm, com-
bined with a one-vs-rest logistic regression implementation by LibLinear, for various multiclass
multilabel classification tasks.
140
Papers and Tutorials Dec 6, 2017
Nickel, Murphy, Tresp, and Gabrilovich, “Review of Relational Machine Learning for Knowledge Graphs,”
(2015).
Introduction. Paper discusses latent feature models such as tensor factorization and multiway
neural networks, and mining observable patterns in the graph. In Statistical Relational
Learning (SRL), the representation of an object can contain its relationships to other objects.
The main goals of SRL include:
• Prediction of missing edges (relationships between entities).
• Prediction of properties of nodes.
• Clustering nodes based on their connectivity patterns.
We’ll be reviewing how SRL techniques can be applied to large-scale knowledge graphs
(KGs), i.e. graph structured knowledge bases (KBs) that store factual information in the form
of relationships between entities.
Probabilistic Knowledge Graphs. Let E = {e1 , . . . , eNe } be the set of all entities and
R = {r1 , . . . , rNr } be the set of all relation types in a KG. We model each possible triple
xijk = (ei , rk , ej ) as a binary random variable yijk ∈ {0, 1} that indicates its existence. The
full tensor Y ∈ {0, 1}Ne ×Ne ×Nr is called the adjacency tensor, where each possible realization
of Y can be interpreted as a possible world.
Clearly, Y will be large and sparse in most applications. Ideally, a relational model for large-
scale KGs should scale at most linearly with the data size, i.e., linearly in the number of
entities Ne , linearly in the number of relations Nr , and linearly in the number of observed
triples |D| = Nd .
Types of SRL Modesls. The presence or absence of certain triples in relational data is
correlated with (i.e. predictive of) the presence or absence of certain other triples. In other
words, the random variables yijk are correlated with each other. There are three main ways
to model these correlations:
1. Latent feature models: Assume all yijk are conditionally independent given latent
features associated with the subject, object and relation type and additional parameters.
2. Graph feature models: Assume all yijk are conditionally independent given observed
graph features and additional parameters.
3. Markov Random Fields: Assume all yijk have local interactions.
The first two model classes predict the existence of a triple xijk via a score function f (xijk ; Θ)
141
which represents the model’s confidence that a triple exists given the parameters Θ. The
conditional independence assumptions can be written as
Ne Y
Y Ne Y
Nr
Pr [Y | D, Θ] = Ber (yijk | σ (f (xijk ; Θ))) (335)
i=1 j=1 k=1
where Ber is the Bernoulli distribution101 . Such models will be referred to as probabilistic
models. We will also discuss score-based models, which optimize f (·) via maximizing the margin
between existing and non-existing triples.
Latent Feature Models. We assume the variables yijk are conditionally independent given
a set of global latent features and parameters. All LFMs explain triples (observable facts)
via latent features of entities102 . One task of all LFMs is to infer these [latent] features
automatically from the data.
where the entity vectors ei ∈ RHe and He denotes the number of latent features in
the model. The parameters of the model are Θ = {{ei }N Nr
i=1 , {Wk }k=1 }. Note that
e
entities have the same latent representation regardless of whether they occur as subjects
or objects in a relationship (shared representation), thus allowing the model to capture
global dependencies in the data. We can make a connection to tensor factorization
methods by seeing that the equation above can be written compactly as
Fk = EWk E T (338)
where Fk ∈ RNe ×Ne is the matrix holding all scores for the k-th relation, and the ith row
of E ∈ RNe ×He holds the latent representation of ei .
101
Notation used:
p if y = 1
Ber(y | p) = (336)
1−p if y = 0
102
It appears that “latent” is being used here synonymously with "not directly observed in the data".
142
authors extend this to what they call the E-MLP (E for entity) model:
E−M LP
fijk := wkT g(haijk ) (341)
haijk := ATk φE−M
ij
LP
(342)
φE−M
ij
LP
:= [ei ; ej ] (343)
Graph Feature Models. Here we assume that the existence of an edge can be predicted by
extracting features from the observed edges in the graph. In contrast to LFMs, this kind of
reasoning explains triples directly from the observed triples in the KG.
- Similarity measures for uni-relational data. Link prediction in graphs that consist only
of a single relation (e.g. (Bob, isFriendOf, Sally) for a social network). Various similarity
indices have been proposed to measure similarity of entities, of which there are three main
classes:
1. Local similarity indices: Common Neighbors, Adamic-Adar index, Preferential Attach-
ment derive entity similarities from number of common neighbors.
2. Global similarity indices: Katz index, Leicht-Holme-Newman index (ensembles of all
paths bw entities). Hitting Time, Commute Time, PageRank (random walks).
3. Quasi-local similarity indices: Local Katz, Local Random Walks.
- Path Ranking Algorithm (PRA): extends the idea of using random walks of bounded
lengths for predicting links in multi-relational KGs. Let πL (i, j, k, t) denote a path of length
r1 r2 rL
L of the form ei −→ e2 −→ e3 · · · −→ ej , where t represents the sequence of edge types
rk
t = (r1 , r2 , . . . , rL ). We also require there to be a direct arc ei −→ ej , representing the ex-
istence of a relationship of type k from ei to ej . Let ΠL (i, j, k) represent the set of all such
paths of length L, ranging over path types t.
We can compute the probability of following a given path by assuming that at each step
we follow an outgoing link uniformly at random. Let Pr [πL (i, j, k, t)] be the probability of
the path with type t. The key idea in PRA is to use these path probabilities as features
for predicting the probabilities of missing edges. More precisely, the feature vector and score
function (logistic regression) are as follows:
φPijkRA = [Pr [π] : π ∈ ΠL (i, j, k)] (344)
P RA
fijk := wkT φPijkRA (345)
TODO: Finish...
143
Papers and Tutorials Dec 6, 2017
Introduction. Task: Given a knowledge graph G, scoring function F , and a graph query Q,
top-k subgraph search over G returns k answers with the highest matching scores. An example
is searching for movie makers (directors) worked with “Brad” and have won awards, illustrated
below:
Clearly, it would be extremely inefficient to enumerate all possible matches and then rank
them.
Preliminaries/Terminology.
• Queries. A query [graph] is defined as Q = (VQ , EQ ). Each query node in Q provides
information/constraints about an entity, and an edge between two nodes specifies the
relationship or the connectivity constraint posed on the two nodes. Q∗ denotes a star-
shaped query, which is basically a graph that looks like a star (central node with tree-like
structure radially outward).
• Subgraph Matching. Given a graph query Q and a knowledge graph G, a match φ(Q)
of Q in G is a subgraph of G, specified by a one-to-one matching function φ. It maps each
node u (edge e = (u, u0 )) in Q to a node match φ(u) (edge match φ(e) = (φ(u), φ(u0 )))
in φ(Q).
144
The matching score between query Q and its match φ(Q) is
X X
F (φ(Q)) = FV (v, φ(v)) + FE (e, φ(e)) (346)
v∈VQ e∈EQ
X
FV (v, φ(v)) = αi fi (v, φ(v)) (347)
i
X
FE (e, φ(e)) = βj fj (e, φ(e)) (348)
j
145
Papers and Tutorials Jan 19, 2018
Kong et al., “DRAGNN: A Transition-based Framework for Dynamically Connected Neural Networks,” (2017).
Transition Based Recurrent Networks. When combining transition systems with recur-
rent networks, we will refer to them as Transition Based Recurrent Units (TBRU), which
consist of:
• Transition system T .
• Input function m(s) : S 7→ RK that maps states to some fixed-size vector representation
(e.g. an embedding lookup operation).
• Recurrence function r(s) : S 7→ P{1, . . . , i − 1} that maps states to a set of previous time
steps, where P is the power set. Note that |r(s)| may vary with s. We use r to specify
state-dependent recurrent links in the unrolled computation graph.
• The RNN cell hs ← RNN(m(s), {hi | i ∈ r(s)}).
103
The authors state that we are only concerned with complete structures that have the same number of
decisions n(x) for the same input x.
146
Example: Parsey McParseface.
• Transition system: the arc-standard transition system, defined in image below104 .
so the state contains all words and partially built trees (stack) as well as unseen words
(buffer).
• Input function: m(si ) is the concatenation of 52 feature embeddings extracted from
tokens based on their positions in the stack and the buffer.
• Recurrence function: r(si ) is empty, as this is a feed-forward network.
• RNN cell: a feed-forward MLP (so not an RNN...).
Inference with TBRUs. To predict the output sequence {d1 , . . . , dn } given input sequence
x = {x1 , . . . , xn }, do:
1. Initialize s1 = s† .
2. For i = 1, . . . , n:
(a) Compute hi = RNN(m(si ), {hj | j ∈ r(si )}).
(b) Update transition state:
NOTE: This defines a locally normalized training procedure, whereas Andor et al.
of Syntaxnet clearly conclude that their globally normalized model is the preferred
choice.
104
Image taken from “Transition-Based Parsing” by Joakim Nivre. Note that “right-headed” means “goes
from left to right” or “headed to the right”.
147
Combining multiple TBRUs. We connect multiple TBRUs with different transition systems
via r(s).
1. We execute a list of T TBRU components sequentially, so that each TBRU advances a
global step counter.
2. Each transition state, sτ , from the τ ’th component has access the terminal states from
every prior transition system, and the recurrence function r(sτ ) for any given component
can pull hidden activations from every prior one as well.
Example: Multi-task bi-directional tagging. Say we want to do both POS and NER
tagging (indices start at 1).
• Left-to-right: T = shift-only, m(si ) = xi , r(si ) = {i − 1}.
• Right-to-left: T = shift-only, m(sn+i ) = x(n−i)+1 , r(sn+i ) = {n + i − 1}.
• POS Tagger: TP OS = tagger, m(s2n+i ) = {}, r(s2n+i ) = {i, (2n − i) + 1}
• NER Tagger: TN ER = tagger, m(s3n+i ) = {}, r(s3n+i ) = {i, (2n − i) + 1, 2n + i}
which illustrates the most important aspect of the TBRU:
A TBRU can serve as both an encoder for downstream tasks and a decoder for its
own task simultaneously.
For this example, the POS Tagger served as both an encoder for the NER task as well as a
decoder for the POS task.
Training a DRAGNN. Assume training data consists of examples x along with gold de-
cision sequences for a given TBRU in the DRAGNN. Given decisions d1 , . . . , dN from prior
components 1, . . . , T − 1, the log-likelihood for training the T ’th TBRU along its gold decision
sequence d?N +1 , . . . , d?N +n is then:
X
L(x, d?N +1:N +n ; θ) = log Pr d?N +i | d1:N , d?N +1:N +i−1 ; θ
(351)
i
During training, the entire input sequence is unrolled and backpropagation through struc-
ture is used for gradient computation.
148
4.31.1 More Detail: Arc-Standard Transition System
The arc-standard transition system is mentioned a lot, but with little detail. Here I’ll synthesize
what I find from external resources. The literature defines the states in a transition system
slightly differently than the DRAGNN paper. Here we’ll define them as a configuration
c = (Σ, B, A) triplet, where
• Σ is the stack of tokens in x that we’ve [partially] processed.
• B is the buffer of remaining tokens in x that we need to process.
• A is a set of arcs (wi , wj , `) that link wi to wj , and label the arc/link as `.
So, in the arc-standard transition system figure presented with Parsey McParseface earlier,
• SHIFT just means “move the head element of the buffer to the tail element of the buffer”.
• Left-arc just means “make a link from the tail element of the stack to the element before
it. Remove the pointed-to element from the stack.”
• Right-arc just means “make a link from the element before the tail element in the stack
to the tail element. Remove the pointed-to element from the stack.”
149
Papers and Tutorials April 01, 2018
B. Zoph and Q. Le, “Neural Architecture Search with Reinforcement Learning,” (2017).
Controller RNN. Generates architectures with a predefined number of layers, which is in-
creased manually as training progresses. At convergence, validation accuracy of the generated
network is recorded. Then, the controller parameters θc are optimized to maximized the ex-
pected validation accuracy over a batch of generated architectures.
Reinforcement Learning to learn the controller parameters θc . Let a1:T denote a list of
actions taken by the controller105 , which defines a generated architecture. We denote the
resulting validation accuracy by R, which is the reward signal for our RL task. Concretely, we
want our controller to maximize its expected reward, J(θc ):
Since the quantity ∇θc R is non-differentiable106 , we use a policy gradient method to itera-
tively update θc . All this means is that we instead compute gradients over the softmax outputs
(the action probabilities), and use the value of R as a simple weight factor.
T
X
∇θc J(θc ) = R EP (a1:T ; θc ) [∇θc log P (at | a1:t−1 ; θc )] (353)
t=1
m T
1 X X
≈ Rk ∇θc log P (at | a1:t−1 ; θc ) (354)
m k=1 t=1
where the second equation is the empirical approximation (batch-average instead of expecta-
tion) over a batch of size m, an unbiased estimator for our gradient107 . Also note that we do
105
Note that T is not necessarily the number of layers, since a single generated layer can correspond to multiple
actions (e.g. stride height, stride width, num filters, etc.).
106
R is a function of the action sequence a1:T and the parameters θc , and implicitly depends on the samples
used for the validation set. Clearly, we do not have access to an analytical form of R, and computing numerical
gradients via small perturbations of thetac is computationally intractable.
107
It is unbiased for the same reason that any average over samples x drawn from a distribution P (x) is an
unbiased estimator for EP [x].
150
have access to the distribution P (a1:T ; θc ) since it is defined to be the joint softmax probabili-
ties of our controller, given its parameter values θc (i.e. this is not a pdata vs pmodel situation).
The approximation 353 is an unbiased estimator for 354, but has high variance. To reduce the
variance of our estimator, the authors employ a baseline function b that does not depend on
the current action:
m X T
1 X
∇θ log P (at | a1:t−1 ; θc )(Rk − b) (355)
m k=1 t=1 c
151
Papers and Tutorials April 26, 2018
B. Yang and T. Mitchell, “Joint Extraction of Events and Entities within a Document Context,” (2016).
Introduction. Two main reasons that state-of-the-art event extraction systems have difficul-
ties:
1. They extract events and entities in separate stages.
2. They extract events independently from each individual sentence, ignoring the rest of
the document.
This paper proposes an approach that simultaneously extracts events and entities within a
document context. They do this by first decomposing the problem into 3 tractable subproblems:
1. Learning the dependencies between a single event [trigger] and all of its potential argu-
ments.
2. Learning the co-occurrence relations between events across the document.
3. Learning for entity extraction.
and then combine these learned models into a joint optimization framework.
Learning Within-Event Structures. For now, assume we have some document x, a set
of candidate event triggers T , and a set of candidate entities N . Denote the set of entity
candidates that are potential arguments for trigger candidate i as Ni . The joint distribution
over the possible trigger types, roles, and entities for those roles, is given by
All fi also depend on
Prθ [ti , ri , a | i, Ni , x] ∝ (356) i, x. In addition, all fi
except f1 depend on the
X h i current j in the
exp θ1T f1 (ti ) + θ2T f2 (rij ) + θ3T f3 (ti , rij ) + θ4T f4 (aj ) + θ5T f5 (rij , aj ) (357) summation.
j∈Ni
where each fi is a feature function, and I’ve colored the unary feature functions green. The
unary features are tabulated in Table 1 of the original paper. They use simple indicator
functions 1t,r and 1r,a for the pairwise features. They train using maximum-likelihood estimates
with L2 regularization:
152
Learning Event-Event Relations. A pairwise model of event-event relations in a document.
Training data consists of all pair of trigger candidates that co-occur in the same sentence or
are connected by a co-referent subject/object if they’re in different sentences. Given a trigger
candidate pair (i, i0 ), we estimate the probabilities for their event types (ti , ti0 ) as
Prφ ti , ti0 | x, i, i0 ∝ exp φT g(ti , ti0 , x, i, i0 )
(360)
where g is a feature function that depends on the trigger candidate pair and their context. In
addition to re-using the trigger features in Table 1 of the paper, they also introduce relational
trigger features:
1. whether they’re connected by a conjunction dependency relation
2. whether they share a subject or an object
3. whether they have the same head word lemma
4. whether they share a semantic frame based on FrameNet.
As before, they using L-BFGS to compute the maximum-likelihood estimates of the parameters
φ.
Entity Extraction. Trained a standard linear-chain CRF using the BIO scheme. Their CRF
features:
1. current words and POS tags
2. context words in a window of size 2
3. word type such as all-capitalized, is-capitalized, all-digits
4. Gazetteer-based entity type if the current word matches an entry in the gazetteers col-
lected from Wikipedia.
5. pre-trained word2vec embeddings for each word
Joint Inference. Allows information flow among the 3 local models and finds globally-optimal
assignments of all variables. Define the following objective:
X X X
max E(ti , ri , a) + R(ti , ti0 ) + D(aj ) (361)
t,r,a
i∈T i,i0 ∈T j∈N
where
• The first term is the sum of confidence scores for individual event mentions from the
within-event model.
X
E(ti , ri , a) = log pθ (ti ) + [log pθ (ti , rij ) + log pθ (rij , aj )] (362)
j∈Ni
• The second term is the sum of confidence scores for event relations based on the pairwise
event model.
153
• The third term is sum of confidence scores for entity mentions, where
and pψ (aj | j, x) is the marginal probability derived from the linear-chain CRF.
The optimization is subject to agreement constraints that enforce the overlapping variables
among the 3 components to agree on their values. The joint inference problem can be formu-
lated as an integer linear problem (ILP)108 . To solve it efficiently, they find solutions for
the relaxation of the problem using a dual decomposition algorithm AD3 .
108
From Wikipedia: An integer linear program in canonical form: maximize cT x subject to Ax ≤ b, x ≥ 0, x ∈
n
Z
154
Papers and Tutorials April 27, 2018
Introduction. Authors demonstrate that simple FF neural networks can achieve comparable
or better accuracies than LSTMs, as long as they are globally normalized. They don’t use
any recurrence, but perform beam search for maintaining multiple hypotheses and introduce
global normalization with a CRF objective to overcome the label bias problem that locally
normalized models suffer from.
exp ρ(sj , d0 ; θ)
X
ZL (sj ; θ) = (367)
d0 ∈A(sj )
Beam search can be used to attempt to find the action sequence with highest probability.
• Global. In contrast, a CRF defines:
P
n
exp j=1 ρ(sj , dj ; θ)
PrG [d1:n ] = (368)
ZG (θ)
n
0 0
X X
ZG (θ) = exp ρ(s , d ; θ)
j j (369)
d01:n ∈Dn j=1
155
where Dn is the set of all valid sequences of decisions of length n. The inference problem
is now to find
n
X
arg max PrG [d1:n ] = arg max ρ(sj , dj ; θ) (370)
d1:n ∈Dn d1:n ∈Dn j=1
and we can also use beam search to approximately find the argmax.
Training. SGD on the NLL of the data under the model. The NLL takes a different form
depending on whether we choose a locally normalized model vs a globally normalized model.
• Local.
• Global.
To make learning tractable for the globally normalized model, the authors use beam search
with early updates, defined as follows. Keep track of the location of the gold path109 in
the beam as the prediction sequence is being constructed. If the gold path is not found in the
beam after step j, run one step of SGD on the following objective:
j j
Lglobal−beam (d∗1:j , θ) = − ρ(d∗1:t−1 , d∗t ; θ) − ln exp ρ(d0 0
X X X
1:t−1 , dt ; θ) (375)
t=1 d01:j ∈Bj t=1
where Bj contains all paths in the beam at step j, and the gold path prefix d∗ 1 : j. If
the gold path remains in the beam throughout decoding, a gradient step is performed us-
ing BT , the beam at the end of decoding. When training the global model, the authors
first pretrain110 using the local objective function, and then perform additional training steps
using the global objective function..
109
The gold path is the predicted sequence that matches the true labeled sequence, up to the current timestep.
156
The Label Bias Problem. Locally normalized models often have a very weak ability to
revise earlier decisions. Here we will prove that globally normalized models are strictly
more expressive than locally normalized models111 . Let PL denote the set of all possible
distributions pL (d1:n | x1:n ) under the local model as the scores ρ vary. Let PG be the same,
but for the global model.
We are assuming that
Theorem 3.1 PL is a strict subset112 of PG , that is PL ( PG . both PL and PG consist
of log-linear distributions
of scoring functions
In other words, a globally normalized model can model any distribution that a locally normal- ρ(d1:t−1 , dt , x1:t )
ized one can, but the converse is not true. I’ve worked through the proof below.
Proof: PL ( PG
Proof that PL ⊆ PG . For any locally normalized model with scores ρL (d1:t−1 , dt , x1:t ), we can define a
corresponding pG over scores
Similar to a typical linear-chain CRF, let T denote the set of observed label transitions, and let E denote the set
of observed (xt , dt ) pairs. Let α be the single scalar parameter of this simple model, where
ρ(d1:t−1 , dt , x1:t ) = α 1(dt−1 ,dt )∈T + 1(xt ,dt )∈E (379)
for all t. This results in the following distributions pG and pL , evaluating on input sequence of length 3
P3
exp α t=1
(1(dt−1 ,dt )∈T + 1(xt ,dt )∈E )
pG (d1:3 | x1:3 ) = (380)
ZG (x1:3 )
pL (d1:3 | x1:3 ) = pL (d1 | x1 )pL (d2 | d1 , x1:2 )pL (d3 | d1:2 , x1:3 ) (381)
where I’ve written pL as a product over its local CPDs because it reveals the core observation that P the proof
is based on: for any given subsequence (d1:t−1 , x1:t ), the local CPD is constrained to satisfy p (d |
dt L t
d1:t−1 , x1:t ) = 1. With this, the following comparison of pG and pL for large α completes the proof of PG * PL :
∴ PL ( PG .
111
This is for conditional models only.
112
Note that ⊂ and ( mean the same thing. Matter of notational preference/being explicit/etc.
157
Papers and Tutorials April 30, 2018
Graphical Modeling (2.1). Some notation. Denote factors as ψa (ya ) where 1 ≤ a ≤ A and
A is the total number factors. ya is an assignment to the subset Ya ⊆ Y of variables associated
with ψa . The value returned by ψa is a non-negative scalar that can be thought of as a measure
of how compatible the values ya are with each other. Given a collection of subsets {Ya }Aa=1 of
Y , an undirected graphical model is the set of all distributions that can be written as
A
1 Y
p(y) = ψa (ya ) (384)
Z a=1
A
XY
Z= ψa (ya ) (385)
y a=1
for any choice of factors F = {ψa } that have ψa (ya ) ≥ 0 for all ya . The sum for the partition
function, Z, is over all possible assignments y of the variables Y . We’ll use the term ran-
dom field to refer to a particular distribution among those defined by an undirected model113 .
We can represent the factorization with a factor graph: a bipartite graph G = (V, F, E) in
which one set of nodes V = {1, 2, . . . , |Y |} indexes the RVs in the model, and the set of nodes
F = {1, 2, . . . , A} indexes the factors. A connection between a variable node Ys for s ∈ V to
some factor node ψa for a ∈ F means that Ys is one of the arguments of ψa . It is common to
draw the factor nodes as squares, and the variable nodes as circles.
Generative versus Discriminative Models (2.2). Naive Bayes is generative, while logistic
regression (a.k.a maximum entropy) is discriminative. Recall that Naive Bayes and logistic are
defined as, respectively,
K
Y
p(y, x) = p(y) p(xk | y) (386)
k=1
K
!
1 X
p(y | x) = exp θk fk (y, x) (387)
Z(x) k=1
where the fk in the definition of logistic regression denote the feature functions. We could set
them, for example, as fy0 ,j (y, x) = 1y0 =y xj .
113
i.e. a particular set of factors.
158
An example generative model for sequence prediction is the HMM. Recall that an HMM
defines
T
Y
p(y, x) = p(yt | yt−1 )p(xt | yt ) (388)
t=1
where we are using the dummy notation of assuming an initial-initial state y0 clamped to 0
and begins every state sequence, so we can write the initial state distribution as p(y1 | y0 ).
We see that the generative models, like naive Bayes and the HMM, define a family of joint
distributions that factorizes as p(y, x) = p(y)p(x | y). Discriminative models, like logistic re-
gression, define a family of conditional distributions p(y | x). The main conceptual difference
here is that a conditional distribution p(y | x) doesn’t include a model of p(x). The principal
advantage of discriminative modeling is that it’s better suited to include rich, overlapping fea-
tures. Discriminative models like CRFs make conditional independence assumptions both (1)
among y and (2) about how the y can depend on x, but do not make conditional independence
assumptions among x.
The difference between NB and LR is due only to the fact that NB is generative and LR is
discriminative. Any LR classifier can be converted into a NB classifier with the same decision
boundary, and vice versa. In other words, NB defines the same family as LR, if we interpret
NB generatively as
P
exp ( k θk fk (y, x))
p(y, x) = P P (389)
x exp ( k θk fk (y
y ,e
e e, x
e))
and train it to maximize the conditional likelihood. Similarly, if the LR model is interpreted
as above, and trained to maximize the joint likelihood, then we recover the same classifier as
NB.
159
Linear-Chain CRFs (2.3). Key point: the conditional distribution p(y | x) that follows from
the joint distribution p(y, x) of an HMM is in fact a CRF with a particular choice of feature
functions. First, we rewrite the HMM joint in a form that’s more amenable to generalization:
T
1 Y X X
HMM joint distribution
p(y, x) = exp θi,j 1{yt =i,yt−1 =j} + µo,i 1{yt =i,xt =o} (390)
Z t=1 i,j∈S i∈S,o∈O
T K
!
1 Y X
= exp θk fk (yt , yt−1 , xt ) (391)
Z t=1 k=1
and the latter provides the more compact notation114 . We can use Bayes rule to then write
p(y | x), which would give us a particular kind of linear-chain CRF that only includes features
for the current word’s identity. The general definition of linear-chain CRFs is given below:
Let Y, X be random vectors, θ = {θk } ∈ RK be a parameter vector, and F = {fk (y, y 0 , xt )}K
k=1
be a set of real-valued feature functions. Then a linear-chain conditional random field
is a distribution p(y | x) that takes the form:
T
!
1 Y X
p(y | x) = exp θk fk (yt , yt−1 , xt ) (392)
Z(x) t=1 k
T
!
XY X
Z(x) = exp θk fk (yt , yt−1 , xt ) (393)
y t=1 k
Notice that a linear chain CRF can be described as a factor graph over x and y, where each
local function (factor) ψt has the special log-linear form:
!
X
ψt (yt , yt−1 , xt ) = exp θk fk (yt , yt−1 , xt ) (394)
k
General CRFs (2.4). Let G be a factor graph over X and Y . Then (X, Y ) is a conditional
random field if for any value x of X, the distribution p(y | x) factorizes according to G. If
F = {ψa } is the set of A factors in G, then the conditional distribution for a CRF is
A
1 Y
p(y | x) = ψa (ya , xa ) (395)
Z(x) a=1
It is often useful to require that the factors be log-linear over a prespecified set of feature
functions, which allows us to write the conditional distribution as
K(a)
1 Y X
p(y | x) = exp θa,k fa,k (ya , xa ) (396)
Z(x) ψ ∈F k=1
a
114
Note how we collapsed the summations over i, j and i, o to simply k. This is purely notational. Each
value k can be mapped to/from a unique i, j or i, o in the first version. Also note that, necessarily, each feature
function fk in the latter version maps to a specific indicator function 1 in the first.
160
In addition, most models rely extensively on parameter tying115 . To denote this, we partition
the factors of G into C = {C1 , C2 , . . . , CP }, where each Cp is a clique template: a set of factors
K(p)
sharing a set of feature functions {fp,k (xc , yc )}k=1 and a corresponding set of parameters
θp ∈ RK(p) . A CRF that uses clique templates can be written as
1 Y Y
p(y | x) = ψc (xc , yc ; θp ) (397)
Z(x) C ∈C ψ ∈C
p c p
K(p)
1 Y Y X
= exp θp,k fp,k (xc , yc ) (398)
Z(x) C ∈C c∈C k=1
p p
In a linear-chain CRF, typically one uses one clique template C0 = {ψt }Tt=1 . Again, each factor
in a given template shares the same feature functions and parameters, so the previous sentence
means that we reuse the set of features and parameters for each timestep.
[edge-obs] f (yt , yt−1 , xt ) = qm (xt )1yt =y,yt−1 =y0 (∀y, y 0 inY, ∀m) (400)
0
[node-obs] f (yt , yt−1 , xt ) = 1yt =y,yt−1 =y0 (∀y, y inY) (401)
and both use the same f (yt , xt ) = qm (xt )1yt =y (∀y, ∈ Y, ∀m). Recall that m is the index
into our set of observation features116 .
• Feature Induction. The model begins with a number of base features, and the training
procedure adds conjunctions of those features.
115
Note how, for CRFs, the actual parameters θ are tightly coupled with the feature functions f .
116
In CRFSuite, the observation features are all the attributes we define, and any features that use both label
and observation are defined within CRFSuite itself.
161
4.35.1 Inference (Sec. 4)
We see that we can save an exponential amount of work by caching the inner sums as we go.
Let M denote the number of possible states for the y variables. We define a set of T forward
variables αt , each of which is a vector of size M :
X X t−2
Y
= ψt (j, i, xt ) ψt−1 (yt−1 , yt−2 , xt−1 ) ψt0 (yt0 , yt−1 , xt0 ) (410)
i∈S yh1...t−2i t0 =1
X
= ψt (j, i, xt )αt−1 (i) (411)
i∈S
162
P P
where S is the set of M possible states. By recognizing that p(x) = j∈S yh1...t−1i p(xh1...ti , yh1...t−1i , j),
we can rewrite p(x) as
X
p(x) = αT (j) (412)
j∈S
Notice how in the step from equation 407 to 408, we marginalized over all possible y sub-
sequences that could’ve been aligned with xh1...ti . We will repeat this pattern to derive the
backward recursion for βt , which is the same idea except now we go from T backward until
some t (instead of going from 1 forward until some t).
Similar to how we obtained equation 412, we can rewrite p(x) in terms of the β:
X
p(x) = β0 (y0 ) , ψ1 (y1 , y0 , x1 )β1 (y1 ) (416)
y1
We can then combine the definition for α and β to compute marginals of the form p(yt−1 , yt |
x):
1
p(yt−1 , yt | x) = αt−1 (yt−1 )ψt (yt , yt−1 , xt )βt (yt ) (417)
p(x)
X
where p(x) = αT (j) = β0 (y0 ) (418)
j∈S
163
Viterbi algorithm (HMMs). For computing y ∗ = arg maxy p(y | x). The derivation is
nearly the same as how we derived the forward-backward algorithm, except now we’ve replaced
the summations in equation 406 with maximization. The analog of α for viterbi are defined
as:
t−1
Y
δt (j) = max ψt (j, yt−1 , xt ) ψt0 (yt0 , yt0 −1 , xt0 ) (419)
yh1...t−1i
t0 =1
= max ψt (j, i, xt )δt−1 (i) (420)
i∈S
and the maximizing assignment is computed by a backwards recursion,
Computing the recursions for δt and yt∗ together is the Viterbi algorithm.
Note that the interpretation is also slightly different. The α, β, and δ variables should only be
interpreted with the factorization formulas, and not as probabilities. Specifically, use
( ) t0 −1 ( )
X X Y X
αt (j) = exp θk fk (j, yt−1 , xt ) exp θk fk (yt0 , yt0 −1 , xt0 ) (428)
yh1...t−1i k t0 =1 k
T
( ) ( )
X X Y X
βt (i) = exp θk fk (yt+1 , i, xt+1 ) exp θk fk (yt0 , yt0 −1 , xt0 ) (429)
yht+1...T i k t0 =t+2 k
( ) t−1 ( )
X Y X
δt (j) = max exp θk fk (j, yt−1 , xt ) exp θk fk (yt0 , yt0 −1 , xt0 ) (430)
yh1...t−1i
k t0 =1 k
164
Markov Chain Monte Carlo (MCMC). The two most popular classes of approximate infer-
ence algorithms are Monte Carlo algorithms and variational algorithms. In what follows,
we drop the CRF-specific notation and refer to the more general joint distribution
p(y) = Z −1
Y
ψa (ya ) (431)
a∈F
that factorizes according to some factor graph G = (V, F ). MCMC methods construct a
Markov chain whose state space is the same as that of Y , and sample from this chain to
approximate, e.g., the expectation of some function f (y) over the distribution p(y). MCMC
algorithms aren’t commonly applied in the context of CRFs, since parameter estimation by
maximum likelihood requires calculating marginals many times.
Maximum Likelihood for Linear-Chain CRFs. Since we’re modeling the conditional
distribution with CRFs, we use the conditional log likelihood `(θ) with l2-regularization:
N K
X
(i) (i) 1 X
`(θ) = log p(y | x ; θ) − 2 θ2 (432)
i=1
2σ k=1 k
N X K
T X N K
X (i) (i) (i) X 1 X
= θk fk (yt , yt−1 , xt ) − log Z(x(i) ) − θ2 (433)
i=1 t=1 k=1 i=1
2σ 2 k=1 k
165
We can rewrite this in the form of expectations. For now, let pe(y, x) denote the empirical
distribution, and let p̂(y | x; θ)pe(x) denote the model distribution.
" # " #
∂` X X
= Ex,y∼pe(y,x) fk (yt , yt−1 , xt ) − Ex,y∼p̂(y,x) fk (yt , yt−1 , xt ) (440)
∂θk t t
Inference. We need to compute the log probabilities for each instance in the dataset, under the current
parameters. We will need them when evaluating Z(x) and the marginals p(y, y 0 | x) when computing
gradients. P
1. Initialize α1 (j) = exp{ k θk fk (j, y0 , x1 )} (y0 is the fixed initial state) and βT (i) = 1.
2. For all t, compute:
( )
X X
αt (j) = exp θk fk (j, i, xt ) αt−1 (i) (441)
i∈S k
( )
X X
βt (j) = exp θk fk (i, j, xt+1 ) βt+1 (i) (442)
i∈S k
Since we observe y during training, what if we instead treat this as a CRF with both x and y
observed?
1 Y
p(z | y, x) = ψt (yt , yt−1 , zt , xt ) (448)
Z(y, x) t
XY
Z(y, x) = ψt (yt , yt−1 , zt , xt ) (449)
z t
We can use the same inference algorithms as usual to compute Z(y, x), and the key result is
that we can now write
166
1 XY Z(y, x)
p(y | x) = ψt (yt , yt−1 , zt , xt ) = (450)
Z(x) z t Z(x)
where I’ve assumed there are no connections between any zt and zt0 6=t .
Stochastic Gradient Methods. Now we compute gradients for a single example, and for a
linear-chain CRF:
∂`i (i) (i) (i) (i) θk
fk (y, y 0 , xt )p(y, y 0 | x(i) ) −
X X
= fk (yt , yt−1 , xt ) − (453)
∂θk t t,y,y 0
N σ2
which corresponds to parameter update (remember: we are using the LL, not the negative
LL):
117
This uses the trick
df d log f
= f (θ)
dθ dθ
167
4.35.3 Related Work and Future Directions (Sec. 6)
which has some important differences compared to the linear-chain CRF. Notice how maximum-
likelihood training of MEMMs does not require performance inference over full output se-
quences y, because Zt is a simple sum over the labels at a single position. MEMMs, however,
suffer from label bias, while CRFs do not.
Unfortunately, computing the exact integral (the expectation) is usually intractable, and we
must resort to approximate methods like MCMC.
168
Papers and Tutorials May 02, 2018
Han et al., “Co-sampling: Training Robust Networks for Extremely Noisy Supervision,” (2018).
Introduction. The authors state that current methodologies [for training networks under
noisy labels] involves estimating the noise transition matrix (which they don’t define).
Patrini et al. (2017) define the matrix as follows:
Denote by T ∈ [0, 1]c×c the noise transition matrix specifying the probability of one label being flipped
to another, so that ∀i, j Tij , Pr ye = ej | y = ei . The matrix is row-stochastic118 and not necessarily
symmetric across the classes.
Algorithm. Authors propose a learning paradigm called Co-sampling. They maintain two
networks fw1 and fw2 simultaneously. For each mini-batch data D̂, each network selects RT
small-loss instances as a “clean” mini-batch D̂1 and D̂2 , respectively. Each of the two networks
then uses the clean mini-batch data to update the parameters w2 (w1 ) of its peer network.
• Why small-loss instances? Because deep networks tend to fit clean instances first, then
noisy/harder instances progressively after.
• Why two networks? Because if we just trained a single network on clean instances,
we would not be robust in extremely high-noise rates, since the training error would
accumulate if the selected instances are not “fully clean.”
The Co-sampling paradigm algorithm is shown below.
118
Each row sums to 1.
169
Papers and Tutorials June 30, 2018
Hidden-Unit CRFs. At each time step t, the HUCRF employs H binary stochastic hidden
units zt . It models the conditional distribution as
1 X
p(y | x) = exp (E(x, z, y)) (460)
Z(x) z
T
X T
X
T
xTt Wzt + ztT Vyt + bT zt + cT yt
E(x, z, y) = yt−1 Ayt +
t=2 t=1 (461)
+ y1T π + yTT τ
Since the hidden units are conditionally independent given the data and labels, the hidden
units can be marginalized out one-by-one. This, along with the nice property that the hidden
units have binary elements, allows us to write p(y | x) without writing any zt explicitly, as
119
See the introduction in my CRF notes for recap of label-bias.
170
shown below:
T
exp{y1T π + yTT τ} Y
p(y | x) = exp{cT yt + yt−1 Ayt }
Z(x) t=1
H (462)
Y X
exp{zh bh + zh whT xt + zh vhT yt }
h=1 zh ∈{0,1}
T
exp{y1T π + yTT τ} Y
= exp{cT yt + yt−1 Ayt }
Z(x) t=1
(463)
H
Y
1 + exp{bh + whT xt + vhT yt }
h=1
For inference, we’ll need an algorithm for computing the marginals p(yt | x) and p(yt , yt−1 | x).
The equations are essentially the same as the forward-backward formulas for the linear-chain
CRF, but with summations over z:
X
p(yt , yt−1 | x) ∝ αt−1 (yt−1 ) Ψt (xt , zt , yt , yt−1 ) βt (yt ) (464)
zt
p(yt | x) ∝ αt (yt )βt (yt ) (465)
(466)
XX
αt (j) = Ψt (xt , zt , j, i)αt−1 (i) (467)
i∈Y zt
XX
βt (j) = Ψt+1 (xt+1 , zt+1 , i, j)βt+1 (i) (468)
i∈Y zt+1
Training. The conditional log likelihood for a single example (x, y) is (bias and initial-state
terms omitted)
L = log p(y | x) (469)
T
!
X X
= log Ψt (xt , zt , yt−1 , yt )) − log Z(x) (470)
t=1 zt
where Ψt := exp{yt−1 Ayt + xTt Wzt + ztT Vyt } (471)
Let Υ = {W, V, b, c, } be the set of model parameters. The gradient w.r.t. the data-
dependent120 parameters v ∈ Υ is given by
!
T H
∂L X X X ∂ohk (xt )
= (1yt =k − p(yt = k | x)) σ(ohk (xt )) (472)
∂v t=1 k∈Y h=1
∂v
where ohk (xt ) = bh + ck + Vhk + whT xt (473)
Unfortunately, the negative CLL is non-convex, and so we are only guaranteed to converge
to a local maximum of the CLL.
120
The data-dependent parameters are each individual element of the elements of Υ. “Data” here means
(x, y). Notice that Υ does not include A, π, or τ.
171
4.37.1 Detailed Derivations
Unfortunately, the paper leaves out a lot of details regarding derivations and implementations.
I’m going to work through them here. First, a recap of the main equations, and with all
biases/initial states included. Not leaving anything out121
T H
exp y1T π + yTT τ Y Y
exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt }
p(y | x) = (474)
Z(x) t=1 h=1
N
X
N LL = − log p(y (i) | x(i) ) (475)
i=1
N
" T ! #
X X X
(i)
=− log ψt (xt , zt , yt−1 , yt ) − log Z(x ) (476)
i=1 t=1 zt
The above formula for p(y | x) implies something that will be very useful:
X H
Y
ψt (xt , zt , yt−1 , yt ) = exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt } (477)
zt h=1
Using the generalization of the product rule for derivatives over N products, we can derive that
" #" #
∂ Y Y X ∂o(h)
(1 + exp{o(h)}) = (1 + exp{o(h)}) σ (o (h)) (478)
∂v h h h
∂v
P
Which means the derivatives of z ψ for the data-dependent params vdat and transition params
vtr , are:
P " #
∂ zt ψt X ∂o(h, yt ) X
= σ (o (h, yt )) ψt (479)
∂vdat h
∂vdat zt
P
∂ ψt ∂ T
X
zt
= c yt + yt−1 Ayt ψt (480)
∂vtr ∂vtr zt
I’ll now proceed to derive the gradients of negative (conditional) log-likelihood for the main
parameters. We can save some time by getting the base formula for any of the gradients with
121
The equation for p(y | x) from the paper, and thus here, is technically incorrect. The term exp{cT yt +
yt−1 Ayt } should not be included in the product over t for t = 1.
172
respect to a specific single parameter v:
N
" T ! #
∂N LL X X ∂ X (i) (i) (i) ∂ (i)
=− log ψt (xt , zt , yt−1 , yt ) − log Z(x ) (483)
∂v i=1 i=1
∂v z
∂v
t
∂ log Z(x(i) ) 1 ∂ X
= pe(y1 , . . . , yT | x(i) ) (484)
∂v Z(x(i) ) ∂v y
h1...T i
∂ ∂ YX
pe(y1 , . . . , yT | x(i) ) = ψt (485)
∂v ∂v t z
t
" #" #
X ∂
P
zt ψ t
YX
∂v
= ψt P (486)
t z t t
zt ψ t
where I’ve done some regrouping on the last line to be more gradient-friendly.
Data-dependent parameters
N
" T
! #
∂N LL X X ∂ X ∂
=− log ψt − log Z(x(i) ) (487)
∂vdat ∂vdat ∂v
i=1 t=1 zt
" #" #
N ∂
P
X XX ∂o(h, yt ) 1 X YX X ∂v
ψt
=− σ (o (h, yt )) − ψt P zt (488)
∂vdat Z(x(i) ) zt
ψt
i=1 t h yh1...T i t zt t
" #" #
N
X XX ∂o(h, yt ) 1 X YX XX ∂o(h, yt )
=− σ (o (h, yt )) − ψt σ (o (h, yt ))
∂vdat Z(x(i) ) ∂vdat
i=1 t h yh1...T i t zt t h
(489)
N
" " # #
X XX ∂o(h, yt ) XXX X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) ξt,y,y0 (490)
∂vdat ∂vdat
i=1 t h t y y0 h
N
" " # #
X XX ∂o(h, yt ) X X X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) γt,y (491)
∂vdat ∂vdat
i=1 t h t y h
N
" " # !#
X X X ∂o(h, yt ) X X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) γt,y (492)
∂vdat ∂vdat
i=1 t h y h
N
" !#
X XX X ∂o(h, yt )
=− (1yt =y − γt,y ) σ (o(h, yt )) (493)
∂vdat
i=1 t y h
NOTE: Although I haven’t thoroughly checked the last few steps, they are required to be true in order to match
the paper’s results.
173
Transition parameters
N
" T
! #
∂N LL X X ∂ X ∂
=− log ψt − log Z(x(i) ) (494)
∂vtr ∂vtr ∂vtr
i=1 t=1 zt
N
" #
X X ∂ ∂
T
=− c yt + yt−1 Ayt − log Z(x(i) ) (495)
∂vtr ∂vtr
i t
" #" #
N
X X ∂ 1 X YX X ∂
=− cT yt + yt−1 Ayt − ψt cT yt + yt−1 Ayt
∂vtr Z(x(i) ) ∂vtr
i=1 t yh1...T i t zt t
(496)
N
" #
X XX ∂ T
=− (1yt =y − γt,y ) c yt + yt−1 Ayt (497)
∂vtr
i=1 t y
Boundary parameters
N
∂N LL X
=− 1y1 =` − γ1,` (498)
∂π`
i=1
N
∂N LL X
=− 1yT =` − γT,` (499)
∂τ`
i=1
The results of each of the boxes above are summarized below, for the case of N = 1 to save
space.
!
∂N LL XX X ∂o(h, y)
=− (1yt =y − γt,y ) σ (o(h, y)) (500)
∂vdat t y h
∂vdat
∂N LL XX ∂ h T i
=− (1yt =y − γt,y ) c yt + yt−1 Ayt (501)
∂vtr t y ∂vtr
N
∂N LL X
=− [1y1 =` − γ1,` ] (502)
∂π` i=1
N
∂N LL X
=− [1yT =` − γT,` ] (503)
∂τ` i=1
Now I’ll further go through and show how the equations simplify for each type of data-
174
dependent parameter.
!
∂N LL ∂
σ o(h0 , y)
XX X
=− (1yt =y − γt,y ) bh0 + cy + Vh0 ,y + whT0 xt (504)
∂Wc,h t y h0
∂Wc,h
!
XX X
=− (1yt =y − γt,y ) σ (o(h, y)) 1h=h0 1c∈xt (505)
t y h0
XX
=− (1yt =y − γt,y ) σ (o(h, y)) 1c∈xt (506)
t y
∂N LL X
=− (1yt =y − γt,y ) σ (o(h, y)) (507)
∂Vh,y t
∂N LL XX
=− (1yt =y − γt,y ) σ (o(h, y)) (508)
∂bh t y
∂N LL X X
=− (1yt =y − γt,y ) σ (o(h, y)) (509)
∂cy t h
(510)
Alternative Approach. The above was a bit more cumbersome than needed. I’ll now derive
it in an easier way.
T H
exp y1T π + yTT τ Y Y
exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt }
p(y | x) = (511)
Z(x) t=1 h=1
T Y
H
exp{I + T } Y
1 + exp{bh + whT xt + vhT yt }
= (512)
Z(x) t=1 h=1
N
X
N LL = − log p(y (i) | x(i) ) (513)
i=1
N
" H
T X
#
X X
log 1 + exp{bh + whT xt + vhT yt } − log Z(x(i) )
=− I +T + (514)
i=1 t=1 h=1
∂ log Z(x(i) ) 1 X ∂
= (i)
pe(y | x(i) ) (517)
∂v Z(x ) yh1...T i ∂v
1 X ∂
= (i)
pe(y | x(i) ) log pe(y | x(i) ) (518)
Z(x ) yh1...T i ∂v
175
as our base formula for partial derivatives.
Transition parameters
" #
∂Li ∂ X T T
(i)
= I +T + log 1 + exp bh + wh xt + vh yt − log Z(x ) (519)
∂Ai,j ∂Ai,j
t,h
" #
∂ X ∂ X
= [I + T ] − p(y | x(i) ) y1T π + yT
T
τ+ yt−1 Ayt (520)
∂Ai,j ∂Ai,j
yh1...T i t
T
X X 1 X
= 1 (i) 1 (i) − e(y | x(i) )
p 1yt−1 =i 1yt =j (521)
yt−1 =i yt =j Z(x(i) )
t yh1...T i t=1
X XX X X X 1
= 1 (i) 1 (i) − 1yt−1 =i 1yt =j e(y | x(i) )
p (522)
yt−1 =i yt =j Z(x(i) )
t t yt yt−1 yh1...t−2i yht+1...T i
176
Papers and Tutorials June 30, 2018
Model Definition. The Hidden-Unit CRF (HUCRF) accepts the usual observation sequence
x = x1 , . . . , xn , and associated label sequence y = y1 , . . . yn for training. The HUCRF also
has a hidden layer of binary-valued z = z1 . . . zn . It defines the joint probability
where
• Y(x, z) is the set of all possible label sequences for x and z.
• Φ(x, z) = nj=1 φ(x, j, zj )
P
Pn
• Ψ(z, y) = j=1 ψ(zj , yj−1 , yj ).
Also note that we model (zi ⊥ zj6=i | x, y).
Pre-training HUCRFs. Since the objective for HUCRFs is non-convex, we should choose a
better initialization method than random initialization. This is where pre-training comes in,
a simple 2-step approach:
1. Cluster observed tokens from M unlabeled sequences and treat the clusters as labels to
train an intermediate HUCRF. Let C(u(i) ) be the sequence of cluster assignments/labels
for the unlabeled sequence u(i) . We compute:
M
X
(θ1 , γ1 ) ≈ arg max log pθ,γ (C(u(i) ) | u(i) ) (524)
θ,γ i=1
N
X
(θ2 , γ2 ) ≈ arg max log pθ,γ (y (i) | x(i) ) (525)
θ,γ i=1
Note that pre-training only defines the initialization for θ, the parameters between x and z.
We still train γ, the parameters from z to y, from scratch.
177
Canonical Correlation Analysis (CCA). A general technique that we will need to under-
stand as a prerequisite for the multi-sense clustering approach (defined in the next section).
0
Given n samples of the form (x(i) , y (i) ), where each x(i) ∈ {0, 1}d and y (i) ∈ {0, 1}d , CCA
0
returns projection matrices A ∈ Rd×k and B ∈ Rd ×k that we can use to project the samples
to k dimensions:
x −→ AT x (526)
y −→ B T y (527)
n
X
Di,j = 1 (l) 1 (l) (528)
xi =1 yj =1
l=1
n
X
ui = 1 (l) (529)
xi =1
l=1
n
X
vi = 1 (l) (530)
yi =1
l=1
Multi-sense clustering. For each word type, use CCA to create a set of context embeddings
corresponding to all occurrences of that word type. Then, cluster these embeddings with k-
means. Set the number of word senses k to the number of label types occurring in the labeled
data.
178
Papers and Tutorials July 07, 2018
For comparisons later on with the traditional attention mechanism, here it is:
T
X
c= p(z = t | x, q)xt (531)
t
p(z = t | x, q) = softmax(θt ) (532)
where usually x is the sequence of hidden state of the encoder RNN, q is the hidden state of
the decoder RNN at the most recent time step, z gives the source position to be attended to,
and θt = score(xt , q).
122
Also called the “alignments”. It is the output of the softmax layer of attention scores in the majority of
cases.
123
In all applications I’ve seen, f (x, z) = xz .
179
Example 1: Subsequence Selection. Let m = T , and let each zi ∈ {0, 1} be a binary R.V.
Let f (x, z) = Tt ft (x, zt ) = Tt xt 1zt =1 124 . This yields the context vector,
P P
T
X
c = Ez1 ,...,zT [f (x, z)] = p(zt = 1 | x, q)xt (535)
t
Although this looks similar to equation 531, we haven’t yet revealed the functional form for
p(z | x, q). Two possible choices:
T
1 Y The factor ψ for the
Linear-Chain CRF: p(z1 , . . . , zT | x, q) = ψt (zt , zt−1 ) (536) CRF is NOT the same
Z(x, q) t
as the factor for the
T
Y T
Y Bernoulli!
Bernoulli: p(z1 , . . . , zT | x, q) = p(zt = 1 | x, q) = σ(ψt (zt )) (537)
t t
These show why equation 535 is fundamentally different than equation 531:
• It allows for multiple inputs (or no inputs) to be selected for a given query.
• We can incorporate structural dependencies across the zt ’s.
Also note that all methods can use potentials from the same neural network or RNN that takes
x and q as input. By this we mean, for example, that we can take the same parameters we’d use
when computing the scores in our attention layer, and reinterpret them as e.g. CRF parameters.
Then, we can compute the marginals p(zt | x) using the forward-backward algorithm125 .
124
Ok, so equivalently, z T x, i.e. the indicator function can just be replace by zt here
125
This is different than the simple softmax we usually use in an attention layer, which does not model any
interdependencies between the zt . The marginals we end up with when using the CRF originate from a joint
distribution over the entire sequence z1 , . . . zT . This seems potentially incredibly powerful. Need to analyze in
depth.
180
Papers and Tutorials July 08, 2018
Neural CRFs. Essentially, we feed the input sequence x through a feed-forward network
whose output layer has a linear activation function. The output layer is then connected with
the target variable sequence Y. In other words, instead of feeding instances x of the observation
variables X, we feed the hidden layer activations of the NN. This results in the conditional
probability
yc We can set Φc = Φ for a
e−Ec (x,yc ,w) = ehwc
Y Y
,Φ c (x,wN N )i
p(y | x) ∝ (538) shared-weights approach
c∈C c∈C
where
• wN N are the weights for the NN.
• wcyc are the weights (for clique c) for the CRF.
• Φc (x, wN N ) is the output of the NN. It symbolizes the high-level feature representation
of the input x at clique c computed by the NN.
The authors refer to the linear output layer (containing the CRF weights) as the energy outputs.
For the sake of writing this in more familiar notation for the linear-chain CRF case, here is
the above equation translated for the case where each clique corresponds to a timestep t of the
input sequence and is either a label-label clique or a state-label clique.
T
1 Y
p(y | x) = exp {−Et (x, yt , yt−1 , w)} (539)
Z(x) t
T
1 Y
= exp {−Eloc (x, t, yt , w) − Etrans (x, t, yt−1 , yt , w)} (540)
Z(x) t
(541)
where the authors are using a blanket w to denote all model parameters126 .
126
Also note that the authors allow for utilizing the input sequence x in the transition energy function, Etrans ,
although usually we implement Etrans using only yt−1 and yt .
181
Initialization & Fine-Tuning. The hidden layers of the NN are initialized layer-by-layer in
an unsupervised manner using RBMs. It’s important to note that the hidden layers of the NN
consist of binary units. Then, using the pre-trained hidden layers, the CRF layer is initialized
by training it in the usual way, and keeping the pretrained NN weights fixed.
182
Papers and Tutorials July 08, 2018
BI-LSTM-CRF Networks. Consider the matrix of scores fθ (x) for input sentence x. The
element [fθ ]`,t gives the score for label ` with the t-th word. This is output by the LSTM
network parameterized by θ. We let [A]i,j denote the transition score from label i to label j
within the CRF. The total set of parameters is denoted θe = θ ∪ A. The total score for input
sentence x and predicted label sequence y is then
T
X
s(x, y, θ)
e = Ayt−1 ,yt + [fθ ]yt ,t (546)
t
EDIT: they may actually replace the one-hot encoded word features with the word embeddings.
Unclear.
183
Papers and Tutorials July 16, 2018
Feature-based Methods. For each entity pair (e1 , e2 ), generate a set of features and train
a classifier to predict the relation127 . Some useful features are shown in the figure below.
Authors found that SVMs outperform MaxEnt (logistic reg) classifiers for this task.
Kernel methods. Instead of explicit feature engineering, we can design kernel functions for
computing similarities between representations of two relation instances128 (a relation instance
is a triplet of the form (e1 , e2 )), and SVM for the classification.
127
If there are N unique relations for our data, it is common to train the classifier to predict 2N total relations,
to handle both possible orderings of relation arguments.
128
Recall that kernel methods are for the general optimization problem
n n
! n n
X X T
X X
min loss αj φ(xj ) φ(xi ), yi +λ αj αk φ(xj )T φ(xk ) (547)
w
i=1 j=1 j=1 k=1
which changes the direct focus from feature engineering to “similarity” engineering.
184
One approach is the sequence kernel. We represent each relation instance as a sequence of
feature vectors:
(e1 , e2 ) → (f1 , . . . , fN )
where N might be e.g. the number of words between the two entities, and the dimension of
each f is the same, and could correspond to e.g. POS tag, NER tag, etc. More formally,
define the generalized subsequence kernel, Kn (s, t, λ), that computes some number of weighted
subsequences u such that
• There exist index sequences ii := (i1 , . . . in ) and jj := (j1 , . . . jn ) of length n such that
ui ∈ si ∀i ∈ ii (548)
uj ∈ tj ∀j ∈ jj (549)
(550)
• The weight of u is λl(ii)+l(jj) , where l(x) = max(x) − min(x) and 0 < λ < 1. Sparser
(more spaced out) subsequences get lower weight.
The authors then provide the recursion formulas for K, and describe some extensions of se-
quence kernels for relation extraction.
Syntactic Tree Kernels. Structural properties of a sentence are encoded by its constituent
parse tree. The tree defines the syntax of the sentence in terms of constituents such as noun Syntactic Tree Kernels
phrases (NP), verb phrases (VP), prepositional phrases (PP), POS tags (NN, VB, IN, etc.)
as non-terminals and actual words as leaves. The syntax is usually governed by Context Free
Grammar (CFG). Constructing a constituent parse tree for a given sentence is called parsing.
The Convolution Parse Tree Kernel KT can be used for computing similarity between two
syntactic trees.
Dependency Tree Kernels. For grammatical relations between words in a sentence. Words
are the nodes and dependency relations are the edges (in the tree), typically from dependent
to parent. In the relation instance representation, we use the smallest subtree containing Dependency Tree
the entity pair of a given sentence. Each node is augmented with additional features like POS, Kernels
chunk, entity level (name, nominal, pronoun), hypernyms, relation argument, etc. Formally,
an augmented dependency tree is defined as a tree T where
• Each node ti has features φ(ti ) = {v1 , . . . , vd }.
• Let ti [c] denote all children of ti , and let P a(ti ) denote its parent.
• For comparison of two nodes we use:
– Matching function m(ti , tj ): equal to 1 if some important features are shared
between ti and tj , else 0.
– Similarity function s(ti , tj ): returns a positive real similarity score, and defined
as
X X
s(ti , tj ) = Compat(vq , vr ) (551)
vq ∈φ(ti ) vr ∈φ(tj )
185
Finally, we can define the overall dependency tree kernel K(T1 , T2 ) for similarity between trees
T1 and T2 as follows. Let ri denote the root node of tree Ti .
(
0 if m (r1 , r2 ) = 0 a1 ≤ a2 ≤ . . . ≤ an
K(T1 , T2 ) = (552)
s(r1 , r2 ) + Kc (r1 [c], r2 [c]) otherwise d(a) , an − a1 + 1
X
d(a)+d(b)
Kc (ti [c], tj [c]) = λ K(ti [a], tj [b]) (553) 0<λ<1
a,b,l(a)=l(b)
The interpretation is that, whenever a pair of matching nodes is found, all possible matching
subsequences129 of their children are found. Two subsequences a and b are said to “match”
if m(ai , b1 ) = 1(∀i < n). Similar to the sequence kernel seen earlier, λ is a decay factor that
penalizes sparser subsequences.
129
P
Note that a summation over subsequences of a sequence a, denoted here as a
, expands to
{a1 , . . . an , a1 a2 , a1 a3 , . . . a1 an , a1 a2 a3 , . . . , a2 a5 a6 , . . .} and so on and so forth.
186
Papers and Tutorials July 16, 2018
Lin et al., “Neural Relation Extraction with Selective Attention over Instances,” (2016).
Input Representation. Each instance sentence x is tokenized into a sequence of words. Each
word wi is transformed into a concatenation wi ∈ Rd (d = dw + 2dpos ),
where dist(a, b) returns the [embedded] relative distance (num tokens) between a and b in the
given sentence (positive integer)130 .
qi = wi−`+1:i (1 ≤ i ≤ T + ` − 1) (555)
and let Q ∈ R(T +`−1)×`·d be defined such that row Qi = qiT . It follows that, for convolution
matrix W ∈ RK×(`·d) , the output of the kth filter, and subsequent max-pooling, is
h i
pk = WQT +b (556)
k
[x]k = [max(pk1 ); max(pk2 ); max(pk3 )] (557)
where we’ve divided pk into three segments, corresponding to before entity 1, middle, and
after entity 2 of the given sentence. The sentence vector x ∈ R3K is the concatenation of all
of these, after feeding through a non-linear activation function like a ReLU.
130
More specifically, it is an embedded representation of the relative distance. To actually implement it, you’d
first shift the relative distances such that they begin at 0, and learn 2 * window_size + 1 embedding vectors
for each of the possible position offsets. Anything outside the window is embedded into the zero vector.
187
Selective Attention over Instances. An attention mechanism is employed over all n sen-
tence instances xi for some candidate entity pair (e1 , e2 ). The output is a set vector s, a
real-valued vector representation of the set of instances, where
X
s= αi xi (558)
i
αi = softmax(xi Ar) (559)
is an attention-weighted sum over the instance embeddings. Note that they constrain A to be
diagonal. Finally, the predictive distribution is defined as
where S is the set of n sentences for the given entity pair (e1 , e2 ).
188
Papers and Tutorials July 31, 2018
Gelfand et al., “On Herding and the Perceptron Cycling Theorem,” (2010).
Introduction. Begin with the familiar learning rule of Rosenblatt’s perceptron, after some
incorrect prediction yˆi = sgn(wT xi ),
which has the effect that a subsequent prediction on xi will (before taking the sign) be ||xi ||22
closer to the correct side of the hyperplane. The perceptron cycling theorem (PCT) states
that if the data is not linearly separable, the weights will still remain bounded and won’t
diverge to infinity. This paper shows that the PCT implies that certain moments are
conserved on average. Formally, their result says that, for some N number of iterations
over samples selected from training data (with replacement)131 ,
N N
1 X 1 X 1
xi yi − xi yˆi ∼ O (562)
N i N i N
where it’s important to remember that yˆi here is the prediction for xi when it was encountered
at that training iteration. This result shows that perceptron learning generate predictions that
correlate with the input attributes the same way as the true labels do, and [the correlations]
converge to the sample mean with a rate of 1/N. This also hints at why averaged percep-
tron algorithms (using the average of weights across training) makes sense, as opposed to just
selected the best weights. This paper also shows that supervised perceptron algorithms and
unsupervised herding algorithms can all be derived from the PCT.
Below are some theorems that will be used throughout the paper. Let {wt } be a sequence of
vectors wt ∈ RD , each generated according to iterative updates wt+1 = wt + vt , where vt is
an element of a finite set V, and the norm of vt is bounded: max ||vt || = R < ∞.
131
Unclear whether this is only for samples that corresponded to an update, or just all samples during training.
189
Herding. Consider a fully observed Markov Random Field (MRF) over m variables, each of
which can take on an integer value in the range [1, K]. In herding, our energy function and
weight updates for observation x (over all m variables in X ),
What if we consider more complicated features that depend on the weights w? This situation
may arise in e.g. models with hidden units z, where our feature function would take the form
φ(x, z), and we always select z via
and therefore our feature function φ depends on weights w through z. In this case, our herding
update terms from above take the form
h i
φ̄ = Ex(i) ∼pdata φ(x(i) , z(x(i) , w)) (568)
x∗t , zt∗ = arg max wtT φ(x, z) (569)
x,z
Conditional Herding. Main contribution of this paper. It’s basically identical to regular
herding, but now we decompose x into inputs and outputs (x, y) for interpreting in a discrim-
inative setting. In the paper, they express wT φ(x, y, z) identically as a discriminative RBM.
The parameter update for mini-batch Dt is given by
η
φ(x(i) , y (i) , z) − φ(x(i) , y ∗ , z ∗ )
X
wt+1 = wt + (570)
|Dt |
(x(i) ,y (i) )∈D t
190
Papers and Tutorials August 12, 2018
Convex Set. Sets that contain all [points in] line segments that join any 2 points in the set.
A set C is called a convex set if ∀x, y ∈ C and λ ∈ [0, 1], we have that (1 − λ)x + λy ∈ C
as well.
After reading the definition of a convex set above, it seemed intuitive that any convex combination of points ∈ C
should also be in it as well (i.e. generalizing the pairwise definition). Let x, y, z ∈ C. How can we prove that
θ1 x + θ2 y + θ3 z (where θi satisfy the constraints of a convex comb.) is also in C? Here is how I ended up doing it:
• If we can prove that θ1 x + θ2 y = (1 − θ3 )v for some v ∈ C, then our work is done. This is pretty easy to
show via simple arithmetic.
• Case 1: assume θ3 < 1, so that we can divide both sides by 1 − θ3 :
θ1 θ2
v= x+
1 − θ3 1 − θ3
Clearly, the two coefficients here sum 1 and satisfy the constraints of a convex combination, and therefore
we know that v ∈ C, and this case is done.
• Case 2: assume θ3 = 1. Well, that means θ1 = θ2 = 0. Trivially, z ∈ C and this case is done.
Convex Function
A continuously differentiable function f : Rp 7→ R is a convex function if ∀x, y ∈ Rp ,
we have that
f (y) ≥ f (x) + h∇f (x), y − xi (571)
While thinking about how to gain intuition for the above, I came across chapter 3 of “Convex
Optimization” which describes this in much more detail. It’s crucial to recognize that the RHS
of the inequality is the 1st-order Taylor expansion of the function f about x, evaluated at
y. In other words, the first-order Taylor approximation is a global underestimator of any
convex function f 132 .
132
Consider what this implies about all the 1st-order gradient-based optimizers we use.
191
Strongly Convex/Smooth Function. Informally, strong convexity ensures a convex func-
tion doesn’t grow too slow, while strong smoothness ensures a convex133 function doesn’t grow
too fast. Formally,
A continuously differentiable function is considered α-strongly convex (SC) and β-strongly
smooth (SS) if ∀x, y ∈ Rp we have
α β
||x − y||22 ≤ f (y) − f (x) − h∇f (x), y − xi ≤ ||x − y||22 (572)
2 2
We need to find a function whose linear approximation is always more than 12 times the magnitude of the difference
in inputs squared, compared to the true value. Intuitively, I’d expect any concave function to satisfy this, since
its linear approximation is a global overestimator of the true value. So, for example, f (y) = −||y||22 would satisfy
1 − SS while being non-convex.
Lipschitz Function
A function f is B-Lipschitz if ∀x, y ∈ Rp ,
133
Strong smoothness alone does not imply convexity.
134
Notice that SC and SS are quadratic lower and upper bounds, respectively. This means that the allowed
deltas grow as a function of the distance between x and y, whereas things like Lipschitzness grow linearly.
135
It should be obvious that expectations are convex combinations.
192
Convex Projections (2.2). Given any closed set C ∈ Rp , the projection operator ΠC (·) is
defined as
If C is a convex set, then the above reduces to a convex optimization problem. Projections
onto convex sets have three particularly interesting properties. For each of them, the setup is:
“For any convex set C ⊂ Rp , and any z ∈ Rp , let ẑ := ΠC (z). Then ∀x ∈ C, . . . ”
• Property-O136 : ||ẑ − z||2 ≤ ||x − z||2 . Informally: “the projection results in the point
ẑ in C that is closest to the original z”. This basically just restates the optimization
problem.
• Property-I. hx − ẑ, z − ẑi ≤ 0 . Informally: “from the perspective of ẑ, all points x ∈ C
are in the (informally) opposite direction of z.”
• Property-II. ||ẑ − x||2 ≤ ||z − x||2 . Informally: “the projection brings the point closer
to all points in C compared to its original location.”
Proving Property-I
A proof by contradiction.
1. Assume that ∃x ∈ C s.t. hx − ẑ, z − ẑi > 0.
2. We know that ẑ is also in C, and since C is convex, then for any λ ∈ [0, 1],
xλ := λx + (1 − λ)ẑ (576)
must also be in C.
3. If we can show that some value of λ guarantees that ||z − xλ ||2 < ||z − ẑ||2 , this would directly contradict
property-O, implying ẑ is not the closest member of C to z. I’m not sure how to actually derive the range
of λ values that satisfy this, though.
136
In this case only, C need not be convex
193
4.45.1 Non-Convex Projected Gradient Descent (3)
Non-Convex Projections (3.1). Here we look at a few special cases where projecting onto
a non-convex set can still be carried out efficiently.
• Projecting into sparse vectors. The set of s-sparse vectors (vectors with at most s
nonzero elements) is denoted as
It turns out that ẑ := ΠB0 (s) (z) is obtained by setting all except the top-s elements of z
to zero.
• Projecting into low-rank matrices. The set of m × n matrices with rank at most r
is denoted as
This can be done efficiently via SVD on A and retaining the top r singular values and
vectors.
and a similar rephrasing for restricted strong convexity (RSC) and restricted strong smoothness
(RSS).
194
Papers and Tutorials August 24, 2018
The authors use a Transformer decoder, i.e. literally just the decoder part of the Trans-
former in “Attention is all you need.”
Supervised Fine-Tuning (3.2). Now we have a labeled corpus C, where each instance consists
of a sequence of input tokens x1 , . . . , xm , along with a label y. They just feed the inputs
through the transformer until they obtain the final transformer block’s activation hm l , and
linearly project it to output space:
P (y | x1 , . . . , xm ) = softmax(hm
l Wy ) (583)
X
1 m
L2 (C) = log P (y | x , . . . , x ) (584)
(x,y)
They also found that including a language modeling auxiliary objective helped learning,
195
Papers and Tutorials August 30, 2018
and it’s important to note the shared parameters Θx (token representation) and Θs (output
softmax).
where stask are softmax-normalized weights (so the combination is convex). The authors also
mention that, in some cases, it helped to apply layer normalization to each biLM layer
before weighting.
196
Using biLMs for Supervised NLP (3.3). Given a pretrained biLM and a supervised
architecture, we can learn the ELMo representations (jointly with the given supervised task)
as follows.
1. Freeze the weights of the [pretrained] biLM.
2. Concatenate the token representations (e.g. GloVe) with the ELMo representation.
3. Pass the concatenated representation into the supervised architecture.
The authors found it beneficial to some dropout to ELMo, and in some cases add L2-regularization
on the ELMo weights.
Pretrained biLM Architecture (3.4). In addition to the biLM we introduced earlier, the
authors make the following changes/specifications for their pretrained biLMs:
• residual connections between LSTM layers137 .
• Halved all embedding and hidden dimensions from the CNN-BIG-LSTM model in Ex-
ploring the Limits of Language Modeling.
• The xk token representations use 2048 character n-gram convolutional filters followed by
two highway layers.
137
So the output of some layer, instead of being LSTM(x), becomes (x + LSTM(x))
197
Papers and Tutorials August 30, 2018
NCE and Importance Sampling (3.1). In this section, assume any p(w) is shorthand for
p(w | {wprev }).
• Noise Contrastive Estimation (NCE). Train a classifier to discriminate between true
data (from distribution pd ) or samples coming from some arbitrary noise distribution pn .
If these distributions were known, we could compute
p(w | true)p(true)
p(Y =true | w) = (589)
p(w)
pd (w)p(true)
= (590)
p(w, true) + p(w, false)
pd (w)p(true)
= (591)
pd (w)p(true) + pn (w)p(false)
pd (w)
= (592)
pd (w) + kpn (w)
where k is the number of negative samples per positive word. The idea is to train a
logistic classifier p(Y =true | w) = σ(log pmodel − log kpn (w)), then softmax(log pmodel ) is
a good approx of pd (w).
• Importance Sampling. Estimates the partition function. Consider that now we have a
set of k + 1 words W = {w1 , . . . , wk+1 }, where w1 is the word coming from the true data,
and the rest are from the noise distribution. We train a multinomial logistic regression
over k + 1 classes,
pd (wi ) 1
p(Y =i | W ) = Pk+1 (593)
pn (wi ) i0 =1 pd (wi0 )/pn (wi0 )
pd (wi )
∝Y (594)
pn (wi )
and we end up seeing that IS is the same as NCE, except in the multiclass setting and
with cross entropy loss instead of logistic loss.
198
CNN Softmax (3.2). Typically the logit for word w is given by zw = hT ew , where h is often
the output state of an LSTM, and ew is a vector of parameters that could be interpreted as
the word embedding for w. Instead of this, the authors propose what they call CNN Softmax,
where we compute ew = CN N (charsw ). Although this makes the function mapping from w
to ew much smoother (due to the tied weights), it ends up having a hard time distinguishing
between similarly spelled words that may have entirely different meanings. The authors use a
correction factor, learned for each word, such that
where M projects low-dimensional corrw back up to the dimensionality of the LSTM state h.
Char LSTM Predictions (3.3). To reduce the computational burden of the partition func-
tion, the authors feed the word-level LSTM state h through a character-level LSTM that
predicts the target word one character at a time.
199
Papers and Tutorials October 28, 2018
Graves et al., “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks,” (2006).
Temporal Classification.
• Input space: Let X = (Rm )∗ be the set of all sequences of m dimensional real-valued
vectors.
• Output space: Let Z = L∗ be the set of all sequences of a finite vocabulary of L labels.
• Data distribution: Denote by DX ×Z the probability distribution over samples (x, z).
Let S denote a set of training examples drawn from this distribution.
From Network Outputs to Labellings (3.1). Let L0 = L ∪ {} denote the set of unique
labels combined with the blank token . We refer to the alignment sequences of length T (same
length as x), i.e. elements of the set (L0 )T , as paths and denote them π.
Now that we have alignment sequences π, we need to convert them to label sequences by
(1) merging repeated contiguous labels, and then (2) removing blank tokens. We denote this
procedure as a many-to-one map B : L0T 7→ L≤T . We can then write the conditional posterior
over possible output sequences `:
X
p(` | x) = p(π | x) (596)
π∈B−1 (`)
Constructing the Classifier (3.2). There is no tractable algorithm for exact decoding, i.e.
computing
200
The CTC Forward-Backward Algorithm (4.1). Define the probability of obtaining the
first s output labels, `h1...si , at time t as
t
X Y 0
αt (s) , yπt t0 (598)
π∈L0T t0 =1
B(πh1...ti )=`h1...si
Note that the summation here could contain duplicate πh1...ti that differ only in their elements
beyond t.
We insert a blank token at the beginning and end of ` and between each pair of labels, and call
this augmented sequence `0 . We have the following rules for initializing α at the first output
step t=1, followed by the recursion rule:
1
y s=1
α1 (s) = y1 s=2 (599)
`1
0 s>2
ᾱ (s)y t 0
t `s `0s =b or `0s−2
αt (s) = (600)
(ᾱt (s) + αt−1 (s − 2))y t 0 otherwise
`s
It’s worth emphasizing how to interpret these, given we’ve imposed this weird augmented label
sequence. In as-verbose-as-possible terms,
αt (s) is the probability, after running our RNN for t time steps to pro-
duce the path πh1...ti , that B(πh1...ti )==`h1... s−1 i for which, after inserting
2
between all elements of `h1... s−1 i , we obtain the augmented labeling `0h1...si .
2
The way you should think about the different possible cases here is that, at time step t, in
order for there to be nonzero probability that we can merge the sequence of t RNN outputs
into the augmented label sequence `0h1...si , it must be true that:
201
Papers and Tutorials November 03, 2018
BERT
Table of Contents Local Written by Brandon McKinzie
Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Google
AI Language (Oct 2018).
BERT. Instead of using the unidirectional transformer decoder, they use the bidirectional
encoder architecture for their pre-trained model138 .
The input representation is shown in the figure above. The input is a sentence pair, as com-
monly seen in tasks like QA/NLI.
138
Is this seriously paper-worthy?? I’m taking notes so I can easily refer back on popular approaches, but I
don’t see what’s so special here.
202
Pre-training Tasks (3.3).
1. Masked LM: Mask 15% of all tokens, and try to predict only those masked tokens139
Furthermore, at training time, the mask tokens are either fed through as (a) the special
[MASK] token 80% of the time, (b) a random word 10% of the time, and (c) the original
word unchanged 10% of the time. Now this is just hackery.
2. Next Sentence Prediction: Given two input sentences A and B, train a binary clas-
sifier to predict whether sentence B came directly after sentence A.
They do the pretraining jointly, using a loss function that’s the sum of the mean masked LM
likelihood and mean next sentence prediction likelihood.
139
This is the only “novel” idea I’ve seen in the paper. Seems hacky-ish but also reasonable.
203
Papers and Tutorials November 25, 2018
Singh et al., “Wasserstein is all you need,” EPFL Switzerland (August 2018).
where T is called the transportation matrix. Informally, the constraints are simply enforcing
bijection to/from µ and ν, in that “all the mass sent from element i must be exactly ai , and
all mass sent to element j must be exactly bj ”. A particular case called the p-Wasserstein
distance, where Ω = Rd and Mij is a distance metric over Rd , is defined as
p 1/p
Wp (µ, ν) , OT(µ, ν; DΩ ) (603)
Distributional Estimate (4.1). Let C , {c}i be the set of possible contexts, where each
context ci can be a word/phrase/sentence/etc. For a given word w and our set of observed
contexts for that word, we essentially want to embed its context histogram into a space Ω (where
typically Ω = Rd ). Let V denote a matrix of context embeddings, such that Vi,: = ci ∈ Rd , the
embedding for context ci in what the authors call the ground space. Combining the histogram
H w containing observed context counts for word w with V, the distributional estimate of
the word w is defined as
X
PVw , (H w )c δ(vc ) (604)
c∈C
204
Also, the point estimate is just vw , i.e. the embedding of the word w when viewed as a context.
Distance (4.2). Given some distance metric DΩ in ground space Ω = Rd , the distance between
words wi and wj is the solution to the following OT problem140 :
w p w
OT(PVwi , PV j ; DΩ ) := Wpλ (PVwi , PV j )p (605)
Concrete Framework (5). The authors make use of the shifted positive pointwise mu-
tual information (SPPMI), Ssα , for computing the word histograms:
Ssα (w, c)
(H w )c := P α 0
(606)
c0 ∈C Ss (w, c )
Count(w, c) c0 Count(c0 )α
P
α
Ss (w, c) , max log − log(s), 0 (607)
Count(w)Count(c)α
140
I’m not sure whether the rightmost exponent of p is a typo in the paper, but that is how it is written.
205
Papers and Tutorials December 9, 2018
M. Gutman and A. Hyvärinen, “Noise contrastive estimation: A new estimation principle for unnormalized
statistical models,” University of Helsinki (2010).
1 X
JT (θ) = ln [h(xt )] + ln [1 − h(yt )] (608)
2T t
h(u) = σ (G (u)) (609)
G(u) = ln pm (u) − ln pn (u) (610)
ln pm (·; θ) := ln pem (·; α) + c (611)
where for compactness reasons I’ve removed the explicit dependence of all functions above
(except pn ) on θ. Notice how this fixes the issue of the model just setting c arbitrarily high to
obtain a high likelihood143 .
141
The implicit assumption here is that ∃α∗ such that pd (·) = pm (·; α∗ ).
142
This all assumes of course that pn is fully defined.
143
The primary reason why MLE is traditionally unable to parameterize the partition function.
206
Connection to supervised learning (2.2). The NCE objective can be interpreted as binary
logistic regression that discriminates whether a point belongs to the data (pd ) or to the noise
(pn ).
We model with a
P (C= 1)p(u | C=1) uniform prior:
P (C=1 | u ∈ X ∪ Y ) = (612)
p(u) P (C=1) = P (C=0) =
1/2
pm (u)
= (613)
pm (u) + pn (u)
≡ h(u; θ) (614)
where we’re now using the union of X and Y , U := {u1 , . . . , u2T }. The log-likelihood of the
data under the parameters θ is
2T
X
`(θ) = [Ct ln P (Ct =1 | ut ; θ) + (1 − Ct ) ln P (Ct =0 | ut ; θ)] (615)
t
2T
X
= [Ct ln h(ut ; θ) + (1 − Ct ) ln [1 − h(ut ; θ)]] (616)
t
T
X
= [ln h(xt ; θ) + ln [1 − h(yt ; θ)]] (617)
t
207
4.52.1 Self-Normalized NCE
Notes from A. Mnih and Y. Teh, “A fast and simple algorithm for training neural probabilistic
language models” (2012).
Maximum likelihood learning. Let Pθh (w) denote the probability of observing word w given
context h. For neural LMs, we assume this is the softmax of a scoring function sθ (w, h) (logits).
In what follows, I’ll drop the explicit θ and h subscript/superscript notation for brevity.
" #
∂ log P (w) ∂ ∂ X 0
= s(w, h) − log es(w ,h) (620)
∂θ ∂θ ∂θ w0
∂ ∂
P (w0 ) s(w0 , h)
X
= s(w, h) − (621)
∂θ w0
∂θ
∂
∂
= s(w, h) − Ew∼Pθh s(w, h) (622)
∂θ ∂θ
where the expectation (in red) is expensive due to requiring s(w, h) evaluated for all words in
the vocabulary. One approach is importance sampling where we sample a subset of k words
from the vocab and compute the probabilities from that approximation:
∂ log P (w) ∂ ∂
P (w0 ) s(w0 , h)
X
= s(w, h) − (623)
∂θ ∂θ w0
∂θ
k
∂ 1 X ∂
≈ s(w, h) − v(xj ) s(w0 , h) (624)
∂θ V j=1 ∂θ
esθ (x,h)
where v(x) = (625)
Qh (w=x)
and we refer to v as the importance weights. In NLP, we typically set Q to the Zipfian
distribution144
144
TensorFlow seems to define this as
log(w + 2) − log(w + 1)
PZipf (w) = (626)
log(V + 1)
where V is the vocabulary size. I can’t seem to find this definition anywhere else though. A more common form
seems to be
1
w
P (w) = P 1
(627)
w0 w0s
I plotted both on WolframAlpha (link here) and they do indeed look basically the same, especially for any
reasonably large V .
208
NCE. In NCE, we introduce [unigram] noise distribution Pn (w) and impose a prior that
noise samples are k times more frequent than data samples from Pdh (w), resulting in the joint
distribution,
Our goal is to learn the posterior distribution P h (D=1 | w) (so we replace Pd with Pθ ):
NCE posterior
h Pθh (w)
P (D=1 | w, θ) = h (630)
Pθ (w) + kPn (w)
where Pθh0 denotes the unnormalized distribution. It turns out that, in practice, we can im-
pose that exp(ch )=1 and use the unnormalized Pθh0 in place of the true probabilities in all
that follows. Critically, note that this means we rewrite equation 630 using the unnormalized
probabilities in place of Pθh . The NCE objective is to find146 as follows, where I’ve shown each
step of the derivation:
h
∝ kEPn log P (0 | w) + EP h log P h (1 | w)
(638)
d
∂ h X kPn (w) ∂
J (θ) = h
Pdh (w) − Pθh (w) log Pθh (w) (639)
∂θ w P θ (w) + kPn (w) ∂θ
145
Reminder that the h is a reminder that Z is a function of the context h.
146
I’ll drop off θ dependence wherever obvious for the sake of compactness.
209
Papers and Tutorials December 16, 2018
R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural Ordinary Differential Equations,” University
of Toronto (Oct 2018).
Introduction (1). Let T denote the number of layers of our network. In the limit of T → ∞
and small δh(t) between each “layer”, we can parameterize the dynamics via an ODE:
dh(t)
= f (h (t) , t, θ) (640)
dt
Benefits of defining models using ODE solvers:
• Memory. Constant as a function of depth, since we don’t need to store intermediate
values from the forward pass.
• Adaptive computation. Modern ODE solvers adapt evaluation strategy on the fly.
• Parameter efficiency. Nearby “layers” have shared parameters.
Review: ODE
Remember the basic idea with ODEs like the one shown above. Our goal is to solve for h(t).
(644)
210
which we can use to define the adjoint a(t):
∂L ∂L dz(t + )
a(t) , − =− (647)
∂z(t) ∂z(t + ) dz(t)
∂T (z(t), t)
= a(t + ) (648)
∂z(t)
da(t) ∂f (z(t), t, θ)
= −a(t)T (649)
dt ∂z
where da(t)
dt can be derived using the limit definition of a derivative. We’ll now outline the
algorithm for computing gradients. We use a black box ODE solver as a subroutine that solves
a first-order ODE initial value problem. As such, it accepts an initial state, its first derivative,
the start time, the stop time, and parameters θ as arguments.
Reverse-mode derivative
∂L
Given start time t0 , stop time t1 , final state z(t1 ), parameters θ, and gradient ∂z(t1 ) compute
all gradients of L.
1. Compute t1 gradient:
∂L ∂L T ∂z(t1 ) ∂L T
= = f (z(t1 ), t1 , θ) (650)
∂t1 ∂z(t1 ) ∂t1 ∂z(t1 )
2. Initialize the augmented state:
∂L
h i
s0 := z(t1 ), a(t1 ), 0, − (651)
∂t1
3. Define augmented sate dynamics:
ds ∂f ∂f ∂f
h i
, f (z(t), t, θ), − a(t)T , − a(t)T , − a(t)T (652)
dt ∂z ∂θ ∂t
4. Solve reverse-timea ODE:
∂L ∂L ∂L ds
z(t0 ), , , = ODESolve s0 , , t1 , t 0 , θ (653)
∂z(t0 ) ∂θ ∂t0 dt
∂L
5. Return ∂z(t0 )
, ∂L
∂θ
∂L ∂L
, ∂t ,
0 ∂t1
a
Notice how our “initial state” actually corresponds to t1 , and we pass in t1 and t0 in the opposite
order we typically do.
211
Papers and Tutorials December 16, 2018
Z. Yin and Y. Shen, “On the Dimensionality of Word Embedding,” Stanford University (Dec 2018).
Unitary Invariance of Word Embeddings (2.1). Authors interpret result any unitary
transformation (e.g. a rotation) on embedding matrix as equivalent to the original.
PIP Loss (3). Given embedding matrix E ∈ RV ×d , define its Pairwise Inner Product
(PIP) matrix to be
PIP(E) , EE T (655)
Notice that PIP(E)i,j = hwi , wj i. Define the PIP loss for comparing two embeddings E1 and
E2 for a common vocab of V words:
v
(2) 2
uX
(1) (1) (2)
||PIP(E1 ) − PIP(E2 )||F = t hwi , wj i − hwi , wj i (656)
u
i,j
212
Papers and Tutorials December 26, 2018
147
Note that G outputs samples x, not probabilities. By doing this, it implicitly defines a probability distri-
bution pg (x). This is what the authors say.
213
Global Optimality of pg ≡ pdata (4.1).
Proposition 1
For fixed G, the optimal D is
∗ pdata (x)
DG (x) = (658)
pdata (x) + pg (x)
∗
Derivation of DG (x).
Aside: Law of the unconscious statistician (LotUS). The distribution pg (x) should be read as “the proba-
bility that the output of G yields the value x.” Take a step back and recognize that G is simply a function of a
random variable z. As such, we can apply familiar rules like
However, recall that functions of random variables can themselves be interpreted as random variables. In other
words, we can also use the interpretation that G evaluates to some output x with probability pg (x).
As this blog post details, this equivalence is NOT due to a change of variables, but rather by the Law of the
unconscious statistician.
LotUS allowed us to express V (G, D ) as a continuous function over x. More importantly, it means we can
∂V
evaluate ∂D and take the derivative inside the integrala . Setting the derivative to zero and solving for D yields
D∗G , the form that maximizes V .
a
Also remember that D(·) ∈ [0, 1] since it is a probability distribution.
214
∗ ):
The authors use this proposition to define the virtual training criterion C(G) , V (G, DG
" # " #
pdata (x) pg (x)
C(G) = Ex∼pdata log + Ex∼pg log (666)
pdata (x) + pg (x) pdata (x) + pg (x)
Theorem 1.
The global minimum of C(G) is achieved IFF pg = pdata . At that point C(G) = − log 4.
Proof: Theorem 1
∗ , G; p =p
The authors subtract V (DG g data ) from both sides of 666, do some substitions, and find that
where JSD is the Jensen-Shannon divergencea . Since 0 ≤ JSD(p||q) always, with equivalence only if p ≡ q,
this proves Theorem 1 above.
a
Recall that the JSD represents the divergence of each distribution from the mean of the two
215
Papers and Tutorials January 01, 2019
J. Hawkins et al., “A Framework for Intelligence and Cortical Function Based on Grid Cells in the NeoCortex,”
Numata Inc. (Oct 2018).
Introduction. Authors propose new framework based on location processing that provides
supporting evidence to the theory that all regions of the neocortex are fundamentally
the same. We’ve known that grid cells exist in the hippocampal complex of mammals, but
only recently have seen evidence that they may be present in the neocortex.
How Grid Cells Represent Location. Grid cells in the entorhinal cortex148 represent
space and location. The main concepts, in order such that they build on one another, are as
follows:
• A single grid cell is a neuron that fires [when the agent is] at [one of many] multiple
locations in a physical environment149 .
• A grid cell module is a set of grid cells that activate with the same lattice spacing and
orientation but at shifted locations within an environment.
• Multiple grid cell modules that differ in tile spacing and/or orientation can provide
unique location information150 .
Crucially, the number of unique locations that can be represented by a set of grid cell modules
scales exponentially with the number of modules. Every learned environment is associated
with a set of unique locations (firing patterns of the grid cells).
Grid Cells in the Neocortex. The authors propose that we learn the structures of objects
(like pencils, cups, etc) via grid cells in the neocortex. Specifically, they propose:
1. Every cortical column has neurons that perform a function similar to grid cells.
2. Cortical columns learn models of objects similarly to how grid/place cells learn models
of environments.
148
The entorhinal cortex is located in the medial temporal lobe and functions as a hub in a widespread network
for memory, navigation and the perception of time.
149
For example, there may be a grid cell in my brain that fires when I’m at certain locations inside my room.
Those locations tend to form a lattice of sorts.
150
A single module alone cannot, because it repeats periodically. In other words, it can only provide relative
location information.
216
Papers and Tutorials January 05, 2019
Burda et al., “Large-Scale Study of Curiosity Driven Learning,” OpenAI and UC Berkeley (Aug 2018).
An agent sees observation xt , takes action at , and transitions to the next state with observation
xt+1 . Goal: incentivize agent with reward rt relating to how informative the transition was.
The main components in what follows are:
• Observation embedding φ(x).
• Forward dynamics network for prediction P (φ(xt+1 | xt , at ).
• Exploration reward (surprisal):
rt = − log p (φ (xt+1 ) | xt , at ) (668)
The authors choose to model the next state embedding with a Gaussian,
Interpretation. It seems that this works because after awhile, it is boring and predictable to
take actions that result in losing a game. The most surprising actions seem to be those that
advance us forward, to new and uncharted territory. However, these experiments are all on
games that have a very "linear" uni-directional-like sequence of success. I wonder how successful
this would be in a game like rocket league, where there is no tight coupling of direction with
success and novelties (e.g. moving forward in mario bros).
217
Papers and Tutorials March 02, 2019
J. Howard and S. Ruder, “Universal Language Model Fine-Tuning for Text Classification,” (May 2018).
TL;DR. ULMFiT is a transfer learning method. They introduce techniques for fine-tuning
language models.
Universal Language Model Fine-tuning (3). Define the general inductive transfer
learning setting for NLP:
Given a source task Ts and target task TT 6= TS , improve performance on TT .
Target Task LM Fine-tuning (3.2). For step 2, the authors propose what they call dis-
criminative fine-tuning and slanted triangular learning rates.
θt` = θt−1
`
− η ` · ∇θ` J(θ) (671)
• Slanted triangular learning rates. A type of learning rate schedule that looks like
the picture below.
218
First, we define the following hyperparameters:
– T : total number of training iterations151
– cf rac: fraction of T (in num iterations) where we’re increasing the learning rate.
– cut: bT · cf racc. Iteration where we switch from increasing the LR to decreasing it.
– ratio: ηmax /ηmin . We of course must also define ηmax .
We can now compute the learning rate for a given iteration t:
1 + p · (ratio − 1) Suggested:
ηt = ηmax · (672) cf rac=0.1
ratio ratio=32
t ηmax =0.01
cut if t < cut
p= t−cut (673)
1 − otherwise
cut·(1/cf rac−1)
Target Task Classifier Fine-tuning (3.3). Augment the LM with two fully-connected layers.
The first with ReLU activation and the second with softmax. Each uses batch normalization
and dropout. The first is fed the output hidden state of the LM concatenated with the max-
and mean-pooled hidden states over all timesteps152 :
In addition to DF-T and STLR from above, they also employ the following techniques:
• Gradual Unfreezing: first unfreeze the last layer and fine-tune it alone with all other
layers frozen for one epoch. Then, unfreeze the next layer and fine-tune the last-two
layers only for the next epoch. Continue until the entire network is being trained, at
which time we just train until convergence.
• BPTT for Text Classification. Divide documents into fixed-length “batches”153 of
size b. They initialize the ith section with the final state of the model run on section
i − 1.
151
Steps-per-epoch times number of epochs.
152
Or just as much as we can fit into GPU memory.
153
Not a fan of how they overloaded this term here.
219
Papers and Tutorials March 10, 2019
Wilson et al., “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” (May 2018).
TL;DR. For simple overparameterized problems, adaptive methods (a) often find drastically
different solutions than SGD, and (b) tend to give undue influence to spurious features that
have no effect on out-of-sample generalization. They also found that tuning the initial learning
and decay scheme for Adam yields significant improvements over its default settings in all
cases.
Background (2). The gradient updates for general stochastic gradient, stochastic momentum,
and adaptive gradient methods, respectively, can be formalized as follows154
where Hk is a p.d. matrix involving the entire sequence of iterates (w1 , . . . wk ). For example,
regular momentum would be γk =0, and Nesterov momentum would be γk =βk . In practice, we
basically always define Hk as the diagonal matrix:
k
!1/2
X
Hk := diag ηi gi gi (678)
i=1
The Potential Perils of Adaptivity (3). Consider the binary least-squares classification
problem, where we aim to minimize
1
Rs [w] := ||Xw − y||22 (679)
2
where X ∈ Rn×d and y ∈ {−1, 1}n .
Lemma 3.1
If there exists a scalar c s.t. Xsign(X T y) = cy, then (assuming w0 := 0) AdaGrad, Adam,
and RMSProp all converge to the unique solution w ∝ sign(X T y).
154
I’m defining ∆wk+1 , wk+1 − wk .
220
Papers and Tutorials March 17, 2019
Y. Gal and Z. Ghahramani, “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks,”
University of Cambridge (Oct 2016).
Background (3). In Bayesian regression, we want to infer the parameters ω of some function
y = f ω (x). We define a prior, p(ω), and a likelihood,
p(y=d | x, ω) = Catd (softmax (f ω (x))) (680)
for a classification setting. Given a dataset X, Y, and some new point x∗ , we can predict its
output via
Z
p(y ∗ | x∗ , X, Y) = p(y ∗ | x∗ , ω)p(ω | X, Y)dω (681)
In a Bayesian neural network, we place the prior over the NNs weights (typically Gaussians).
The posterior p(ω | X, Y) is usually intractable, so we resort to variational inference to
approximate it. We define our approximating distribution q(ω) and aim to minimize the
KLD:
Z
KL (q(ω)||p(ω | X, Y)) ∝ − q(ω) log p(Y | X, ω)dω + KL(q(ω)||p(ω)) (682)
N Z
X
= q(ω) log p(yi | f ω (xi ))dω + KL(q(ω)||p(ω)) (683)
i=1
Variational Inference in RNNs (4). The authors use MC integration to approximate the
integral. The use only a single sample ω̂ ∼ q(ω) for each of the N summations, resulting in
an unbiased estimator. Plugging this in, we obtain our objective:
N
log p yi | fyω̂i fhω̂i (xi,T , hT −1 )
X
L≈− + KL(q(ω)||p(ω)) (684)
i=1
221
Papers and Tutorials March 24, 2019
E. Grave, A. Joulin, and N. Usunier, “Improving Neural Language Models with a Continuous Cache,” Facebook
AI Research (Dec 2016).
The cache stores pairs (ht , xt+1 ) of the final hidden-state representation at time t, along with
the word which was generated 155 based on this representation.
pvocab (w | xh1...ti ) ∝ exp hTt ow (686)
t−1
X
pcache (w | hh1...ti , xh1...ti ) ∝ 1xi+1 =w exp θhTt hi (687)
i=1
X
= exp θhTt h (688)
(x,h)∈cache
s.t. x=w
p(w | hh1...ti , xh1...ti ) = (1 − λ)pvocab (w | ht ) + λpcache (w | hh1...ti , xh1...ti ) (689)
where θ is a scalar parameter that controls the flatness of the cache distribution.
155
They say this, but everything else in the paper strongly suggests they mean the next gold-standard input
instead.
222
Papers and Tutorials April 26, 2019
A. Bhowmick et al., “Protection Against Reconstruction and Its Applications in Private Federated Learning,”
Apple, Inc. (Dec 2018).
Introduction (1). In many scenarios, it is possible to reconstruct model inputs x given just
∇θ `(θ; x, y). Differential privacy is one approach for obscuring the gradients such that guar-
antees can be made regarding risk of compromising user data x. Locally private algorithms,
however, are preferred to DP when the user wants to keep their data private even from the
data collector. The authors want to find a way to perform SGD while providing both local
privacy to individual data Xi and stronger guarantees on the global privacy of the output θ̂n
of their procedure.
Formally, say we have two users’ data x and x0 (both in X ) and some randomized mechanism
M : X 7→ Z. We say that M is ε-local differentially private if ∀x, x0 ∈ X and sets S ⊂ Z:
Pr [M (x) ∈ S]
≤ eε (690)
Pr [M (x0 ) ∈ S]
Clearly, the RHS will need to be pretty big for this to be achievable. The authors claim that
allowing ε >> 1 “may [still] provide meaningful privacy protections.”
Privacy Protections (2). The focus here is on the curious onlooker: an individual (e.g.
Apple PFL engineer) who can observe all updates to a model and communication from individ-
ual devices. Let X denote some user data. Let ∆W denote the weights difference after some
model update using the data X. Let Z be the result of the randomized mapping ∆W 7→ Z.
Our setting can be described with the Markov chain X → ∆W → Z. The onlooker observes
Z and wants to estimate some function f (X) on the private data.
Separated private mechanisms (2.2). The authors propose, instead of a simple mapping
∆W → Z, to split it up into two parts: Z1 = M1 (U ) and Z2 = M2 (R), where
∆W
U= (691)
||∆W ||2
R = ||∆W ||2 (692)
223
Privatizing Unit `2 Vectors with High Accuracy (4.1). Given some vector u ∈ Sd−1156 ,
we want to generate an ε-differentially private vector Z such that
E [Z | u] = u ∀u ∈ Sd−1 (693)
d−1 1+γ
Let α = 2
, τ = 2
, and
(1 − γ 2 )α p 1−p
m= − (695)
2d−2 (d − 1) B(α, α) − B(τ ; α, α) B(τ ; α, α)
where B(x, α, β) is the incomplete beta function (see paper pg 17 for details).
1
Return Z = m
·V
Privatizing the Magnitude (4.3). We also need to privatize the weight delta norms. We
want to return values r ∈ [0, rmax ] for some rmax < ∞.
156
Here, this denotes an n-sphere:
Sn , {x ∈ Rn+1 : ||x|| = r}
224
Papers and Tutorials June 02, 2019
T. Mikolov and G. Zweig, “Context Dependent Recurrent Neural Network Language Model,” BRNO and Mi-
crosoft (2012).
Model Structure (2). Given one-hot input vector xt , output a probability distribution yt
for the next word. Incorporate a feature vector ft that will contain topic information.
LDA for Context Modeling (3). “Documents” fed to LDA here will be individual sentences.
The generative process assumed by LDA is compactly defined by the following sequence of
operations157 :
N: number of words
N ∼ Poisson(ξ) (698) Θi ≡ p(topic[i])
Θ ∼ Dir(α) (699) zn : topic of word n
zn ∼ Multinomal(Θ) (700)
wn ∼ Pr [wn | zn , β] (701)
where Pr [wn =a | zn =b] = βb,a , so we are really just sampling from row zn of β, where β ∈
[0, 1]Z×V (where Z is number of topics). The result of LDA is a learned value for α, and the
topic distributions β.
K
1 Y
ft = tx (702)
Z i=0 t−i
1 γ (
ft = f t 1 − γ) (703)
Z t−1 xt
157
α is a vector with number of elements equal to number of topics.
225
Papers and Tutorials July 27, 2019
Chen et al., “Strategies for Training Large Vocabulary Neural Language Models,” FAIR (Dec 2018). arXiv:1512.04906
Setup/Notation. Note that in everything below, the authors are using a rather primitive feed-
forward network as their language model. To predict wt it just concatenates the embeddings
of the previous n words and feeds it through a k-layer FF network. Then, layer k + 1 is the
dense projection and softmax:
Using cross-entropy loss, the derivative of log p(wt =i) wrt the jth element of the logits is:
∂ log yi ∂ h k+1 i
= h i − log Z (706)
∂hk+1
j ∂hk+1
j
= δij − yj (707)
When computing gradients of the cross-entropy loss, yi here is the ground truth. Therefore,
to increase the probability of the correct token, we need to increase the logits element for that
index, and decrease the elements for the others. Note how this implies we must compute the
final activations for all words in the vocabulary.
Hierarchical Softmax (2.2). Group words into one of two clusters {c1 , c2 }, based on unigram
frequency158 . Then model p(wt | x) = p(ct | x)p(wt | ct ) where ct is the class that word wt was
assigned to.
158
For example, you could just put the top 50% in c1 and the rest in c2 .
226
NCE (2.5). Define Pnoise (w) by the unigram frequency distribution. For each real token wt
in the training set, sample K noise tokens {nk }K
k=1 . NCE aims to minimize
(i) "
N len(w
X ) K
#
X (i) X (i)
LN CE ({w1 , . . . , wN }) = log h(wt ) + log(1 − h(nk )) (708)
i=1 t=1 k=1
Pmodel (w)
h(wt ) = (709)
Pmodel (w) + kPnoise (w)
Pemodel (w)
≈ (710)
Pemodel (w) + kPnoise (w)
where the final approximation is what makes NCE less computationally expensive in practice
than standard softmax. This would seem to imply that NCE should approach standard softmax
(in terms of correctness) as k increases.
Takeaways.
• Hierarchical softmax is the fastest.
• NCE performs well on large-volume large-vocab datasets.
• Similar NCE values can result in very different validation perplexities.
• Sampled softmax shows good results if the number of negative samples is at 30% of the
vocab size or larger.
• Sampled softmax has a lower ppl reduction per step than others.
227
Papers and Tutorials August 03, 2019
Vector Quantization. Denote the index set I = [0..k − 1] and the set of reproduction values
(a.k.a. centroids) ci as C = {ci ∈ RD : i ∈ I}. We refer to C as the codebook of size k. A
quantizer is a function q : x 7→ q(x), where x ∈ RD and q(x) ∈ C. We typically evaluate the
quality of a quantizer with mean squared error of the reconstruction:
h i
M SE(q) = Ex∼p(x) ||x − q(x)||22 (711)
In order for an optimizer to be optimal, it must satisfy the LLoyd optimality conditions:
Note that each subquantizer qj has its own index set Ij and codebook Cj . Therefore, the
final reproduction value of a product quantizer is identified by an element of the product set
I = I1 × · · · × Im . The associated final codebook is C = C1 × · · · × Cm .
228
Papers and Tutorials August 03, 2019
Lample et al., “Large Memory Layers with Product Keys,” FAIR (July 2019). arXiv:1907.05242v1
where equation 715 is inefficient for large memory (key-value) stores. The authors propose
instead a structured set of keys that they call product keys. Spoiler alert: it’s just product
quantization with m=2 subvectors. Instead of using the flat key set K , {k1 , . . . , k|K| } with
each ki ∈ Rdq from earlier, we redefine it as
K , {(c, c0 ) | c ∈ C, c0 ∈ C 0 } (718)
229
3. Run the standard algorithm using the new reduced key set K.
TODO: finish this note
230
Papers and Tutorials August 10, 2019
V. Kazemi and A. Elqursh, “Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering”
Google Research (April 2017). arXiv:1704.03162v2
TL;DR: Good for a high-level overview of the VQA task, but extremely vague with so many
details omitted it renders the paper fairly useless.
Method (3). Given a training set of image-question-answer triplets (I, q, a), learn to estimate
the most likely answer â out of the set of most frequent answers161 in the training data:
161
Same approach as how we define vocabulary in language modeling tasks.
162
Authors do a laughably poor job at describing this part in any detail, so I’m taking the liberty of filling in
the blanks. Blows my mind that papers this sloppy can even be published.
231
Dataset. Although, again, the authors are horribly vague/sloppy here, it seems like the data
they use actually provides K “correct” answers for each image-question pair. The model loss
is therefore an average NLL over the K true classes.
232
Papers and Tutorials August 10, 2019
Mudrakarta et al., “Did the Model Understand the Question?” Univ. Chicago & Google Brain (May 2018).
arXiv:1805.05492v1
TL;DR. Basically all QA-related networks are dumb and don’t learn what we think they learn.
• Networks tend to make predictions based a tiny subset of the input words. Due to this,
altering the non-important words in ways that may drastically change the meaning of
the question can have virtually no impact on the network’s prediction.
• Networks assign high-importance to words like “there”, “what”, “how”, etc. These are
actually low-information words that the network should not heavily rely on.
• Networks rely on the image far more than the question.
Integrated Gradients (IG) (3). Purpose: “isolate question words that a DL system uses to
produce an answer.”
where x0 is some baseline input we use to compute the relative attribution of input x. The
authors set x0 as the “empty question” (sequence of padding values)163 .
Given an input x and baseline x0 , the integrated gradient along the ith dimension is as
follows.
Z 1
∂F (x0 + α × (x − x0 ))
IGi (x, x0 ) , (xi − x0i ) × dα (724)
α=0 ∂xi
Interpretation: seems like IG gives us a better idea of the total “attribution” of each input
dimension xi relative to baseline x0i along the line connecting xi and x0i , instead of just the im-
mediate derivative around xi . Although, the fact that infinitesimal contributions could cancel
each other out seems problematic (positive and negative gradients along the interpolation).
163
The use the same context though (e.g. the associated image for VQA). Only the question is changed.
233
Papers and Tutorials August 17, 2019
XLNet
Table of Contents Local Written by Brandon McKinzie
Yang et al., “XLNet: Generalized Autoregressive Pretraining for Language Understanding” CMU & Google
Brain (May 2018).
TL;DR: Instead of minimizing the NLL using p(w1 , . . . , wT ), minimize over NLL’s using every
possible order of the given word sequence.
Background. Recall that BERT does denoising auto-encoding. Given text sequence x =
{x1 , . . . xT }, BERT constructs a corrupted version x̂ by randomly masking out some tokens.
Let x̄ denote the tokens that were masked. The BERT training objective is then
X
[BERT] max log pθ (x̄ | x̂) ≈ log pθ (x̄ | x̂) (725)
θ
x̄∈x̄
p(x̄ | x̂) = Softmax Hθ (x̂)Tt e(x̄) (726)
where ZT is the set of all possible permutations of the length-T index sequence [1..T ]. To
implement this, the authors had to re-parameterize the next-token distribution to be target
position aware:
T
pθ (Xzt =x | xhz1 ...zt−1 i ) = Softmax gθ xhz1 ...zt−1 i , zt e (x) (728)
They accomplish this via two-stream self-attention, a technique that utilizes two sets of
hidden representations (instead of one):
• Content representation: hzt , hθ (xhz1 ...zt i ).
• Query representation: gzt , gθ (xhz1 ...zt−1 i , zt ).
234
(0)
The query stream is initialized with some vector gi =w, and the content stream is initialized
(0)
with word embedding hi =e(xi ). For the subsequent attention layers 1 ≤ m ≤ M , they are
computed respectively as follows:
gz(m)
t
← Attention(Q=gz(m−1)
t
, K=V =h(m−1)
z<t ) (729)
h(m) (m−1)
zt ← Attention(Q=hzt , K=V =h(m−1)
z≤t ) (730)
In practice, in order to speed up optimization, the authors do partial prediction: only train
to predict over xz>c targets rather than all of them.
Incorporating Ideas from Transformer-XL. Often times, sequences are too long to feed
all at once. The authors adopt relative positional encoding and segment-level recurrence from
Transformer-XL. To compute the attention update with memory on a given segment, we use
the content representations from the previous segment, h, e along with the current segment,
hz≤t as follows:
h(m) ← Attention Q=h(m−1) , K=V e (m−1) ; h(m−1)
= h (731)
zt zt z≤t
235
Papers and Tutorials August 24, 2019
Transformer-XL
Table of Contents Local Written by Brandon McKinzie
Dai et al., “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” CMU & Google
Brain (Jan 2019).
Segment-Level Recurrence with State Reuse. Denote two consecutive segments of length
L as sτ = [xτ,1 , . . . , xτ,L ] and sτ +1 = [xτ +1,1 , . . . , xτ +1,L ]. Denote output of layer n given input
segment sτ as hnτ ∈ RL×d , where d is the hidden dimension. To obtain the output of layer n
given the next segment, sτ +1 , do:
where the concat in 734 is along the length (time) dimension. In other words, Q remains the
same, but K and V get the previous segment prepended. Ultimately this only changes the
inner dot products in the attention mechanism to attend over both segments. The L output
attention vectors are therefore each weighted sums over the previous 2L timesteps instead of
just L.
Relative Positional Encodings. Instead of absolute positional encodings (as regular trans-
formers do), only√encode the relative positional information in the hidden states. Ignoring the
scale factor of 1/ dk , we can write the score for query vector qi = Wq (exi + ui ) and key vector
kj = Wk (exj + uj ), for input embeddings e and positional encodings u as follows. Below it we
show the authors proposed re-parameterized relative encoding version.
Aabs T T T T T T T T
i,j = exi Wq Wk exj + exi Wq Wk uj + ui Wq Wk exj + ui Wq Wk uj (735)
Ares
i,j = eTxi WqT Wk,E exj + eTxi WqT Wk,R ri−j T
+ u Wk,E exj + v Wk,R ri−jT
(736)
| {z } | {z } | {z } | {z }
cont-based addr cont-dep pos bias global cont bias global pos bias
where content is abbreviated as “cont” and positional is abbrev as “pos”. I’ve shown all
differences introduced by the second version in red font. It appears that ri−j is literally just
ui−j but I guess using new letters is cool. Note that they separate Wk into content-based
Wk,E and location-based Wk,R .
236
Papers and Tutorials August 31, 2019
Grave et al., “Efficient Softmax Approximation for GPUs” FAIR (June 2017).
More generally, we can extend the above algorithm to N clusters (instead of 2). We can also
adapt the capacity of each cluster (varying their embedding size). The authors recommend,
for each successive tail cluster, reducing the output size by a factor of 4. Of course, this then
has to be followed by projecting back up to the number of words associated with the given
cluster.
TODO: detail out how cross entropy loss is computed under this setup.
237
Papers and Tutorials September 02, 2019
A. Baevski and M. Auli, “Adaptive Input Representations for Neural Language Modeling” FAIR (Feb 2019).
TL;DR: Literally just adaptive softmax but for the input embeddings. Official implementation
can be found here.
Adaptive Input Representations (3). Same as Grave et al., they partition the vocabulary
V into
V = V1 ∪ V2 ∪ . . . ∪ Vn (738)
where V1 is the head and the rest are the tails (ordered by decreasing frequency). They
reduce the capacity of each cluster by a factor of k=4 (also same as Grave et al.). Finally,
they add linear projections for each cluster’s embeddings in order to ensure they all result in
d-dimensional output embeddings (even V1 ).
238
Papers and Tutorials September 14, 2019
Andreas et al., “Deep Compositional Question Answering with Neural Module Networks” UC Berkeley (Nov
2015).
239
Answering natural language questions (4.3). They combine the results from the module
network with an LSTM, which is fed the question as input and outputs a predictive distribution
over the set of answers164 . The final prediction is a geometric average of the LSTM output
probabilities and the root module output probabilities.
164
This is the same distribution that the root module is trying to predict
240
Papers and Tutorials September 14, 2019
Andreas et al., “Learning to Compose Neural Networks for Question Answering” UC Berkeley (June 2016).
TL;DR. Improve initial NMN work (previous note) by (1) learning network predictor (P (w)
in previous paper) instead of manually specifying it, and (2) extending visual primitives from
previous work to reason over structured world representations.
Model (4). Training data consists of (world, question, answer) triplets (w, x, y). The model
is built around two distributions:
• layout model p(z | x; θ` ) which predicts a layout z for sentence x.
• execution model pz (y | w; θe ) which applies the network specified by z to the world
representation w.
241
Evaluating Modules (4.1). The execution model is defined as
pz (y | w) = (JzKw )y (739)
where JzKw denotes the output of network with layout z on input world w. The defining
equations for all modules is as follows (σ ≡ ReLU, sm ≡ softmax):
To train, maximize
X
log pz (y | w; θe ) (746)
(w,y,z)
242
Papers and Tutorials September 14, 2019
R. Hu, J. Andreas, et al., “Learning to Reason: End-to-End Module Networks for Visual Question Answering”
UC Berkeley, FAIR, BU (Sep 2017).
End-to-End Module Networks (3). High-level sequence of operations, given some input
question and image:
1. Layout policy predicts a coarse functional expression that describes the structure of the
computation.
2. Some subset of function applications within the expression receive parameter vectors
predicted from the text.
3. Network is assembled with the modules according to layout expression to output an
answer.
Note that, whereas the original NMN paper (see previous note) instantiated module types based
on words (e.g. describe[shape] vs describe[where]) and gave different instantiations different
parameters, this paper has a single module for each module type (no“instances” anymore).
To distinguish between cases where e.g. describe should describe a shape vs describing a
(m)
location, the module incorporates a text feature xtxt computed separately/identically for each
module m:
T
(m) X (m)
xtxt = αi wi (747)
i=1
243
Layout Policy with Seq2Seq RNN (3.2). TODO finish note
244
Papers and Tutorials October 01, 2019
Carbune et al., “Fast Multi-language LSTM-based Online Handwriting Recognition” Google AI Perception (Feb
2019).
Introduction (1). Task: given input strokes, i.e. sequences of points (x, y, t), output it in the
form of text.
165
Vizier is a program made by Google for black-box tuning
245
Papers and Tutorials October 02, 2019
166
Segmentation/cut point: a point at which another character my start. Segment: the (partial) strokes
between 2 consecutive segmentation points.
167
Character hypothesis: a set of one or more segments (not necessarily consecutive).
246
Language Models (6.1). They utilize two types of language models:
• Stupid-backoff entropy-pruned 9-gram character LM. This is their “main” LM. Depending
on the language, they use about 10M to 100M n-grams.
• Word-based probabilistic finite automaton. Creating using 100K most frequent words of
a language.
Search (6.2). Goal: obtain a recognition result by finding the best path from the source
node (no ink recognized) to the target node (all ink recognized). Algorithm: ink-aligned beam
search that starts at the start node and proceeds through the lattice in topological order.
247
Papers and Tutorials October 13, 2019
Zhao et al., “Modular Generative Adversarial Networks” UBC, Tencent AI Lab (April 2018).
Network Construction (3.2). Let x and y denote the input and target image, respectively,
wherever applicable. Let A = {A1 , A2 , · · · , An } denote an attribute set. Four types of
modules are used:
1. Initial module is task-dependent (below). Output is feature map in RC×H×W .
• [translation] encoder E: x 7→ E(x)
• [generation] generator G: (z, a0 ) 7→ G(z, a0 ) where z is random noise and a0 is
a condition vector representing auxiliary information.
2. transformer(s) Ti : E(x) 7→ Ti (E(x), ai ). Modifies repr of attrib ai in the FM.
3. reconstructor R : (Ti , Tj , . . .) 7→ y. Reconstructs image from an intermediate FM.
4. discriminator Di : R 7→ {0, 1} × Val(ai ). Predicts probability that R came from ptrue ,
and the [transformed] value of ai .
The authors emphasize that the transformer module is their core module. It’s architecture is
illustrated below.
248
Loss Function (3.4).
n
X n
X
LD (D) = − Ladvi + λcls Lrclsi (748)
i=1 i=1
n n n
!
X X X
LG (E, T, R) = Ladvi + λcls Lfclsi + λcyc LER
cyc + LT i
cyc (749)
i=1 i=1 i=1
Ladvi (E, Ti , R, Di ) = Ey∼pdata (y) [log Di (y)] +
(750)
Ex∼pdata (x) [log (1 − Di (R (Ti (E(x)))))]
Lrclsi = −Ex,ci [log Dclsi (ci | x)] (751)
Lfclsi = −Ex,ci [log Dclsi (ci | R(E(Ti (x))))] (752)
LER
cyc = Ex [||R(E(x)) − x||1 ] (753)
LT i
cyc = Ex [||Ti (E(x)) − E(R(Ti (E(x))))||1 ] (754)
249
Papers and Tutorials October 13, 2019
Jia et al., “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” Google (Jan
2019).
TL;DR: TTS that’s able to generate speech in the voice of different speakers, including those
unseen during training.
250
NLP with Deep
Learning
Contents
251
NLP with Deep Learning May 08
Meaning of a word. Common answer is to use a taxonomy like WordNet that has hypernyms
(is-a) relationships. Problems with this discrete representation: misses nuances, e.g. the words
in a set of synonyms can actually have quite different meanings/connotations. Furthermore,
viewing words as atomic symbols is a bit like using one-hot vectors of words in a vocabulary
space (inefficient).
Distributed representations. Want a way to encode word vectors such that two similar
words have a similar structure/representation. The distributional similarity-based168 ap-
proach represents words by means of its neighbors in the sentences in which it appears. You
end up with a dense “vector for each word type, chosen so that it is good at predicting other
words appearing in its context.”
where
• The params θ are the vector representation of the words (they are the only learnable
parameters here).
• m is our radius/window size.
• o and c are indices into our vocabulary (somewhat inconsistent notation).
• Yes, they are using different vector representations for u (context words) and v (center
words). I’m assuming one reason this is done is because it makes the model architecture
simpler/easier to build.
Some subtleties:
168
Note that this is distinct from the way “distributed” is meant in “distributed representation.” In contrast,
distributional similarity-based representations refers to the notion that you can describe the meaning of words
by understanding the context in which they appear.
252
• Looks like e.g. P r(wt+j | wt ) doesn’t really care what the value of j is, it is just modeling
the probability that it is somewhere in the context window. The wt are one-hot vectors
into the vocabulary. Standard tricks for simplifying the cross-entropy loss apply.
• Equation 756 should be interpreted as the probability that the oth word in our vocabulary
occurs in the context window of the cth word in our vocabulary.
• The model architecture is identical to an autoencoder. However, the (big) difference is
the training procedure and interpretation of the model parameters going “in” versus the
parameters going “out”.
Sentence Embeddings. It turns out that simply taking a weighted average of word vectors
and doing some PCA/SVD is a competitive way of getting unsupervised word embeddings. Discussion based on
Apparently it beats supervised learning with LSTMs (?!). The authors claim the theoretical paper by Arora et al.,
(2017).
explanation for this method lies in a latent variable generative model for sentences (of course).
Approach:
1. Compute the weighted average of the word vectors in the sentence:
where wi is the word vector for the ith word in the sentence, a is a parameter, and p(wi )
is the (estimated) word frequency [over the entire corpus].
2. Remove the projections of the average vectors on their first principal component (“com-
mon component removal”) (y tho?).
1
253
Further Reading.
• Learning representations by back-propagating errors (Rumelhard et al., 1986)
• A Neural Probabilistic Language Model (Bengio et al., 2003)
• NLP (almost) from Scratch (Collobert & Watson, 2008)
• Word2Vec (Miklov et al. 2013)
254
NLP with Deep Learning May 08
GloVe (Lec 3)
Table of Contents Local Written by Brandon McKinzie
GloVe (Global Vectors). Given some co-occurrence matrix we computed with previous meth-
ods, we can use the following GloVe loss function over all pairs of co-occurring words in our
matrix:
W
X
J(θ) = f (Pij )(uTi vj − log Pij )2 (758)
i,j=1
where Pij is computed simply from the counts of words i and j co-occurring (empirical proba-
bility) and f is some squashing function that really isn’t discussed in this lecture (TODO).
Woah that is pretty neat. The solution is xi = queen. xb − xa is the vector pointing
from man to woman, which encodes the type of similarity we are looking for with the
other pair. Therefore, we take the vector to “king” and add the aforementioned difference
vector – the resultant vector should point to “queen”. Neat!
255
Derivation. Based on the descriptions in the original paper169 We want to develop a model
for learning word vectors.
1. The authors argue that “the appropriate starting point for word vector learning should
be with ratios of co-occurrence probabilities rather than the probabilities themselves.”
The most general such model takes the form,
e k | wi ]
Pr [w Pik
F (wi , wj , w
ek ) = ≡ where all w ∈ Rd (760)
e k | wj ]
Pr [w Pjk
and the tilde in wek denotes that wek is a context word vector, which are given a distinct
space from the word vectors wi and wj being compared. We compute all Pik via frequency
counts over the corpus.
2. Now that we’ve specified the inputs and ratio of interest, we can start specifying some
desirable constraints on the function F that we’re trying to find. The authors speculate
that, since vector spaces are linear structures, we should have F encode the information
of the ratio in the vector space via vector differences:
Pik
F (wi − wj , w
ek ) = (761)
Pjk
which basically says “our representation of the word vectors should be s.t. the rela-
tive probability of some word w ek occurring in the context of a word wi compared to it
occurring in the context of a different word wj can be captured by wi − wj and w
ek alone.”
d
3. Next we notice that F maps arguments in R to a scalar in R. The most straightforward
way of doing so while maintaining the linear structure we are trying to capture is via a
dot product:
Pik
F ((wi − wj )T w
ek ) = (762)
Pjk
Note that now F : R 7→ R.
4. We want our model to be invariant under the exchanges w ↔ w e and X ↔ X T . We can
restore this symmetry by first requiring that F be a homomorphism170 between the
groups (R, +) and (R>0 , ×) (in our case, negation and division would be better symbols,
but it’s equivalent). This requires that the following relation hold171
F (wiT w
ek )
F (wiT w
ek − wjT w
ek ) = T
(763)
F (wj wek )
where I’ve grouped terms on the LHS to emphasize how this is the definition of homomor-
phism. The solution for this equation is that F (·) ≡ exp(·). Combining this realization
with equations 762 and 763 yields,
T e
F (wiT w
ek ) = ewi wk
= Pik (764)
⇒ wiT w
ek = log(Pik ) = log(Xik ) − log(Xi ) (765)
169
Pennington et al., “GloVe: Global Vectors for Word Representation.”
170
In more detail, F : (R, +) 7→ (R>0 , ×), which reads “the function F maps elements in R and any summation
of elements in R to elements in R that are greater than zero or any product of positive elements in R.”
171
Note that we do not need to know anything about the RHS of the equations above to state this relation.
We write it by definition of homomorphism.
256
where Xik is the number of times word k appears in the context of word i, and Xi =
P
k Xik is the number of times any word appears in the context of i.
5. Restore symmetry under the exchanges w ↔ w e and X ↔ X T . We absorb Xi into a bias
bi for wi since it is independent of k.
wiT w
ek + bi + e
bk = log(Xik ) (766)
6. A main drawback to this model is that it weighs all co-occurrences equally, even those
that happen rarely or never. The authors propose a new weighted least squares regression
model, introducing a weighting function f (Xij ) into the cost function of our model:
V
X
J= f (Xij )(wiT w bj − log Xij )2
ej + bi + e (767)
i,j=1
257
Speech and
Language
Processing
Contents
258
Speech and Language Processing July 10, 2017
Overview. Going to rapidly jot down what seems most important from this chapter.
• Morphology: captures information about the shape and behavior of words in context
(Ch. 2/3).
• Syntax: knowledge needed to order and group words together.
• Lexical semantics: knowledge of the meanings of individual words.
• Compositional semantics: knowledge of how these components (words) combine to
form larger meanings.
• Pragmatics: the appropriate use of polite and indirect language.
• The knowledge of language needed to engage in complex language behavior can be sep-
arated into the following 6 distinct categories:
1. Phonetics and Phonology – The study of linguistic sounds.
2. Morphology – The study of the meaningful components of words.
3. Syntax – The study of the structural relationships between words.
4. Semantics – The study of meaning.
5. Pragmatics – The study of how language is used to accomplish goals.
6. Discourse – The study of linguistic units larger than a single utterance.
• Methods for resolving ambiguity: pos-tagging, word sense disambiguation, probabilis-
tic parsing, and speech act interpretation.
• Models and Algorithms. Among the most important are state space search and
dynamic programming algorithms.
259
Speech and Language Processing July 10, 2017
English Morphology. Morphology is the study of the way words are built up from smaller
meaning- bearing units, morphemes. A morpheme is often defined as the minimal meaning-
bearing unit in a language172 . The two classes of morphemes are stems (the “main” morpheme
of the word) and affixes (the “additional” meanings of various kinds).
Affixes are further divided into prefixes (precede stem), suffixes (follow stem), circumfixes (do
both), and infixes (inside the stem).
Two classes of ways to form words from morphemes: inflection and derivation. Inflection is
the combination of a word stem with a grammatical morpheme, usually resulting in a word of
the same class as the original stem, and usually filling some syntactic function like agreement.
Derivation is the combination of a word stem with a grammatical morpheme, usually resulting
in a word of a different class, often with a meaning hard to predict exactly.
172
Examples: “fox” is its own morpheme, while “cats” consists of the morpheme “cat” and the morpheme
“-s”.
260
Speech and Language Processing July 10, 2017
Counting Words. Most N -gram based systems deal with the wordform, meaning they treat
words like “cats” and “cat” as distinct. However, we may want to treat the two as instances
of a single abstract word, or lemma: a set of lexical forms having the same stem, the same
major part of speech, and the same word-sense.
where C(·) is the number of times the sequence, denoted as ·, occurred in the corpus.
Entropy. Denote the random variable of interest as x with probability function p(x). The
entropy of this random variable is:
X
H(x) = − p(x = x) log2 p(x = x) (770)
x
which should be thought of as a lower bound on the number of bits it would take to encode
a certain decision or piece of information in the optimal coding scheme. The value 2H is the
perplexity, which can be interpreted as the weighted average number of choices a random
variable has to make.
261
Cross Entropy for Comparing Models. Useful when we don’t know the actual prob-
ability distribution p that generated some data. Assume we have some model m that’s an
approximation of p. The cross-entropy of m on p is defined by:
1 X
H(p, m) = lim p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (771)
n→∞ n
W ∈L
That is we draw sequences according to the probability distribution p, but sum the log of their
probability according to m.
262
Speech and Language Processing June 21, 2017
Overview. This chapter is concerned with text categorization, the task of classifying an
entire text by assigning it a label drawn from some set of labels. Generative classifiers like naive
Bayes build a model of each class. Given an observation, they return the class most likely to Discriminative systems
have generated the observation. Discriminative classifiers like logistic regression instead learn are often more accurate
what features from the input are most useful to discriminate between the different possible and hence more
commonly used.
classes. Notation: we will assume we have a training set of N documents, each hand-labeled
with some class: {(d1 , c1 ), . . . , (dN , cN )}.
Naive Bayes. A multinomial173 classifier with a naive assumption about how the features
interact. We model a text document as a bag of words, meaning we store (1) the words
that occurred and (2) their frequencies. It’s a probabilistic classifier, meaning it estimates the
label/class of a document d as
Rapid review: multinomial distribution. Let x = (x1 , . . . , xk ) denote the result of an experiment with
Pk
n independent trials (n = x ) and k possible outcomes for any given trial. i.e. xi is the number of trials
i i
that had outcome i (1 ≤ i ≤ k). The pmf of this multinomial distribution, over all possible x constrained by
Pk
n = i xi , is:
n! x
Pr [x = (x1 , . . . , xk ); n] = px1 × · · · × pk k (772)
x1 ! · · · xk ! 1
where pi is the probability of outcome i for any single trial.
263
• Naive Bayes Assumption: First, recall that d is typically modeled as a (random)
vector consisting of features f1 , . . . , fn , each of which has an associated probability dis-
tribution Pr [fi | c]. The NB assumption is that the features are mutually independent
given the class c:
Pr [f1 , . . . , fn | c] = Pr [f1 | c] · · · Pr [fn | c] (776)
Y
cN B = arg max Pr [c] Pr [f | c] (777)
c∈C f ∈F
where 777 is the final equation for the class chosen by the naive Bayes classifier.
In text classification we typically use the word at position i in the document as fi , and move
to log space to avoid underflow/increase speed:
len(d)
X
cN B = arg max log Pr [c] + log Pr [wi | c] (778)
c∈C i
Classifiers that use a linear combination of the inputs to make a classification decision (e.g.
NB, logistic regression) are called linear classifiers.
Training the NB Classifier. No real "training" in my opinion, just simple counting from
the data:
Nc
P̂ [c] = (779)
Ndocs
count(wi , c) + 1
P̂ [wi | c] = P (780)
( w∈V count(w, c)) + |V |
The Laplace smoothing is added to avoid the occurrence of zero-probabilities in equation 777.
264
Speech and Language Processing July 28, 2017
Overview. Here we will first go over the math behind HMMs: the Viterbi, Forward, and
Baum-Welch (EM) algorithms for unsupervised or semi-supervised learning. Recall that
a HMM is defined by specifying the set of N states Q, transition matrix A, sequence of T
observations O, sequence of observation likelihoods B, and the initial/final states. They can
be characterized by three fundamental problems:
1. Likelihood. Given an HMM λ = (A, B) and observation sequence, compute the likeli-
hood (prob. of the observations given the model). (Forward)
2. Decoding. Given an HMM λ = (A, B) and observation sequence, discover the best
hidden state sequence. (Viterbi)
3. Learning. Given an observation sequence and the set of states in the HMM, learn the
HMM parameters A and B. (Baum-Welch/Forward-Backward/EM)
The Forward Algorithm. For likelihood computation. We want to compute the probability
of some sequence of observations O, without knowing the sequence of hidden states (that
emitted the observations) Q. In general, this can be expressed by summing over all possible
hidden state sequences:
X
Pr [O] = Pr [Q] Pr [O | Q] (781)
Q
However, for N hidden states and T observations, this summation involves N T terms, which
becomes intractable rather quickly. Instead, we can use the O(N 2 T ) Forward Algorithm.
The forward algorithm can be defined by initialization, recursion definition, and termination,
shown respectively as follows:
265
Viterbi Algorithm. For decoding. Want the most likely hidden state sequence given observa-
tions. Let vt (j) represent the probability that we are in state j after t observations and passing
through the most probable state sequence q0 , q1 , . . . , qt−1 . Similar to the forward algorithm,
we show the defining equations for the Viterbi algorithm below:
N.B.: At each step, the best path up to that point can be found by taking the argmax instead
of max.
We can use the forward and backward probabilities α and β to compute the transition prob-
abilities aij and observation probabilities bi (ot ) from an observation sequence. The derivation
steps are as follows:
266
1. Estimating âij . Begin by defining quantities that will prove useful:
Remember, knowing the
observation sequence
ξt (i, j) , Pr [qt = i, qt+1 = j | O, λ] (793) does NOT give us the
sequence of hidden
ξet (i, j) , Pr [qt = i, qt+1 = j, O | λ] (794) states.
= αt (i)aij bj (t + 1)βt+1 (j) (795)
where you should be able to derive eq. 795 in your head using just logic. If you cannot,
review before continuing. We can then derive ξt (i, j) using basic definitions of conditional
probability, combined with eq. 791. Finally, we estimate âij as the expected number of
transitions qi → qj divided by the expected number of transitions from qi total:
PT −1
ξt (i, j)
âij = PT −1t=1
PN (796)
t=1 k=1 ξt (i, k)
2. Estimating b̂j (vk ). We define our estimate as the expected number of times we are in
qj and emit observation vk , divided by the expected number of times we are in state j.
Similar to our approach for âij we define helper quantities for these values at a given
timestep, then sum over them (all t) to obtain our estimate.
Thus, we obtain b̂j (vk ) by summing over all timesteps where ot = vk , denoted as the set
Tvk , divided by the summation over all t regadless of ot :
P
t∈Tvk γt (j)
b̂j (vk ) = PT (800)
t=1 γt (j)
267
Speech and Language Processing July 30, 2017
English Parts-of-Speech. POS are traditionally defined based on syntactic and morpholog-
ical function, grouping words that have similar neighboring words or take similar affixes.
Closed classes. POS with relatively fixed membership. Some of the most important in
English are:
Some subtleties: the particle resembles a preposition or an adverb and is used in combination
with a verb. An example case where “over” is a particle: “she turned the paper over.” When a
verb and a particle behave as a single syntactic and/or semantic unit, we call the combination
a phrasal verb. Phrasal verbs cause widespread problems with NLP because they often
behave as a semantic unit with a noncompositional meaning – one that is not predictable from
the distinct meanings of the verb and the particle. Thus, “turn down” means something like
“reject”, “rule out” means “eliminate”, “find out” is “discover”, and “go on” is “continue”.
268
HMM POS Tagging. Since we typically train on labeled data, we need only use the Viterbi
algorithm for decoding174 . In the POS case, we wish to find the sequence of n tags, t̂n1 , given
the observation sequence of n words w1n .
t̂n1 = arg max Pr [tn1 | w1n ] = arg max Pr [w1n | tn1 ] Pr [tn1 ] (801)
tn
1 tn
1
where we’ve dropped the denominator after using Bayes’ rule (since argmax is the same).
HMM taggers made two further simplifying assumptions:
n
Y
Pr [w1n | tn1 ] ≈ Pr [wi | ti ] (802)
i=1
Yn
Pr [tn1 ] ≈ Pr [ti | ti−1 ] (803)
i=1
We can thus plug-in these values into eq. 801 to obtain the equation for t̂n1 .
n
Y
t̂n1 = arg max Pr [wi | ti ] Pr [ti | ti−1 ] (804)
tn
1 i=1
Yn
= arg max bi (wi )ai−1,i (805)
tn
1 i=1
where I’ve written a “translated” version on the second line using the familiar syntax from
the previous chapter. In practice, we can obtain quick estimates for the two probabilities on
the RHS by taking counts/averages over our tagged training data. We then run through the
Viterbi algorithm to find all the argmaxes over states for the most likely hidden state sequence.
Visually, we can think of the difference between HMMs and MEMMs via the direction of
arrows, as illustrated below.
174
Recall that decoding is the problem of finding the best hidden state sequence, given λ = (A, B) and
observation sequence O.
175
Because it is based on logistic regression, the MEMM is a discriminative sequence model. By contrast,
the HMM is a generative sequence model.
269
The top shows the HMM representation, while the bottom is MEMM.
The reason to use a discriminative sequence model is that discriminative models make it
easier to incorporate a much wider variety of features.
Bidirectionality. The one problem with the MEMM and HMM models as presented is that
they are exclusively run left-to-right. MEMMs176 have a weakness known as the label bias
problem. Consider the tagged fragment: “will/NN to/TO fight/VB 177 .” Even though the
word “will“ is followed by “to”, which strongly suggests “will” is a NN, a MEMM will incor-
rectly label “will” as MD (modal verb). The culprit lies in the fact that Pr [TO | to, twill ] is
essentially 1 regardless of twill ; i.e. the fact that “to” must have the tag TO has explained
away the presence of TO and so the model doesn’t learn the importance of the previous NN
tag for predicting TO.
One way to implement bidirectionality (and thus allowing e.g. the link between TO being
available when tagging the NN) is to use a Conditional Random Field (CRF) model.
However, CRFs are much more computationally expensive than MEMMs and don’t work better
for tagging.
176
And other non-generative finite-state models based on next-state classifiers
177
Note on the tag meanings: TO literally means “to”. MD means “modal” and refers to modal verbs such as
will, shall, etc.
270
Speech and Language Processing August 5, 2017
Constituency and CFGs. Discovering the inventory of constituents present in the language.
Groups of words like noun phrases or prepositional phrases can be thought of as single units
which can only be placed within certain parts of a sentence.
The most widely used formal system for modeling constituent structure in English is the
Context-Free Grammar178 . A CFG consists of a set of productions (rules), e.g.
The sequence of rule expansions going from left to right is called a derivation of the string of
words, commonly represented by a parse tree. The formal language defined by a CFG is the
set of strings that are derivable from the designated start symbol.
178
Also called Phrase-Structure Grammars. Equiv formalism as Backus-Naur Form (BNF)
271
Speech and Language Processing June 21, 2017
Words and Vectors. Vector models are generally based on a co-occurrence, an example
of which is a term-document matrix: Each row is identified by a word, and each column Information Retrieval:
a document. A given cell value is the number of times the assoc. word occurred in the assoc. task of finding document
d, from D docs total,
document. Can also view each column as a document vector. that best matches a
query q.
For individual word vectors, however, it is most common to instead use a term-term matrix179 ,
in which columns are also identified by individual words. Now, cell values are the number
of times the row (target) word and the column (context) word co-occur in some context in
some training corpus. The context is most commonly a window around the row/target word,
meaning a cell gives the number of times the column word occurs in a window of ±N words
from the row word.
• Q: What about the co-occurrence of a word with itself (row i, col i)?
– A: It is included, yes. Source: “The size of the window . . . is generally between 1
and 8 words on each side of the target word (for a total context of 3-17 words).”
• Q: Why is the size of each vector generally |V | (vocab size)? Shouldn’t this vary
substantially with window and corpus size?
– A: idk
(TODO: revisit end of page 5 in my pdf of this)
which can be applied for our specific use case as PMI(w, c) = ln PP(w)P
(w,c)
(c) . The interpretation is
simple: the denominator tells the joint probability of the given target word w occurring with
context word c if they were independent of each other, while the numerator tells us how often
we observed the two words together (assuming we compute probability by using the MLE).
179
Also called the word-word or term-context matrix
272
Therefore, the ratio gives us how an estimate of how much more the target and feature co-occur
than we expect by chance180 . Most people use Positive PMI, which is just max(0, PMI). We
can compute a PPMI matrix (to replace our co-occurrence matrix), where PPMIij gives the
PPMI value of word wi with context cj . The authors show a few formulas which is really
distracting since all we actually need is the counts fij = counts(wi , cj ), and from there we can
use basic probability and Bayes rule to get the PPMI formula.
• Q: Explain why the following is true: very rare words tend to have very high PMI values.
– A: hi
• Q: What is the range of α used in PPMIα ? What is the intuition behind doing this?
– A: For reference:
P (w, c)
PPMIα (w, c) = max ln ,0 (812)
P (w)Pα (c)
count(c)α
Pα (c) = P 0 α
(813)
c0 count(c )
Although there are better methods than PPMI for weighted co-occurrence matrices, most
notably TF-IDF, things like tf-idf are not generally used for measuring word similarity. For
that, PPMI and significance-testing metrics like t-test and likelihood-ratio are more common.
The t-test statistic, like PMI, measures how much more frequent the association is than
chance.
x̄ − µ
t= p 2 (814)
s /N
P (a, b) − P (a)P (b)
t-test(a, b) = p (815)
P (a)P (b)
where x̄ is the observed mean, while µ is the expected mean [under our null-hypothesis of
independence].
Measuring Similarity. By far the most common similarity metric is the cosine of the angle
between the vectors:
v·w
cosine(v, w) = (816)
|v| |w|
Note that, since we’ve been defining vector elements as frequencies/PPMI values, they won’t
have negative elements, and thus our cosine similarities will be between 0 and 1 (not -1 and
1).
180
Computing PMI this way can be problematic for word pairs with small probability, especially if we have a
small corpus. Recognize that PMI should never really be negative, but in practice this happens for such cases
273
Alternatives to cosine:
• Jaccard measure: Described as "weighted number of overlapping features, normalized",
but looks like a silly hack in my opinion:
PN
i=1 min(vi , wi )
simJac (v, w) = PN (817)
i=1 max(vi , wi )
v+w v+w
simJS (v||w) = D v +D w (820)
2 2
181
Idea:if two vectors, v and w, each express a probability distribution (their values sum to one), then
they are are similar to the extent that these probability distributions are similar. The basis of comparing two
probability distributions P and Q is the Kullback-Leibler divergence or relative entropy, defined as:
X P (x)
D(P ||Q) = P (x) log (819)
Q(x)
x
274
Speech and Language Processing June 21, 2017
Overview. This chapter introduces three methods for generating short, dense vectors: (1)
dimensionality reduction like SVD, (2) neural networks like skip-gram or CBOW, and (3)
Brown clustering.
Dense Vectors via SVD. Method for finding more important dimensions of a dataset, “im-
portant” defined as dimensions wherein the data most varies. First applied (for language)
for generating embeddings from term-document matrices in a model called Latent Semantic
Analysis (LSA). LSA is just SVD on a |V | × c term-document matrix X, factorized into
WΣC T . By using only the top k < m dimensions of these three matrices, the product be- W ∈ R|V |×m
comes a least-squares approx. to the original X. It also gives us the reduced |V |×k matrix Wk , Σ ∈ Rm×m
where each row (word) is a k-dimensional vector (embedding). Voilà, we have our dense vectors! C T ∈ Rm×c
Note that LSA implementations typically use a particular weighting of each cell in the term-
document matrix called the local and global weights.
Skip-Gram and CBOW. Neural models learn an embedding by starting with a random
vector and then iteratively shifting a word’s embeddings to be more like the embeddings of
neighboring words, and less like the embeddings of words that don’t occur nearby182 Word2vec,
for example, learns embeddings by training to predict neighboring words183 .
• Skip-Gram: Learns two embeddings for each word w: the word embedding v (within
matrix W) and context embedding c (within matrix C). Visually:
v0T
T
v1
W= . C = c0 c1 · · · c|V | (823)
..
T
v|V |
182
Why? Why is this a sensible assumption? I see no reason a priori why it ought to be true.
183
Note that the prediction task is not the goal – it just happens to result in good word embeddings. Hacky.
275
For a context window of L = 2, and at a given word v (t) inside the corpus184 , our goal is
to predict the context [words] denoted as [ c(t−2) , c(t−1) , c(t+1) , c(t+2) ].
– Example: Consider one of the context words, say c(t+1) , ck , where we also assume
it’s the kth word in our vocab. Also assume that our target word v (t) , vj is the
jth word in our vocab.
– Our task is to compute Pr [ck | vj ]. We do this with a softmax:
T
eck vj
Pr [ck | vj ] = P (824)
cT vj
i∈|V | e i
Example: Suppose we come along the following window (in “[]”) (L=2) in our corpus:
lemon, a [tablespoon of apricot preserves or] jam
Ultimately, we want dot products, ci · vector(“apricot”), to be high for all four of the context
words ci . We do negative sampling by sampling k random noise words according to their
[unigram] frequency. So here, for e.g. k = 2, this would amount to 8 noise words, 2 for each
context word. We want the dot products between “apricot” and these noise words to be low.
For a given single context-word pair (w, c), our training objective is to maximize:
In practice, common to
k
X use p3/4 (w) instead of
log σ(c · w) + Ewi ∼p(w) [log σ(−wi · w)] (825) p(w)
i=1
Again, the above is for a single context-target word-pair and, accordingly, the summation is
only over k = 2 (for our example). Don’t try to split the expectation into a summation or any-
thing – just view it as an expected value. To iteratively shift parameters, we use an optimizer
like SGD.
184
Note that, technically, the position t of v (t) is irrelevant for our computation; we are predicting those words
based on which word v (t) is in the vocabulary, not it’s position in the corpus.
276
The actual model architecture is a typical neural net, progressing as follows:
A naive and extremely inefficient version of Brown clustering, a hierarchical clustering algo-
rithm, is as follows:
1. Each word is initially assigned to its own cluster.
2. For each cluster pair (ci , cj6=i ), compute the value of eq 832 that would result from merging
ci and cj into a single cluster. The pair whose merger results in the smallest decrease in
eq 832 is merged.
3. Clustering proceeds until all words are in one big cluster.
This process builds a binary tree from the bottom-up, and the binary string corresp. to a
word’s traversal from leaf-to-root is its representation.
277
Speech and Language Processing July 27, 2017
Overview. The first step in most IE tasks is named entity recognition (NER). Next we
can do relation extraction: finding and classifying semantic relations among the entities,
e.g. “spouse-of.” Event extraction is finding the events in which the entities participate, and
event coreference for figuring out which event mentions actually refer to the same event. It’s
also common to extract dates/times (temporal expression) and perform temporal expression
normalization to map them onto specific calendar dates. Finally, we can do template
filling: finding recurring/stereotypical situations in documents and filling the template slots
with appropriate material.
Here we see a classifier determining the label for Corp. with a context window of size 2 and
various features shown in the boxed region. For evaluation of NER, we typically use the familiar
recall, precision, and F1 measure.
185
2n for B-<NE> and I-<NE>, with +1 for the blanket O tag (not any of our NEs)
278
Relation Extraction. The four main algorithm classes used are (1) hand-written patterns,
(2) supervised ML, (3) semi-supervised, and (4) unsupervised. Terminology:
• Infobox: structured tables associated with certain articles/topics/etc. For example, the
Wikipedia infobox for Stanford includes structured facts like state = ’California’.
• Resource Description Framework (RDF): a metalanguage of RDF triples, tuples
of (entity, relation, entity), called a subject-predicate-object expression. For example:
(Golden Gate Park, location, San Francisco).
• hypernym: the “is-a” relation.
• hyponym: the “kind-of” relation. Gelidium is a kind of red algae.
186
Here, hyponym(A, B) means “A is a kind-of (hyponym) of B.”
187
seed tuples are tuples of the general form (M1, M2) where M1 and M2 are each specific named entities we
know have the relation of interest R.
279
4. Unsupervised. The Re Verb system extracts a relation from a sentence s in 4 steps:
Event Extraction. An event mention is any expression denoting an event or state that can
be assigned to a particular point, or interval, in time. Note that this is quite different than the
colloquial usage of the word “event,” you should think of the two as distinct. Here, most event
mentions correspond to verbs, and most verbs introduce events. Event extraction is typically
modeled via ML, detecting events via sequence models with BIO tagging, and assigning event
classes/attributes with multi-class classifiers.
Template Filling. The task is creation of one template for each event in the input documents,
with the slots filled with text from the document. For example, an event could be “Fare-Raise
Attempt” with corresponding template (slots to be filled) “(<Lead Airline>, <Amount>, <Ef-
fective Date>, <Follower>)”. This is generally modeled by training two separate supervised
systems:
1. Template recognition. Trained to determine if template T is present in sentence S.
Here, “present” means there is a sequence within the sentence that could be used to fill
a slot within template T.
2. Role-filler extraction. Trained to detect each role (slot-name), e.g. “Lead Airline”.
280
Probabilistic
Graphical
Models
Contents
281
Probabilistic Graphical Models May 13, 2018
Foundations (Ch. 2)
Table of Contents Local Written by Brandon McKinzie
Paths and Trails. Definitions for longer-range connections in graphs. We use the notation
Xi Xj+1 to denote that Xi and Xj are connected via some edge, whether directed (in any
direction) or undirected.
• Trail/Path. We say that X1 , . . . , Xk form a trail in the graph K = (X , E) if ∀i =
1, . . . , k − 1, we have that Xi Xj+1 . A path makes an additional restriction: either
Xi → Xi+1 or Xi —Xi+1 .
• Connected Graph. A graph is connected if ∀Xi , Xj there is a trail between Xi and Xj .
• Cycle. A cycle in K is a directed path X1 , . . . Xk where X1 = Xk .
• Loop. A loop in K is a trail where X1 = Xk . A graph is singly connected if it contains no
loops. A node in a singly connected graph is called a leaf if it has exactly one adjacent
node.
• Polytree/Forest. A singly connected graph is also called a polytree. A singly connected
undirected graph is called a forest; if a forest is also connected, it is called a tree.
– A directed graph is a forest if each node has at most one parent. A directed forest
is a tree if it is also connected.
• Chordal Graph. Let X1 —X2 — · · · —Xk —X1 be a loop in the graph. A chord in
the loop is an edge connecting Xi and Xj for two nonconsecutive nodes Xi , Xj . An
undirected graph H is said to be chordal if any loop X1 —X2 — · · · —Xk —X1 for k ≥ 4
has a chord.
188
BoundaryX , PaX ∪ NbX . For DAGs, this is simply X’s parents, and for undirected graphs X’s neighbors.
282
Probability. Some notational reminders for this book. Let Ω denote a space of possible
outcomes, and let S denote a set of measurable events α, each of which are a subset of Ω.
A probability distribution P over (Ω, S) is a mapping from events in S to real values that satisfy:
• P (α) ≥ 0 for all α ∈ S.
• P (Ω) = 1.
• If α, β ∈ S and α ∩ β = ∅, then P (α ∪ β) = P (α) + P (β).
Symmetry : (X ⊥ Y | Z) =⇒ (Y ⊥ X | Z) (833)
Decomposition : (X ⊥ (Y, W ) | Z) =⇒ (X ⊥ Y | Z) (834)
Weak Union : (X ⊥ (Y, W ) | Z) =⇒ (X ⊥ Y | Z, W ) (835)
Contraction : (X ⊥ W | Z, Y )&(X ⊥ Y | Z) =⇒ (X ⊥ Y, W | Z) (836)
I’ll be using the definition that (X ⊥ Y | Z) ⇔ P (X, Y | Z) = P (X | Z)P (Y | Z). Given this definition the proof
for the symmetry property is trivial. In what follows, I’ll assume the LHS of the given implication is true, and
then show that the RHS must hold as well.
Decomposition:
X X
P (X, Y | Z) = P (X, Y, w | Z) = P (X | Z)P (Y, w | Z) = P (X | Z)P (Y | Z) X
w w
Weak Union:
P (X, Y, W | Z)
P (X, Y | Z, W ) = (837)
P (W | Z)
P (X | Z)P (Y, W | Z)
= (838)
P (W | Z)
P (X | Z)P (W | Z)P (Y | Z, W )
= (839)
P (W | Z)
= P (X | Z, W )P (Y | Z, W ) X (840)
Contraction:
= P (Y | Z)P (X | Z, Y )P (W | Z, Y ) (842)
= P (X | Z) [P (Y | Z)P (W | Z, Y )] (843)
= P (X | Z)P (Y, W | Z) X (844)
283
We now define what “positive distribution” means, and a useful property of such distributions.
A distribution P is said to be positive if for all events α ∈ S, such that α 6= ∅, we have that
P (α) > 0.
For positive distributions, and for mutually disjoint sets X, Y, Z, W, the intersection prop-
erty also holds:
7.1.1 Appendix
Figured this would be a good place to put some of the definitions in the Appendix, too.
Hp (X) is a lower bound for the expected number of bits required to encode instances
sampled from P (X). Another interpretation is that the entropy is a measure of our
uncertainty about the value of X.
• Conditional Entropy. The conditional entropy of X given Y is
1
HP (X | Y ) = HP (X, Y ) − HP (Y ) = EP lg (848)
P (X | Y )
HP (X | Y ) ≤ HP (X) (849)
which captures the additional cost (in bits) of encoding X when we’re already encoding
Y.
• Mutual Information. The mutual information between X and Y is
P (X | Y )
IP (X; Y ) = HP (X) − HP (X | Y ) = EP lg (850)
P (X)
which captures how many bits we save (on average) in the encoding of X if we know the
value Y .
• Distance Metric. A distance metric is any distance measure d evaluating the distance
between two distributions that satisfies all of the following properties:
– Positivity: d(P, Q) ≥ 0 and d(P, Q) = 0 if and only if P = Q.
284
– Symmetry: d(P, Q) = d(Q, P ).
– Triangle inequality: For any three distributions P , Q, R, we have that
P (X1 , . . . , Xn )
D(P kQ) = EP lg (852)
Q(X1 , . . . , Xn )
Note that this only satisfies the positivity property, and is thus not a true distance metric.
285
Combinatorial Optimization and Search (A.4). Below, I’ll outline some common search
algorithms. These are designed to address the following task:
Given initial candidate solution σcur , a score function score, and a set of search opera-
tors O, search for the optimal solution σbest that maximizes the value of score(σbest )
Repeat the following until didU pdate evaluates to f alse at the end of an iteration.
1. Initialize σbest := σcur .
2. Set didU pdate := f alse.
3. For each operator o ∈ O, do:
(a) Let σo := o(σbest ).
(b) If σo is legal solution, and score(σo ) > score(σbest ), reassign σbest := σo , and set
didU pdate := true.
4. If didU pdate == true, go back to step 2. Otherwise terminate and return σbest .
We are given a beam width K. Initialize our beam, the set of at most K solutions we are currently
tracking, to {σcur }. Repeat the following until terminationa :
1. Initial the set of successor states H := ∅.
2. For each solution σ ∈ Beam, and each operator o ∈ O, insert a candidate successor state o(σ)
into H.
3. Set Beam := KBestScore(H)b .
Once termination is reached, return the best solution σbest in Beam.
a
Termination condition could be e.g. an upper bound on number of iterations or on the improvement
achieved in the last iteration.
b
Notice that this implies an underlying assumption of beam search: all successor states σ ∈ H have
scores greater than any of the states in the current beam. We always assume improvement.
At risk of stating the obvious, g(η) is referred to as a “line” because it’s a function of the
form mx + b.
286
7.1.2 L-BFGS
Some notes from the textbook “Numerical Optimization” (chapters 8 and 9).
The BFGS Method (8.1). Begin the derivation by forming the following quadratic model
of the objective function f at the current iterate189 θt : For mt (p), p denotes the
deviation at step t from
the current parameters
1
mt (p) = ft + ∇ftT p + pT Bt p (856) θt .
2
where Bt is an n × n symmetric p.d. matrix that will be revised/updated every iteration (it is
not the Hessian!). The minimizer pt of this function can be written explicitly
∂ 1 T
p Bt p = Bt p
pt = −Bt−1 ∇ft (857) ∂p 2
θt+1 ← θt + αt pt (858)
where the step length αt is chosen to satisfy the Wolfe conditions190 . This is basically
Newton’s method with line search, except that we’re using the approximate Hessian Bt instead
of the true Hessian. It would be nice if we could somehow avoid recomputing Bt at each
step. One proposed method involves imposing conditions on Bt+1 based on the previous
step(s). Require that ∇mt+1 equal ∇f at the latest two iterates θt and θt+1 . Formally, the
two conditions can be written as
which is true only if st and yt satisfy the curvature condition, sTt yt > 0. The curvature
condition is guaranteed to hold if we impose the Wolfe conditions on the line search. As is, this
189
Recall that an “iterate” is just some variable that gets iteratively computed/updated. Fancy people with
fancy words.
190
The Wolfe conditions are the following sufficient decrease and curvature conditions for line search:
287
still has infinitely many solutions for Ht+1 . To determine it uniquely, we impose the additional
condition that Ht+1 is the closest of all possible solutions to the current Ht :
where || · ||W is the weighted Frobenius norm 191 , and W can be any matrix satisfying W st = yt .
For concreteness, assume that W = G e t , where
Z 1
G
et = ∇2 f (θt + τ αt pt )dτ (870)
0
Algorithm 8.1 (BFGS Method). Given starting point θ0 , convergence tolerance > 0, and
inverse Hessian approximation H0 . Initialize t = 0. While ||∇ft || > do:
1. Compute search direction pt = −Ht ∇ft .
2. Set θt+1 = θt + αt pt , where αt is computed via line search to satisfy the Wolfe conditions.
3. Define st = θt+1 − θt and yt = ∇ft+1 − ∇ft .
4. Compute Ht by means of equation 8.16.
5. Increment t += 1 and go back to step 1.
191
288
L-BFGS. Modifies BFGS to store a modified version of Ht implicitly, by storing some number
m of vector pairs {si , yi }, corresponding to the m most recent time steps. We use a recursive
procedure to compute Ht ∇ft given the set of vectors.
Algorithm 9.1 (L-BFGS two-loop recursion) Subroutine of L-BFGS for computing Ht ∇ft .
We’re given the current value of ∇ft , and we initialize q to this value.
1. For i in the range [t − 1, t − m], compute
αi ← ρi sTi q (871)
q ← q − αi yi (872)
2. Set r → Ht0 q.
3. For i in the range [t − m, t − 1], compute
β ← ρi yiT r (873)
r ← r + si (αi − β) (874)
sTt−1 yt−1
γt , T y
yt−1 t−1
289
7.1.3 Exercises
Exercise 2.4
Let α ∈ S be an event s.t. P (α) > 0. Show that P (· | α) satisfies the properties of a valid probability distribution.
1
P (β | α) = P (α ∩ β) (875)
P (α)
and since the full joint P ≥ 0 and since P (α) > 0, we have the desired result.
• Show P (Ωα ) = 1. Again, using just the definitions,
X
P (Ωα ) = P (β | α) (876)
β∈S
1 X
= P (α ∩ β) (877)
P (α)
β∈S
!
1 X X
= P (β) + P (∅) (878)
P (α)
β∈α γ ∈α
/
1
= (P (α) + 0) (879)
P (α)
=1 (880)
1
P (β ∪ γ | α) = P ((β ∪ γ) ∩ α) (881)
P (α)
1
= P ((β ∩ α) ∪ (γ ∩ α)) (882)
P (α)
1
= (P (β ∩ α) + P (γ ∩ α)) (883)
P (α)
= P (β | α) + P (γ | α) (884)
290
Exercise 2.16: Jensen’s Inequality
1
h i
HP (X) , EP lg = EP [f (u)] (886)
P (X)
!
X
≤ f (Ep [u]) = f u(x)P (x) = f (|V al(X)|) = lg |V al(X)| (887)
x
(888)
• HP (X) ≥ 0.
1
h i
−HP (X) = −EP lg (889)
P (X)
= EP [lg P (X)] ≤ 0 (890)
291
Probabilistic Graphical Models May 06, 2018
Goal: represent a joint distribution P over some set of variables X = {X1 , . . . , Xn }. Consider
the case where each Xi is binary-valued. A single joint distribution requires access to the
probability for each of the 2n possible assignments for X . The set of all such possible joint
distributions,
2n
Understanding the
2n
X
{(p1 , . . . , p2n ) ∈ R : pi = 1} (893) exponential blowup.
i=1
n
is a 2n
− 1 dimensional subspace of R2 .
Note that each pi represents the probability for
a unique instantiation of X . Furthermore, in the general case, knowing pi tells you nearly
nothing about pj6=i – i.e. you require an instantiation of 2n − 1 independent parameters to
specify a given joint distribution.
But it would be foolish to parameterize any joint distribution in this way, since we can often
take advantage of independencies. Consider the case where each Xi gives the outcome (H or T) Taking advantage of
independencies.
of coin i being tossed. Then our distribution satisfies (X ⊥ Y) for any disjoint subsets fo the
variables X and Y. Let θi denote the probability that coin i lands heads. The key observation
is that you only need each of the n θi to specify a unique joint distribution over X , reducing
n
the 2n − 1 dimensional subspace to an n dimensional manifold in R2 192 .
The Naive Bayes Model. Say we want to determine the intelligence of an applicant based
on their grade G in some course, and their score S on the SAT. A naive bayes model can be
illustrated as below.
G S
192
TODO: Formally define this manifold using the same notation as in 893. Edit: Not sure how to actually
write it out, but intuitively it’s because each of the 2n pi values, when going from the general case to the case
of i.i.d, become functions of the n θi values. Whereas before they were independent free parameters.
193
We say that an event α is independent of event β in P with the notation P |= (α ⊥ β). (Def 2.2, pg 23 of
book)
292
In general, a naive bayes model assumes that instances fall into one of a number of mutually
exclusive and exhaustive classes, defined as the set of values that the top variable in the graph
can take on194 . The model also includes some number of features X1 , . . . Xk , whose values
are typically observed. The naive Bayes assumption is that the features are conditionally
independent given the instance’s class.
D I
G S
A Bayesian network structure G is a DAG whose nodes represent RVs X1 , . . . Xn . Let P aGXi
denote the parents of Xi in G, and NonDescXi denotes the variables that are NOT descendants
of Xi . Then G encodes the following set of local independencies,
194
For the intelligence example, the classes are high intelligence and low intelligence.
293
Graphs and Distributions. Here we see that a distribution P satisfies the local indepen-
dencies associated with a graph G iff P is representable as a set of CPDs associated with
G.
• Local independencies. Let P be a distribution over X . We define I(P ) to be the set
of independence assertions of the form (X ⊥ Y | Z) that hold in P . The statement
“P satisfies the local independencies associated with G” can thus be succinctly written:
Z Z X Y Z
Y X
From left-to-right: indirect causal effect, indirect evidential effect, common cause, common
effect. The first 3 satisfy (X ⊥ Y | Z), but the 4th does not. Another way of saying this is
that the first 3 trails are active195 IFF Z is not observed, while the 4th trail is active IFF Z
(or a descendent of Z) is observed.
195
When influence can flow from X to Y via Z, we say that the trail X Y Z is active.
294
General case:
To see the detailed algorithm for finding nodes reachable from X given Z via active trails, see
Algorithm 3.1 on pg. 75 of the book.
196
P is faithful to G if, whenever X ⊥ Y | Z ∈ I(P ), then d-sepG (X; Y | Z).
295
Probabilistic Graphical Models November 03, 2017
The Misconception Example. Consider a scenario where we have four students who get to-
gether in pairs to work on homework for a class. Only the following pairs meet: (Alice, Bob),
(Bob, Charles), (Charles, Debbie), (Debbie, Alice). The professor misspoke in class, giving
rise to a possible misconception among the students. We have four binary random variables,
{A, B, C, D}, representing whether the student has the misconception (1) or not (0) 197 . Intu-
itively, we want to model a distribution that satisfies (A ⊥ C | {B, D}), and (B ⊥ D | {A, C}),
but no other independencies198 . Note that the interactions between variables seem symmetric
here – students influence each other (out of the ones they have a pair with).
D B
The nodes in the graph of a Markov network represent the variables, and the edges corre-
spond to a notion of direct probabilistic interaction between the neighboring variables – an
interaction that is not mediated by any other variable in the network. So, how should we
parameterize our network? We want to capture the affinities between the related variables
(e.g. Alice and Bob are more likely to agree than disagree).
We restrict our attention
Let D be a set of random variables. We define a factor φ to be a function from to nonnegative factors.
V al(D) to R. A factor is nonnegative if all its entries are nonnegative. D is called
the scope of the factor, denoted Scope[φ].
The factors need not be normalized. Therefore, to interpret probabilities over factors, we must
197
A student might not have the misconception if e.g. they went home and figured out the problem via reading
the textbook instead.
198
These independences cannot be naturally captured in a Bayesian (i.e. directed) network.
296
normalize it with what we’ll call the partition function, Z:
1
Pr [a, b, c, d] =φ1 (a, b) · φ2 (b, c) · φ3 (c, d) · φ4 (d, a) (899)
ZX
Z= φ1 (a, b) · φ2 (b, c) · φ3 (c, d) · φ4 (d, a) (900)
a,b,c,d
Let X, Y, and Z be three disjoint sets of variables, and let φ1 (X, Y) and φ2 (Y, Z) be
two factors. We define the factor product φ1 ×φ2 to be a factor ψ : V al(X, Y, Z) 7→
R as follows:
where the authors (Koller and Friedman) have made a point to emphasize: A factor is only
one contribution to the overall joint distribution. The distribution as a whole has
to take into consideration the contributions from all of the factors involved. Now,
we relate the parameterization of a Gibbs distribution to a graph structure.
The factors that parameterize a Markov network are often called clique potentials. Although
it can be used without loss of generality, the parameterization using maximal clique potentials
generally obscures structure that is present in the original set of factors. Below are some useful
definitions that we will use often.
199
A subgraph is complete if every two nodes in the subgraph are connected by some edge. The set of nodes
in such a subgraph is often called a clique. A clique X is maximal if for any superset of nodes Y ⊃ X, Y is
not a clique.
297
Factor reduction:
For U * Y, define φ[u] only for the assignments in u to the variables in U 0 = U∩Y.
Note that if a Gibbs distribution PΦ (X) factorizes over H, then PΦ [u] factorizes over H[u].
This is the separation criterion. Note that the definition of separation is monotonic in
Z: if it holds for Z, then it holds for any Z 0 ⊃ Z as well200 .
200
Which means that Markov networks are fundamentally incapable of representing nonmonotonic indepen-
dence relations!
298
• Soundness of the separation criterion for detecting independence properties in distri-
butions over H. In other words, we want to prove that
(P factorizes over H) =⇒ sepH (X; Y | Z) =⇒ P |= (X ⊥ Y | Z)
Proof
– Consider the case where X ∪ Y ∪ Z = X .
– Then, any clique in X is fully contained in either X ∪ Z or Y ∪ Z. In other words,
1
P (X, Y, Z) = f (X, Z)g(Y, Z) (907)
Z
f (X, Z)g(Y, Z)
which implies P (X, Y | Z) = P (908)
x,y
f (x, Z)g(y, Z)
f (X, Z)g(Y, Z)
P (X, Y | Z) = (909)
P (Z)
f (X, Z)g(Y, Z) P (Z)
= (910)
P (Z) P (Z)
P P
f (X, Z)g(Y, Z) x
f (x, Z) y
g(y, Z)
= (911)
P (Z) P (Z)
P P
f (X, Z) y
g(y, Z) g(Y, Z) f (x, Z)
x
= (912)
P (Z) P (Z)
= P (X | Z)P (Y | Z) (913)
– For the general case where X∪Y∪Z = X , let U = X −(X∪Y∪Z). Since we know that sepH (X; Y |
Z), we can partition U into two disjoint sets U1 and U2 such that sepH (X ∪ U1 ; Y ∪ U2 | Z).
Combining the previous result with the decomposition propertya give us the desired result that
P |= (X ⊥ Y | Z).
a
The decomposition property:
(X ⊥ (Y, W) | Z) =⇒ (X ⊥ Y | Z)
Pairwise Independencies:
which just says “X is indep. of Y given everything else if there’s no edge between
X and Y.”
299
Local Independencies:
For a given graph H, define the Markov blanket of X in H, denoted MBH (X),
to be the neighbors of X in H. We define the local independencies associated
with H:
I` (H) = { X ⊥ X − {X} − MBH (X) | MBH (X) : X ∈ X} (915)
which just says “X is indep. of the rest of the nodes in the graph given its immediate
neighbors.”
Log-Linear Models. Certain patterns involving particular values of variables for a given
factor can often be more easily seen by converting factors into log-space. More precisely, we
can rewrite a factor φ(D) as
where (D) is often called an energy function. Note how can take on any value along the
real line (i.e. removes our nonnegativity constraint)201 . Also note that as the summation
approaches 0, the probability approaches one.
This motivates introducing the notion of a feature, which is just a factor without the nonneg-
ativity requirement. A popular type of feature is the indicator feature that takes on value
1 for some values y ∈ V al(D) and 0 otherwise. We can now provide a more general definition
for our notion of log-linear models:
The log-linear model provides a much more compact representation for many distributions,
especially in situations where variables have large domains (such as text).
201
We seem to be implicitly assuming that the original factors are all positive (not just non-negative).
300
Box 4.C – Concept: Ising Models and Boltzmann Machines
The Ising model: Each atom is modeled as a binary RV Xi ∈ {+1, −1} denoting its spin. Each pair of
neighboring atoms is associated with energy function i,j (xi , xj ) = wi,j xi xj . We also have individual energy
functions ui xi for each atom. This defines our distribution:
1
X X
P (ξ) = exp − wi,j xi xj − ui xi (920)
Z
i<j i
The Boltzmann distribution: Now the variables are Xi ∈ {0, 1}. The distribution of each Xi given its
neighbors is
Consider the pairwise graph X1 , . . . , Xn in the context of sequence labeling. We want to assign each Xi a label.
We also want adjacent nodes to prefer being similar to each other. We usually use the MAP objective, so our
goal will be to minimize the total energy over the parameters (which are given by the individual energy functions
i ).
X X
E(x1 , . . . , xn ) = i (xi ) + i,j (xi , xj ) (923)
i (i,j)∈E
The simplest place to start for preferring neighboring labels to take on similar values is to define i,j to have low
energy when xi = xj and some positive λi,j otherwise. We want to have finer granularity for our similarities
between labels. To do this, we introduce the definition of a metric: a function µ : V × V 7→ [0, ∞) that satisfies
202
Meaning: for any PΦ , there are often infinitely many ways to choose its set of parameter values for a given
H.
203
And x is some assignment to some subset of X that also contains Z.
301
The canonical energy function for a clique D is defined below, as well as the associated
total P (ξ) over a full network assignment:
The sum is over all
subsets of D, including
∗D (d) = ∗
) · (−1)|D−Z|
X
`(dZ , ξ−Z (927) D itself and ∅.
Z⊆D
X
P (ξ) = exp ∗Di (ξhDi i) (928)
i
Conditional Random Fields. So far, we’ve only described Markov network representation
as encoding a joint distribution over X . The same undirected graph representation and pa-
rameterization can also be used to encode a conditional distribution Pr [Y | X], where Y is a
set of target variables and X is a (disjoint) set of observed variables.
Y1 Y2 Y3 Y4
204
Also note how we never have to deal with a summation over all possible X, due to restricting ourselves to
Z(X).
302
Rapid Summary.
• Gibbs distribution: any probability distribution that can be written as a product of
factors divided by some partition function Z.
• Factorizes: A Gibbs distribution factorizes over H if each factor [in its product of
factors] is a clique.
7.3.1 Exercises
Exercise 4.1
Let H be the graph of binary variables below. Show that P does not factorize over H (Hint: proof by contradic-
tion).
X1 X2 X3 X4
• Example 4.4 recap: P satisfies the global independencies w.r.t H. They showed this by manually checking
the two global indeps of H, (X1 ⊥ X3 | X2 , X4 ) and (X2 ⊥ X4 | X1 , X3 ), against the tabulated list of
possible assignments for P (given in example). Nothing fancy.
• P factorizes over H if it can be written as a product of clique potentials.
• My proof:
1. Assume that P does factorize over H.
2. Then P can be written as
1
P (X1 , X2 , X3 , X4 ) = φ1 (X1 , X2 )φ2 (X2 , X3 )φ3 (X3 , X4 )φ4 (X4 , X1 ) (932)
Z
Let the above assertion be denoted as C, and the statement that P factorizes according to H be
denoted simply as PH . Since PH ⇐⇒ C, if we can prove that C does not hold, then we’ve found
our contradiction, meaning PH also must not hold.
3. I know that the proof must take advantage of the fact that we know P is zero for certain assignments
to X . For example P (0100) = 0. Furthermore, by looking at the assignments where P is not zero,
I can see that all possible combinations of (X1 , X2 ) are present, which means φ1 never evaluates
to zero.
4. From the example, we know that P (1100) = 1/8 6= 0. However, since
and we also know that both the numerator and denominator of eq. 934 are positive, and thus we
have a contradiction.
303
Exercise 4.4
Prove theorem 4.7 for the case where H consists of a single clique. Theorem 4.7 is equation 928 in my notes.
For a single clique D, the question reduces to: Show that, for any assignment d to D:
exp − (d)
P (d) = P (935)
d0
exp − (d0 )
= exp ∗D (d) (936)
and therefore
exp ∗D (d) = P (d)/P (d∗ ) (940)
which is clearly incorrect (???) TODO: figure out what’s going on here. Either the book has a type in its for
theorem 4.7, or I’m absolutely insane.
304
Probabilistic Graphical Models May 27, 2018
Consider the example below, where the double-line notation on C means that C is a determin-
istic function of A and B. What new conditional dependencies do we have?
A B
D E
Answer: (D ⊥ E | A, B), which would not be true by d-separation alone. It only holds because
C is a deterministic function of A and B. z
305
Probabilistic Graphical Models May 27, 2018
In what follows, we will build on an example where a vehicle tries to track its true location
(L) using various sensor readings: velocity (V), weather (W), failure of sensor (F), observed
location (O) of the noisy sensor.
W W’
V
V’
L
L’
F
F’
O’
Temporal Models. We discretize time into slices of interval ∆, and denote the ground
random variables at time t · ∆ by X (t) . We can simplify our formulation considerably by
assuming a Markovian system: a dynamic system over template variables X that satisfies
the Markov assumption:
(X (t+1) ⊥ X (0:(t−1)) | X (t) ) (942)
which allows us to define a more compact representation of the joint distribution from time 0
to T:
−1
TY
P (X (0:T ) ) = P (X (0) ) P (X (t+1) | X (t) ) (943)
t=0
One last simplifying assumption, to avoid having unique transition probabilities for each time
t, is to assume a stationary205 Markovian dynamic system, defined s.t. P (X (t+1) | X (t) ) is
the same for all t.
205
Also called time invariant or homogeneous.
306
Dynamic Bayesian Networks (6.2.2). Above, I’ve drawn the 2-time-slice Bayesian net-
work (2-TBN) for our location example. A 2-TBN is a conditional BN over X 0 given XI ,
where XI ⊆ X is a set of interface variables206 . For each template variable Xi , the CPD
P (Xi0 | P aXi0 ) is a template factor. We can use the notion of the 2-TBN to define the more
general dynamic Bayesian network:
State-Observation Models (6.2.3). Temporal models that, in addition to the Markov as-
sumption (eq. 942), model the observation variables at time t as conditionally independent of
the entire state sequence given the variables at time t:
O (t) ⊥ X (0:(t−1)) , X (t+1:∞) | X (t) (944)
So basically a 2-TBN with the constraint that observation variables are leaves and only have
parents in X 0 . We now view our probabilistic model as consisting of 2 components: the
transition model P (X 0 | X), and the observation model P (O | X). The two main architectures
for such models are as follows:
• Hidden Markov Models. Defined as having a single state variable S and a single
observation variable O. In practice, the transition model P (S 0 | S) is often assumed to
be sparse (many possible transitions having zero probability). In such cases, one usually
represents them visually as probabilistic finite-state automaton207 .
• Linear Dynamical Systems (LDS) represent a system of one or more real-valued
variables that evolve linearly over time, with some Gaussian noise. Such systems are
often called Kalman filters, after the algorithm used to perform tracking. They can be
viewed as a DBN with continuous variables and all dependencies are linear Gaussian208 .
A LDS is traditionally represented as a state-observation model, where both state and
observation are vector-valued RVs, and the transition/observation models are encoded
using matrices. More formally, for X (t) ∈ Rn , O ∈ Rm :
206
Interface variables are those variables whose values at time t can have a direct effect on the variables at
time t + 1.
207
FSA use the graphical notation where the nodes are the individual possible values in V al(S), and the
directed edges from some a to b have weight equal to P (S 0 = b | S = a).
208
Linear Gaussian: some var Z pointing to X denotes that X = ΛZ + noise, where noise ∼ N (µx , Σx ).
307
Template Variables and Template Factors (6.3). It’s convenient to view the world as
being composed of a set of objects, which can be divided into a set of mutually exclusive and
exhaustive classes Q = Q1 , . . . , Qk . Template attributes have a tuple of arguments, each of
which is associated with a particular class of objects, which defines the set of objects that can
be used to instantiate the argument in a given domain. Template attributes thus provide us
with a “generator” for RVs in a given probability space. Formally,
An attribute A is a function A(U1 , . . . , Uk ), whose range is some set V al(A), and where
each argument Ui is a typed logical variable associated with a particular class Q[Ui ]. The
tuple U1 , ldots, Uk is called the argument signature of the attribute A, and denoted α(A).
Plate Models (6.4.1). The simplest example of a plate model is shown below. It describes
multiple RVs generated from the same distribution D.
θX
X
Data m
This could be a plate model for a set of coin tosses sampled from a single coin. We have a set
of m random variables X(d), where d ∈ D. Each X(d) is the random variable for the dth coin
toss. We also explicitly model that the single coin for which the tosses are used is sampled
from a distribution θX , which takes on values [0, 1] and denotes the bias of the coin.
308
Probabilistic Graphical Models July 26, 2018
Multivariate Gaussians. Here I’ll give two forms of the familiar density function, followed
by some comments and terminology.
1 1
p(x) = √ exp − (x − µ)T Σ−1 (x − µ) (947)
(2π)n/2 det Σ 2
1
p(x) ∝ exp − xT Jx + (Jµ)T x where J , Σ−1 (948)
2
Multivariate Gaussians are also special because we can easily determine whether two xi and
xj are independent: xi ⊥ xj IFF Σi,j = 0210 . For conditional independencies, the information
matrix J is easier to work with: (xi ⊥ xj | {x}k∈{i,j}
/ ) IFF Ji,j = 0. This condition is also
how we defined pairwise independencies in a Markov network, which leads to the awesome
realization:
We can view the information matrix J as directly defining a minimal I-map Markov network
for [multivariate Gaussian] p, whereby any entry Ji,j 6= 0 corresponds to an edge xi —xj in
the network.
209
Most definitions of p.d. also require that the matrix be symmetric. I also think it helps to view positive
definite from the operator perspective: A linear operator T is positive definite if T is self-adjoint and hT (x), xi >
0. In other words, “positive definite” means that the result of applying the matrix/operator to any nonzero x
will always have a positive component along the original direction x̂.
210
This is not true in general – just for multivariate Gaussians!
309
Probabilistic Graphical Models May 06, 2018
Analysis of Exact Inference. The focus of this chapter is the conditional probability query,
Pr [Y, e]
Pr [Y | E = e] = (9.1)
Pr [e]
and note that, by computing all instantiations for the numerator first, we can reuse them to
obtain the denominator. We can formulate the inference problem as a decision problem, which
we will call BN P rDP , defined as follows:
211
Apparently the time it takes to generate a guess is irrelevant.
212
P
This is true because P (x) can be decomposed as P (ξ) + P (. . .) ≥ P (ξ).
310
Analysis of Approximate Inference. Consider a specific query P (y | e), where we focus
on a particular assignment y. Let ρ denote some approximate answer, whose accuracy we wish
to evaluate relative to the correct probability. We can use the relative error to estimate the
quality of the approximation: An estimate ρ has relative error if:
ρ
≤ P (y | e) ≤ ρ(1 + ) (949)
1+
Unfortunately, the task of finding some approximation ρ with relative error is also N P-hard.
Furthermore, even if we relax this metric by using absolute error instead, we end up finding
that in the case where we have evidence, approximate inference is no easier than
exact inference, in the worst case.
n − 1 times, starting with i = 1, all the way up to i = n − 1, reusing the previous computation
at each step, with total cost O(nk 2 ). So for this simple network, even though the size of the
joint is k n (exponential in n), we can do inference in linear time.
First, we formalize some basic concepts before defining the algorithm.
Factor marginalization:
Let X be a set of variables, and Y ∈
/ X a variable. Let φ(X, Y ) be a factor. Define
P
the factor marginalization of Y in φ, denoted Y φ, to be a factor ψ over X
such that: X
ψ(X) = φ(X, Y )
Y
The key observation that’s easy to miss is that we’re only summing entries in the table where
the values of X match up. One useful rule for exchanging factor product and summation: If
X∈ / Scope[φ1 ], then
X X
(φ1 · φ2 ) = φ1 · φ2 (9.6)
X X
311
So, when computing some marginal probability, the main idea is to group factors together and
compute expressions of the form
XY
φ (951)
Z φ∈Φ
where Φ is the set of all factors φ for which Z ∈ Scope[φ]. This is commonly called the sum-
product inference task. The full algorithm for sum-product variable elimination, which is an
instantiation of the sum-product inference task, is illustrated below.
This is what we use to compute the marginal probability P (X) where X = X −Z. To compute
conditional queries of the form P (Y | E = e), simply replace all factors whose scope overlaps
with E with their reduced factor (see chapter 4 notes for definition) to get the unnormalized
φ∗ (Y) (the numerator of P (Y | e)). Then divide by y φ∗ to obtain the final result.
P
312
Coherence
Difficulty Intelligence
Grade SAT
Letter Job
Happy
Due to Happy being a child of Job, P (J) actually requires using all factors in the graph.
Below shows how VE with elimination ordering C, D, I, H, G, S, L progressively simplifies the
equation for computing P (J).
X
P (J) = P (J | s, `)P (s | i)P (` | g)P (g | d, i)P (d | c)P (h | J, g)P (c)P (i) (952)
X
= (P (c)P (d | c)) · P (g | d, i)P (i)P (s | i)P (h | J, g)P (` | g)P (J | s, `) (953)
c,d,i,h,g,s,`
X
= (τ1 (d)P (g | d, i)) · P (i)P (s | i)P (h | J, g)P (` | g)P (J | s, `) (954)
d,i,h,g,s,`
X
= (τ2 (g, i)P (i)P (s | i)) · P (h | J, g)P (` | g)P (J | s, `) (955)
i,h,g,s,`
X
= (P (h | J, g)) · τ3 (g, s)P (` | g)P (J | s, `) (956)
h,g,s,`
X
= (τ4 (g, J)τ3 (g, s)P (` | g)) · P (J | s, `) (957)
g,s,`
X
= τ5 (J, `, s) · P (J | s, `) (958)
s,`
X
= τ6 (J, `) (959)
`
where red indicates the focus of the given step in the VE algorithm.
313
Probabilistic Graphical Models May 06, 2018
Cluster Graphs. A graphical flowchart of the factor-manipulation process that will be rel-
evant when we discuss message passing. Each node is a cluster, which is associated with a
subset of variables. Formally,
A cluster graph U for a set of factors Φ over X is an undirected graph. Each node
i is associated with a subset Ci ⊆ X . Each factor φ ∈ Φ must be associated with
a cluster Ci , denoted α(φ), such that Scope[φ] ⊆ Ci (family-preserving). Each
edge between a pair of clusters Ci and Cj is associated with a sepset Si,j ⊆ Ci ∩Cj .
Recall that each step of variable elimination involves creating a factor ψi by multiplying a
group of factors213 . Then, denoting the variable we are eliminating at this step as Z, we
P
obtain another factor τi that’s the factor marginalization of Z in ψi (denoted Z ψi ). An
execution of variable elimination defines a cluster graph: we have a cluster for each of the ψi ,
defined as Ci = Scope[ψi ]. We draw an edge between Ci and Cj if the message τi is used in
the computation of τj .
Consider when we applied variable elimination to the student graph network below, to compute
P (J). Elimination ordering C, D, I, H, G, S, L.
213
All whose scope contains the variable we are currently trying to eliminate.
314
D G,I
Coherence 1: C, D 2: D, I, G 3: G, I, S
J,S,L J,L
5: G, J, L, S 6: J, L, S 7: J, L
Difficulty Intelligence
4: G, H, J
Grade SAT
Letter Job
Happy
Clique Trees. Since VE uses each intermediate τi at most once, the cluster graph induced
by an execution VE is necessarily a tree, and it also defines a directionality: the direction of
the message passing (left-to-right in the above illustration). All the messages flow toward a
single cluster where the final result is computed – the root of the tree; we say the messages
“flow up” to the root. Furthermore, for cluster trees induced by VE, the scope of each message
(edge) τi is exactly Ci ∩ Cj , not just a subset214 .
Let Φ be a set of factors over X . A cluster tree over Φ that satisfies the running
intersection property is called a clique tree. For clique trees, the clusters are also
called cliques.
214
This follows from the running intersection property, which is satisfied by any cluster tree that’s defined
by variable elimination. It’s defined as, if any variable X is in both cluster Ci and Cj , then X is also in every
cluster in the (unique) path in the tree between Ci and Cj .
315
X Y
δi→j = ψi · δk→i (10.2)
Ci −Si,j k∈(N bi −{j})
where the summation is simply over the variables in Ci that aren’t passed along to Cj , and the
product is over all messages that Ci received. Stated even simpler, we multiply all the incoming
messages by our initial potential, then sum out all variables except those in Si,j . When the
root clique has received all messages, it multiplies them with its own initial potential, resulting
in a factor called the beliefs, βr (Cr ). It represents
X Y
PeΦ (Cr ) = φ (960)
X −Cr φ
Below is a more compact summary of all of this, showing the procedure for computing all final
factors (belief) βi for some marginal probability query on the variables in Cr asynchronously.
Algorithm 10.2: Sum-Product Belief Propagation
1. For each clique Ci , compute its initial potential:
Y
ψi (Ci ) ← φj
φj :α(φj )=i
3. Then, compute each belief factors βi by multiplying the initial potential ψi by the in-
coming messages to Ci : Y
βi ← ψi · δk→i
k∈N bCi
The SP Belief Propagation algorithm above is also called clique tree calibration. A clique
tree T is calibrated if all pairs of adjacent cliques are calibrated. A calibrated clique tree
satisfies the following property for what we’ll now call the clique beliefs, βi , and the sepset
beliefs, µi,j over Si,j :
X X
µi,j (Si,j ) , βi = βj (962) µi,j = P
eΦ (Si,j )
Ci −Si,j Cj −Si,j
The main advantage of the clique tree algorithm is that it computes the posterior probability
of all variables in a graphical model using only twice the computation215 of the upward pass
in the same tree.
215
The algorithm is equivalent to doing one upward pass, one downward pass.
316
We can also show that µi,j = δj→i δi→j , which then allows us to derive:
Q
i∈VT βi
PeΦ (X ) = Q (10.10)
ij∈ET µi,j
In other words, the clique and sepset beliefs provide a reparameterization of the unnormal-
ized measure, a property called the clique tree invariant.
Message Passing: Belief Update. We know discuss an alternative message passing ap-
proach that is mathematically equivalent but intuitively different than the sum-product ap-
proach. First, we introduce some new definitions.
Factor Division:
Let X and Y be disjoint sets of variables, and let φ1 (X, Y) and φ2 (Y) be two factors.
We define the division of φ1 and φ2 as a factor ψ with scope X, Y as follows: Define 0/0 = 0
φ1 (X, Y)
ψ(X, Y) ,
φ2 (Y)
Looking back at equation 10.2, we can now see that another way to write δi→j is
P
Ci −Si,j βi
δi→j = (10.13)
δj→i
Now, consider the clique tree below for the simple Markov network A-B-C-D:
If we assigned C2 as the root, then our previous approach would compute δ2→1 as C ψ2 ·δ3→2 .
P
Alternatively, we can use equation 10.13 to realize this is equivalent to dividing β2 by δ1→2
and marginalizing out C. This observation motivates the algorithm below, which allows us to
execute message passing in terms of the clique and sepset beliefs, without having to remember
the initial potentials ψi or explicitly compute the messages δi→j .
Algorithm 10.3: Belief-Update Message Passing
1. For each clique Ci , set its initial belief βi to its initial potential ψi . For each edge in ET ,
set µi,j = 1.
2. Wile there exists an uninformed216 clique in T , select any edge in ET , and compute
216
A clique is informed once it has received informed messages from all of its neighbors. An informed message
is one that has been sent by taking into account information from all of the sending cliques’ neighbors (aside
from the receiving clique of that message, of course).
317
X
σi→j ← βi (963)
Ci −Si,j
σi→j
βj ← βj · (964)
µi,j
µi,j ← σi→j (965)
3. Return the resulting set of informed beliefs {βi }.
At convergence, σi→j = µi,j = σj→i .
318
Probabilistic Graphical Models June 02, 2018
A
A 1: A, B 4: A, D
B D
C
B D 2: B, C 3: C, D
The clique tree for this network, which can be used for exact inference, has two cliques ABD
and BCD and messages are passed between them consisting of τ (B, D). Suppose that, instead,
we set up 4 clusters corresponding to each of the initial potentials, shown as the cluster graph
above on the right. We can still apply belief propagation here, but due to it now having loops
(as opposed to before when we only had trees), the process may not converge.
319
Probabilistic Graphical Models June 17, 2018
Maximum Likelihood Estimation (17.1). In this chapter, assume the network structure
is fixed and that our data set D consists of fully observed instances of the network variables:
D = {ξ[1], . . . , ξ[M ]}. We begin with the simplest learning problem: parameter learning for a
single variable. We want to estimate the probability, denoted via the parameter θ, with which
the flip of a thumbtack will land heads or tails. Define the likelihood function L(θ : x) as
the probability of observing some sequence of outcomes x under the parameter θ. In other
words, it is simply P (x : θ), but interpreted as a function of θ. For our simple case, where D
consists of M thumbtack flip outcomes,
where M [1] denotes the number of outcomes in D that were heads. Since it’s easier to maximize
a logarithm, and since it yields the same optimal θ̂, optimize the log-likelihood to obtain:
θ̂ = arg max `(θ : D) = arg max M [1] log θ + M [0] log(1 − θ) (967)
θ θ
M [1]
= (968)
M [1] + M [0]
Note that MLE has the disadvantage that it can’t communicate confidence of an estimate217 .
217
We get the same result (0.3) if we get 3 heads out of 10 flips, as we do for getting 300 heads out of 1000
flips; yet, the latter experiment should include a higher degree of confidence
218
We also have the constraint that P (ξ; θ) must be a valid distribution (nonnegative and sums to 1 over all
possible ξ)
320
We can often simplify the likelihood function to simpler terms, like our M [0] and M [1] values
in the thumbtack example. These are called the sufficient statistics, defined as functions of
the data that summarize the relevant information for computing the likelihood. Formally,
A function τ (ξ) : ξ → R` (for some `) is a sufficient statistic if for any two data sets D
and D0 , we have that
X X
0 0
τ (ξ[m]) = τ (ξ [m]) =⇒ L(θ : D) = L(θ : D ) (969)
ξ[m]∈D ξ 0 [m]∈D 0
P
We often informally refer to the tuple ξ[m]∈mathcalD τ (ξ[m]) as the sufficient
statistics of the data set D.
MLE for Bayesian Networks – Simple Example. We now move on to estimating pa-
rameters θ for the simple BN X → Y for two binary RVs X and Y . Our parameters θ are the
individual probabilities of P (X) and P (Y | X) (6 total). Since BNs have the nice property that
their joint probability decomposes into a product of probabilities, just like how the likelihood
function is a product of probabilities, we can write the likelihood function as a product of the
individual local probabilities:
Y Y decomposability of the
L(θ : D) = P (x[m] : θX ) P (y[m] | x[m] : θY |X ) (970) likelihood function
m m
which can be decomposed even further by e.g. differentiating products over x[m] : x[m] = x0
etc. Just as we used M [0] in the thumbtack example to count the number of instances with a
certain value, we can use the same idea for the general case.
Let Z be some set of RVs, and z be some instantiation to them. We define M [z] to be the
number of entries in data set D that have Z[m] = z:
X
M [z] = 1{Z[m] = z} (971)
m
Global Likelihood Decomposition. We now move to the more general case of computing
the likelihood for BN with structure G.
Y global decomposition of
L(θ : D) = PG (ξ[m] : θ) (972) the likelihood
m
YY
= P (xi [m] | P aXi [m] : θ) (973)
m i
YY
= P (xi [m] | P aXi [m] : θXi |P aX ) (974)
i
i m
Y
= Li (θXi |P aX : D) (975)
i
i
where Li is the local likelihood function for Xi . Assuming these are each disjoint sets of
parameters from one another, it implies that θ̂ = hθ̂X1 |P aX1 , . . . , θ̂Xn |P aXn i
321
Probabilistic Graphical Models June 17, 2018
Likelihood of Data and Observation Models (19.1.1). Consider the simple example of
flipping a thumbtack, but occasionally the thumbtack rolls off the table. We choose to ignore
the tosses for which the thumbtack rolls off. Now, in addition to the random variable X giving
the flip outcome, we have the observation variable OX , which tells us whether we observed the
value of X.
θ ψ
X
OX
The illustration above is a plate model where we choose a thumbtack sampled with bias θ and
repeat some number of flips with that same thumbtack. We also sample the random variable
OX that has probability of observation sampled from ψ and fixed for all the experiments we
do. This leads to the following definition for the observability model.
Let X = {X1 , . . . Xn } be some set of RVs, and let OX = {OX1 , . . . , OXn } be their observ-
ability variable. The observability model is a join distribution
Pmissing (X, OX ) = P (X) · Pmissing (OX | X)
so that P (X) is parameterized by θ and Pmissing (OX | X) is parameterized by ψ. We
define a new set of RVs Y = {Y1 , . . . Yn } where V al(Yi ) = V al(Xi ) ∪ {?}. The actual
observation Y is a deterministic function of X and OX :
(
Xi OXi = o1
Yi = (976)
? OXi = o0
P (Y = 1) = θψ (977)
P (Y = 0) = (1 − θ)ψ (978)
P (Y =?) = (1 − ψ) (979)
L(θ, ψ; D) = θM [1] (1 − θ)M [0] ψ M [1]+M [0] (1 − ψ)M [?] (980)
322
The main takeaway is to understand that when we have missing data, the data-generation
process involves two steps: (1) generate data by sampling from the model, then
(2) determine which values we get to observe and which ones are hidden from us.
The Likelihood Function (19.1.3). Assume we have a BN network G over a set of variables
X. In general, each instance has a different set of observed variables. Denote by O[m] and
o[m] the observed vars and their values in the m’th instance, and by H[m] the missing (or
hidden) vars in the m’th instance.
323
Information
Theory,
Inference, and
Learning
Algorithms
Contents
324
Information Theory, Inference, and Learning Algorithms November 11, 2017
[Note: Skipping most of this chapter since it’s mostly introductory material.]
Preface. For ease of reference, some common quantities we will frequently be using:
• Binomial distribution. Let r denote the number of successful trials out of N total
trials. Let f denote the probability of success for a single trial.
!
N r
Pr [r | f, N ] = f (1 − f )N −r E [r] = N f Var [r] = N f (1 − f ) (981)
r
• Stirling’s Approximation.
√ 1 Recall that
x! ' xx e−x 2πx ⇔ ln x! ' x ln x − x +
ln 2πx (982) logb x =
loga x
! 2 loga b
N N N
ln ' r ln + (N − r) ln (983)
r r N −r
• Binary Entropy Function and its relation with Stirling’s approximation.
1 1
H2 (x) , x lg + (1 − x) lg (984)
!
x 1−x
N
lg ' N H2 (r/N ) (985)
r
325
We see that we must make assumptions about the prior probability Pr [si ]. It is common
to assume all possible values of si (0 or 1 in this case) are equally probable. It is useful
to observe the likelihood ratio,
Y−1 Pr [rn | tn = 1]
Pr [ri:i+N | si = 1] i+N Y−1 γ
i+N
(
if rn = 1
= = (987)
Pr [ri:i+N | si = 0] n=i
Pr [rn | tn = 0] n=i γ −1 if rn = 0
where we’ve defined γ := (1 − f )/f , with f being the probability of a bit getting flipped
by the channel. We want to assign ŝi to the most likely hypothesis out of the possible
si . If the likelihood ratio [for the two hypotheses] is greater than 1, we choose ŝi = 1,
else we choose ŝi = 0.
• Block Codes - the (7, 4) Hamming Code. Although, by increasing the number of
repetitions per bit N for our RN repetition code can decrease the error-per-bit probability
pb , we incur a substantial decrease in the rate of information transfer – a factor of 1/N.
The (7, 4) Hamming Code tries to improve this by encoding blocks of bits at a time
(instead of per-bit). It is a linear block code – it encodes each 4-bit block into a 7-bit
block, where the additional 3 bits are linear functions of the original K = 4 bits. The
(7, 4) Hamming code has each of the extra 3 bits act as parity checks. It’s easiest to show
this with the illustration below:
326
Information Theory, Inference, and Learning Algorithms November 12, 2017
[Note: Skipping most of this chapter since it’s mostly introductory material.]
The Likelihood Principle. For a generative model on data d given parameters θ, Pr [d | θ],
and having observed a particular outcome d1 , all inferences and predictions should depend
only on the function Pr [d1 | θ].
They mention the example of the unigram probabilities for each character in a document.
For example, p(z) = 0.007, which has information content of 10.4 bits222 .
222
Intuition digression: Recall from CS how to compute the number of bits required to represent N unique
values (answer: lg(N ).). Similarly, a probability of e.g 1/8 can be interpreted as “one of 8 possible outcomes”,
meaning that lg(1/(1/8))=3 bits are needed to encode all possible outcomes. Similarly, one could interpret
p(z)=0.007 as “belonging to 7 of 1000 possible results”. I guess in some strange world you can then say that
there are 1000/7 ≈ 142.86 evenly-proportioned events like this (how do you even word this) and it would take
327
- Entropy of Ensemble X. Defined to be the average Shannon information content of an
outcome.
1
0 × lg ,0
0
1
X
H(X) , Pr [x] lg (989)
x∈AX
Pr [x]
1
H(X) ≤ lg (|AX |) with equality iff pi = ∀i (990)
AX
where, in the words of the author, “Gibb’s inequality is probably the most important
inequality in this book”.
- Convex functions and Jensen’s Inequality. A function f (x) is convex over the interval
[x = a, x = b] if every chord of the function lies above the function. That is, ∀x1 , x2 ∈ [a, b]
and 0 ≤ λ ≤ 1:
and we say f is strictly convex if, ∀x1 , x2 ∈ [a, b], we get equality for only λ = 0 and λ = 1.
log(142.86)=10.4 bits to encode all of them. Low-probability events (such as a character being z) have high
information content.
328
8.2.1 More About Inference (Ch. 3 Summary)
A first inference problem. Author explores a particle decay problem of finding Pr [λ | {x}]
where λ is the characteristic decay length and {x} is our collection of observed decay distances.
Plotting the likelihood Pr [{x} | λ] as a function of λ for any given x ∈ {x} shows each has
a peak value. The kicker is the interpretation: if each measurement x ∈ {x} is independent,
the total likelihood is the product of all the individual likelihoods, which can be interpreted as
updating/narrowing the interval [λa , λb ] within which Pr [{x} | λ] (as a function of λ) peaks.
In the words of the author’s mentor:
what you know about λ after the data arrive is what you knew before (Pr [λ]) and
what the data told you (Pr [{x} | λ])
So, why did I still get the answer correct? The reason is because enumeration of prob-
abilities wasn’t necessary at all, you just needed to realize that the likelihood for the
two remaining hypotheses (H1 and H2 ) were the same – the probability of observing the
earthquake open door 3 and the prize not being revealed was the same for the case of
the prize being behind door 1 or door 2. So maybe the real lesson here is determine
whether calculations are even needed in order to solve the given problem,
which luckily I’ve had drilled in my head for years from studying physics.
(3.15) Another biased coin variant. One of the best examples I’ve seen for favoring Bayesian
methods over frequentist
Z 1
methods. Also, made use of the beta function:
Γ(x + 1)Γ(y + 1)
px (1 − p)y dp = = B(x + 1, y + 1) (998)
0 Γ(x + y + 2)
223
Note that, while important to recognize and understand, I could’ve avoided this pitfall entirely by just
ignoring the evidence during calculations and normalizing after, since the evidence can be determined solely by
the normalization constraint.
329
where B is the beta function, which is defined by the LHS.
330
Information Theory, Inference, and Learning Algorithms November 23, 2017
331
The response to this reaction is: “yes, exactly, and we must use that information to be
able to discern in the future whether, e.g., a measurement of “left side heavier” means
the oddball is on the left and heavy, or if it’s on the right and light – it’s useful to know
that a given side of the scale does not contain the oddball before a measurement.
• More generally, it’s also not optimal to greedily search for solutions that eliminate the
highest number of possibilities in any given single step. Another way of thinking about
this is that it’s undesirable for the ith measurement outcome to cause any of the 3 possible
measurement outcomes to be impossible at the next stage.
• I focused a disproportionate amount of thought on handling the equal-weight measure-
ment outcome, for whatever reason. I probably would’ve arrived at the solution faster if
I’d actually thought about how my strategies would’ve handled some outcome being “left
heavier” and then considered what that strategy would put on the scale at the next step,
where the italics denote what would’ve illuminated the fatal flaw in all my approaches.
Guessing Games. What’s the smallest number of yes/no questions needed to identify an
integer x between 0 and 63? Although it was obvious to me that the solution is to succes-
sively halve the possible values of x, I found it interesting that you can write down the
list of questions independent of the answers at each step using a basic application of
modular arithmetic. In other words, you can specify the full decision tree of N nodes with
just lg N questions. Nice. Also, recognize that the Shannon information content for any single
1
outcome is lg 0.5 = 1 bit, and thus the total Shannon information content (for our predefined
6 questions) is 6 bits, which is not-coincidentally the number of possible values that x could
be before we ask any questions.
In general, if an outcome x has Shannon information content h(x) number of bits, I like to
interpret that as “learning the result x eliminates 2h(x) possibilities for the final result.” The
battleship example follows this interpretation well. Stated another way (in the author’s words):
The Shannon information content can be intimately connected to the size of a file
that encodes the outcomes of a random experiment.
332
8.3.1 Data Compression and Typicality
Data Compression. A lossy compressor compresses some files, but maps some files to
the same encoding. We introduce a parameter δ that describes the risk (aggressiveness of our
compression) we are taking with a given compression method: δ is the probability that there
will be no name for an outcome225 x.
The smallest δ-sufficient subset
If Sδ is the smallest subset of AX satisfying
Pr [x ∈ Sδ ] ≥ 1 − δ (1001)
then Sδ is the smallest δ-sufficient subset. It can be constructed by ranking the elements
of AX in order of decreasing probability and adding successive elements starting from
the most probable elements until the total probability is ≥ (1 − δ).
• Raw bit content of X: H0 (X) , lg |AX |. A lower bound for the number of binary
questions that are always guaranteed to identify an outcome from the ensemble X – it
simply maps each outcome to a constant-length binary string.
• Essential bit content of X: Hδ (X) , lg |Sδ |. A compression code can be made by
assigning a binary string of Hδ bits to each element of the smallest sufficient subset.
Finally, we can now state Shannon’s source coding theorem: Let X be an ensemble
for the random variable x with entropy H(X) = H bits, and let X N denote a sequence of
identically distributed (but not necessarily independent226 ) of random variables/ensembles,
(X1 , X2 , . . . , XN ).
1
(∃N0 ∈ Z+ )(∀N > N0 ) : Hδ (X N ) − H < (0 < δ < 1) ( > 0) (1002)
N
which, in English, can be read: N i.i.d. random variables each with entropy H(X) can be
compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞;
conversely if they’re compressed into fewer than N H(X) bits it is virtually certain that infor-
mation will be lost.
225
P {a}, of unique values that x can take on but our compression
More specifically, if there is some subset,
method discards/ignores, then we say δ = i p(x = ai ).
226
Actually, before the actual theorem statement, the author mentions we are now concerned with “string of
N i.i.d. random variables from a single ensemble X.” It’s probably fair to assume this is true for the quantities
in the theorem, but I’m leaving this note here as a reminder.
333
Typicality. The reason that large N in equation 1002 corresponds to larger potential for
better compression is that the subset of likely results for a string of outcomes becomes more
and more concentrated relative to the number of possible sequences as N increases227 . I just
realized this is for the same fundamental reasons that entropy exists in thermodynamics –
there are just more ways to exist in a high entropy state than otherwise. The author showed
N
r as a function of r (the number of 1s in the N-bit string). For large N , this becomes almost
comically concentrated near the center (like a delta function at N/2) – see footnote for more
details228 .
This motivates the notion of typicality for [a string of length N from] an arbitrary ensemble
X with alphabet AX . For large N , we expect to find pi N occurrences of the outcome x = ai .
Hence the probability of such a string, and its information content, is roughly229
(p1 N ) (p2 N ) (p N )
Pr [x]typ = Pr [x1 ] Pr [x2 ] · · · Pr [xN ] ' p1 p2 · · · pI I (1003)
I
1 X 1
h(x)typ = lg 'N pi lg = N H(X) (1004)
Pr [x] typ i=1
pi
1 1
TN β , x ∈ AN
X : lg −H <β (1005)
N P (x)
It turns out that whatever value of β we choose, the TN β contains almost all the probability
as N increases.
227
The author gave an example for a sequence of bits with probability of any given bit being 1 as 0.1. He
showed how, although the average
√ number of 1s in a sequence of N bits grew as O(N ), the standard deviation
of that average only grew as N .
228
The probability
p of getting a string with r 1s follows a binomial distribution with mean N p1 and standard
deviation N p1 (1 − p1 ). This results in an increasingly narrower distribution P (r) for larger N .
229
We appear to be assuming that each outcome x in the string x are i.i.d. (confirmed)
334
8.3.2 Further Analysis and Q&A
h i σh2
Pr (x − h̄)2 ≥ α ≤ (1007)
αN
which can be easily derived from Chebyshev’s inequalities.
• Proving ‘asymptotic equipartition’ principle, i.e. that an outcome x is almost
certain to belong to the typical set, approaching probability 1 for large enough N . It is
a simple application of the WLLN to the random variable
N N
1 1 1 X 1 1 X
lg = lg = h(xn ) (1008)
N Pr [x] N n=1 xn N n=1
where E [h(xn )] = H(X) for all terms in the summation. Observe, then, that the defini-
tion of the typical set given in equation 1005 (squaring both sides) has the same form as
the definition for the WLLN. Plugging in and rearranging yields
σ2
Pr [x ∈ TN β ] ≥ 1 − (1009)
β2N
h i
where σ 2 ≡ Var lg P (x1 n ) . This proves the asymptotic equipartition principle. It will
also be useful to recognize that for any x in the typical set, we can rearrange equation
1005 to obtain
230
Notice how the two inequalities are technically the same.
335
Questions & Answers. Collection of my questions as I read through the chapter, which I
answered upon completing it.
• Q: Why isn’t the essential bit content of a string of N i.i.d. variables N H when δ = 0?
– A: I’m not entirely sure how to answer this still, but it seems the question is con-
fused. First off, the essential bit content approaches the raw bit content as δ de-
creases to 0: Hδ → H0 as δ → 0. It’s important to notice that both Hδ and H0
define an entropy where all members (SδN for Hδ ; AN X for H0 ) are equiprobable. I
remember asking this question wondering “what is the significance of Hδ (X N ) ap-
proaching N H(X) (not a typo!) for tiny δ”. The answer: for larger N , more of
the probability mass is concentrated in a relatively smaller region, with elements of
that region being roughly equiprobable. The last part is what I didn’t initially realize
– that allowing for tiny δ combined with large N essentially makes it so
that SδN ≈ TN β .
• Q: Why aren’t the most probable elements necessarily in the typical set?
– A: In the limit of N → ∞, they are, since in that limit, all elements are in the
typical set and they’re equiprobable. However, in essentially any real case, we can
imagine that some elements will be too unlikely to be found within the typical set,
which necessarily requires that there exist elements with probability too high to be
in the typical set. Remember that the typical set is basically defined such that all
elements have probability within the range given in equation 1010.
336
Information Theory, Inference, and Learning Algorithms November 24, 2017
Overview. The aims of Monte Carlo methods are to solve one or both of the following:
1. Generate samples {x(r) }R
r=1 from a given probability distribution Pr [x].
2. Estimate expectations of function under Pr [x], for example
Z
Φ = Ex∼Pr[x] [φ(x)] ≡ dN xPr [x] φ(x) (1011)
where it’s assumed that Pr [x] is sufficiently complex that we can’t evaluate such expec-
tations by exact methods.
Note that we can concentrate on the first problem (sampling), since we can use it to solve the
second problem (estimating an expectation) by using the random samples {x(r) }R r=1 to give
the estimator
1 X
Φ̂ ≡ φ(x(r) ) (1012)
R r
where
h i 1 X h i
E Φ̂ = Ex(r) ∼Pr[x] φ(x(r) ) = Φ (1013)
R r
P 2
h i r φ(x(r) ) − Φ σ2 1
Z
lim Var Φ̂ = lim = ≡ dN xPr [x] (φ(x) − Φ)2 (1014)
R→∞ R→∞ R−1 R R
wr φ(x(r) ) P ∗ (x(r) )
P
r
Φ̂ = where wr ≡ (1015)
Q∗ (x(r) )
P
r wr
337
Rejection Sampling. In addition to the assumptions in importance sampling, we now also
assume that we know the value of a constanct c such that
1. Generate sample x from proposal density Q(x), and evaluate cQ∗ (x).
2. Generate a uniformly distributed random variable u from the interval [0, cQ∗ (x)].
3. If u > P ∗ (x), then reject x; else, accept x and add it to our set of samples {x(r) }.
Metropolis-Hastings Method. Proposal density Q now depends on the current state x(t) .
1. Sample tentative next state x0 from proposal density Q(x0 ; x(t) ).
2. Compute:
P ∗ (x0 ) Q(x(t) ; x0 )
a= (1017)
P ∗ (x(t) ) Q(x0 ; x(t) )
3. If a ≥ 1, the new state is accepted. Otherwise, the new state is accepted with probability
a.
4. If accepted, we set x(t+1) = x0 . If rejected, we set x(t+1) = x(t) .
338
Information Theory, Inference, and Learning Algorithms November 15, 2017
e−βE(x;J)
Pr [x | β, J] = (1018)
Z(β, J)
1X X
E(x; J) = − Jmn xm xn − hn xn (1019)
2 m,n n
e−βE(x;J)
X
Z(β, J) = (1020)
x
Note that evaluating E(x; J) for a given x takes polynomial time in the number of spins N , and
evaluating Z is O(2N ). Variational free energy minimization is a method for approximating
the complex distribution P (x) by a simpler ensemble Q(x; θ) parameterized by adjustable θ.
where F , − ln Z(β, J) is the true free energy. It’s not immediately clear why this approxi-
mation Q is more tractable than P – for that we turn to an example below.
231
Yay!
339
Variational Free Energy Minimization for Spin Systems. For the system with energy
function given in equation 1019, consider the separable approximating distribution,
P
an xn Q an xn Q an xn
e n
ne ne
Q(x; a) = = an x0n0
=P a1 x01 P 0 ea2 x02 · · · P an x0n
(1024)
ZQ P Q
x0 n0 e x01 e x2 x0n e
X 1
HQ (x) = Q(x; a) ln (1027)
x Q(x; a)
X 1 1
= Q(xn = 1; an ) ln + Q(xn = 0; an ) ln (1028)
n Q(xn = 1; an ) Q(xn = 0; an )
X 1 1
= qn ln + (1 − qn ) ln (1029)
n qn 1 − qn
340
Machine
Learning: A
Probabilistic
Perspective
Contents
341
Machine Learning: A Probabilistic Perspective August 07, 2018
Probability (Ch. 2)
Table of Contents Local Written by Brandon McKinzie
Define the cumulative distribution function (cdf) F (q) , p(X ≤ q), and the probability
d
density function (pdf) f (x) , dx F (x).
If f is monotonic (and hence invertible), then we can derive the pdf py (y) from the pdf px (x)
by taking derivatives as follows:
d d dx
py (y) , Py (y) = Px (f −1 (y)) = px (x) (1038)
dy dy dy
and, since we only work with integrals over densities (i.e. overall sign does not matter), it is
convention to take the absolute value of dx
dy in the above formula. In the multivariate case, the
∂xi
Jacobian matrix [Jy→x ]i,j , ∂yj is used:
342
Central Limit Theorem (2.6.3). Consider N i.i.d. RVs each with arbitrary pdf p(xi ), mean
µ, and covariance σ 2 . Let SN = i Xi denote the sum over the N RVs. The central limit
P
lim SN ∼ N (N µ, N σ 2 ) (1040)
N →∞
√
or equivalently lim N (X̄ − µ) ∼ N (0, σ 2 ) (1041)
N →∞
9.1.1 Exercises
Exercise 2.1
Intuition approach: If X and Y are uncorrelated, it is intuitive that their sum should have variance equal to the
sum of the individual variances. If there is some linear dependence between the two, we’d expect it to either
increase (if positively correlated) or decrease (if negatively correlated) the variance of their sum.
343
Machine Learning: A Probabilistic Perspective August 07, 2018
Bayesian Concept Learning (3.1). The interesting notion of learning a concept C, such
as “all prime numbers”, by only seeing positive examples x ∈ C. How should we approach
predicting whether a new test case x e belongs to the concept C? Well, what we’re actually
doing is as follows: Given an initial hypothesis space H of concepts, we collect data D
that gradually narrows the down the subset of H consistent with our data. We also need
to address how we (humans) will weigh certain hypotheses differently even if they are both
entirely consistent with the evidence. The Bayesian approach can be summarized as follows
(no particular order):
• Likelihood. Probability of observing D given a particular hypothesis h. In the simple
but illustrative case where the data is sampled from a uniform distribution over the
extension 232 of a concept (a.k.a. the strong sampling assumption). The probability here
of sampling N items independently under hypothesis h is
N
1
p(D | h) = (1045)
|h|
which elucidates how the model favors the smallest hypothesis space consistent with the
data 233 .
• Prior. Using just the likelihood can mean we fall prey to contrived/overfitting hypotheses
that basically just enumerate the data. The prior p(h) allows us to specify properties
about how the learned hypothesis ought to look.
• Posterior. This is just the likelihood times the prior, normalized [to one]. In general,
as we collect more and more data, the posterior tends toward a Dirac measure peaked at
the MAP estimate:
232
The extension of a concept is just the set of numbers that belong to it.
233
A result commonly known as Occam’s razor or the size principle.
344
The Beta-Binomial Model (3.3). In the previous section we considered inferring some
discrete distribution over integers; now we will turn to a continuous version. Consider the
problem of inferring the probability that a coin shows up heads, given a series of observed coin
tosses.
• Likelihood. As should be familiar, we’ll model the outcome of each toss Xi ∈ {1, 0}
indicating heads or tails with Xi ∼ Ber(θ). Assuming the tosses are i.i.d, this gives us
p(D | θ) ∝ θN1 (1 − θ)N0 , where N1 and N0 are the sufficient statistics of the data,
since p(θ | D) can be entirely modeled as p(θ | N1 , N0 ).
• Prior. We technically just need a prior p(θ) with support over the interval [0, 1], but it
would be convenient if the prior had the same form as the likelihood, i.e. p(θ) ∝ θγ1 (1 −
θ)γ2 for some hyperparameters γ1 and γ2 This is satisfied by the Beta distribution234
This would also result in the posterior having the same form as the prior, meaning the
prior is a conjugate prior for the corresponding likelihood.
• Posterior. As mentioned, the posterior corresponding to a Bernoulli/binomial likelihood
with a beta prior is itself a beta distribution:
By either reading off from a table or deriving via calculus, we can obtain the following
properties for our Beta posterior:
a + N1 − 1
[mode] θ̂M AP = (1049)
a+b+N −2
a + N1
[mean] θ̄ = = λm1 + (1 − λ)θ̂M LE (1050)
a+b+N
a
where m1 = a+b is the prior mean. The last form captures the notion that the posterior
is a compromise between what we previously believed and what the data is telling us.
The Dirichlet-Multinomial Model (3.4). We now generalize further to inferring the prob-
ability that a die with K sides comes up as face k.
• Likelihood. As before, we assume a specific observed sequence D of N dice roles.
Assuming the data is i.i.d., the likelihood has the form
K
Nk
Y
p(D | θ) = θk (1051)
k=1
which is the likelihood for the multinomial model up to an irrelevant constant factor.
• Prior. We’d like to find a conjugate prior for our likelihood, and we need it to have sup-
port over the K-dimensional probability simplex 235 . The Dirichlet distribution satisfies
234
You may be wondering, why not the Bernoulli distribution? It trivially has the same form as the Bernoulli
distribution, eh? Then, pause and actually think about what you’re saying for five seconds. You want to model
a prior on θ with a Bernoulli distribution? You do realize that the support for a Bernoulli is in k ∈ {0, 1}.
It’s the opposite domain entirely. We want something that “looks” like a Bernoulli but is a distribution over θ,
NOT the value(s) of x.
235
The K-dimensional probability simplex is the (K −1)th dimensional simplex determined by the unit vectors
e1 , . . . , eK ∈ RK , i.e. the set of vectors x such that xi ≥ 0 and
P
i
xi = 1.
345
both of these and is defined as
K
1 Y
Dir(θ | α) = θαk −1 (1052)
B(α) k=1 k
To get θ̂M AP , we’d take derivatives w.r.t. λ and each θk , do some substitutions and solve.
Example: Language Models with Bag of Words
Given a sequence of words, predict the most likely next word. Assume that the ith word Xi ∈ {1, . . . , K} (where
K is the size of our vocab) is sampled indep from all others using a Cat(θ) (multinoulli) distribution. This is
the BoW model.
My attempt: We need to derive the form of posterior predictive p(Xi+1 | X1 , . . . , Xi ) where θ ∈ SK . First, I
know that the posterior is
and so I can derive the posterior predictive in the usual way, while also exploiting the fact that all Xi ⊥ Xj ,
Z
p(Xi+1 = k | X1 , . . . , Xi ) = p(Xi+1 | θ)p(θ | X1 , . . . , Xi )dθ (1055)
Z
= θk p(θk , θ−k | X1 , . . . , Xi )dθk dθ−k (1056)
Z
= θk p(θk | X1 , . . . , Xi )dθk (1057)
= E [θk | X1 , . . . , Xi ] (1058)
αk + N k
= P (1059)
α + Nj
j j
346
Naive Bayes Classifiers (3.5). For classifying vectors of discrete-valued features, x ∈
{1, . . . , K}D . Assumes features are conditionally independent given the class label:
D
Y
p(x | y = c, θ) = p(xj | y = c, θj,c ) (1060)
j=1
with parameters θ ∈ RD×|Y|236 . We can model the individual class-conditional densities with
the multinoulli distribution. If we were modeling real-valued features, we could instead use a
Gaussian distribution.
• MLE fitting. Our log-likelihood is
N X
D
X X (i)
log p(D | θ) = Nc log πc + log p(xj | y (i) , θj,y(i) ) (1061)
c i j
where πc = p(y = c) are the class priors237 . We see that we can optimize the class
priors separately from the others, and that they have the same form as the multinomial
likelihood we used in the last section. We already know that the MLE for these are
π̂c = Nc /N (remember this involves using a Lagrangian). Let’s assume next that we’re
working in the case of binary features (xj | c ∼ Ber(θj,c )). This results in MLE estimates
θ̂j,c = Nj,c /Nc .
236
You could also generalize this to having some number of params N associated with each pairwise (j, c).
It’s also important to recognize that this is the first model of this chapter where the input x is a vector, and
thus introduces pairwise parameters.
237
We only see the class prior parameters here because they appear in the likelihood of generative classifiers,
since p(x, y) = p(y)p(x | y). We don’t see the priors for θ that aren’t class priors because MLE is only concerned
with maximizing the likelihood, not the posterior (which would contain those priors).
347
9.2.1 Exercises
The density function for the uniform distribution centered on 0 with width 2a is
1
p(x) = 1{x ∈ [−a, a]}
2a
Remember this is for continuous x, and p(x) is interpreted as the derivative of the corresponding CDF P (x).
a. Given data x1 , . . . , xn , find âM LE . So there are a few ways of doing this. We can get the answer pretty quick
with intuition, and not-so-quick by grinding through the math. Quick-n-easy: If you were paying attention to
the chapter, you’d instantly remember that MLE is all about finding the smallest hypothesis space consistent
with the data. It should then be obvious that âM LE = max |xi |. Slightly more rigorous. We can also solve a
constrained optimization problem, with constraint that a ≥ max |xi |, since we must require our solution to assign
nonzero probability to any of our observations.
The rest is mechanical: (1) take deriv wrt λ, (2) deriv wrt a and set to zero, (3) solve for a as a function of λ, (4)
solve for λ by substituting previous step’s results into contstraint, yielding that λ ≤ n/|xmax |, (5) plug result for
λ into result from step 3 to obtain result that a ≥ xmax . In the limit of many samples, the first term becomes
more important in the optimization, which decreases as a increases, and so we choose the lowest value of a that
satisfies the constraints: a := xmax .
b. What probability would the model assign to a new data point xn+1 using â. We are obviously meant to
trivially answer that it will assign the density with a = â. However, I take objection to this question, since it
makes no sense to evaluate a density at a single point p(x).
c. Do you see any problem with this approach? Yes, it extremely overfits to the data. We’ll assign zero
probabilities to any points outside the interval of observations, and the same probability to anything in that
interval. We should do a more Bayesian approach instead.
348
Machine Learning: A Probabilistic Perspective August 11, 2018
Basics (4.1). I’ll be filling in the gaps that the book leaves out in its derivations, as a way of
reviewing the relevant linear algebra/calculus/etc. For notation’s sake, here how the author
writes the MVN in D dimensions:
1 1
N (x | µ, Σ) , exp − (x − µ)T Σ−1 (x − µ) (1063)
(2π)D/2 |Σ|1/2 2
U is invertible. Since Σ is p.d., it’s eigenvalues are all positive, and thus Λ is also invertible.
We can then apply the basic definition for an invertible matrix to write
D Remember, an
−1 −1 T
X 1 orthonormal matrix
Σ = UΛ U = ui uTi (1064) satisfies U T = U −1 .
i=1
λi
where ui is the ith eigenvector and ith column of U. We can use this to rewrite
Rewriting in form of an
ellipse.
D
!
1
(x − µ)T Σ−1 (x − µ) = (x − µ)T
X
ui uTi (x − µ) (1065)
i
λi
D
X 1
= (x − µ)T ui uTi (x − µ)T (1066)
i
λi
D
X 1 2
= y (1067)
i
λi i
238
We know this because all covariance matrices of any random vector X are symmetric p.s.d., and the
additional requirement that Σ−1 exists means that it is p.d.
349
where yi , uTi (x − µ). The fascinating insight is recalling that the equation for an ellipse in
2D is
y12 y2
+ 2 =1 (1068)
λ1 λ2
Hence we see that the contours of equal probability density of a Gaussian lie along ellipses, as
illustrated above. The eigenvectors determine the orientation of the ellipse, and the eigenvalues
determine how elongated it is.
Maximum Entropy Derivation of the Gaussian (4.1.4). Recall that the exponential
family can be derived as the family of distributions p(x) that maximizes H(p) subject to
constraints that the moments of p match some set of empirical moments Fk specified by us. It
turns out that the MVN is the distribution with maximum entropy subject to having a specified
mean and covariance. Consider the zero-mean MVN and its entropy239 :
1 1
p(x) = exp − xT Σ−1 x (1069)
Z 2
1 h i
h(p) = ln (2πe)D det Σ (1070)
2
R
Let p = N (0, Σ) above and let q(x) be any density satisfying q(x)xi xj dx = Σij . The result
is based on the fact that h(q) must be less than or equal to h(p). This can be shown by
evaluating DKL (q||p) and recalling that DKL is always greater than or equal to zero.
239
For derivation, see this wonderful answer on stackexchange.
350
Gaussian Discriminant Analysis (4.2). With generative classifiers, it is common to define
the class-conditional density as a MVN,
ŷ(x) = arg max [log p(y=c | π) + log p(x | y=c, θ)] (1072)
c
If we have uniform prior over classes, we can classify a new test vector as follows:
Linear Discriminant Analysis (LDA). The special case where all covariance matrices are
the same, Σc = Σ. Now the quadratic term xT Σ−1 x is independent of c and thus is not
important for classification. Instead we end up with the much simpler,
T
eβc x+γc
p(y=c | x, θ) = P = S(η)c (1074)
β T0 x+γc0
c0 e c
βc := Σ−1 µc (1075)
1
γc := − µc Σ−1 µc + log πc (1076)
2
where S is the familiar softmax. TODO: come back and compare/contrast LDA with multi-
nomial logistic regression after reading chapter 8.
240
Don’t confuse the word “discriminant” for “discriminative” – we are still in a generative model setting. See
8.6 for details on the distinction.
241
This is easy to show. If diagonal, then p(x | y) factorizes. We know the ith item in the product corresponds
to p(xi | c) by considering how simple it is to compute marginals for Gaussians with diagonal Σ.
351
p(x1 | x2 ) = N (x1 | µ1|2 , Σ1|2 )
where µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 ) (4.69)
Σ1|2 = Σ11 − Σ12 Σ−1
22 Σ21 = Λ−1
11
where Λ := Σ−1 . Notice that the conditional covariance is a constant matrix independent of
x2 . The proof here relies on Schur complements (see Appendix).
Information Form (4.3.3). Thus far, we’ve been working in terms of the moment parame-
ters µ and Σ. We can also rewrite the MVN in terms of its natural (canonical) parameters
Λ and ξ,
1
1
Nc (x | ξ, Λ) = (2π)−D/2 |Λ| 2 exp − xT Λx + ξ T Λ−1 ξ − 2xT ξ (1080)
2
where Λ , Σ−1 and ξ , Σ−1 µ (1081)
where Nc is how we’ll denote “in canonical form.” The marginals and conditionals in canonical
form are
p(x2 ) = Nc (x2 | ξ2 − Λ21 Λ−1 −1
11 ξ1 , Λ22 − Λ21 Λ11 Λ12 ) (1082)
p(x1 | x2 ) = Nc (x1 | ξ1 − Λ12 x2 , Λ11 ) (1083)
and we see that marginalization is easy in moment form, while conditioning is easier
in information form.
Linear Gaussian Systems (4.4). Suppose we have a hidden variable x and a noisy observa-
tion of it y. Let’s assume we have the following prior and likelihood:
p(x) = N (x | µx , Σx )
(4.124)
p(y | x) = N (y | Ax + b, Σy )
The Wishart Distribution (4.5). We now dive into the distributions over the parameters Σ
and µ. First, we need some math prereqs out of the way. The Wishart is the generalization
of the Gamma to pd matrices:
1 1
Wi(Λ | S, ν) = |Λ|(ν−D−1)/2 exp − tr ΛS −1 (1084)
ZW i 2
ZW i = 2νD/2 ΓD (ν/2)|S|ν/2 (1085)
where
• ν: degrees of freedom
• S: scale matrix (a.k.a. scatter matrix). Basically empirical Σ.
• ΓD : multivariate gamma function
A nice property is that if Σ−1 ∼ W i(X, ν), then Sigma ∼ IW(S −1 , ν + D + 1), the inverse
Wishart (multi-dim generalization of inv Gamma).
352
Machine Learning: A Probabilistic Perspective September 13, 2018
MAP Estimation (5.2.1). The most popular point estimate for parameters θ is the pos-
terior mode, aka the MAP estimate. However, there are many drawbacks:
• No measure of uncertainty (true for any point estimate).
• Using θM AP for predictions is prone to overfitting.
• The mode is an atypical point.
• It’s not invariant to reparameterization. Say two possible parameterizations θ1 and
θ2 =f (θ1 ), where f is some deterministic function. In general, it is not the case that
θ̂2 = f (θ̂1 ) under MAP.
Bayesian Model Selection (5.3). A model selection technique where we compute the best
model m for data D using the formulas,
p(D | m)p(m)
p(m | D) = P (1086)
m∈M p(D | m)p(m)
Z
p(D | m) = p(D | θ)p(θ | m)dθ (1087)
where the latter is the marginal likelihood242 for model m. Note that this isn’t anything
new; we’ve usually just denoted it simply as p(D), since typically m is specified beforehand.
Although large models with many parameters can achieve higher likelihoods under MLE/MAP,
p(D | θ̂m ), this is not necessarily the case with marginal likelihood, an effect known as the
Bayesian Occam’s razor. Below we give the marginal likelihoods for familiar models:
• Beta-Binomial:
!
N B(a + N1 , b + N0 )
p(D | m=BetaBinom) =
N1 B(a, b)
242
Also called the integrated likelihood or the evidence.
353
BIC Approximation (5.3.2.4). The integral involved in computing p(D | m) (henceforth
denoted simply as p(D)) can be intractable. The Bayesian information criterion (BIC) is
a popular approximation:
1
BIC , log p(D | θ̂) − dof(θ̂) log N ≈ log p(D) (1088)
2
where
• dof(θ̂) is the number of degrees of freedom in the model.
• θ̂ is the MLE for the model.
BIC is also closely related to the minimum description length (MDL) principle and the
Akaike information criterion (AIC).
Hierarchical Bayes (5.5). When defining our prior p(θ | η), we have to of course specify the
hyperparameters η required by our choice of prior. The Bayesian approach for doing this is
to put a prior on our prior! This situation can be represented as a directed graphical model,
illustrated below.
η θ D
Bayesian Decision Theory (5.7). Decision problems can be cast as games against nature,
where natures selects a quantity y ∈ Y unknown to us, and then generates an observation
x ∈ X that we get to see. Our goal is to devise a decision procedure or policy δ : X 7→ A
for generating an action a ∈ A from observation x that’s deemed most compatible with the
hidden state y. We define “compatible” via a loss function L(y, a):
In this context, we call ρ the posterior expected loss, and δ(x) the Bayes estimator. Some
Bayes estimators for common loss functions are given below.
• 0-1 loss: L(y, a) = 1{y 6= a}. Easy to show that δ(x) = arg maxy∈Y p(y | x).
354
Machine Learning: A Probabilistic Perspective August 19, 2018
Model Specification (7.2). The linear regression model is a model of the form
p(y | x, θ) = N (y | wT x, σ 2 ) (1091)
Maximum Likelihood Estimation (least squares) (7.3). Most commonly, we’ll estimate
the parameters by computing the MLE, defined by
where RSS(w) is the residual sum of squares. Notice that θ := (w, σ 2 ), but typically we’re
focused on estimating w 243 .
Derivation of the MLE (7.3.1). We’ll now denote the negative log likelihood as NLL(w)
and drop constant terms that are irrelevant for the optimization task.
1 1
NLL(w) = ||y − Xw||22 = wT (X T X)w − wT (X T y) (1096)
2 2
N
X
where XT X = xi xTi (1097)
i=1
XN
XT y = xi yi (1098)
i=1
243
Since our goal is typically to make future predictions ŷ(x) = wT x, rather than sampling y ∼ p(y | x, θ),
we aren’t concerned with estimating σ 2 . We assume some variability, and the goal is focused on fitting the data
to a straight line.
355
And the optimal ŵOLS be found by taking the gradient, setting to zero, and solving for w as
usual:
∇NLL(w) = X T Xw − X T y (1099)
T T
X Xw = X y [normal eq.] (1100)
−1
ŵOLS = X T X XT y (1101)
i.e. a linear combination of the D column vectors x̃i ∈ RN . Crucially, observe that this means
ŷ ∈ span({x̃}D i ) no matter what (a hard constraint by definition of our model). So, how do
you minimize the residual norm ||y − ŷ|| given that ŷ is restricted to a particular subspace?
You require y − ŷ to be orthogonal to that subspace, of course245 ! Formally, this means
x̃Tj (y − ŷ) = 0, for all 1 ≤ j ≤ D. Equivalently,
Although this neat, I’m left unsatisfied since there appears to be no intuition of what the
column vectors of X really mean on a conceptual level.
Robust Linear Regression (7.4). Gaussians sensitive to outliers, since their log-likelihood
penalizes deviations quadratically246 . One way to achieve robustness to outliers is to instead Replacing Gaussian with
use a distribution with heavy tails, such that they still allow for higher likelihoods of outliers, heavy-tailed dists.
but they don’t need to shift the whole distribution around to accommodate for them. One
popular choice is the Laplace distribution,
1 1
T
p(y | x, w, b) = Lap(y | w x, b) , exp − |y − wT x| (1106)
2b b
X
N LL(w) = |yi − wT xi | (1107)
i
244
This means we have more rows than columns, which means our column space cannot span all of RN .
245
Consider that y − ŷ points from our prediction (which is in the constraint subspace) ŷ to the true y that
we want to get closest to. Intuitively that means ŷ is looking “straight out” at y, in a direction orthogonal to
the subspace that ŷ lives in. Formally, we can write y = (yk , y⊥ ), where yk is the component within Col(x).
The best we can do, then, is ŷ := yk .
246
In other words, outliers initially get huge loss values, and the distribution shifts toward them to minimize
loss (undesirably).
356
Goal: convert N LL to a form easier to optimize (linear). Let ri , yi − wT xi be the i’th
residual. The following steps show how we can convert this into a linear program:
Ridge Regression (7.5). We know that MLE can overfit by essentially memorizing the data.
If, for example, we model 21 points with a degree-14 polynomial247 , we get many large positive Doing MAP instead of
and negative numbers for our learned coefficients, which allow the curve to wiggle in just MLE
the right way to almost perfectly interpolate the data – this is why we often regularize
weights to have low absolute value. This encourages smoother/less-wiggly curves. One
way to do this is by using a zero-mean Gaussian prior on our weights:
Y
p(w) = N (wj | 0, τ 2 ) (1111)
j
This makes our MAP estimation problem and solution take the form
Note that w0 is NOT
N
1 X regularized.
J(w) = yi − wT xi − w0 + λ||w||22 (1112)
N i
ŵridge = (X T X + λID )−1 X T y (1113)
247
To review polynomials, search “Lagrange interpolation” in your CS 70 notes.
357
Machine Learning: A Probabilistic Perspective August 11, 2018
Kevin P. Murphy (2012). Generalized Linear Models and the Exponential Family.
Machine Learning: A Probabilistic Perspective.
where θ are the natural (canonical) parameters249 , and φ(x) are the sufficient
statistics.
Below are some (quick/condensed) examples showing the first couple steps in rewriting familiar
distributions in exponential family form:
248
Given certain regularity conditions.
249
We often generalize this with η(θ), which maps whatever params θ we’ve chosen to the canonical params
η(θ).
358
Log Partition Function (9.2.3). The derivatives of the log partition, A(θ), can be used to
generate cumulants250 for the sufficient statistics, φ(x).
∂A(θ)
= E [φ(x)] (1119)
∂θ
∇2 A(θ) = cov [φ(x)] (1120)
MLE for the Exponential Family (9.2.4). The likelihood takes the form
"N #
n o
It appears that we are
Y
p(D | θ) = h(x ) g(θ)N exp η(θ)T φ(D)
(i)
(1121) denoting 1/Z with g
i now.
N
X hP i
φ(x(i) ) = i φ1 · · ·
P
φ(D) = i φK (1122)
i
Consider a canonical252 exponential family which also sets h(·) = 1. The log-likelihood is
Since −A(θ) is concave253 and the other term is linear in θ, the log-likelihood is concave and
thus has a global maximum.
250
The first and second cumulants are mean and variance.
251
The wording is weird here. We mean “out of all families/distributions that already satisfy certain constraints
that must be met, the exponential family is the only...”. For example, the uniform distribution has finite statistics
and is not in the exponential family, but it does not meet the constraint that its support must be independent
of the parameters, so it’s outside the scope of the theorem.
252
Defined as those which satisfy η(θ) = θ.
253
We know −A is concave because A is convex. We know A is convex because ∇2 A is positive definite.
Remember that any twice-differentiable multivariate function f is convex IFF its Hessian is pd for all θ. See
sec 7.3.3 and 9.2.3 for more.
359
Maximum Entropy Derivation of the Exponential Family (9.2.6). Suppose we don’t
know which distribution p to use, but we do know the expected values of certain features or
functions:
X
Fk , Ex∼p(x) [fk (x)] = fk (x)p(x) (1124)
x
The principle of maximum entropy or maxent says we should pick the distribution with
maximum entropy, subject to the constraints that the moments of the distribution match the
empirical moments of the specified functions fk (x). Treating p as a fixed length vector (i.e.
assuming x is discrete), we can take the derivative of our Lagrangian (entropy in units of nats
with constraints) w.r.t. each “element” px = p(x) to find the optimal distribution.
! !
X X X
J(p, λ) = H(p) + λ0 1 − p(x) + λk Fk − p(x)fk (x) (1125)
x k x
∂J X
= −1 − log p(x) − λ0 − λk fk (x) (1126)
∂p(x) k
Thus the maxent distribution p(x) has the form of the exponential family, a.k.a. the Gibbs
distribution.
360
Machine Learning: A Probabilistic Perspective August 26, 2018
Latent Variable Models (LVMs) (11.1). In this chapter, we explore directed GMs that have
hidden/latent variables. Advantages of LVMs:
1. Often have fewer params.
2. Hidden vars can serve as a bottleneck (representation learning).
Mixture Models (11.2). The simplest form of LVM is where the hidden variables zi ∈
{1, . . . , K} represent a discrete latent state. We use discrete prior p(zi ) = Cat(π) = πi , and
likelihood p(xi | z=k) = pk (xi ). A mixture model is defined by
We call pk the kth base
K
X distribution.
p(xi | θ) = πk pk (xi | θ) (1128)
k=1
361
Below, I derive the expectation and covariance of x in this case254 .
X
E [x] = xp(x) (1129)
x
X K
X
= x πk pk (x) (1130)
x k
X K
X D
Y
= x πk Ber(xj | µjk ) (1131)
x k j
D
X X X D
Y
= πk ··· x Ber(xj | µjk ) (1132)
k x1 xD j
D
X X X
= πk Ber(x1 | µ1k ) · · · Ber(xD | µDk )x (1133)
k x1 xD
D
X
= πk µk (1134)
k
h i
Next we want to find cov(x) = E xxT − E [x] E [x]T . I think the insight that makes
finding the first term easiest is realizing that you only need to find the two cases, E x2i
where Σk = diag(µjk (1 − µjk )) is the covariance of x under pk . The fact that the mixture
covariance matrix is now non-diagonal confirms that the mixture can capture correlations
between variables xi , xj6=i , unlike a single product-of-Bernoullis model.
The EM Algorithm (11.4). In LVMs, it’s usually intractable to compute the MLE since
we have to marginalize over hidden variables while satisfying constraints like positive definite
covariance matrices, etc. The EM algorithm gets around these issues via a two-step process.
The first step involves taking an expectation, where the expectation is over z ∼ p(z | x, θt−1 )
for each individual observed x, where we use the current parameter estimates when sampling
z in the expectation. This gives us an auxiliary likelihood that’s a function of θ which
will serve as a stand-in (in the 2nd step) for what we typically use as the likelihood in MLE.
The second step is then just finding the optimal θ t over the auxiliary likelihood function from
the first step. This iterates until convergence or some stopping condition.
254
Shown in excruciating detail because I was unable to work through this in my head alone.
362
Procedure: EM Algorithm
where, again, the “expectation” serves the purpose of determining the expected value of z (i) for each
observed x(i) . It’s somewhat of a misnomer to denote the expectation like this, since each z (i) is innately
tied with its corresponding observation x(i) .
1. E-Step: Evaluate Q(θ | θ t−1 ) using (obviously) the previous parameters θ t−1 . This yields a
function of θ. This gives us the expected log-likelihood function of θ for the observed data D.
2. M-Step: Optimize Q w.r.t θ to get θ t :
363
Machine Learning: A Probabilistic Perspective September 14, 2018
Factor Analysis (12.1). Whereas mixture models define p(z) = Cat(π) for a single hidden
variable z ∈ {1, . . . , K}, factor analysis begins by instead using a vector of real-valued latent
variables, z ∈ RL . The simplest prior is z ∼ N (µ0 , Σ0 ). If x ∈ RD , we can define
where
• W ∈ RD×L : factor loading matrix.
• Ψ ∈ RD×D is a diagonal covariance matrix, since we want to “force” z to explain the
correlation255 . If Ψ = σ 2 I, we get probabilistic PCA.
Summaries of key points regarding FA:
• Low-rank parameterization of MVN. FA can be thought of as specifying p(x) using
a small number of parameters. [math] yields that
cov(x) = WW T + Ψ
which has O(LD) params (remember Ψ is diagonal) instead of the usual O(D2 ).
• Unidentifiability: The params of an FA model are unidentifiable.
Classical PCA (12.2.1). Goal: find an orthogonal set of L linear basis vectors wj ∈ RD , and
the scores zi ∈ RL , such that we minimize the average reconstruction error:
N
1 X
J(W, Z) = ||xi − x̂i ||2 , where x̂i := Wzi (1142)
N i=1
= ||X − WZ T ||2F (1143)
where W ∈ RD×L is orthonormal. Solution: assign each column W:,` to the eigenvector
with `’th largest eigenvalue of Σ̂ = N1 N T
P
i xi xi , assuming E [x] = 0. Then we compute
T
ẑi := W xi .
255
It’s easier to think of this graphically. Our model asserts that xi ⊥ xj6=i | z. Independence implies zero
correlation, and we cement this by constraining Ψ to be diagonal. See your related note on LFMs, chapter 13
of the DL book.
364
Proof: PCA
Case: L=1. Goal: Find the best 1-d solution, w ∈ RD , zi ∈ R, z ∈ RN . Remember that ||w||2 = 1.
N
1 X 1 T
J(w, z) = ||xi − zi w||2 = xi xi − 2zi wT xi + zi2 (1144)
N N
i
∂J
=0 → zi = wT xi (1145)
∂zi
N N
1 X 1 X
J(w) = xT 2
i xi − zi = const − zi2 (1146)
N N
i i
∴ arg min J(w) = arg max Var [z̃] (1147)
w w
This shows why PCA finds directions of maximal variance – aka the analysis view of PCA. Before finding the
optimal w, don’t forget the Lagrange multipliers for constraining unit norm,
˜
J(w) = wT Σ̂w + λ(wT w − 1) (1148)
Singular Value Decomposition (SVD) (12.2.3). Any real N × D matrix X can be decom-
posed as
X= U S VT (1149)
|{z} |{z} |{z}
N ×N N ×D D×D
where the columns of U are the left singular vectors, and the columns of V are the right
singular vectors. Economy-sized SVD will shrink U to be N × D and S to be D × D (we’re
assuming N > D).
365
Machine Learning: A Probabilistic Perspective August 11, 2018
Hidden Markov Models (17.3). HMMs model a joint distribution over a sequence of T
observations xh1...T i and hidden states zh1...T i ,
T
" #" T #
Y Y
p(zh1...T i , xh1...T i ) = p(zh1...T i )p(xh1...T i | zh1...T i ) = p(z1 ) p(zt | zt−1 ) p(xt | zt )
t=2 t=1
(1150)
where each hidden state is discrete: zt ∈ {1, . . . , K}, while each observation xt can be discrete
or continuous.
The Forwards Algorithm (17.4.2). Goal: compute the filtered256 marginals p(zt | xh1...ti ).
1. Prediction step. Compute the one-step-ahead predictive density p(zt | xh1...t−1i ),
X
p(zt =j | xh1...t−1i ) = p(zt = j, zt−1 = i | xh1...t−1i ) (1151)
i
which will serve as our prior for time t (since it does not take into account observed data
at time t).
2. Update step. We “update” our beliefs by observing xt ,
366
Forwards Algorithm (Algorithm 17.1)
We are given transition matrix Ti,j = p(zt = j | zt−1 = i), evidence vectors ψt (j) = p(xt | zt =j), and
initial state distribution π(j) = p(z1 = j).
1. First, compute the initial [α1 , Z1 ] = normalize(ψ1 π).
2. For time 2 ≤ t ≤ T , compute [αt , Zt ]P = normalize(ψt T T αt−1 ).
3. Return αh1...T i and log p(xh1...T i ) = t log Zt .
Using the same notation as Algorithm 17.1 above, the matrix-vector form for β is
βt = T (ψt βt+1 ) (1166)
with base case βT = 1.
The Viterbi Algorithm (17.4.4). Denote the [probability for] the most probable path leading
to zt =j as
with initialization of δ1 (j) = πj ψ1 (j). We compute this until termination at zT∗ = arg maxi δT (i).
Note the arg max here instead of a max – we keep track of both for all time steps. We do this
so we can perform traceback to get the full most probable state sequence, starting at T and
ending at t = 1:
zt∗ = at+1 (zt+1
∗
) (1169)
where at (j), the most probable state at time t − 1 leading to state j at time t, is the same
formula as δt (j) but with an arg max.
367
Machine Learning: A Probabilistic Perspective August 11, 2018
Learning (19.5). Consider a MRF in log-linear form over C cliques and its log-likelihood
(scaled by 1/N):
( )
1 X
p(y | θ) = exp θcT φc (y) (1170)
Z(θ) c
" #
1 X X T
`(θ) = θc φc (yi ) − log Z(θ) (1171)
N i c
∂
We know from chapter 9 that this log-likelihood is concave in θ, and that ∂θc log Z = E [φc ].
So the gradient of the LL is
" #
∂` 1 X
= φc (yi ) − Ep(y|θ) [φc (y)] (1172)
∂θc N i
Note that the first (“clamped”) term only needs to be computed once for any yi , it is completely
independent of any parameters. It’s just evaluating feature functions, which e.g. for CRFs are
often all indicator functions.
CRF Training (19.6.3). For [linear-chain] CRFs, the equations change slightly (but impor-
tantly).
" #
1 X X T
`(w) , wc φc (yi , xi ) − log Z(w, xi ) (1173)
N i c
∂` 1 Xh i
= φc (yi , xi ) − Ep(y|xi ) [φc (yi , xi )] (1174)
∂wc N i
It’s important to recognize that the gradient of the log partition function now must be com-
puted for each instance xi .
368
Convex
Optimization
Contents
369
Convex Optimization August 19, 2018
Affine sets
A set C ⊆ Rn is affine if the line (not just segment) through any two distinct points in
C also lies in C. More generally, this implies that for any set of points {x1 , . . . , xk }, with
each xi ∈ C, all affine combinations,
k
X X
θ i xi , where θi = 1 (1177)
i=1 i
There’s a lot of neat things to say here, but I only have space to state the results:
• If C is an affine set and x0 ∈ C, then
V = C − x0 , {x − x0 | x ∈ C} (1179)
{y ∈ Rn | ||y − x||2 ≤ }
are also in C.
370
• The solution set of a system of linear equations, C = {x | Ax = b}, is an affine set. The
subspace associated with C is the nullspace of A.
Convex sets
A set C is convex if it contains all convex combinations,
k
X X
θ i xi , where θi = 1, and θi ≥ 0 (1180)
i=1 i
Related terminology:
• convex hull a set C, denoted convC, is the set of all convex combinations of points
in C.
Cones
A set C is called a cone if (∀x ∈ C)(θ ≥ 0) we have θx ∈ C. A set C is called a convex
cone if it contains all conic combinations,
k
X
θ i xi , where θi ≥ 0 (1181)
i=1
371
Bayesian Data
Analysis
Contents
372
Bayesian Data Analysis November 22, 2017
The process of Bayesian Data Analysis can be divided into the following 3 steps:
1. Setting up a full probability model.
2. Conditioning on observed data. Calculating the posterior distribution over unobserved
quantities, given observed data.
3. Evaluating the fit of the model and implications of the posterior.
Notation: In general, we let θ denote unobservable vector quantities or population parameters
of interest, and y as collected data. This means our posterior takes the form p(θ | y), and our
likelihood takes the form p(y | θ).
Proofs
Z Z
E [u] = dvPr [v] duPr [u | v] (1184)
Transformation of Variables.
h i
Prv [v] = |J|Pru f −1 (v) (1190) ∂u ∂f −1 (v)
Ji,j = ∂v
= ∂v
where u and v have the same dimensionality, and |J| is the determinant of the Jacobian of the
transformation u = f −1 (v). When working with parameters defined on the open unit interval,
(0, 1), we often use the logistic transformation:
u
logit(u) = log (1191)
1−u
ev
logit−1 (v) = log (1192)
1 + ev
373
Standard Probability Distributions258 .
258
The gamma function, Γ(x), is defined as
Z ∞
Γ(x) = tx−1 e−t dt (1193)
0
or simply as (x − 1)! if x ∈ Z+ .
374
Bayesian Data Analysis November 22, 2017
Since we’ll be referring to the binomial model frequently, below are the main distributions for
reference:
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)n−y (1194)
y
p(θ | y) ∝ θy (1 − θ)n−y (1195)
θ | y ∼ Beta(y + 1, n − y + 1) (1196)
where y is the number of successes out of n trials. We assume a uniform prior over the interval
[0, 1].
Posterior as compromise between data and prior information. Intuitively, the prior
and posterior distributions over θ should have some general relationship showing how the
process of observing data y updates our distribution on θ. We can use the identities from the
previous chapter to see that
where:
→ Equation 1197 states the obvious: our prior expectation for θ is the expectation, taken over
the distribution of possible data, of the posterior expectation.
→ Equation 1198 states: the posterior variance is on average smaller than the prior variance,
by an amount that depends on the variation [in posterior means] over the distribution of
possible data. Stated another way: we can reduce our uncertainty with regard to θ by
larger amounts with models whose [expected] posteriors are strongly informed by the data.
Informative Priors. We now discuss some the issues that arise in assigning a prior distri-
bution p(θ) that reflects substantive information. Instead of using a uniform prior for our
binomial model, let’s explore the prior θ ∼ Beta(α, β)259 . Now our posterior takes the form
θ | y ∝ Beta(α + y, β + n − y) (1199)
The property that the posterior follows the same parametric form as the prior is called con-
jugacy.
259
Assume we can select reasonable values for α and β.
375
Conjugacy, families, and sufficient statistics. Formally, if F is a class of sampling dis-
tributions {Pri [y | θ]}, and P is a class of prior distributions, {Prj [θ]}, then the class P is
conjugate for F if260
Probability distributions that belong to an exponential family have natural conjugate prior
distributions. The class F is an exponential family if all its members have the form,
T
Pr [yi | θ] = f (yi )g(θ)eφ(θ) u(yi )
(1201)
n n
! !
Y X
Pr [y | θ] = f (yi ) g(θ)n exp φ(θ)T u(yi ) (1202)
i=1 i=1
∝ g(θ)n exp φ(θ)T t(y) (1203)
where
• y = (y1 , . . . , yn ) denotes n iid observations.
• φ(θ) is called the natural parameter of F.
• t(y) = ni=1 u(yi ) is said to be a sufficient statistic for θ, because the likelihood for θ
P
By definition, this implies that the posterior should also be normal. Indeed, after some basic
arithmetic/substitutions, we find
Writing our posterior
1
Pr [θ | y] ∝ exp − (θ − µ1 )2 (1207) precision and mean.
2τ12
1
µ + σ12 y
τ02 0 1 1 1
µ1 = 1 and = 2+ 2 (1208)
τ02
+ σ12 τ12 τ0 σ
260
In English: A class of prior distributions is conjugate for a class of sampling distributions if, for any
pair of sampling distribution and prior distribution [from those two respective classes], the associated posterior
distribution is also in the same class of prior distributions.
261
The following will be useful to remember:
Z ∞
2 b2
π 4a
e−ax +bx+c
dx = e +c (1204)
−∞
a
376
where we see that the posterior precision (inverse of variance) equals the prior prior precision
plus the data precision. We can see the posterior mean µ1 expressed as a weighted average of
the prior mean and the observed value262 y, with weights proportional to the precisions.
Normal distribution with unknown variance. Now, we assume the mean θ is known, and
the variance σ 2 is unknown. The likelihood for a vector y = (y1 , . . . , yn ) of n iid observations
is
h i
n
Computing our
Pr y | σ 2 ∝ (σ 2 )−n/2 exp − v (1209) likelihood for n IID
2σ 2 observations.
n
1X
v := (yi − θ)2 (1210)
n i=1
where v is the sufficient statistic. The corresponding conjugate prior density is the inverse-
gamma. This and our choice for parameterization (how we define α and β) are, respectively,
h i 2 Defining our conjugate
Pr σ 2 ∝ (σ 2 )−(α+1) e−β/σ (1211) prior.
Jeffrey’s prior model defines the noninformative prior density as Pr [θ] ∝ [J(θ)]1/2 . We can
work out that this model is indeed invariant to parameterization263 .
262
For now, we’re considering the single data point case.
263
Evaluate J(φ) at θ = h−1 (φ). You should find that J(φ)1/2 = J(θ)1/2 | dφ
dθ
|
377
Bayesian Data Analysis December 02, 2017
where θ̂ is the posterior mode. The remainder terms of higher order fade in importance relative
to the quadratic term when θ is close to θ̂ and n is large. We’d like to cast this into a normal
distribution. First, let
d2
I(θ) , − log Pr [θ | y] (1218)
dθ2
which we will refer to as the observed information. We can then rewrite our approximation
as264
264
I also found it helpful to explicitly write the substitution after raising eq 1217 by power of e (all logs are
assumed natural logs)
1 T −1
Pr [θ | y] = elog Pr [θ̂|y]− 2 (θ−θ̂) [I(θ̂)] (θ−θ̂) (1219)
−1
−1 (θ−θ̂)T [I(θ̂)] (θ−θ̂)
= Pr θ̂ | y e 2 (1220)
378
Example. Let y1 , . . . , yn be independent observations from N (µ, σ 2 ). Define θ := (µ, log σ)
as the parameters of interest, and assume a uniform prior265 Pr [θ]. Recall that (equation 3.2
in textbook)
n
!
1 X
Pr [θ = (µ, log σ) | y] ∝ σ −(n+2) exp − 2 (yi − µ)2 (1223) Note
Pn that
2σ i=1 i=1
(yi − ȳ) = 0
n
" #!
1
= σ −(n+2) exp −
X
n(ȳ − µ)2 + (yi − ȳ)2 (1224)
2σ 2 i=1
1 h
i
= σ −(n+2) exp − n(ȳ − µ) 2
+ (n − 1)s2
(1225)
2σ 2
1 n
where s2 = n−1 2
i=1 (yi − ȳ) is the sample variance of the yi ’s. The sufficient statistics are
P
ȳ and s2 . To construct the approximation, we need the second derivatives of the log posterior
density266 ,
1 2 2
log Pr [µ, log σ | y] = const − n log σ − n(ȳ − µ) + (n − 1)s (1226)
2σ 2
in order to compute I(θ̂). After computing first derivatives, we find that the posterior mode is
r !!
n−1
θ̂ = (µ̂, log σ̂) = ȳ, log s (1227)
n
We then compute second derivatives and evaluate at θ = θ̂ to obtain I(θ̂). Combining all this
into the final result:
where σ̂ 2 /n is the variance along the µ dimension, and 1/(2n) is the variance along the log σ
direction. This example was just meant to illustrate, with a simple case, how we work through
constructing the approximate normal distribution.
265
Recall from Ch 3.2 that the uniform prior on µ, log σ is
Pr µ, σ 2 ∝ (σ 2 )−1
(1222)
379
Large-Sample Theory. Asymptotic normality of the posterior distribution: as more data
arrives from the same underlying distribution f (y), the posterior distribution of the parameter
vector θ approaches multivariate normality, even if the true distribution of the data is not
within the parametric family under consideration.
Suppose the data are modeled by a parametric family, Pr [y | θ], with a prior distribution
Pr [θ]. If the true distribution, f (y), is included in the parametric family (i.e. if ∃θ0 : f (y) =
Pr [y | θ0 ]), then it’s also true that consistency holds267 : Pr [θ | y] converges to a point mass
at the true parameter value, θ0 as n → ∞.
267
So, what if the true f (y) is not included in the parametric family? In that case, there is no longer a true
value θ0 , but its role in the theoretical result is replaced by a value θ0 that makes the model distribution Pr [y | θ]
closest to the true distribution f (y), in a technical sense involving the Kullback-Leibler divergence.
380
Bayesian Data Analysis July 29, 2018
Since I like to begin by motivating what we’re going to talk about, and since BDA doesn’t
really do this, I’m going to start with an excerpt from chapter 15 of Kevin Murphy’s book:
In supervised learning, we observe some inputs xi and some outputs yi . We assume that
yi = f (xi ), for some unknown function f , possibly corrupted by noise. The optimal ap-
proach is to infer a distribution over functions given the data, p(f | X, y), and then to use
this to make predictions given new inputs, i.e., to compute
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ )p(f | X, y)df (1230)
Gaussian Processes or GPs define a prior over functions p(f ) which can be converted
into a posterior over functions p(f | X, y) once we’ve seen some data.
Apparently, we only need to consider a finite (and arbitrary) set of points x1 , . . . , xn to consider
when evaluating any given µ. The GP prior on µ is defined as
with mean m and covariance K 269 . The covariance function k specifies the covariance between
the process at any two points, with K an n × n covariance matrix with Kp,q = k(xp , xq ). The
covariance function controls the smoothness of realizations from the GP270 and the degree of
shrinkage towards the mean.
268
What if there are like, a whole space of different predictors, man? Like, what if there is an infinite sea of
predictor functions, all with their own unique traits and quirks? Woah.
269
Don’t confuse the notation – N uses covariance K as an argument, while GP uses covariance function k
as an argument.
270
In English: How similar we expect different samples of µ to look as a function of x. The reason this was
weird to think about at first is because I’m used to thinking about covariance/smoothness over x rather than
sampled functions of x. Meta.
381
A common choice the squared exponential,
!
0 2 |x − x0 |2
k(x, x ) = τ exp − (1232)
2`2
382
Gaussian
Processes for
Machine
Learning
Contents
383
Gaussian Processes for Machine Learning July 29, 2018
Regression (Ch. 2)
Table of Contents Local Written by Brandon McKinzie
Rasmussen and Williams (2006). Regression. Gaussian Processes for Machine Learning.
Weight-space view (2.1). We review the standard probabilistic view of linear regression.
f (x) = xT w (1233)
y = f (x) + ε where ε ∼ N (0, σn2 ) (1234)
T X ∈ Rd×n
[likelihood] y | X, w ∼ N (X w, σn2 I) (1235)
[prior] w ∼ N (0, Σp ) (1236)
p(y | w, X)p(w)
[posterior] p(w | y, X) = (1237)
p(y | X)
1 −1
∼ N ( 2 A Xy, A−1 ) (1238)
σn
where A = σn−2 XX T + Σ−1 p , and we often set Σp = I. When analyzing the contour plots for
the likelihood, remember that it is not a probability distribution, but rather it’s interpreted
as a function of w, i.e. likelihood(w ) := N (y; X T w , σn2 I).
Note that we often use Bayesian techniques without realizing it. For example, what does the
following remind you of?
1
ln p(w) ∝ wT Σp w (1239)
2
It’s the l2 penalty from ridge regression (where typically Σp = I). We can also project the
inputs to a higher-dimensional space, often referred to as the feature space, by passing them
through feature functions φ(x). As we’ll see later (ch 5), GPs actually tell us how to define
the basis functions hi (x) which define the value of φi (x), the ith element of the feature vector.
The author then proceeds to give an overview of the kernel trick.
384
Function-space view (2.2). Instead of inference over parameters w, we can equivalently
consider inference in function space with a Gaussian process (GP), formally defined as
A Gaussian process is a collection of random variables, any finite number of which have a
joint Gaussian distribution.
A GP is completely specified by its mean function and covariance function. Define mean
function m(x) and covariance function k(x, x0 ) of a real [Gaussian] process f (x) as
Note that the expectations are over the random variable f (x) for any given (non-random) x.
In other words, the expectation is over the space of possible functions, each evaluated at point
x. Concretely, this is often an expectation over the parameters w, as is true for our linear
regression example. We can now write our Bayesian linear regression model (with feature
functions) as a GP.
A common covariance function is the squared exponential (a.k.a. the RBF kernel),
1
k(xp , xq ) = Cov [f (xp ), f (xq )] = exp(− |xp − xq |2 ) (1245)
2
where it’s important to recognize that, whereas we’ve usually seen this is the RBF kernel for
the purposes of kernel methods on inputs, we are now using it specify the covariance of the
outputs.
Ok, so how do we sample some functions and plot them? Below is an overview for our current
linear regression running example.
1. Choose a number of input points X∗ . For our linear regression example, we could set
this to np.arange(-5, 5.1, 0.1) to get evenly spaced x in [−5, 5] in intervals of 0.1.
2. Write out the covariance matrix defined by Kp,q = k(xp , xq ) using our squared exponen-
tial covariance function, for all pairs of inputs.
3. We can now generate samples of function f , represented as a random vector with size
equal to the number of inputs |X∗ |, by sampling from the GP prior
So far, we’ve only dealt with the GP prior. What do we do when we get labeled training
observations? How do we make predictions on unlabeled test data? Well, for the simple case
where our observations are noise free271 , that is we know {(xi , fi ) | i = 1, . . . , n}, the joint
271
For example, noise-free linear regression would mean we model y=f(x), implicitly defining ε = 0.
385
GP prior over the train set inputs X and test set inputs X∗ is defined exactly how we did
it earlier (zero mean, elementwise evaluation of k). In other words, our GP prior models the
train outputs f and test outputs f∗ as random vectors sampled via
" # " #!
f K(X, X) K(X, X∗ )
∼N 0, (1247)
f∗ K(X∗ , X) K(X∗ , X∗ )
You may be wondering: why are we talking about sampling from the prior on inputs? We
already know the outputs!, and you’d be correct. The way we obtain our posterior is by
restricting our joint prior to only those functions that agree with the observed training data
X, f, which we can do by simply conditioning on them. Our posterior for sampling test outputs
given test inputs X∗ is thus
f∗ | X∗ , X, f ∼ N K(X∗ , X)K(X, X)−1 f,
(1248)
−1
K(X∗ , X∗ ) − K(X∗ , X)K(X, X) K(X, X∗ )
386
Blogs
Contents
387
Blogs December 21, 2016
The title is inspired by the following figure. Colah mentions how groups of neurons, like A,
that appear in multiple places are sometimes called modules, and networks that use them are
sometimes called modular neural networks. You can feed the output of one convolutional layer
into another. With each layer, the network can detect higher-level, more abstract features.
388
Blogs December 21, 2016
Understanding Convolutions
Table of Contents Local Written by Brandon McKinzie
Imagine we drop a ball from some height onto the ground, where it only has one dimension of
motion. How likely is it that a ball will go a distance c if you drop it and then drop it again from
above the point at which it landed?
From basic probability, we know the result is a sum over possible outcomes, constrained by
a + b = c. It turns out this is actually the definition of the convolution of f and g.
X
Pr(a + b = c) = f (a) · g(b) (1249)
a+b=c
X
(f ∗ g)(c) = f (a) · g(b) (1250)
a+b=c
X
= f (a) · g(c − a) (1251)
a
Visualizing Convolutions. Keeping the same example in the back of our heads, consider a
few interesting facts.
• Flipping directions. If f (x) yields the probability of landing a distance x away from
where it was dropped, what about the probability that it was dropped a distance x from
where it landed? It is f (−x).
389
We can relate these ideas to image recognition. Below are two common kernels used to convolve
images with.
On the left is a kernel for blurring images, accomplished by taking simple averages. On the
right is a kernel for edge detection, accomplished by taking the difference between two pixels,
which will be largest at edges, and essentially zero for similar pixels.
390
Blogs December 23, 2016
Reinforcement Learning. Vulnerable to the credit assignment problem - i.e. unsure which
of the preceding actions was responsible for getting some reward and to what extent. Also
need to address the famous explore-exploit dilemma when deciding what strategies to use.
Markov Decision Process. Most common method for representing a reinforcement problem.
MDPs consist of states, actions, and rewards. Total reward is sum of current (includes previous)
and discounted future rewards:
Q - learning. Define function Q(s, a) to be best possible score at end of game after performing
action a in state s; the “quality” of an action from a given state. The recursive definition of Q
(for one transition) is given below in the Bellman equation.
Q(s, a) = r + γ max
0
Q(s0 , a0 )
a
Deep Q Network. Deep learning can take deal with issues related to prohibitively large
state spaces. The implementation chosen by DeepMind was to represent the Q-function with a
neural network, with the states (pixels) as the input and Q-values as output, where the number
of output neurons is the number of possible actions from the input state. We can optimize
with simple squared loss:
391
2. Second forward pass from s0 and again compute maxa0 Q(s0 , a0 ).
3. Set target output for each action a0 from s0 . For the action corresponding to max
(from step 2) set its target as r + γ maxa0 Q(s0 , a0 ), and for all other actions set target to
same as originally returned from step 1, making the error 0 for those outputs. (Interpret
as update to our guess for the best Q-value, and keep the others the same.)
4. Update weights using backprop.
Experience Replay. This the most important trick for helping convergence of Q-values when
approximating with non-linear functions. During gameplay all the experience < s, a, r, s0 > are
stored in a replay memory. When training the network, random minibatches from the replay
memory are used instead of the most recent transition.
Exploration. One could say that initializing the Q-values randomly and then picking the
max is essentially a form of exploitation. However, this type of exploration is greedy, which
can be tamed/fixed with ε-greedy exploration. This incorporates a degree of randomness
when choosing next action at all time-steps, determined by probability ε that we choose the
next action randomly. For example, DeepMind decreases ε over time from 1 to 0.1.
392
Blogs January 15, 2017
Overview.
• Model. Implementing a retrieval-based model. Input: conversation/context c. Output:
response r.
• Data. Ubuntu Dialog Corpus (UDC). 1 million examples of form (context, utterance,
label). The label can be 1 (utterance was actual response to the context) or a 0 (utterance
chosen randomly). Using NLTK, the data has been . . .
→ Tokenized: dividing strings into lists of substrings.
→ Stemmed. IDK
→ Lemmatized. IDK
The test/validation set consists (context, ground-truth utterance, [9 distractors (incorrect
utterances)]). The distractors are picked at random272
Dual-Encoder LSTM.
1. Inputs. Both the context and the response text are split by words, and each word is
embedded into a vector and fed into the same RNN.
2. Prediction. Multiply the [vector representation ("meaning")] c with param matrix M
to predict some response r0 .
3. Evaluation. Measure similarity of predicted r0 to actual r via simple dot product. Feed
this into sigmoid to obtain a probability [of r0 being the correct response]. Use (binary)
cross-entropy for loss function:
L = −y · ln(y 0 ) − (1 − y) · ln(1 − y 0 ) (1253)
where y 0 is the predicted probability that r0 is correct response r, and y ∈ {0, 1} is the
true label for the context-response pair (c, r).
272
Better example/approach: Google’s Smart Reply uses clustering techniques to come up with a set of
possible responses.
393
Data Pre-Processing. Courtesy of WildML, we are given 3 files after preprocessing: train.tfrecords,
validation.tfrecords, and test.tfrecords, which use TensorFlow’s ’Example’ format. Each Ex-
ample consists of . . .
• context: Sequence of word ids.
• context_len: length of the aforementioned sequence.
• utterance: seq of word ids representing utterance (response).
• utterance_len.
• label: only in training data. 0 or 1.
• distractor_[N]: Only in test/validation. N ranges from 0 to 8. Seq of word ids reppin
the distractor utterance.
• distractor_[N]_len.
394
Blogs April 04, 2017
[Link to article]
For convenience, I’ll rewrite the familiar equations for computing quantities at some step i.
Now we can see just how simple this really is. Recall that Bahdanau et al., 2015 use the
wording: “eij is an alignment model which scores how well the inputs around position j and
the output at position i match.” But we can see an example implementation of an alignment
model above: the tanh function (that’s it).
395
Appendix
Contents
396
Common Distributions and Models
Table of Contents Local Written by Brandon McKinzie
Continuous Distributions.
Distributions with support θ > 0: (∀n ∈ Z+ )Γ(n) = (n − 1)!
Z ∞
Distribution Density Function Notation Γ(z) = xz−1 e−x dx
2−ν/2 ν/2−1 e−θ/2 0
Chi-Square p(θ) = Γ(ν/2) θ θ ∼ χ2ν
β α
p(θ) = α−1 e−βθ θ ∼ Gamma(α, β)
Gamma Γ(α) θ
β α −α−1 −β/θ
Inverse-gamma p(θ) = Γ(α) θ e θ ∼ Inv-gamma(α, β)
1
2−ν/2 −ν/2−1 − 2θ
Inverse-chi-square p(θ) = Γ(ν/2) θ e θ ∼ Inv-χ2ν
Discrete Distributions.
Distribution Density Function Notation
Bernoulli p(x; θ) = θ1x=1 (1 − θ)1x=0 X ∼ Ber(θ)
n x
Binomial p(x; n) = x p (1 − p)n−x x ∼ Bin(n, p)
Qk xi
Multinomial p(x1 , . . . , xk ; n) = Qn!
k p
i i
x!
i i
Logistic Regression. Perhaps the simplest linear method 273 for classification is logistic
regression. Let K be the number of classes that y can take on. The model is defined as
exp(θkT x)
Pr [y = k | x] = PK−1 , for 1 ≤ k ≤ K − 1 (1258)
1 + `=1 exp(θ`T x)
1
Pr [y = K | x] = PK−1 (1259)
1 + `=1 exp(θ`T x)
and we often denote Pr [y = k | x] under the entire set of parameters θ simply as pk (x; θ)
or just pk (x). The decision boundaries are the set of points in the domain of x for which
some pk (x) = pj6=k (x). Equivalently, the model can be specified by K − 1 log-odds or logit
273
We say a classification method is linear if its decision boundary is linear.
397
transformations of the form
pi (x)
log = θiT x for 1 ≤ i ≤ K − 1 (1260)
pK (x)
Also note that the parameter vectors are orthogonal to the K-1 decision boundaries. For any
x, x0 on the decision boundary defined as the set of points {x : pa (x) = pb (x)}, we know that
the vector x − x0 is parallel to the decision boundary (by definition), and can derive
pa (x)
= 1 = exp(θaT x − θbT x) =⇒ θaT x = θbT x (1261)
pb (x)
∴ θaT (x − x0 ) = θbT (x − x0 ) = 0 (1262)
and thus θa and θb are both perpendicular to the decision boundary where pa (x) = pb (x).
398
Math
Table of Contents Local Written by Brandon McKinzie
Fancy math definitions/concepts for fancy authors who require fancy terminology.
• Support. Sloppy definition you’ll see in most places: The set-theoretic support of a
real-valued function f : X 7→ R is defined as
supp(f ) , {x ∈ X : f (x) 6= 0}
Note that Wikipedia is surprisingly sloppy with how it defines and/or uses support in
various articles. After some digging, I finally found the formal definition for probability
and measure theory:
If X : Ω 7→ R is a random variable on (Ω, F, P ), then the support of X is the smallest
closed set RX ⊂ R such that P (X ∈ RX ) = 1.
• Infimum and Supremum. The fancy-math way of saying minimum and maximum.
Yes, I recognize that these are important in certain (usually rather abstract) settings,
but often in ML it is used when sup means exactly the same thing as max, but the
authors want to look sophisticated. Here I’ll give the formal definition for sup. You have
a partially ordered set274 P , and are for the moment checking out some subset S ⊆ P .
Someone asks you, “hey, give me an upper bound of S.” You just gotta find some b ∈ P
(the larger/global set) that is greater than or equal to every element in S. The person
then comes back and says “ok no, I need the supremum of S.” Now you need to find the
smallest value out of all the possible upper bounds.
Hopefully it is clear why this is only relevant in real-valued cases where the “edges” aren’t
well-defined.
• Probability Measure. Informal definition: a probability distribution275 . Formal defi-
nition: a function µ : α 7→ R[0, 1] from events to scalar values, where µ(α) = 1 if α = Ω
(the full space) and µ(∅) = 0. Also µ must satisfy the countable additivity property:
µ(∪i αi ) = i µ(αi ) for pairwise disjoint sets {α}i .
P
274
A partially ordered set (P, ≤) is a set of elements such that element i is less than or equal to element i + i.
275
See this great answer detailing how the difference between “measure” and “distribution” is basically just
context.
399
Linear Algebra. Feeling like I need a quick recap from my adv. linalg course and an area
where I can ramble my interpretations. In what follows, let V and W be vector spaces over
some field F .
Linear Transformation
A function T : V 7→ W is called a linear transformation from V to W if ∀x, y ∈ V and
∀c ∈ F :
• T (x + y) = T (x) + T (y).
• T (cx) = cT (x).
Now suppose that V and W have ordered bases β = {v1 , . . . , vn } and γ = {w1 , . . . , wm },
respectively. Then for each basis vector vj , there exist unique scalars aij ∈ F such that
m
X
T (vj ) = aij wi (1263)
i=1
Remember that each vj and wi are members of a vector space (they are not scalars). And also
be careful to not associate the representation of any vector space element with its coordinate
vector relative to a specific ordered basis, which itself is a different linear transformation from
some V 7→ F n . Keep it abstract.
Matrix Representation of a Linear Transformation
We call the m × n matrix A defined by Aij = aij the matrix representation of T in the
ordered bases β and γ and write A = [T ]γβ . If V = W and β = γ, then A = [T ]β .
Given this definition, I think it’s wise to interpret matrix representations by the column-vector
point of view. Each column vector [T (vj )]γ , read as “the coordinate vector of T (vj ) relative to
ordered basis γ,” tells you how each basis vector vj in domain V gets mapped to a [coordinate]
vector in output space W [relative to a given ordered basis γ]. For some reason, my brain has
always had a hard time with the fact that the matrix row indices i correspond to output space,
while the column indices j represent the input space. The easiest way (I think) to help ground
this the right way is to remember that LA (x) , Ax, i.e. the operator view. At the same time,
notice how the effect of Ax is to take successive linear combinations over each element of x.
I just realized another reason why the interpretation felt backwards to my brain: when we are
taught matrix multiplication, we do the computations Ax in our heads “row by row” along A,
taking inner products of the row with x, so I’ve been taught to think of the rows as the main
“units” of the matrix. I’m not sure how to fix this, but notice that the matrix is basically just
a blueprint/roadmap/whatever you want to call it for taking coordinate vectors in one basis
to coordinate vectors in another basis. It’s really important to remember the coefficients of A
are intimately tied to the input/output bases.
AHA. I’ve been thinking about this all wrong. For the longest time, I’ve been trying to force
an interpretation of matrix multiplication that “feels” like scalar multiplication. I realize now
that this is going about it all the wrong way. Matrix multiplication need only be considered from
400
the lens of a linear transformation. After all, that’s exactly the purpose of matrices anyway!
It’s so glaringly obvious from the paragraphs above, but I guess I never took them seriously
enough. Matrices are simply convenient ways for us to write down linear transformations on
vectors in a given [ordered] basis. The jth column of the matrix defines how the original jth
basis vector is transformed. AHA (again). Now I see why I missed this crucial connection
– everything above focuses on the formal definition of input ordered basis β to output ordered
basis γ, but 99 percent of the time in real life we have either β ⊂ γ or γ ⊂ β (we usually are
mapping from Rn to Rm ). For example, let A ∈ M m×n and x ∈ Rn ; the following is always
true:
Ax = LA (x) (1264)
h i
= LA (ê1 ) LA (ê2 ) · · · LA (ên ) x (1265)
n
X
= LA (êi )xi (1266)
i=1
This viewpoint is painfully obvious to me now, but I guess I hadn’t thought deeply enough
about the implications of the definition of a linear transformation, and I definitely took the
representation of a matrix way too seriously, rather than focusing on its sole purpose:
provide a convenient way to write down linear transformations. For example, the above is
actually a direct consequence of the definition of a L.T. itself:
Time to really nail in the understanding. I also remember getting screwed up trying to think
about ok, so how do I conceptualize of the ith element of x after the transformation? It’s just
a bunch of summed up goobley-gook!. On one hand, yes that’s true, but focus on the following
before/after representations of x to make your life easier:
n n
X Ax X
x, xi êi −−−−→ LA (x), xi LA (êi ) (1269)
i i
401
Matrix Multiplication and Neural Networks. Let’s use the previous interpretations
in the context of neural networks. A basic feedforward network with one hidden layer will
compute outputs o given inputs x, each of which are vectors of possibly different dimension:
where W (o) and W (h) are the output and hidden parameter matrices, respectively. We already
know that we can interpret each columns of these matrices as how the input basis vectors get
mapped to the hidden or output space. However, since we usually think of the parameter
matrices as representing the weighted edges of a network, we often think in terms of individual
units. For example, the ith unit of the hidden layer vector h is given by hi = nj in Wij xj =
P
hWi,: , xi. One interesting interpretation is that the ith element of h is a projection276 of
the input x onto the ith row of W. This is of course true for any linear transformation; we
can always think of the elements of the transformed vector as the result of projections of the
original vector along a particular direction.
Determinants. det A is the volume of space that a unit [hyper] cube is mapped to. Starting
with the simplest non-trivial case, let A ∈ M 2×2 , and define A s.t. it simply scales the basis
vectors (zero rotation). In other words, Ai,j6=i := 0. In this case, det A = a11 a22 , which is
the area enclosed by the new scaled basis vectors. Skipping straight to the general case of
[necessarily square] matrix A ∈ M n×n using Einstein summation notation and the Levi-Cevita
symbol277 :
n
Y
det A , εi1 ,...in a1,i1 . . . an,in = εi1 ,...in aj,ij (1273)
j=1
Consider that if det A = 0, then TA “squishes” the volume of space in such a way that we
essentially lose one or more dimensions. Notice how it only takes one lost dimension, since
the volume of any region in Rn is zero unless there is some amount in all dimensions (e.g. a
cube with zero width has zero volume, regardless of its height/depth). It’s also interesting to
consider the relationships here with the invertible matrix theorem (defined a few paragraphs
below). Having the intuition that determinants can be thought of as a change-in-volume
makes it much more obvious why the equivalence statements of the invertible matrix theorem
are indeed equivalent.
276
This is informally worded. See the footnotes in the dot products section to understand why the element
is technically just the result of a transformation (a projection would require re-mapping the scalar back to the
space that Wi,: lives (input space)).
277
Recall that
+1 if (i1 , . . . , in ) is even perm of (1, . . . , n)
εi1 ,...,in , −1 if (i1 , . . . , in ) is odd perm of (1, . . . , n) (1272)
0 otherwise
Note that this implies equal-to-zero if any of the indices are equal.
402
Dot Products and Projections. First, recall that a projection is defined to be a linear
transformation P that is idempotent (P n = P for n ≥ 1). Also, note that what you generally
think of as a projection is technically an orthogonal projection 278 .
Here we’ll show the intimate link between the dot product and [orthogonal] projection. Let
Pu define the [orthogonal] projection onto some unit vector u ∈ Rn (more generally, we
could project onto a subspace instead of a single vector279 ). We can re-cast this as a linear
transformation Tu : Rn 7→ R (technically not a projection, which would require re-mapping the
output scalar back to Rn ). We interpret the scalar output of Tu (x) as the coordinate along
the line spanned by {u} that input vector x gets mapped to. But wait, didn’t we just talk a
bunch about how to represent/conceptualize of the matrix representation of a transformation?
Yes, we did. Well then, what would the matrix representation of Tu look like? Don’t forget
that we’ve defined ||u|| = 1.
h i
[Tu ]R
Rn = u1 · · · un (1274)
n
X
Tu (x) = ui xi (1275)
i
−→ =x•u (1276)
Furthermore, since linear transformations satisfy T (cx) = cT (x) by definition, the final result
is true even when u is not a unit vector.
278
The more general definition uses wording “projection of vector x along k onto m”, where the distinction
is shown in italics. An orthogonal projection implicitly defines k as its null space; for any α ∈ R, an orthogonal
projection satisfies P (αk) = 0
279
And technically, you don’t project onto a vector, but rather you project onto a line, which is itself technically
a subspace of 1 dimension. Yada yada yada.
403
Inverse of a partitioned matrix
Consider a general partitioned matrix,
!
E F
M= (1278)
G H
[E [XY ] =E [X] E [Y ]]
=⇒ [P (X, Y )=P (X)P (Y )] (1282)
404
=⇒
Example: Uncorrelated Independent
To get an intuition for this, I immediately try formalizing the possible cases where this is true. It
seems that symmetry is always present in such cases, and they do seem like edge cases. The simplest
and by far most common example is the case where we have x,y coordinates {(−1, 0), (0, −1), (0, 1), (1, 0)}.
It’s obvious that X ∗ Y equals zero for all of these points, and also that both X and Y are symmetric
about the origin, meaning that E [XY ] = 0 = E [X] E [Y ]. In other words, they are uncorrelated. The
key insight comes from understanding why this is so: Regardless of whether one variable is either
positive or negative the other is zero. I really want to nail this insight down, because it made me realize
I was thinking about correlation wrong – I was thinking about it more as independence in the first place,
and so looking back it’s no wonder I was confused about the difference. You simply cannot think about
correlation from the perspective of a single instance. For example, when I first read this, I thought “well
if I know X is 1, then I know automatically that Y is zero”, and although that is technically true, that is
not what correlation is about. Rather, correlation is about trends of multiple instances. A more correct
thought would be “Regardless of whether X is positive or negative, Y is zero, therefore positive values of
X are neither positively nor negatively correlated with values of Y.”
Now that we’ve got the thornier part (for me at least) out of the way, recognize that although X and Y
are uncorrelated, they are not independent. This should be fairly obvious, since given either X or Y , we
can say what the other’s value is with higher certainty than otherwise.
where finding the optimal linear function Ŷ (X) amounts to finding the optimal coeffi-
cients (a, b) over the dataset D. Unfortunately, it seems that the main way of “deriving”
the result is actually to just proven, given the result, that it does indeed minimize the
MSE. So, with that said, we begin with the result:
cov(X, Y )
L[Y | X] = E [Y ] + (X − E [X]) (1285)
Var(X)
405
Proof: Eq 1285 Minimizes the MSE
E (Y − L[Y | X])2 ≤ E (Y − Ŷ )2
(1286)
Our next goal is to evaluate the term in red (spoiler alert: it is zero).
2. First, it is easy to show that E [Y − L[Y | X]] = 0 by simple substitution/arithmetic. We
can also show thata ∀aX + b ∈ L(X),
as well.
3. Since L[Y | X] ∈ L(X), it is also true that ∀Ŷ ∈ L(X), we know (L[Y | X] − Ŷ ) ∈ L(X),
too. Therefore, the red term from step 1 equates to zero.
4. We now know that our formula from step 1 can be written
TODO: Figure out how this formulation is equivalent to the typical multivariate expres-
sion:
T
ŷ = ŵOLS x (1290)
ŵOLS = (X T X)−1 X T y (1291)
Questions.
• Q: In general, how can one tell if a matrix A has an eigenvalue decomposition? [insert
more conceptual matrix-related questions here . . . ]
• Q: Let A be real-symmetric. What can we say about A?
– Proof that eigendecomposition A = QΛQT exists: Wow this is apparently quite This is the principal axis
theorem: if A symmetric,
hard to prove according to many online sources. Guess I don’t feel so bad now that then orthonorm basis of
it wasn’t (and still isn’t) obvious. e-vects exists.
– Eigendecomposition not unique. This is apparently because two or more eigenvec-
tors may have same eigenvalue.
406
Stuff I Forget.
• Existence of eigenvalues/eigenvectors. Let A ∈ Rn×n . Most info here comes
from chapter 5 of your
– λ is an eigenvalue of A iff it satisfies det(λI − A) = 0. Why? Because it is an “Elementary Linear
equivalent statement as requiring that (λI − A)x = 0 has a nonzero solution for x. Algebra” textbook
(around pg305)
– The following statements are equivalent:
∗ A is diagonalizable.
∗ A has n linearly independent eigenvectors.
– The eigenspace of A corresponding to λ is the solution space of the homogeneous
system (λI − A)x = 0.
– A has at most n distinct eigenvalues.
• Diagonalizability notes from 5.2 of advanced linear alg. book (261). Recall that A is
defined to be diagonalizable if and only if there exists and ordered basis β for the space
consisting of eigenvectors of A. Recall that a linear
operator is a special case
– If the standard way of finding eigenvalues leads to k distinct λi , then the corre- of a linear map where
sponding set of k eigenvectors vi are guaranteed to be linearly independent (but the input space is the
same as the output
might not span the full space). space.
– If A has n linearly independent eigenvectors, then A is diagonalizable.
– The characteristic polynomial of any diagonalizable linear operator splits (can be
factored into product of linear factors). The algebraic multiplicity of an eigen-
value λ is the largest positive integer k for which (t − λ)k is a factor of f (t).
• Expectation of a random vector. Defined as
E [x1 ]
.
E [x] =
.
. (1292)
E [xd ]
You can work out that it separates like that (which is not intuitive/immediately obvious
imo) by considering e.g. the case where d = 2. You’ll end up finding that
XX
E [x] = xp(x = x) (1293)
x1 x2
E [x1 ]
= h i (1294)
Ex1 Ex2 ∼p(x2 |x1 ) [x2 | x1 ]
and since we know from CS70 that E [E [Y | X]] = E [Y ], we get the desired result.
407
Matrix Cookbook
Table of Contents Local Written by Brandon McKinzie
∂ ∂
||X||2F = Tr(XX T ) = 2X (1295)
∂X ∂X
∂g(U) T ∂U
" #
∂g(U)
[chain rule] = Tr (1296)
∂Xij ∂U ∂Xij
408
Main Tasks in NLP
Table of Contents Local Written by Brandon McKinzie
Going to start compiling a list of the main tasks in NLP (alphabetized). Note that NLP-
Progress, a site dedicated to this purpose, is a much more detailed. I’m going for short-and-
sweet here.
Constituency Parsing.
• Task: Generate parse tree of a sentence. Nodes are typically labeled by parts of speech
and/or chunks.
– A constituency parse tree breaks a text into sub-phrases, or constituents. Non-
terminals in the tree are types of phrases, the terminals are the words in the sentence.
Coreference Resolution.
• Task: clustering mentions in text that refer to the same underlying real world entities.
• SOTA: End-to-end Neural Coreference Resolution.
Dependency Parsing.
• Task: Given a sentence, generate its dependency tree (DT). A DT is a labeled directed
tree whose nodes are individual words, and whose edges are directed arcs labeled with
dependency types.
– A dependency parser analyzes the grammatical structure of a sentence, establishing
relationships between “head” words and words which modify those heads.
• SOTA: Deep Biaffine Attention for Neural Dependency Parsing.
Related Tasks:
– Constituency parsing. See this great wiki explanation of dependency vs constituency.
Information Extraction.
• Task: Given a (typically long) portion of raw text, recover information about pre-
specified relations, entities, events, etc.
409
Language Modeling.
• Task: Learning the probability distribution over text sequences. Often used for predict-
ing the next word in a sequence, given the K previous words.
• SOTA: ELMo.
Machine Translation.
Semantic Parsing.
• Task: Translating natural language into a formal meaning representation on which a
machine can act.
Sentiment Analysis.
• Task: Determining whether a piece of text is positve, negativ, or neutral.
• SOTA: Biattentive classification network (BCN) from Learned in Translation: Contex-
tualized Word Vectors (the CoVe paper) with ELMo embeddings.
Summarization.
Textual Entailment.
• Task: Given a premise, determine whether a proposed hypothesis is true.
• SOTA: Enhanced Sequenctial Inference (ESIM) model from Enhanced LSTM for Nat-
ural Language Inference with ELMo embeddings.
280
The predicate of a sentence mostly corresponds to the main verb and any auxiliaries that accompany the
main verb; whereas the arguments of that predicate (e.g. the subject and object noun phrases) are outside the
predicate.
410
Topic Modeling.
411
Misc. Topics
Table of Contents Local Written by Brandon McKinzie
It’s important to emphasize how ridiculous this really is. It literally means that we walk
along each word in the prediction and ask “is this word somewhere in any of the reference
translations?” and if that answer is “yes”, we +1 to the numerator. Period.
• Modified unigram precision: actually considering how many times we’ve mentioned a
given word w when incorporating it into the precision calculation. Now, when walking
along a sentence, we add to the aforementioned question, “. . . and have I seen it less
than N times already?” where N = [max(count(sent, w)) for sent in refs]. This
means our numerator can now be at most N for any given word.
• Generalize to n-grams. Below is the formula for Blue score on n-grams only:
P
ngrams∈ŷ Countclip (ngram, refs)
pn (ŷ) = P (1298)
ngrams∈ŷ Count(ngram)
412
14.5.2 Connectionist Temporal Classification (CTC)
X T
Y
p(Y | X) = pt (at | X) (1301)
A∈AX,Y t=1
where A is one of the valid alignments from AX,Y , the full set of valid alignments from X
to a given output sequence Y . The per-timestep probabilities p(at | X) can be given by, for
example, an RNN.
413
Number of Valid Alignments
Given X of length T and Y of length U ≤ T (and no repeating letters), how many valid align-
ments exist?
Stated even simpler, the alignments differ first and foremost by which elements of X are “extra” tokens, where
T
I’m using “extra” to mean either blank or repeat token. Given a set of T tokens, there are T −U different ways
to assign T − U of them as “extra.” The tricky part is that we can’t just randomly decide to repeat or insert a
blank, since a sequence of one or more blanks is always followed by a transition to next letter, by definition.
And remember, we have defined Y to have no repeated [contiguous] labels.
T +U
Apparently, the answer is T −U
total valid alignments.
Computing forward probabilities αt (s), defined as the probability of arriving at [prefix of]
augmented label sequence `0h1...si given unmerged alignments up to step t. There are two cases
to consider.
1. (1.1) The augmented label at step s, `0s is the blank token . Remember, occurs at
every other position in the augmented sequence `0 . At the previous RNN output (time
t − 1), we could’ve emitted either a blank token or the previous token in the augmented
label sequence, `0s−1 . In other words,
2. (1.2) The augmented label at step s, `0s is the same augmented label as at step s − 2.
This occurs when the [not augmented] label sequence has repeated labels next to each
other.
In this situation, αt−1 (s) corresponds to us just emitting the same token as we did at
t − 1 or emitting a blank token , and αt−1 (s − 1) corresponds to a transition to/from
and a label.
3. (2) The augmented label at step s - 1, `0s−1 is the blank token between unique characters.
In addition to the two αt−1 terms from before, we now also must consider the possibility
that our RNN emitted `0s−2 at the previous time (t − 1) and then emitted `0s immediately
after at time t.
414
14.5.3 Perplexity
Per Wikipedia:
In information theory, perplexity is a measurement of how well a probability distribution
or probability model predicts a sample.
where H is entropy (in bits). It’s important to note that, in practice, we are never able to use
the theoretical version since we don’t know p exactly (we are usually trying to estimate it) –
instead of H(p) we thus usually think in terms of H(p, q), the cross entropy 281 . The empirical
definition is when we have N samples in some test set T , and a model q that we want to
approximate the true distribution p.
In NLP, it is more common to want the per-word perplexity of a language model. We typically
do this by flattening out a sequence of words in some test set containing M words total and
simply compute
1
P = 2− M lg p({w1 ,...wM }) (1307)
1
= 1 (1308)
p({w1 , . . . wM }) M
In other words, NLP nearly always defines perplexity as the inverse probability of the test
set, normalized by number of words. So, why is this valid? We are implicitly assuming that
language sources are ergodic:
Ergodic
A random process is ergodic if its (asymptotic) time average is the same as its expectation
value over all possible states (w.r.t the specified probability distribution).
Informally, this means that the system eventually reaches all states, and such that the
probability of observing it in state s is p(s), where p is the true generating distribution.
281
Recall the relationship between entropy H(p) and cross entropy H(p, q):
415
In the per-word NLP case, this means we can assume that
1
lim E [lg p({w1 , . . . wm })] = lim lg p({w1 , . . . wm }) (1309)
m→∞ m→∞ m
where the sequence on the RHS is any sample from p282 .
Intuition. Ok, now that we’ve got definitions out of the way, what does it actually mean?
First consider some limiting cases. If the distribution p is uniform over N possible outcomes,
then P(p) = 2lg N = N . Since the uniform distribution has the highest possible entropy, N
is also the largest possible value for perplexity of a discrete distribution p over N possible
outcomes.
Consider the interpretation of the cross entropy loss as the negative log-likelihood:
M
1 X 1
N LL(pdata ) = − log p(w=wi ) = Ewi ∼pdata log (1310)
M i=1 p(w)
we see that N LL (and thus P= exp(N LL)) decreases as our model assigns higher probabilities
to samples drawn from pdata . Better models of pdata are less surprised by samples from pdata .
If we use the typical interpretation of entropy as the number of bits needed (on average) to
represent a sample from p, then the perplexity can be interpreted as the total number of pos-
sible results (on average) when drawing a sample from p.
In the case of language modeling, this represents the total number of reasonable next-word
predictions for wt+1 given some context w1 , . . . wt . As our model assigns higher probabilities
to the true samples in pdata , the number of bits required to specify each word, on average,
becomes smaller. Therefore, you can roughly think of per-word perplexity as telling you the
number of possible choices, on average, your model considers uniformly at random at a given
step. For example, P = 42 could be interpreted loosely as “to predict the next word out of
some vocabulary V , my model can narrow it down on average to about 42 choices, and chooses
uniformly at random from that subset”, where typically |V | >> 42.
282
Something feels off here. I’m synthesizing what I’m reading from wikipedia and this source from berkeley
but I can’t fix the sloppy tone of the wording.
416
14.5.4 Byte Pair Encoding
14.5.5 Grammars
In formal language theory, a formal grammar is a set of production rules for strings in a
formal language. The rules describe how to form strings from the language’s alphabet that
are valid according to the language’s syntax. A grammar does not describe the meaning of the
strings or what can be done with them in whatever context—only their form.
• Regular Grammar: no rule has more than one nonterminal in its right-hand side,
and each of these nonterminals is at the same end of the right-hand side. Every regular
grammar corresponds directly to a nondeterministic finite automaton.
• A context-free grammar (CFG) is a formal grammar that consists of:
– Terminal symbols: characters that appear in the strings generated by the grammar.
– Nonterminal symbols: placeholders for patterns of terminal symbols that can be generated by
the nonterminal symbols.
– Productions: rules for replacing (or rewriting) nonterminal symbols (on the LHS) in a string with
other nonterminal or terminal symbols (on the RHS), which can be applied regardless of context.
– Start symbol: a special nonterminal symbol that appears in the initial string generated by the
grammar.
To generate a string of terminal symbols from a CFG, we:
1. Begin with a string consisting of the start symbol;
2. Apply one of the productions with the start symbol on the left hand size, replacing the start symbol
with the right hand side of the production;
3. Repeat the process of selecting nonterminal symbols in the string, and replacing them with the
right hand side of some corresponding production, until all nonterminals have been replaced by
terminal symbols.
• A Probabilistic CFG extends CFGs the same way has HMMs extend regular grammars,
by defining the set P of probabilities on production rules.
417
14.5.6 Bloom Filter
Data structure for querying whether a data point is a member of some set. It returns either
“no” or “maybe”. It is implemented as a bit vector. Each member of the set is passed through
k hash functions. Each hash function maps an element to an integer index. For each member,
we set the k output indices of the hash functions to 1 in our bit vector. To answer if some
data point x is in the set, we pass x through the k hash functions, which gives us k indices. If
all k indices have their bit value set to 1, the answer is “maybe”, otherwise (if any bit value is
0) the answer is “no”.
Asynchronous SGD.
Nx
α X ∂L(x(j) )
Wi+1 = Wi − (1311)
Nx j=1 ∂Wi
Nw NX
x (j)
X ∂L(x(k) )
[SyncSGD] Wi+1 = Wi − λ α (1312)
j=1 k=1
∂Wi
In asynchronous SGD, we just apply the gradient updates to a global version of the parameters
whenever they are available. In practice, this can result in stale gradients, which happens
when a worker takes a long time to compute some gradient step, while the master version of
the parameters has been updated many times. This results is the master computing an update
like Wt+1 = Wt − λ∆Wt−D for larger-than-desired values of D (delay in num updates).
418
14.5.8 Traditional Language Modeling
Click this link to see a really good overview of the terms above and more.
283
Good example is how a unigram model assigns a decent probability for "York" but a human bigram could
tell you that it is nearly certain that "New" preceded it. The backoff model would tend to overestimate p(York)
since it has no contextual information
419