0% found this document useful (0 votes)
91 views419 pages

Logarithmic Power Property Explained

The document outlines a comprehensive table of contents for a text on machine learning and deep learning, covering topics from mathematical foundations to advanced deep learning practices and research. It includes sections on linear algebra, probability, machine learning basics, deep networks, and various applications and tutorials. Each chapter is broken down into subtopics, indicating a structured approach to the subject matter.

Uploaded by

rolandtamakole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views419 pages

Logarithmic Power Property Explained

The document outlines a comprehensive table of contents for a text on machine learning and deep learning, covering topics from mathematical foundations to advanced deep learning practices and research. It includes sections on linear algebra, probability, machine learning basics, deep networks, and various applications and tutorials. Each chapter is broken down into subtopics, indicating a structured approach to the subject matter.

Uploaded by

rolandtamakole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Contents

1 Math and Machine Learning Basics 7


1.1 Linear Algebra (Quick Review) (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Example: Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Probability & Information Theory (Quick Review) (Ch. 3) . . . . . . . . . . . . . . . . . . . 12
1.3 Numerical Computation (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Machine Learning Basics (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Estimators, Bias and Variance (5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Maximum Likelihood Estimation (5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.3 Bayesian Statistics (5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.4 Supervised Learning Algorithms (5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Deep Networks: Modern Practices 23


2.1 Deep Feedforward Networks (Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Back-Propagation (6.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Regularization for Deep Learning (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Optimization for Training Deep Models (Ch. 8) . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Convolutional Neural Networks (Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Sequence Modeling (RNNs) (Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Review: The Basics of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 RNNs as Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 Challenge of Long-Term Deps. (10.7) . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 LSTMs and Other Gated RNNs (10.10) . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Applications (Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.1 Natural Language Processing (12.4) . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.2 Neural Language Models (12.4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Deep Learning Research 46


3.1 Linear Factor Models (Ch. 13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Autoencoders (Ch. 14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Representation Learning (Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Structured Probabilistic Models for DL (Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Inference and Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Monte Carlo Methods (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Confronting the Partition Function (Ch. 18) . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Approximate Inference (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8 Deep Generative Models (Ch. 20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Papers and Tutorials 66


4.1 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

1
4.2 Neural Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Neural Conversation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 NMT By Jointly Learning to Align & Translate . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Detailed Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Effective Approaches to Attention-Based NMT . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Using Large Vocabularies for NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Candidate Sampling – TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Attention Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.9.2 Sentence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Simple Baseline for Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 Survey of Text Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.11.1 Distance-based Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.2 Probabilistic Document Clustering and Topic Models . . . . . . . . . . . . . . . . . 98
4.11.3 Online Clustering with Text Streams . . . . . . . . . . . . . . . . . . . . . . . . 100
4.12 Deep Sentence Embedding Using LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.13 Clustering Massive Text Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.14 Supervised Universal Sentence Representations (InferSent) . . . . . . . . . . . . . . . . . . . 107
4.15 Dist. Rep. of Sentences from Unlabeled Data (FastSent) . . . . . . . . . . . . . . . . . . . . 108
4.16 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.17 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.18 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.19 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.20 Joint Event Extraction via RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.21 Event Extraction via Bidi-LSTM Tensor NNs . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.22 Reasoning with Neural Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.23 Language to Logical Form with Neural Attention . . . . . . . . . . . . . . . . . . . . . . . 128
4.24 Seq2SQL: Generating Structured Queries from NL using RL . . . . . . . . . . . . . . . . . . 130
4.25 SLING: A Framework for Frame Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . 133
4.26 Poincaré Embeddings for Learning Hierarchical Representations . . . . . . . . . . . . . . . . . 135
4.27 Enriching Word Vectors with Subword Information (FastText) . . . . . . . . . . . . . . . . . 137
4.28 DeepWalk: Online Learning of Social Representations . . . . . . . . . . . . . . . . . . . . . 139
4.29 Review of Relational Machine Learning for Knowledge Graphs . . . . . . . . . . . . . . . . . 141
4.30 Fast Top-K Search in Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.31 Dynamic Recurrent Acyclic Graphical Neural Networks (DRAGNN) . . . . . . . . . . . . . . . 146
4.31.1 More Detail: Arc-Standard Transition System . . . . . . . . . . . . . . . . . . . . 149
4.32 Neural Architecture Search with Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 150
4.33 Joint Extraction of Events and Entities within a Document Context . . . . . . . . . . . . . . . 152
4.34 Globally Normalized Transition-Based Neural Networks . . . . . . . . . . . . . . . . . . . . 155
4.35 An Introduction to Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . 158
4.35.1 Inference (Sec. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.35.2 Parameter Estimation (Sec. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.35.3 Related Work and Future Directions (Sec. 6) . . . . . . . . . . . . . . . . . . . . . 168

2
4.36 Co-sampling: Training Robust Networks for Extremely Noisy Supervision . . . . . . . . . . . . 169
4.37 Hidden-Unit Conditional Random Fields . . . .. . . . . . .
. . . . . . . . . . . . . . . . 170
4.37.1 Detailed Derivations . . . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 172
4.38 Pre-training of Hidden-Unit CRFs . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 177
4.39 Structured Attention Networks . . . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 179
4.40 Neural Conditional Random Fields . . . . . .. . . . . . .
. . . . . . . . . . . . . . . . 181
4.41 Bidirectional LSTM-CRF Models for Sequence Tagging . . . .
. . . . . . . . . . . . . . . . 183
4.42 Relation Extraction: A Survey . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 184
4.43 Neural Relation Extraction with Selective Attention over Instances . . . . . . . . . . . . . . . 187
4.44 On Herding and the Perceptron Cycling Theorem . . . . . . . . . . . . . . . . . . . . . . . 189
4.45 Non-Convex Optimization for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 191
4.45.1 Non-Convex Projected Gradient Descent (3) . . . . . . . . . . . . . . . . . . . . . 194
4.46 Improving Language Understanding by Generative Pre-Training . . . . . . . . . . . . . . . . . 195
4.47 Deep Contextualized Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.48 Exploring the Limits of Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 198
4.49 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.51 Wasserstein is all you need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
4.52 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4.52.1 Self-Normalized NCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
4.53 Neural Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.54 On the Dimensionality of Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 212
4.55 Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.56 A Framework for Intelligence and Cortical Function . . . . . . . . . . . . . . . . . . . . . . 216
4.57 Large-Scale Study of Curiosity Driven Learning . . . . . . . . . . . . . . . . . . . . . . . . 217
4.58 Universal Language Model Fine-Tuning for Text Classification . . . . . . . . . . . . . . . . . 218
4.59 The Marginal Value of Adaptive Gradient Methods in Machine Learning . . . . . . . . . . . . . 220
4.60 A Theoretically Grounded Application of Dropout in Recurrent Neural Networks . . . . . . . . . 221
4.61 Improving Neural Language Models with a Continuous Cache . . . . . . . . . . . . . . . . . . 222
4.62 Protection Against Reconstruction and Its Applications in Private Federated Learning . . . . . . . 223
4.63 Context Dependent RNN Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.64 Strategies for Training Large Vocabulary Neural Language Models . . . . . . . . . . . . . . . . 226
4.65 Product quantization for nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . 228
4.66 Large Memory Layers with Product Keys . . . . . . . . . . . . . . . . . . . . . . . . . . 229
4.67 Show, Ask, Attend, and Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.68 Did the Model Understand the Question? . . . . . . . . . . . . . . . . . . . . . . . . . . 233
4.69 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.70 Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
4.71 Efficient Softmax Approximation for GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 237
4.72 Adaptive Input Representations for Neural Language Modeling . . . . . . . . . . . . . . . . . 238
4.73 Neural Module Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.74 Learning to Compose Neural Networks for QA . . . . . . . . . . . . . . . . . . . . . . . . 241
4.75 End-to-End Module Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.76 Fast Multi-language LSTM-based Online Handwriting Recognition . . . . . . . . . . . . . . . 245

3
4.77 Multi-Language Online Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . 246
4.78 Modular Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.79 Transfer Learning from Speaker Verification to TTS . . . . . . . . . . . . . . . . . . . . . . 250

5 NLP with Deep Learning 251


5.1 Word Vector Representations (Lec 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
5.2 GloVe (Lec 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

6 Speech and Language Processing 258


6.1 Introduction (Ch. 1 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.2 Morphology (Ch. 3 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.3 N-Grams (Ch. 6 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.4 Naive Bayes and Sentiment (Ch. 6 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.5 Hidden Markov Models (Ch. 9 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.6 POS Tagging (Ch. 10 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.7 Formal Grammars (Ch. 11 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.8 Vector Semantics (Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.9 Semantics with Dense Vectors (Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.10 Information Extraction (Ch. 21 3rd Ed) . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

7 Probabilistic Graphical Models 281


7.1 Foundations (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.1.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.1.2 L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
7.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.2 The Bayesian Network Representation (Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . 292
7.3 Undirected Graphical Models (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.4 Local Probabilistic Models (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.5 Template-Based Representations (Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
7.6 Gaussian Network Models (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.7 Variable Elimination (Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.8 Clique Trees (Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
7.9 Inference as Optimization (Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
7.10 Parameter Estimation (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
7.11 Partially Observed Data (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

8 Information Theory, Inference, and Learning Algorithms 324


8.1 Introduction to Information Theory (Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2 Probability, Entropy, and Inference (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.2.1 More About Inference (Ch. 3 Summary) . . . . . . . . . . . . . . . . . . . . . . . 329
8.3 The Source Coding Theorem (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

4
8.3.1 Data Compression and Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.3.2 Further Analysis and Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.4 Monte Carlo Methods (Ch. 29) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.5 Variational Methods (Ch. 33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

9 Machine Learning: A Probabilistic Perspective 341


9.1 Probability (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9.2 Generative Models for Discrete Data (Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . . 344
9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.3 Gaussian Models (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.4 Bayesian Statistics (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.5 Linear Regression (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.6 Generalized Linear Models and the Exponential Family (Ch. 9) . . . . . . . . . . . . . . . . . 358
9.7 Mixture Models and the EM Algorithm (Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . 361
9.8 Latent Linear Models (Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
9.9 Markov and Hidden Markov Models (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . 366
9.10 Undirected Graphical Models (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

10 Convex Optimization 369


10.1 Convex Sets (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

11 Bayesian Data Analysis 372


11.1 Probability and Inference (Ch. 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
11.2 Single-Parameter Models (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.3 Asymptotics and Connections to Non-Bayesian Approaches (Ch. 4) . . . . . . . . . . . . . . . 378
11.4 Gaussian Process Models (Ch. 21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

12 Gaussian Processes for Machine Learning 383


12.1 Regression (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

13 Blogs 387
13.1 Conv Nets: A Modular Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
13.2 Understanding Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
13.4 Deep Learning for Chatbots (WildML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
13.5 Attentional Interfaces – Neural Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 395

14 Appendix 396
14.1 Common Distributions and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
14.2 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
14.3 Matrix Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408

5
14.4 Main Tasks in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5 Misc. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.1 BLEU Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.2 Connectionist Temporal Classification (CTC) . . . . . . . . . . . . . . . . . . . . . 413
14.5.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
14.5.4 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.5 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.6 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.7 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.8 Traditional Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

6
Math and
Machine
Learning Basics
Contents

1.1 Linear Algebra (Quick Review) (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


1.1.1 Example: Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Probability & Information Theory (Quick Review) (Ch. 3) . . . . . . . . . . . . . . . . . . . 12
1.3 Numerical Computation (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Machine Learning Basics (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Estimators, Bias and Variance (5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Maximum Likelihood Estimation (5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.3 Bayesian Statistics (5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.4 Supervised Learning Algorithms (5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22

7
Math and Machine Learning Basics January 23, 2017

Linear Algebra (Quick Review) (Ch. 2)


Table of Contents Local Written by Brandon McKinzie

• For A−1 to exist, Ax = b must have exactly one solution for every value of b. Determining Unless stated otherwise,
assume A ∈ Rm×n
whether a solution exists ∀b ∈ Rm means requiring that the column space (range) of
A be all of Rm . It is helpful to see Ax expanded out explicitly in this way:
   
A1,1 A1,n
X  .   . 
Ax = xi A:,i = x1  .   . 
 .  + · · · + xm  .  (2.27)
i
Am,1 Am,n

→ Necessary: A must have at least m columns (n ≥ m). (“wide”).


→ Necessary and sufficient: matrix must contain at least one set of m linearly indepen-
dent columns.
→ Invertibility: In addition to above, need matrix to be square (re: at most m columns
∧ at least m columns).
• A square matrix with linearly dependent columns is known as singular. A (necessarily
square) matrix is singular if and only if one or more eigenvalues are zero.
• A norm is any function f that satisfies the following properties: ||x||∞ = maxi |xi |

f (x) = 0 ⇒ x = 0 (1)
f (x + y) ≤ f (x) + f (y) (2)
∀α ∈ R, f (αx) = |α|f (x) (3)

• An orthogonal matrix is a square matrix whose rows are mutually orthonormal and
whose columns are mutually orthonormal: Note that orthonorm cols
implies orthonorm rows
(if square). To prove,
AT A = AAT = I (2.37) consider the relationship
−1 T between AT A and AAT
A =A (2.38)

• Suppose square matrix A ∈ Rn×n has n linearly independent eigenvectors {v (1) , . . . , v (n) }.
The eigendecomposition of A is then given by1

A = V diag (λ) V −1 (2.40)


In the special case where A is real-symmetric, A = QΛQT . Interpretation: Ax can
All real-symmetric A
be decomposed into the following three steps: have an
1 eigendecomposition, but
This appear to imply that unless the columns of V are also normalized, can’t guarantee that its inverse
it might not be
equals its transpose? (since that is the only difference between it and an orthogonal matrix) unique!

8
1) Change of basis: The vector (QT x) can be thought of as how x would appear in
the basis of eigenvectors of A.
2) Scale: Next, we scale each component (QT x)i by an amount λi , yielding the new
vector (Λ(QT x)). A common convention to
sort the entries of Λ in
3) Change of basis: Finally, we rotate this new vector back from the eigen-basis into descending order.
its original basis, yielding the transformed result of QΛQT x.

• Positive definite: all λ are positive; positive semidefinite: all λ are positive or zero.
→ PSD: ∀x, xT Ax ≥ 0
→ PD: xT Ax = 0 ⇒ x = 0.2

• Any real matrix A ∈ Rm×n has a singular value decomposition of the form,
U ∈ Rm×m (7)
A = UDV T (10) m×n
D∈R (8)

where both U and V are orthogonal matrices, and D is diagonal. V ∈ Rn×n (9)

– The singular values are the diagonal entries Dii .


– The left(right)-singular vectors are the columns of U(V).
– Eigenvectors of AAT are the L-S vectors. Eigenvectors of AT A are the R-S vectors.
The eigenvalues of both AAT and AT A are given by the singular values squared.

• The Moore-Penrose pseudoinverse, denoted A+ , enables us to find an “inverse” of sorts


for a (possibly) non-square matrix A. Most algorithms compute A+ via A+ is useful, e.g., when
we want to solve
Ax = y by
A+ = VD + U T (11) left-multiplying each side
to obtain x = By.
It is far more likely for
• The determinant of a matrix is det(A) = i λi . Conceptually, | det(A)| tells how
Q
solution(s) to exist when
A is wider than it is tall.
much [multiplication by] A expands/contracts space. If det(A) = 1, the transformation
preserves volume.

2
I proved this and it made me happy inside. Check it out. Let A be positive definite. Then

xT Ax = xT QΛQT x (4)
X T T
= (Q x)i λi (Q x)i (5)
i
X
= λi (QT x)2i (6)
i

Since all terms in the summation are non-negative and all λi > 0, we have that xT Ax = 0 if and only if
(QT x)i = 0 = q (i) · x for all i. Since the set of eigenvectors {q (i) } form an orthonormal basis, we have that x
must be the zero vector.

9
1.1.1 Example: Principal Component Analysis

Task. Say we want to apply lossy compression (less memory, but may lose precision) to a
collection of m points {x(1) , . . . , x(m) }. We will do this by converting each x(i) ∈ Rn to some
c(i) ∈ Rl (l < n), i.e. finding functions f and g such that:

f (x) = c and x ≈ g(f (x)) (12)

Decoding function (g). As is, we still have a rather general task to solve. PCA is defined
by choosing g(c) = Dc, with D ∈ Rn×l , where all columns of D are both (1) orthogonal and
(2) unit norm.

Encoding function (f ). Now we need a way of mapping x to c such that g(c) will give us
back a vector optimally close to x. We’ve already defined g, so this amount to finding the
optimal c∗ such that:

c∗ = arg min ||x − g(c)||22 (13)


c
(x − g(c))T (x − g(c)) = xT x − 2xT g(c) + g(c)T g(c) (14)
h i
c∗ = arg min −2xT Dc + cT c (15)
c
T
= D x = f (x) (16)

which means the PCA reconstruction operation is defined as r(x) = DD T x.

Optimal D. It is important to notice that we’ve been able to determine e.g. the optimal c∗
for some x because each x has a (allowably) different c∗. However, we use the same matrix D
for all our samples x(i) , and thus must optimize it over all points in our collection. With that
out of the way, we just do what we always do: minimize over the L2 distance between points
and their reconstruction. Formally, we minimize the Frobenius norm of the matrix of errors:
v
uX  2
(i)
D∗ = arg min t xj − r(x(i) )j s.t. DT D = I (17)
u
D i,j

10
Consider the case of l = 1 which means D = d ∈ Rn . In this case, after [insert math here], we
obtain
 
d∗ = arg max T r dT X T Xd s.t. dT d = 1 (18)
d

where, as usual, X ∈ Rm,n . It should be clear that the optimal d is just the largest eigenvector
of X T X.

11
Math and Machine Learning Basics January 24

Probability & Information Theory (Quick Review) (Ch. 3)


Table of Contents Local Written by Brandon McKinzie

Expectation. For some function f (x), Ex∼P [f (x)] is the mean value that f takes on when x
is drawn from P . The formula for discrete and continuous variables, respectively is as follows:
X
Ex∼P [f (x)] = P (x)f (x) (3.9)
x
Z
Ex∼P [f (x)] = p(x)f (x)dx (3.10)

Variance. A measure of how much the values of a function of a random variable x vary as we
sample different values of x from its distribution.
h i
Var [f (x)] = E (f (x) − E [f (x)])2 (3.11)

Covariance. Gives some sense of how much two values are linearly related to each other, as
well as the scale of these variables.

Cov [f (x), g(x)] = E [ (f (x) − E [f (x)]) (g(x) − E [g(x)]) ] (3.13)

→ Large |Cov [f, g] | means the function values change a lot and both functions are far from
their means at the same time.
→ Correlation normalizes the contribution of each variable in order to measure only how
much the variables are related.

Covariance Matrix of a random vector x ∈ Rn is an n × n matrix, such that

Cov [x]i,j = Cov [xi , xj ] (3.14)

and if we want the “sample” covariance matrix taken over m data point samples, then
m
1 X
Σ := (xk − x̄)(xk − x̄)T (19)
m k=1

where m is the number of data points.

12
Measure Theory.
• A set of points that is negligibly small is said to have measure zero. In practical terms,
think of such a set as occupying no volume in the space we are measuring (interested in).
In R2 , a line has measure
zero.
• A property that holds almost everywhere holds throughout all space except for on a
set of measure zero.

Functions of RVs.
• Common mistake: Suppose y = g(x), and g is invertible/continuous/differentiable.
It is NOT true that py (y) = px (g −1 (y)). This fails to account for the distortion of
[probability] space introduced by g. Rather,
∂g(x)
px (x) = py (g(x)) (3.47)
∂x

Information Theory. Denote the self-information of an event x = x to be

I(x) , − log P (x) (20)

where log is always assumed to be the natural logarithm. We can quantify the amount of
uncertainty in an entire probability distribution using the Shannon entropy,

H(x) = Ex∼P [I(x)] = −Ex∼P [log P (x)] (21)

which gives the expected amount of information in an event drawn from that distribution. Tak-
ing it a step further, say we have two separate probability distributions P (x) and Q(x). We can
measure how different these distributions are with the Kullback-Leibler (KL) divergence:

P (x)
 
DKL (P ||Q) , Ex∼P log = Ex∼P [log P (x) − log Q(x)] (22)
Q(x)

Note that the expectation is taken over P , thus making DKL not symmetric (and thus not a
true distance measure), since DKL (P ||Q) 6= DKL (Q||P ). Finally, a closely related quantity is
the cross-entropy, H(P, Q), defined as:

H(P, Q) , H(P ) + DKL (P ||Q) (23)


= −Ex∼P [log Q(x)] (24)

13
Math and Machine Learning Basics January 24, 2017

Numerical Computation (Ch. 4)


Table of Contents Local Written by Brandon McKinzie

Some terminology. Underflow is when numbers near zero are rounded to zero. Similarly,
overflow is when large [magnitude] numbers are approximated as ±∞. Conditioning refers
to how rapidly a function changes w.r.t. small changes in its inputs. Consider the function
f (x) = A−1 x. When A has an eigenvalue decomposition, its condition number is
λi
max (4.2)
i,j λj
which is the ratio of the magnitude of the largest and smallest eigenvalue. When this is large,
matrix inversion is sensitive to error in the input [of f(x)].

Gradient-based optimization. Recall from basic calculus that the directional derivative
of f (x) in direction û (a unit vector) is defined as the slope of the function f in direction û.
By definition of the derivative, this is given by (with v := x + αû)
f (x + αû) − f (x) ∂f (x + αû)
lim = (25)
α→0 α ∂α α=0
X ∂f (v) ∂vi
= (26)
i
∂vi ∂α α=0
X
= (∇v f (v))i ui (27)
i α=0

= ûT ∇v f (v) (28)


α=0
= ûT ∇x f (x) (29)

where it’s important to recognize the distinction between limα→0 and setting α to zero, which
is denoted by α=0 . If we want to find the direction û such that this directional derivative is
a minimum, i.e.

û∗ = arg min ûT ∇x f (x) (30)


û,ûT û=1

= arg min ||û||2 ||∇x f (x)||2 cos(θ) (31)


û,ûT û=1

= cos(θ) (32)

and we see that û points in the opposite direction as the gradient.

14
Jacobian and Hessian Matrices. For when we want partial derivatives of some function
f whose input and output are both vectors. The Jacobian matrix contains all such partial f : Rm → Rn
J ∈ Rn×m where
derivatives. Sometimes we want to know about second derivatives too, since this tells us ∂
Ji,j = ∂x f (x)i
whether a gradient step will cause as much of an improvement as we would expect based on j

the gradient alone. The Hessian matrix H(f )(x) is defined such that The Hessian is the
Jacobian of the gradient.
∂2
H(f )(x)i,j = f (x) (4.6)
∂xi ∂xj
T
The second derivative in a specific direction dˆ is given by d̂ H d̂ 3 . It tells us how well we can
expect a gradient descent step to perform. How so? Well, it shows up in the second-order
approximation to the function f (x) about our current spot, which we can denote x(0) . The
standard gradient descent step will move us from x(0) → x(0) − g, where g is the gradient
evaluated at x(0) . Plugging this in to the 2nd order approximation shows us how H can give
information related to how “good” of a step that really was. Mathematically,
1
f (x) ≈ f (x(0) ) + (x − x(0) )T g + (x − x(0) )T H(x − x(0) ) (4.8)
2
(0) (0) T 1 2 T
f (x − g) ≈ f (x ) − g g +  g Hg (4.9)
2
If g T Hg is positive, then we can easily solve for the optimal  = ∗ that decreases the Taylor
series approximation as
gT g
∗ = (4.10)
g T Hg
which can be as low as 1/λmax (the worst case), and as high as 1/λmin with the λ being the
eigenvalues of the Hessian. The best (and perhaps only) way to take what we learned about
the “second derivative test” in single-variable calculus and apply it to the multidimensional
case with H is by using the eigendecomposition of H. Why? Because we can examine the
eigenvalues of the Hessian to determine whether the critical point x(0) is a local maximum, local
minimum, or saddle point4 . If all eigenvalues are positive (remember that this is equivalent to
saying that the Hessian is positive definite!), the point is a local minimum. The condition number of
the Hessian at a given
point can give us an idea
3 about how much the
In the same manner that I derived equation 29, we can derive the second derivative in a specified direction
ˆ second derivatives (along
d: different directions)
∂2 ˆ ∂ ˆT differ from each other
f (x + αd) = d ∇v f (v) α=0 (33)
∂α2 α=0 ∂α
X ∂ ∂f (v)
= di (34)
∂α ∂vi α=0
i
X ∂ ∂f (v)
= di (35)
∂vi ∂α α=0
i
XX ∂ 2 f (v)
= di dj (36)
∂vi vj α=0
i j

= dˆT H dˆ (37)

4
Emphasis on “values” in “eigenvalues” because it’s important not to get tripped up here about what the

15
Constrained optimization: minimizing/maximizing a function f (x) constrained to only
values of x in some set S. One way of approaching such a problem is to re-design the uncon-
strained optimization problem such that the re-designed problem’s solution satisfies the con-
straints. For example, to minimize f (x) for x ∈ R2 with constraint ||x||2 = 1, we can minimize
g(θ) = f ([cos θ, sin θ]T ) wrt θ, then return [cos θ, sin θ]T as the solution to the original problem.

The Karush-Kuhn-Tucker (KKT) approach, a generalization of Lagrange multipliers,


provides a general approach for re-designing the optimization problem, with procedure as
follows:
1. Find m functions g (i) and n functions h(j) such that your set of allowed values S can be
written

S = {x | ∀i, g (i) (x) = 0 and ∀j, h(j) (x) ≤ 0} (38)

The equations involving g (i) are called the equality constraints and the inequalities
involving h(j) are called the inequality constraints.
2. Introduce new variables λi (for the equality constraints) and αj (for the inequality con-
straints). These are called the KKT multipliers. The generalized Lagrangian is then
defined as
X X
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x) (39)
i j

3. Solve the re-designed unconstrained optimization problem:

min max max L(x, λ, α) (40)


x λ α,α≥0

which has the same optimal objective function value and set of optimal points x as the
original constrained problem, minx∈S f (x). Any time the constraints are satisfied, the
expression maxλ maxα,α≥0 L(x, λ, α) evaluates to f (x), and any time a constraint is
violated, the same expression evaluates to ∞.

eigenvectors of the Hessian mean. The reason for the decomposition is that it gives us an orthonormal basis
(out of which we can get any direction) and therefore the magnitude of the second derivative along each of these
directions as the eigenvalues.

16
Math and Machine Learning Basics January 25, 2017

Machine Learning Basics (Ch. 5)


Table of Contents Local Written by Brandon McKinzie

Capacity, Overfitting, and Underfitting. Difference between ML and optimization is that,


in addition to wanting low training error, we want generalization error (test error) to be
low as well. The ideal model is an oracle that simply knows the true probability distribution
p(x, y) that generates the data. The error incurred by such an oracle, due things like inherently
stochastic mappings from x to y or other variables, is called the Bayes error. The no free
lunch theorem states that, averaged over all possible data-generating distributions, every
classification algorithm has the same error rate when classifying previously unobserved points.
Therefore, the goal of ML research is to understand what kinds of distributions are relevant
to the “real world” that an AI agent experiences, and what kinds of ML algorithms perform
well on data drawn from the relevant data-generating distributions.

1.4.1 Estimators, Bias and Variance (5.4)

Point Estimation: attempt to provide “best” prediction of some quantity, such as some
parameter or even a whole function. Formally, a point estimator or statistic is any function of
the data:
 
θ̂m = g x(1) , . . . , x(m) (5.19)

where, since the data is drawn from a random process, θ̂ is a random variable. Function
estimation is identical in form, where we want to estimate some f (x) with fˆ, a point estimator
in function space.

Bias. Defined below, where the expectation is taken over the data-generating distribution5 .
Bias measures the expected deviation from the true value of the func/param.
h i h i
bias θ̂m = E θ̂m − θ (5.20)
h i
2 for Gaussian distribution [helpful link].
TODO: Figure out how to derive E θ̂m

5
May want to double-check this, but I’m fairly certain this is what the book meant when it said “data,”
based on later examples.

17
Bias-Variance Tradeoff .
→ Conceptual Info. Two sources of error for an estimator are (1) bias and (2) variance,
which are both defined as deviations from a certain value. Bias gives deviation from the
true value, while variance gives the [expected] deviation from this expected value.
→ Summary of main formulas.
h i h i
bias θ̂m = E θ̂m − θ (41)
h i  h i2 
Var θ̂m = E θ̂m − E θ̂m (42)

→ MSE decomposition. The MSE of the estimates is given by6


h i
MSE = E (θ̂m − θ)2 (5.53)
h i
= Bias(θ̂)2 + Var θ̂m (5.54)

and desirable estimators are those with low MSE.

Consistency. As the number of training data points increases, we want the estimators to con-
verge to the true values. Specifically, below are the definitions for weak and strong consistency,
respectively.

plimm→∞ θ̂m = 0
  (5.55)
p lim θ̂m = θ = 1
m→∞

where the symbol plim means P (|θ̂m − θ| > ) → 0 as m → ∞.

6
Derivation:

MSE = E θ̂2 + θ2 − 2θθ̂


 
(43)
 2 2
 
= E θ̂ + θ − 2θE θ̂ (44)
 2  2  2 2
 
= (E θ̂ − E θ̂ ) + E θ̂ + θ − 2θE θ̂ (45)
  2       2 
2 2
= E θ̂ + θ − 2θE θ̂ + E θ̂ − E θ̂ (46)

= Bias(θ̂)2 + Var θ̂m


 
(47)

18
1.4.2 Maximum Likelihood Estimation (5.5)

Consider set of m examples X = {x(1) , . . . , x(m) } drawn independently from the true (but
unknown) pdata (x). Let pmodel (x; θ) be parametric family of probability distributions over the
same space indexed by θ. The maximum likelihood estimator for θ can be expressed as

θM L = arg max Ex∼p̂data [log pmodel (x; θ)] (5.59)


θ

where we’ve chosen to express with log for underflow/gradient reasons. One interpretation of
ML is to view it as minimizing the dissimilarity, as measured by the KL divergence7 , between
p̂data and pmodel .

Any loss consisting of a negative log-likelihood is a cross-entropy between the


p̂data distribution and the pmodel distribution.

Thoughts: Let’s look at DKL in some more detail. First, I’ll rewrite it with the explicit
definition of Ex∼p̂data [log (p̂data (x))]:

DKL (p̂data ||pmodel ) = Ex∼p̂data [log (p̂data (x)) − log (pmodel (x))] (48)
N
! !
1 X
= log (Counts(xi )) − log N − Ex∼p̂data [log (pmodel (x))] (49)
N i=1

Note also that our goal is to find parameters θ such that DKL is minimized. It is for this
reason, that we wish to optimize over θ, that minimizing DKL amounts to maximizing the
quantity, Ex∼p̂data [log (pmodel (x))]. Sure, I can agree this is true, but why is our goal to
minimize DKL , as opposed to minimizing | DKL |? I’m assuming it is because optimizing
w.r.t. an absolute value is challenging numerically.

Conditional Log-Likelihood and MSE. We can readily generalize θM L to estimate a con-


ditional probability p(y | x; θ) in order to predict y given x, since We are assuming the
examples are i.i.d. here.
m
X
θM L = arg max log P (y (i) | x(i) ; θ) (5.63)
θ i=1

where x(i) are fed as inputs to the model; this is why we can formulate MLE as a conditional
probability.

7
The KL divergence is given by

DKL (p̂data ||pmodel ) = Ex∼p̂data [log p̂data (x) − log pmodel (x)] (5.60)

19
1.4.3 Bayesian Statistics (5.6)

Distinction between frequentist and bayesian approach:


• Frequentist: Estimate θ −→ make predictions thereafter based on this estimate.
• Bayesian: Consider all possible values of θ when making predictions.

The prior. Before observing the data, we represent our knowledge of θ using the prior prob-
ability distribution p(θ). Unlike maximum likelihood, which makes predictions using a point It is common to choose a
high-entropy prior, e.g.
estimate of θ (a single value), the Bayesian approach uses Bayes’ rule to make predictions using uniform.
the full distribution over θ. In other words, rather than focusing on the most accurate value
estimate of θ, we instead focus on pinning down a range of possible θ values and how likely
we believe each of these values to be.

So what happens to θ after we observe the data? We update it using Bayes’ rule8 :

p(x(1) , . . . , x(m) | θ)p(θ)


p(θ | x(1) , . . . , x(m) ) = (50)
p(x(1) , . . . , x(m) )

Note that we still haven’t mentioned how to actually make predictions. Since we no longer
have just one value for θ, but rather we have a posterior distribution p(θ | x(1) , . . . , x(m) ), we
must integrate over this to get the predicted likelihood of the next sample x(m+1) :
Z
(m+1) (1) (m)
p(x |x ,...,x )= p(x(m+1) | θ)p(θ | x(1) , . . . , x(m) )dθ (51)
h i
= Eθ∼p(θ|x(1) ,...,x(m) ) p(x(m+1) | θ) (52)

Linear Regression: MLE vs. Bayesian. Both want to model the conditional distribution
p(y | x) (the conditional likelihood). To derive the standard linear regression algorithm, we
define
  Assume σ 2 is some fixed
p(y | x) = N y; ŷ(x; w), σ 2 (53) constant chosen by the
user.
ŷ(x; w) = wT x (54)

8
In practice, we typically compute the denominator by simply normalizing the probability distribution, i.e.
it is effectively the partition function.

20
• Maximum Likelihood Approach: We can use the definition above (and the i.i.d.
assumption) to evaluate the conditional log-likelihood as
m m
X m X ||ŷ (i) − y (i) ||2
log p(y (i) | x(i) ; θ) = −m log σ − log(2π) − (5.65)
i=1
2 i=1
2σ 2

where only the last term has any dependence on w. Therefore, to obtain wM L we take
the derivative of the last term w.r.t. w, set that to zero, and solve for w. We see that Recall that the training
finding the w that maximizes the conditional log-likelihood is equivalent to finding the MSE
1
Pm is
||ŷ (i) − y (i) ||2
w that minimizes the training MSE. m i=1

• Bayesian Approach: Our conditional likelihood is already given in equation 53. Next,
we must define a prior distribution over w. As is common, we choose a Gaussian prior to
express our high degree of uncertainty about θ (implying we’ll choose a relatively large
variance):
Typically assume
Λ0 = diag (λ0 )
p(w) := N (w; µ0 , Λ0 ) (55)

We can then compute [the unnormalized] p(w | X, y) ∝ p(y | X, w)p(w) [and then
normalize it].

Maximum A Posteriori (MAP) Estimation. Often we either prefer a point estimate for
θ, or we find out that computing the posterior distribution is intractable and a point estimate
offers a tractable estimation. The obvious way of obtaining this while still taking the Bayesian
route is to just argmax the posterior and use that as your point estimate:

θM AP = arg max p(θ | x) = arg max log p (x | θ) + log p(θ) (56)


θ θ

where the second form shows how this is basically maximum likelihood with incorporation of
the prior. We don’t want just any θ that maximizes the likelihood of our data if there is
virtually no chance of that value of θ in the first place.

21
1.4.4 Supervised Learning Algorithms (5.7)

Logistic Regression. We’ve already seen that linear regression corresponds to the family
 
p(y | x) = N y; θ T x, I (5.80)

which we can generalize to the binary classification scenario by interpreting as the probability
of class 1. One way of doing this while ensuring the output is between 0 and 1 is to use the
logistic sigmoid function:
Equation 5.81 is the
p(y = 1 | x; θ) = σ(θ T x) (5.81) definition of logistic
regression

Unfortunately, there is no closed-form solution for θ, so we must search via maximizing the
log-likelihood.

Support Vector Machines. Driving by a linear function wT x + b like logistic regression,


but instead of outputting probabilities it outputs a class identity, which depends on the sign of
wT x + b. SVMs make use of the kernel trick, the “trick” being that we can rewrite wT x + b
completely in terms of dot products between examples. The general form of our prediction
function becomes If our kernel function is
just k(x, x(i) ) = xT x(i)
X then we’ve just rewritten
f (x) = b + αi k(x, x(i) ) (5.83) w in the form
w → XT α
i

where the kernel [function] takes the general form k(x, x(i) ) = φ(x)·φ(x(i) ). A major drawback
to kernel machines (methods) in general is that the cost of evaluating the decision function
f (x) is linear in the number of training examples. SVMs, however, are able to mitigate this
by learning an α with mostly zeros. The training examples with nonzero αi are known as
support vectors.

22
Deep Networks:
Modern
Practices
Contents

2.1 Deep Feedforward Networks (Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


2.1.1 Back-Propagation (6.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Regularization for Deep Learning (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Optimization for Training Deep Models (Ch. 8) . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Convolutional Neural Networks (Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Sequence Modeling (RNNs) (Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Review: The Basics of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 RNNs as Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 Challenge of Long-Term Deps. (10.7) . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 LSTMs and Other Gated RNNs (10.10) . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Applications (Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.1 Natural Language Processing (12.4) . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6.2 Neural Language Models (12.4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 45

23
Modern Practices January 26

Deep Feedforward Networks (Ch. 6)


Table of Contents Local Written by Brandon McKinzie

The strategy/purpose of [feedforward] deep learning is to learn the set of features/representa-


tion describing x with a mapping φ before applying a linear model. In this approach, we have
a model
y = f (x; θ, w) = φ(x; θ)T w
with φ defining a hidden layer.

ReLUs and their generalizations. Some nice properties of ReLUs are. . .


Recall the ReLU
• Derivatives through a ReLU remain large and consistent whenever the unit is active. activation function:
g(z) = max{0, z}
• Second derivative is 0 a.e. and the derivative is 1 everywhere the unit is active, mean- a.e. is short for “almost
everywhere”
ing the gradient direction is more useful for learning than it would be with activation
functions that introduce 2nd-order effects (see equation 4.9)

Generalizing to aid gradients when z < 0. Three such generalizations are based on using
a nonzero slope αi when zi < 0:

hi = g(z, α)i = max(0, zi ) + αi min(0, zi ) (57)

→ Absolute value rectification: fix αi = −1 to obtain g(z) = |z|.


→ Leaky ReLU: fix αi to a small value like 0.01.
→ Parametric ReLU (PReLU): treats αi like a learnable parameter.

Logistic sigmoid and hyperbolic tangent. Sigmoid activations on hidden units is a bad
idea, since they’re only sensitive to their inputs near zero, with small gradients everywhere else.
If sigmoid activations must be used, tanh is probably a better substitute, since it resembles
the identity (i.e. a linear function) near zero.

24
2.1.1 Back-Propagation (6.5)

The chain rule. Suppose z = f (y) where y = g(x) (see margin for dimensions). Then9 ,
x ∈ Rm
n n n y ∈ Rn
∂z X ∂z ∂yj X ∂yj X
= (∇x z)i = = (∇y z)j = (∇y z)j (∇x yj )i (6.45) z : Rn → R
∂xi j=1
∂yj ∂xi j=1 ∂xi j=1
T g : Rm → Rn
∂y

→ ∇x z = T (6.46)
∇y z = Jy=g(x) ∇y z
∂x

From this we see that the gradient of a variable x can be obtained by


∂y
multiplying a Jacobian matrix ∂x by a gradient ∇y z.

9
Note that we can view z = f (y) as a multi-variable function of the dimensions of y,

z = f (y1 , y2 , . . . , yn )

25
Modern Practices January 12, 2017

Regularization for Deep Learning (Ch. 7)


Table of Contents Local Written by Brandon McKinzie

Recall the definition of regularization: “any modification we make to a learning algorithm that
is intented to reduce its generalization error but not its training error.”

Limiting Model Capacity. Recall that Capacity [of a model] is the ability to fit a wide
variety of functions. Low cap models may struggle to fit training set, while high cap models
may overfit by simply memorizing the training set. We can limit model capacity by adding a
parameter norm penalty Ω(θ) to the objective function J:

J(θ;
e X, y) = J(θ; X, y) + αΩ(θ) where α ∈ [0, ∞) (7.1)

where we typically choose Ω to only penalize the weights and leave biases unregularized.

L2-Regularization. Defined as setting Ω(θ) = 12 ||w||22 . Assume that J(w) is quadratic, with
minimum at w∗ . Since quadratic, we can approximate J with a second-order expansion about
w∗ .
1
ˆ
J(w) = J(w∗ ) + (w − w∗ )T H(w − w∗ ) (7.6)
2
∇w J(w) = H(w − w∗ )
ˆ (7.7)
2
∂ J
where Hij = ∂w i wj w∗
. If we add in the [derivative of] the weight decay and set to zero, we
obtain the solution

e = (H + αI)−1 Hw∗
w (7.10)
= Q(Λ + αI)−1 ΛQT w∗ (7.13)

which shows that the effect of regularization is to rescale the i eigenvectors of H by λiλ+α
i
. This
means that eigenvectors with λi >> α are relatively unchanged, but the eigenvectors with
λi << α are shrunk to nearly zero. In other words, only directions along which the parameters
contribute significantly to reducing the objective function are preserved relatively intact.

26
Sparse Representations. Weight decay acts by placing a penalty directly on the model
parameters. Another strategy is to place a penalty on the activations of the units, encouraging
their activations to be sparse. It’s important to distinguish the difference between sparse
parameters and sparse representations. In the former, if we take the example of some y = Bh,
there are many zero entries in some parameter matrix B while, in the latter, there are many
zero entries in the representation vector h. The modification to the loss function, analogous
to 7.1, takes the form

J(θ;
e X, y) = J(θ; X, y) + αΩ(h) where α ∈ [0, ∞) (7.48)

Adversarial Training Even for networks that perform at human level accuracy have a nearly
100 percent error rate on examples that are intentionally constructed to search for an input x0
near a data point x such that the model output for x0 is very different than the output for x . In many cases, x0 can be
so similar to x that a
human cannot tell the
x0 ←− x +  · sign (∇x J(θ; x, y)) (58) difference!

In the context of regularization, one can reduce the error rate on the original i.i.d. test set via
adversarial training – training on adversarially perturbed training examples.

27
Modern Practices Feb 17, 2017

Optimization for Training Deep Models (Ch. 8)


Table of Contents Local Written by Brandon McKinzie

Empirical Risk Minimization. The ultimate goal of any machine learning algorithm is to
reduce the expected generalization error, also called the risk:
risk definitions
J ∗ (θ) = E(x,y)∼pdata [L(f (x; θ), y)] (59)

with emphasis that the risk is over the true underlying data distribution pdata . If we knew
pdata , this would be an optimization problem. Since we don’t, and only have a set of training
samples, it is a machine learning problem. However, we can still just minimize the empirical
risk, replacing pdata in the equation above with p̂data 10 .

So, how is minimizing the empirical risk any different than familiar gradient descent ap-
proaches? Aren’t they designed to do just that? Well, sort of, but it’s technically not the ERM 6= GD

same. When we say “minimize the empirical risk” in the context of optimization, we mean this
very literally. Gradient descent methods emphatically do not just go and set the weights to val-
ues such that the empirical risk reaches its lowest possible value – that’s not machine learning.
Furthermore, many useful loss function such as 0-1 loss11 do not have useful derivatives.

Surrogate Loss Functions and Early Stopping. In cases such as 0-1 loss, where mini-
mization is intractable, one typically optimizes a surrogate loss function instead, such as
the negative log-likelihood of the correct class. Also, an important difference between pure
optimization and our training algorithms is that the latter usually don’t halt at a local min-
imum. Instead, we usually must define some early stopping condition to terminate training
before overfitting begins to occur.

We want to minimize the risk, but we don’t have access to pdata , so . . .


We want to minimize the empirical risk, but it’s prone to overfitting and our loss function’s
derivative may be zero/undefined, so . . .
We minimize a surrogate loss function iteratively over minibatches until early stopping is
triggered.

10
This amount to a simple average over the loss function at each training point.
11
The 0-1 loss function is defined as

L(ŷ, y) = I(ŷ 6= y) (60)

28
Batch and Minibatch Algorithms. Computing ∇θ J(θ) as an expectation over the entire
training set is expensive, so we typically compute the expectation over a small subset of the
examples. Recall that the standard deviation, or standard error SE(µm ), of the mean taken
1 P (i) √
over some subset of m ≤ n samples, µm = m i∼Rand(0, n, size=m) x , is given by σ/ m,
where σ is the true [sample] standard deviation of the full n data samples. In other words,
to improve such a gradient by a factor of 10 requires 100 times more samples-per-batch (and
thus 100 times more computation). For this reason, most optimization algorithms actually
converge much faster if they can rapidly compute approximate estimates of the gradient (re:
smaller batches) rather than slowly computing the exact gradient.

The key points to consider when choosing your batch size:


1. Larger batches = more accurate estimates of the gradient, but with less than linear
returns.
2. If examples in the batch are processed in parallel (as is typical), then memory roughly
scales with batch size.
3. Small batches can offer a regularizing effect. Generalization error is often best for a
batch size of 1. However, this requires a low learning rate to maintain stability and thus
a longer overall training runtime.
Also, note that online SGD, where we never reuse data points, but simply update parameters
as new data comes in, gives an unbiased estimator of the exact gradient of the generalization
error (the risk). Once data samples are reused (e.g. when training with multiple epochs),
the gradient estimates become biased. The interesting point here is that the availability of
increasingly massive datasets is making single-epoch12 training more common. In such cases,
overfitting is no longer an issue, but rather underfitting and computational efficiency.
Ill-conditioning of the Hessian matrix H can cause SGD to get “stuck” in the sense that even
very small steps increase the cost function. Recall that a second-order Taylor series expansion
of the cost function predicts that an SGD step of −g will add
1 2 T
 g Hg − g T g (61)
2
to the cost. If H has a large condition number (re: if H is ill-conditioned), then the range
of possible values, [1/λmax , 1/λmin ], for g T Hg can become very large. In particular, if g T Hg
exceeds g T g, then the SGD step will increase the cost!

12
Or even less, i.e. not using all of the training data.

29
Training algorithms. Below, I list some popular training algorithms and their update equa-
tions.

• SGD.
m
1 X  
g← ∇θ L f (x(i) ; θ), y (i) (62)
m i
θ ← θ − g (63)

• Momentum.
m
1 X  
g← ∇θ L f (x(i) ; θ), y (i) (64)
m i
v ← αv − g (65)
θ ←θ+v (66)

• Nesterov Momentum. Gradient computations instead evaluated after current velocity


is applied.
m
1 X  
g← ∇θ L f (x(i) ; θ + αv), y (i) (67)
m i
v ← αv − g (68)
θ ←θ+v (69)

• AdaGrad. Different learning rate for each model parameter. Individually adapts the
learning rates of all model parameters by scaling them inversely proportional to the
square root of the sum of all historical squared values of the gradient. Empirically, can
result in premature and excessive decrease in the effective learning rate.
m
1 X  
g← ∇θ L f (x(i) ; θ), y (i) (70)
m i
r ←r+g g (71)

θ←θ− √ g (72)
δ+ r
where the gradient accumulation variable r is initialized to the zero vector, and the
fraction and square root in the last equation is applied element-wise.
• RMSProp. Modifies AdaGrad by changing the gradient accumulation into an exponen-
tially weighted moving average.
m
1 X  
g← ∇θ L f (x(i) ; θ), y (i) (73)
m i
r ← ρr + (1 − ρ)g g (74)

θ←θ− √ g (75)
δ+ r
It is also common to modify RMSProp to use Nesterov momentum.

30
• Adam. So-named to mean “adaptive moments.” We now call r the 2nd moment (vari-
ance) variable, and introduce s as the 1st moment (mean) variable, where the moments
are for the [true] gradient; the new variables act as estimates of the moments [since we
estimate the gradient with a simple average over a minibatch]. Note that these moments
are uncentered.
m
1 X  
g← ∇θ L f (x(i) ; θ), y (i) (76)
m i
1
s← [ρ1 s + (1 − ρ1 )g] (77)
1 − ρt−1
1
1
r← [ρ2 r + (1 − ρ2 )g g] (78)
1 − ρt−1
2

θ←θ− √ s (79)
δ+ r

where the factors proportional to the ρ values serve to correct for bias in the moment
estimators.

31
Modern Practices January 24, 2017

Convolutional Neural Networks (Ch. 9)


Table of Contents Local Written by Brandon McKinzie

We use a 2-D image I as our input (and therefore require a 2-D kernel K). Note that most
neural networks do not technically implement convolution13 , but instead implement a related
function called the cross-correlation, defined as
XX
S(i, j) = (I ∗ K)(i, j) = I(i + m, j + n)K(m, n) (9.6)
m n

Convolution leverages the following three important ideas:


• Sparse interactions[/connectivity/weights]. Individual input units only interact/-
connect with a subset of the output units. Accomplished by making the kernel smaller
than the input. It’s important to recognize that the receptive field of the units in the
deeper layers of a convolutional network is larger than the receptive field of the units in
the shallow layers, as seen below.

• Parameter sharing.
• Equivariance (to translation). Changes in inputs [to a function] cause output to change
in the same way. Specifically, f is equivariant to g if f (g(x)) = g(f (x)). For convolution,
g would be some function that translates the input.

13
Technically the convolution output is defined as
XX
S(i, j) = (I ∗ K)(i, j) = I(m, n)K(i − m, j − n) (9.4)
m n
XX
= (K ∗ I)(i, j) = I(i − m, j − n)K(m, n) (9.5)
m n

where 9.5 can be asserted due to commutativity of convolution.

32
Pooling. Helps make the representation approximately invariant to small translations of the
input. The use of pooling can be viewed as adding an infinitely strong prior14 that the function
the layer learns must be invariant to small translations.

Additional common tricks15 .


• Local Response Normalization (LRN)16 . Purpose is to aid generalization ability.
Let aix,y denote the activity of a neuron computed by applying kernel i at position (x, y)
and then applying the ReLU nonlinearity. The response-normalized activity bix,y is given
by the expression

aix,y
bix,y =  (80)
P  j 2 β

k + α j ax,y

where j runs from [i − n/2]+ to min(N −1, i+n/2), and N is the total number of kernels
in the given layer17 . Authors used k = 2, n = 5, α = 10−4 , β = 0.75.
• Batch Normalization18 . BN Allows us to use much higher learning rates and be less
careful about initialization. Algorithm defined in image below, where each element of
(k)
the batch, xi ≡ xi (where we drop the k for notational simplicity), represents the kth
activation output from the previous layer [for the ith sample in the batch] and about to
be fed as input to the current layer.

Note that one can model each layer’s activations as arising from some distribution. When
we feed data to a network, we model the data as coming from some data-generating
14
Where the distribution of this prior is over all possible functions learnable by the model.
15
Collected on my own. In other words, not from the deep learning book, but rather a bunch of disconnected
resources over time.
16
From section 3.3 of Krizhevsky et al. (2012). AlexNet paper.
17
In other words, the summation is over the adjacent kernel maps, with [total] window size n (manually
chosen). The min/max just says n/2 to the left (right) unless that would be past the leftmost (rightmost)
kernel map.
18
From “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”
by Ioffe et al.

33
distribution. Similarly, we can model the activations that occur when feeding the data
through as coming from some activation-layer-generating distribution. The problem with
this model is that the process of updating our weights during training changes the dis-
tribution of activations for each layer, which can make the learning task more difficult.
Batch normalization’s goal is to reduce this internal covariate shift, and is motivated by
the practice of normalizing our data to have zero mean and unit variance.

Intuition of some math.


• Q: How to intuitively understand the commutativity of convolution?
– A: You must first realize that, independent of which formula we’re thinking of, as
one index into either the image or kernel increases (decreases), the other decreases
(increases); they increment in opposite directions. The difference between the two
formulas (9.4 and 9.5 in textbook), is just that we start at different ends of the
summations (when you actually substitute in the numbers). Note that this doesn’t
require any symmetry on the boundaries of the summation about zero; it truly is a
property of the convolution.
• Q: What do the authors mean by “we have flipped the kernel”?
– A: Not much, and it’s poor wording. They didn’t do anything, that is just part of
the definition of the convolution. They literally just mean that the convolution has
the property that as one index increases, the other decreases (see previous answer).
The cross-correlation, however, has the property that as one index increases, the
other increases, too.

34
Modern Practices January 15

Sequence Modeling (RNNs) (Ch. 10)


Table of Contents Local Written by Brandon McKinzie

2.5.1 Review: The Basics of RNNs

Notation/Architecture Used.
• U: input → hidden.
• W: hidden → hidden.
• V: hidden → output.
• Activations: tanh [hidden] and softmax [after output].
• Misc. Details: x(t) is a vector of inputs fed at time t. Recall that RNNs can be unfolded for any
desired number of steps τ . For example, if τ = 3, the general functional representation output of an RNN is
s(3) = f (s(2) ; θ) = f (f (s(1) ; θ); θ). Typical RNNs read information out of the state h to make predictions.
Shape of x(t) fixed, e.g.
vocab length.

Black square on
recurrent connection ≡
interaction w/delay of a
single time step.

Forward Propagation & Loss. Specify initial state h(0) . Then, for each time step from
t = 1 to t = τ , feed input sequence x(t) and compute the output sequence o(t) . To determine
the loss at each time-step, L(t) , we compare softmax(o(t) ) with (one-hot) y (t) .
h(t) = tanh(a(t) ) where a(t) = b + W h(t−1) + U x(t) (10.9/8)
(t) (t) (t) (t)
ŷ = softmax(o ) where o =c+Vh (10.11/10)
Note that this is an example of an RNN that maps input seqs to output seqs of the same
length19 . We can then compute, e.g., the log-likelihood loss L = t L(t) over all time steps as:
P

 h i
Convince yourself this is
X
L=− log pmodel y (t) | {x(1) , . . . , x(t) } (10.12/13/14) identical to
t cross-entropy.
19
Where “same length” is related to the number of timesteps (i.e. τ input steps means τ output steps), not
anything about the actual shapes/sizes of each individual input/output.

35
where y (t) is the ground-truth (one-hot vector) at time t, whose probability of occurring is
given by the corresponding element of ŷ (t)

Back-Propagation Through Time.

1. Internal-Node Gradients. In what follows, when considering what is included in the


chain rule(s) for gradients with respect to a node N , just need to consider paths from it
[through its descendents] to loss node(s).
• Output nodes. For any given time t, the node o(t) has only one direct descendant,
the loss node L(t) . Since no other loss nodes can be reached from o(t) , it is the only
one we need consider in the gradient.
  ∂L
∇o(t) L = (t)
Ground-truth y (t) here is
i ∂oi a scalar, interpreted as
the index of the correct
∂L ∂L(t) label of output vector.
= ·
∂L(t) ∂o(t)
i
∂L(t)
= (1) · (t)
∂oi
∂ n 
(t)
o
= (t)
− log ŷy(t)
∂oi (81)
 
 (t)
o
y (t)
 
∂  e
 
= − (t) log  P (t) 

∂oi  oj
je
 

  
∂  (t) X o(t)  
= − (t) oy(t) − log  e j  1 y (t) = i
1i,y(t) =
∂oi  j

0 otherwise
  
 ∂ X o(t) 
= − 1i,y(t) − (t)
log  e j 
 ∂o i

j

 (t) 
eoj 
P
 1 ∂ j
= − 1i,y(t) − P (t) (t)
eoj ∂oi
 
j
 
(t)
 eoi 
= − 1i,y(t) − P (t) (10.18)
oj
je
 
n o
(t)
= − 1i,y(t) − ŷi
(t)
= ŷi − 1i,y(t)

which leaves all entries of o(t) unchanged except for the entry corresponding to the
true label, which will become negative in the gradient. All this means is, since we

36
want to increase the probability of this entry, driving this value up will decrease
the loss (hence negative) and driving any other entries up will increase the loss
proportional to its current estimated probability (driving up an [incorrect] entry
that is already high is “worse” than driving up a small [incorrect entry]).
• Hidden nodes. First, consider the simplest hidden node to take the gradient of,
the last one, h(τ ) (simplest because only one descendant [path] reaching any loss
node(s)).

(τ )
  ∂L nX out
∂L(τ ) ∂ok
∇h(τ ) L =
i ∂L(τ ) k=1 ∂o(τ ) ∂h(τ )
k i
n
Xout   ∂o(τ )
k
= ∇o(τ ) L (τ )
k ∂hi
k=1
 
nout  nhid
X  ∂  X (τ )

= ∇o(τ ) L c + Vkj hj
k ∂hi
(τ )  k  (10.19)
k=1 j=1
n
Xout  
= ∇o(τ ) L Vki
k
k=1
n
Xout  
= (V T )ik ∇o(τ ) L
k
k=1
 
= V T ∇o(t) L
i

Before proceeding, notice the following useful pattern: If two nodes a and b,
each containing na and nb neurons, are fully connected by parameter matrix Wnb ×na
and directed like a → b → L, then20 ∇a L = W T ∇b L. Using this result, we can
then iterate and take gradients back in time from t = τ − 1 to t = 1 as follows:
!T !T
∂h(t+1)   ∂o(t)  
∇h(t) L = ∇h(t+1) L + ∇o(t) L (10.20) d
∂h(t) ∂h(t) tanh(x) = 1−tanh2 (x)
    dx
= W T ∇h(t+1) L diag(1 − tanh2 (a(t+1) )) + V T ∇o(t) L
(diag(a))ii , ai
      (10.21)
T (t+1) 2 T
=W ∇h(t+1) L diag 1 − (h ) +V ∇o(t) L

2. Parameter Gradients. Now we can compute the gradients for the parameter matri-
ces/vectors, where it is crucial to remember that a given parameter matrix (e.g. U )
is shared across all time steps t. We can treat tensor derivatives in the same form as
20
More generally,
∂b T
∇a L = (
) ∇b L
∂a
which is a good example of how vector derivatives map into a matrix. For example, let a ∈ Rna and b ∈ Rnb .
Then
∂b
∈ Rnb ×na
∂a

37
previously done with vectors after a quick abstraction: For any tensor X of arbitrary
rank (e.g. if rank-4 then index like Xijk` ), use single variable (e.g. i) to represent the
complete tuple of indices21 .
• Bias parameters [vectors]. These are nothing new, since just vectors.
!T
X ∂o(t)  
(∇c L) = ∇o(t) L
∂c(t)
t
X  (10.22)
= ∇o(t) L
t
!T
X ∂h(t)  
(∇c L) = ∇h(t) L
∂b(t)
t
X    (10.23)
(t) 2
= diag 1 − (h ) ∇h(t) L
t

• V (nout × nhid ).
τ
X
∇V L = ∇V L(t) (82a)
t
τ
X (t)
= ∇V L(t) (o1 , . . . , o(t)
nout ) (82b)
t
τ nout  
X X (t)
= ∇o(t) L ∇V oi (82c)
i
t i
 
τ nout    nhid 
X X X (t)
= ∇o(t) L ∇V c + Vij hj (82d)
i  i 
t i j=1
 
0 0 ... 0
 .. .. .. 
 
τ nout   
 . . ... .  
∇o(t) L h(t) (t) (t)
X X
= h2 . . . hnhid  (82e)
 
i 1
 .. .. .. 

t i
 .
 . ... .  
0 0 ... 0
τ 
X 
= ∇o(t) L (h(t) )T (82f)
t

21
More details on tensor derivatives: Consider the chain defined by Y = g(X), and z = f (Y), where z is
some vector. Then X ∂z
∇X z = (∇X Yj )
∂Yj
j

38
ith row, which is (h ) .
(See footnote for more)

where if 82e confuses you, see the footnote22 .


• W (nhid × nhid ). This one is a bit odd, since W is, in a sense, even more “shared”
across time steps than V 23 . The authors here define/choose, when evaluating
(t)
∇W hi to only concern themselves with W := W (t) , i.e. the direct connections
to h at time t.
τ
X
∇W L = ∇W L(t) (84a)
t
τ n hid  
X X (t)
= ∇h(t) L ∇W(t) hi (10.25)
i
t i
  
0 0 ... 0
.. .. ..
  
  
τ n hid  

  

 . . ... . 

(t) 2  (t−1) (t−1) (t−1) 
XX
= ∇h(t) L diag 1 − (h ) h1 h2 ... hnhid 

i
 .. .. .. 
 
t i 


 .
 . ... .  

0 0 ... 0
(84b)
τ
X   
= diag 1 − (h(t) )2 ∇h(t) L (h(t−1) )T (10.26)
t

• U (nhid × nin ). Very similar to the previous calculation.


τ
X
∇U L = ∇U L(t) (85a)
t
τ n hid  
X X (t)
= ∇h(t) L ∇U(t) hi (10.27)
i
t i
τ
X   
= diag 1 − (h(t) )2 ∇h(t) L (x(t) )T (10.28)
t

22
The general lesson learned here is that, for some matrix W ∈ Ra×b and vector x ∈ Rb ,

xT
 
X xT 
∇W [(Wx)i ] =  .  (83)
 
i
 .. 
xT

where, of course, the output has the same dimensions as W.


23
Specifically, h(t) is both
(a) An explicit function of the parameter matrix W (t) directly feeding into it.
(b) An implicit function of all other W t=i that came before.
This is different than before, where we had o(t) not implicitly depending on earlier V (t=i) . In other words, h(t)
is a descendant of all earlier (and current) W.

39
2.5.2 RNNs as Directed Graphical Models

The advantage of RNNs is their efficient parameterization of the joint distribution over y (i)
via parameter sharing. This introduces a built-in assumption that we can model the effect of
y (i) in the distant past on the current y (t) via its effect on h. We are also assuming that the
conditional probability distribution over the variables at t + 1 given the variables at time t is
stationary. Next, we want to know how to draw samples from such a model. Specifically,
how to sample from the conditional distribution (y (t) given y (t−1) ) at each time step.
Say we want to model a sequence of scalar random variables Y , {y (1) , . . . , y (τ ) } for some
sequence length τ . Without making independence assumptions just yet, we can parameterize
the joint distribution P (Y) with basic definitions of probability:
τ
Y
P (Y) , P (y (1) , . . . , y (τ ) ) = P (y (t) | y (t−1) , . . . , y (1) ) (86)
t=1

where I’ve drawn an example of the complete graph for τ = 4 below.

The complete graph can


represent the direct
dependencies between
any pairs of y values.

y (1) y (2) y (3) y (4)

If each value y could take on the same fixed set of k values, we would need to learn k 4 pa-
rameters to represent the joint distribution P (Y). This clearly inefficient, since the number
of parameters needed scales like O(k τ ). If we relax the restriction that each y (i) must depend
directly on all past y (j) , we can considerably reduce the number of parameters needed to com-
pute the probability of some particular sequence.

We could include latent variables h at each timestep that capture the dependencies, reminiscent
of a classic RNN:

h(1) h(2) h(3)

y (1) y (2) y (3) y (4)

40
Since in the RNN case all factors P (h(t) | h(t−1) ) are deterministic, we don’t need any addi-
tional parameters to compute this probability24 , other than the single m2 parameters needed
to convert any h(t) to the next h(t+1) (which is shared across all transitions). Now, the number
of parameters needed as a function of sequence length is constant, and as a function of k is
just O(k).

Finally, to view the RNN as a graphical model, we must describe how to sample from it, namely
how to sample a sequence y from P (Y), if parameterized by our graphical model above. In the
general case where we don’t know the value of τ for our sequence y, one approach is to have a
EOS symbol that, if found during sampling, means we should stop there. Also, in the typical
case where we actually want to model P (y | x) for input sequence x, we can reinterpret the
parameters θ of our graphical model as a function of x the input sequence. In other words, the
graphical model interpretation becomes a function of x, where x determines the exact values
of the probabilities the graphical model takes on – an “instance” of the graphical model.

Bidirectional RNNs. In many applications, it is desirable to output a prediction of y (t) that


may depend on the whole sequence. For example, in speech recognition, the interpretation of
words/sentences can also depend on what is about to be said. Below is a typical bidirectional
RNN, where the inputs x( t) are fed both to a “forward” RNN (h) and a “backward” RNN
(g).

Notice how the output


units o(t) have the nice
property of depending on
both the past and future
while being most
sensitive to input values
around time t.

24
Don’t forget that, in a neural net, a variable y (t) is represented by a layer, which itself is composed of k
nodes, each associated with one of the k unique values that y (t) could be.

41
Encoder-Decoder Seq2Seq Architectures (10.4) Here we discuss how an RNN can be
trained to map an input sequence to output sequence which is not necessarily the same length.
(Not really much of a discussion...figure below says everything.)

2.5.3 Challenge of Long-Term Deps. (10.7)

Gradients propagated over many stages either vanish (usually) or explode. We saw how this
could occur when we took parameter gradients earlier, and for weight matrices W further
along from the loss node, the expression for ∇W L contained multiplicative Jacobian factors.
Consider the (linear activation) repeated function composition of an RNN’s hidden state in
10.36. We can rewrite it as a power method (10.37), and if W admits an eigendecomposition
(remember W is necessarily square here), we can further simplify as seen in 10.38.
Q: Explain interp. of
h(t) = W T h(t−1) (10.36) mult. h by Q as
opposed to the usual QT
= (W t )T h(0) (10.37) explained in the linear
T t (0) algebra review.
= Q Λ Qh (10.38)

Any component of h(0) that isn’t aligned with the largest eigenvector
will eventually be discarded.25

If, however, we have a non-recurrent network such that the state elements are repeatedly
multiplied by different w(t) at each time step, the situation is different. Suppose the different
w(t) are i.i.d. with mean 0 and variance v. The variance of the product is easily seen to

25
Make sure to think about this from the right perspective. The largest value of t = τ in the RNNs we’ve seen
would correspond with either (1) the largest output sequence or (2) the largest input sequence (if fixed-vector
output). After we extract the output from a given forward pass, we reset the clock and either back-propagate
errors (if training) or get ready to feed another sequence.

42
be O(v n )26 . To
√ obtain some desired variance v ∗ we may choose the individual weights with
n ∗
variance v = v .

2.5.4 LSTMs and Other Gated RNNs (10.10)

While leaky units have connection weights that are either manually chosen constants or are
trainable parameters, gated RNNs generalize this to connection weights that may change at
each time step. Furthermore, gated RNNs can learn to both accumulate and forget, while leaky
units are designed for just accumulation27

LSTM (10.10.1). The idea is we want self-loops to produce paths where the gradient can
flow for long durations. The self-loop weights are gated, meaning they are controlled by
another hidden unit, interpreted as being conditioned on context. Listed below are the main
components of the LSTM architecture.
 
(t) (t) (t−1)
• Forget gate fi = σ bfi + f f The subscript, i,
P P
j Ui,j xj + j Wi,j hj .
identifies the cell. The
 
(t) (t) (t−1) (t) (t) (t−1) superscript, t, denotes
• Internal state si = fi
P P
si + gi σ bi + j Ui,j xj + Wi,j hj .
 j the time.
(t) P g (t) P g (t−1)
• External input gate gi = σ bgi + j Ui,j xj + j Wi,j h .
  j
(t) (t) (t−1)
• Output gate qi = σ boi + j o x o
P P
Ui,j j + j Wi,j hj .

The final hidden state can then be computed via


(t) (t) (t)
hi = tanh(si ) qi (89)

26
Quick sketch of (my) proof:
2
Var w(i) = v = E (w(i) )2 −  (i)
     
Ew (87)
" n
# " n
!2 # n
Y Y Y 
w(t) = E w(t) E (w(t) )2 = v n

Var = (88)
t t t

27
Q: Isn’t choosing to update with higher relative weight on the present the same as forgetting? A: Sort of.
It’s like “soft forgetting” and will inevitably erase more/less than desired (smeary). In this context, “forget”
means to set the weight of a specific past cell to zero.

43
Modern Practices February 14

Applications (Ch. 12)


Table of Contents Local Written by Brandon McKinzie

2.6.1 Natural Language Processing (12.4)


Begins on pg. 448

n-grams. A language model defines a probability distribution over sequences of [discrete]


tokens (words/characters/etc). Early models were based on the n-gram: a [fixed-length] se-
quence of n tokens. Such models define the conditional distribution for the nth token, given
the (n − 1) previous tokens:
P (xt | xt−(n−1) , . . . , xt−1 )
where xi denotes the token at step/index/position i in the sequence.

To define distributions over longer sequences, we can just use Bayes rule over the shorter
distributions, as usual. For example, say we want to find the [joint] distribution for some
τ -gram (τ > n), and we have access to an n-gram model and a [perhaps different] model for
the initial sequence P (x1 , . . . , xn−1 ). We compute the τ distribution simply as follows:
τ
Y
P (x1 , . . . , xτ ) = P (x1 , . . . , xn−1 ) P (xt | xt−1 , . . . , xt−(n−2) , xt−(n−1) ) (12.5)
t=n

where it’s important to see that each factor in the product is a distribution over a length-n
sequence. Since we need that initial factor, it is common to train both an n-gram model and
an n − 1-gram model simultaneously.

Let’s do a specific example for a trigram (n = 3).


• Assumptions [for this trigram model example]:
– For any n ≥ 3, P (xn | x1 , . . . , xn−1 ) = P (xn | xn−2 , xn−1 ).
– When we get to computing the full joint distribution over some sequence of arbitrary
length, we assume we have access to both P3 and P2 , the joint distributions over all
subsequences of length 3 and 2, respectively.
• Example sequence: We want to know how to use a trigram model on the sequence
[’THE’, ’DOG’, ’RAN’, ’AWAY’].

44
• Derivation: We can use the built-in model assumption to derive the following formula.

P (THE DOG RAN AWAY) = P3 (AWAY | THE DOG RAN) P3 (THE DOG RAN)
= P3 (AWAY | DOG RAN) P3 (THE DOG RAN)
P3 (DOG RAN AWAY)
= P3 (THE DOG RAN)
P2 (DOG RAN)
= P3 (THE DOG RAN)P3 (DOG RAN AWAY)/P2 (DOG RAN)
(12.7)

Limitations of n-gram. The last example illustrates some potential problems one may Recall that, in MLE, the
encounter that arise [if using MLE] when the full joint we seek is nonzero, but (a) some Pn Pn and Pn−1 are usually
approximated via
factor is zero, or (b) Pn−1 is zero. Some methods of dealing with this are as follows. counting occurrences in
the training set
• Smoothing: shifting probability mass from the observed tuples to unobserved ones that
are similar.
• Back-off methods: look up the lower-order (lower values of n) n-grams if the frequency
of the context xt−1 , . . . , xt−(n−1) is too small to use the higher-order model.
In addition, n-gram models are vulnerable to the curse of dimensionality, since most n-grams
won’t occur in the training set28 , even for modest n.

2.6.2 Neural Language Models (12.4.2)

Designed to overcome curse of dimensionality by using a distributed representation of words.


Recognize that any model trained on sentences of length n and then told to generalize to new
sentences [also of length n] must deal with a space29 of possible sentences that is exponential in
n. Such word representations (i.e. viewing words as existing in some high-dimensional space)
are often called word embeddings. The idea is to map the words (or sentences) from the raw
high-dimensional [vocab sized] space to a smaller feature space, where similar words are closer
to one another. Using distributed representations may also be used with graphical models
(think Bayes’ nets) in the form multiple latent variables.

28
For a given vocabulary, which usually has much more than n possible words, consider how many possible
sequences of length n.
29
Ok I tried re-wording that from the book’s confusing wording but that was also a bit confusing. Let
me break it down. Say you train on a thousand sentences each of length 5. For a given vocabulary of size
VOCAB_SIZE, the number of possible sequences of length 5 is (VOCAB_SIZE)5 , which can be quite a lot
more than a thousand (not to mention the possibility of duplicate training examples). To the naive model, all
points in this high-dimensional space are basically the same. A neural language model, however, tries to arrange
the space of possibilities in a meaningful way, so that an unforeseen sample at test time can be said "similar"
as some previously seen training example. It does this by embedding words/sentences in a lower-dimensional
feature space.

45
Deep Learning
Research
Contents

3.1 Linear Factor Models (Ch. 13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


3.2 Autoencoders (Ch. 14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Representation Learning (Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Structured Probabilistic Models for DL (Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Inference and Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Monte Carlo Methods (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Confronting the Partition Function (Ch. 18) . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Approximate Inference (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8 Deep Generative Models (Ch. 20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

46
Deep Learning Research January 12

Linear Factor Models (Ch. 13)


Table of Contents Local Written by Brandon McKinzie

Overview. Much research is in building a probabilistic model 30 of the input, pmodel (x). Why?
Because then we can perform inference to predict stuff about our environment given any of
the other variables. We call the other variables latent variables, h, with
X
pmodel (x) = Pr(h) Pr(x | h) = Eh [pmodel (x | h)] (90)
h

So what? Well, the latent variables provide another means of data representation, which can
be useful. Linear factor models (LFM) are some of the simplest probabilistic models with
latent variables.

A linear factor model is defined by the use of a stochastic linear decoder function
that generates x by adding noise to a linear transformation of h.

Note that h is a vector of arbitrary size, where we assume p(h) is a factorial distribution:
p(h) = i p(hi ). This roughly means we assume the elements of h are mutually independent31 .
Q

The LFM describes the data-generation process as follows:


1. Sample the explanatory factors: h ∼ p(h).
2. Sample the real-valued observable variables given the factors:
x = Wh + b + noise (91)

Probabilistic PCA and Factor Analysis.


• Factor analysis:

h ∼ N (h; 0, I) (92)
2
noise ∼ N (0, ψ ≡ diag(σ )) (93)
x ∼ N (x; b, WW T + ψ) (94)

where the last relation can be shown by recalling that a linear combination of Gaussian
variables is itself Gaussian, and showing that Eh [x] = b, and Cov(x) = WW T + ψ.

30
Whereas, before, we’ve been building functions of the input (deterministic).
31
Note that, technically, this assumption isn’t strictly the definition of mutual independence, which requires
that every subset (i.e. not just the full set) of {hi ∈ h} follow this factorial property.

47
It is worth emphasizing the interpretation of ψ as the matrix of conditional vari-
ances σi2 . Huh? Let’s take a step back. The fact that we were able to separate the
distributions in the above relations for h and noise is from a built-in assumption that
Pr(xi |h, xj6=i ) = Pr(xi |h)32 .

The Big Idea

The latent variable h is a big deal because it captures the dependencies between
the elements of x. How do I know? Because of our assumption that the xi are
conditionally independent given h. If, once we specify h, all the elements of x
become independent, then any information about their interrelationship is hiding
somewhere in h.

Detailed walkthrough of Factor Analysis (a.k.a me slowly reviewing, months after taking
this note):
– Goal. Analyze and understand the motivations behind how Factor Analysis defines the
data-generation process under the framework of LFMs (defined in steps 1 and 2 earlier).
Assume h has dimension n.
– Prior. Defines p(h) := N (h; 0, I), the unit-variance Gaussian. Explicitly,
1 − 21
Pn 2
hi
p(h) := e i
(2π)n/2
– Noise. Assumed to be drawn from a Gaussian with diagonal covariance matrix ψ :=
diag σ 2 . Explicitly,
1 − 21
Pn 2 2
ai /σi
p(noise = a) := Qn e i
(2π)n/2 i σi
– Deriving distribution of x. We use the fact that any linear combination of Gaussians
is itself Gaussian. Thus, deriving p(x) is reduced to computing it’s mean and covariance
matrix.

µx = Eh [Wh + b] (95)
Z
= p(h)(Wh + b)dh (96)
Z
1 1
Pn 2
=b+ n/2
e− 2 i hi Whdh (97)
(2π)
=b (98)
Cov(x) = E (x − E [x])(x − E [x])T
 
(99)
= E (Wh + noise)(hT W T + noiseT )
 
(100)
= E (WhhT W T ) + ψ
 
(101)
T
= WW +ψ (102)

where we compute the expectation of x over h because x is defined as a function of h, and


noise is always expectation zero.
32
Due to <math>, this introduces a constraint that knowing the value of some element xj doesn’t alter the
probability Pr(xi = Wi · h + bi + noise). Given how we’ve defined the variable h, this means that knowing noisej
provides no clues about noisei . Mathematically, the noise must have a diagonal covariance matrix.

48
– Thoughts. Not really seeing why this is useful/noteworthy. Feels very contrived (many
assumptions) and restrictive – it only applies if the dependencies between each xi can be
modeled with a random variable h sampled from a unit variance Gaussian.
• Probabilistic PCA: Just factor analysis with ψ = σ 2 I. So zero-mean spherical Gaus-
sian noise. It becomes regular PCA as σ → 0. Here we can use an iterative EM algorithm
for estimating the parameters W.

49
Deep Learning Research May 07

Autoencoders (Ch. 14)


Table of Contents Local Written by Brandon McKinzie

Introduction. An autoencoder learns to copy its input to its output, via an encoder function
h = f (x) and a decoder function r = g(h). Modern autoencoders generalize this to allow for r for “reconstruction”

stochastic mappings pencoder (h | x) and pdecoder (x | h).

Undercomplete Autoencoders. Constrain dimension of h to be smaller than that of x.


The learning process minimizes some L(x, g(f (x))), where the loss function could be e.g. mean
squared error. Be careful not to have too many learnable parameters in the functions g and
f (thus increasing model capacity), since that defeats the purpose of using an undercomplete
autoencoder in the first place.

Regularized Autoencoders. We can remove the undercomplete constraint/necessity by


modifying our loss function. For example, a sparse autoencoder one that adds a penalty
Ω(h) to the loss function that encourages the activations on (not connections to/from) the
hidden layer to be sparse. One way to achieve actual zeros in h is to use rectified linear units
for the activations.

50
Deep Learning Research May 07

Representation Learning (Ch. 15)


Table of Contents Local Written by Brandon McKinzie

Greedy Layer-Wise Unsupervised Pretraining. Given:


- Unsupervised learning algorithm L which accepts as input a training set of examples X, and
outputs an encoder/feature function f .
- f (i) (X̃) denotes the output of the ith layer of f , given as immediate input the (possibly
transformed) set of examples X̃.
- Let m denote the number of layers (“stages”) in the encoder function (note that each lay-
er/stage here must use a representation learning algorithm for its L e.g. an RBM, autoen-
coder, sparse coding model, etc.)
The procedure is as follows:
1. Initialize.

f (·) ← I(·) (103)


X̃ = X (104)

2. For each layer (stage) i in range(m), do:

f (k) = L(X̃) (105)


(k)
f (·) ← f (f (·)) (106)
(k)
X̃ ← f (X̃) (107)

In English: just apply the regular learning/training process for each layer/stage sequen-
tially and individually33 .
When this is complete, we can run fine-tuning: train all layers together (including any later
layers that could not be pretrained) with a supervised learning algorithm. Note that we do
indeed allow the pretrained encoding stages to be optimized here (i.e. not fixed).

33
In other words, you proceed one layer at a time in order. You don’t touch layer i until the weights in layer
i − 1 have been learned.

51
Deep Learning Research October 01, 2017

Structured Probabilistic Models for DL (Ch. 16)


Table of Contents Local Written by Brandon McKinzie

Motivation. In addition to classification, we can ask probabilistic models to perform other


tasks such as density estimation (x → p(x)), denoising, missing value imputation, or sampling.
What these [other] tasks have in common is they require a complete understanding of the input.
Let’s start with the most naive approach of modeling p(x), where x contains n elements, each
of which can take on k distinct values: we store a lookup table of all possible x and the
corresponding probability value p(x). This requires k n parameters34 . Instead, we use graphs
to describe model structure (direct/indirect interactions) to drastically reduce the number of
parameters.

Directed Models. Also called belief networks or Bayesian networks. Formally, a di-
rected graphical model defined on a set of variables {x} is defined by a DAG, G, whose vertices
are the random variables in the model, and a set of local conditional probability distribu-
tions, p(xi | P aG (xi )), where P aG (xi ) gives the parents of xi in G. The probability distribution
over x is given by
Y
p(x) = p(xi | P aG (xi )) (108)
i

Undirected Graphical Models. Also called Markov Random Fields (MRFs) or Markov
Networks. Appropriate for situations where interactions do not have a well-defined direction.
Each clique C (any set of nodes that are all [maximally] connected) in G is associated with a
factor φ(C). The factor φ(C), also called a clique potential, is just a function (not necessarily
a probability) that outputs a number when given a possible set of values over the nodes in C. Clique potentials are
The output number measures the affinity of the variables in that clique for being in the states constrained to be
nonnegative.
specified by the inputs. The set of all factors in G defines an unnormalized probability
distribution:
Y
pe(x) = φ(C) (109)
C∈G

34
Consider the common NLP case where our vector x contains n word tokens, each of which can take on any
symbol in our vocabulary of size v. If we assign n = 100 and v = 100, 000, which are relatively common values
for this case, this amounts to (1e5)1e2 = 10500 parameters.

52
The Partition Function. To obtain a valid probability distribution, we must normalize the
probability distribution:
1
p(x) = pe(x) (110)
ZZ
Z= pe(x)dx (111)

where the normalizing function Z = Z({φ}) is known as the partition function (physicz
sh0ut0uT). It is typically intractable to compute, so we resort to approximations. Note that
Z isn’t even guaranteed to exist – it’s only for those definitions of the clique potentials that
cause the integral over pe(x) to converge/be defined.

Energy-Based Models (EBMs). A convenient way to enforce ∀x, pe(x) > 0 is to use EBMs,
where

pe(x) , exp (−E(x)) (112)

and E(x) is known as the energy function35 . Many algorithms need to compute not pmodel (x)
but only log pemodel (x) (unnormalized log probabilities - logits!). For EBMs with latent variables
h, such algorithms are phrased in terms of the free energy:
X
F(x = x) = − log exp (−E(x = x, h = h)) (113)
h

where we sum over all possible assignments of the latent variables.

Separation and D-Separation. We want to know which subsets of variables are condi-
tionally independent from each other, given the values of other subsets of variables. A set
of variables A is separated (if undirected model)/d-separated (if directed model) from an-
other set of variables B given a third set of variables S if the graph structure implies that A is
independent from B given S.
• Separation. For undirected models. If variables a and b are connected by a path
involving only unobserved variables (an active path), then a and b are not separated.
Otherwise, they are separated. Any paths containing at least one observed variable are
called inactive.
• D-Separation36 . For directed models. Although there are rules that help determine
whether a path between a and b is d-separated, it is simplest to just determine whether
a is independent from b given any observed variables along the path.

35
Physics throwback: this mirrors the Boltzmann factor, exp(−ε/τ ), which is proportional to the probability
of the system being in quantum energy state ε.
36
The D stands for dependence.

53
3.4.1 Sampling from Graphical Models

For directed graphical models, we can do ancestral sampling to produce a sample x from
the joint distribution represented by the model. Just sort the variables xi into a topological
ordering such that ∀i, j : j > i ⇐⇒ xi ∈ P aG (xj ). To produce the sample, just sequentially
sample from the beginning, x1 ∼ P (x1 ), x2 ∼ P (x2 | P aG (x1 )), etc.

For undirected graphical models, one simple approach is Gibbs sampling. Essentially, this
involves drawing a conditioned sample from xi ∼ p(xi | neighbors(xi )) for each xi . This process
is repeated many times, where each subsequent pass uses the previously sampled values in
neighbors(xi ) to obtain an asymptotically converging [to the correct distribution] estimate for
a sample from p(x).

3.4.2 Inference and Approximate Inference

One of the main tasks with graphical models is predicting the values of some subset of variables
given another subset: inference. Although the graph structures we’ve discussed allow us to
represent complicated, high-dimensional distributions with a reasonable number of parameters,
the graphs used for deep learning are usually not restrictive enough to allow efficient inference.
Approximate inference for deep learning usually refers to variational inference, in which we
approximate the distribution p(h | v) by seeking an approximate distribution q(h | v) that is
as close to the true one as possible.

Example: Restricted Boltzmann Machine. The quintessential example of how graphical


models are used for deep learning. The canonical RBM is an energy-based model with binary
visible and hidden units. Its energy function is

E(v, h) = −bT v − cT h − v T Wh (114)

where b, c, and W are unconstrained, real-valued, learnable parameters. One could interpret
the values of the bias parameters as the affinities for the associated variable being its given
value, and the value Wi,j as the affinity of vi being its value and hj being its value at the same
time37 .

The restrictions on the RBM structure, namely the fact that there are no intra-layer connec-
tions, yields nice properties. Since pe(h, v) can be factored into clique potentials, we can say

37
More concretely, remember that v is a one-hot vector representing some state that can assume len(v)
unique values, and similarly for h. Then Wi,j gives the affinity for the state associated with v being its ith
value and the state associated with h being its jth value.

54
that:
Y
p(h | v) = p(hi | v) (115)
i
Y
p(v | h) = p(vi | h) (116)
i

Also, due to the restriction of binary variables, each of the conditionals is easy to compute,
and can be quickly derived as
 
p(hi = 1 | v) = σ ci + v T W:,i (117)

allowing for efficient block Gibbs sampling.

55
Deep Learning Research May 09

Monte Carlo Methods (Ch. 17)


Table of Contents Local Written by Brandon McKinzie

Monte Carlo Sampling (Basics). We can approximate the value of a (usually prohibitively
large) sum/integral by viewing it as an expectation under some distribution. We can then
approximate its value by taking samples from the corresponding probability distribution and
taking an empirical average. Mathematically, the basic idea is show below:
n
1
Z X
s= p(x)f (x)dx = Ep [f (x)] → ŝn = f (x(i) ) (118)
n
i=1, x(i) ∼p

As we’ve seen before, the empirical average is an unbiased38 estimator. Furthermore, the
central limit theorem tells us that the distribution of ŝn converges to a normal distribution
with mean s and variance Var [f (x)] /n.

Importance Sampling. What if it’s not feasible for us to sample from p? We can approach
this a couple ways, both of which will exploit the following identity:

p(x)f (x)
p(x)f (x) = q(x) (122)
q(x)

• Optimal importance sampling. We can use the aforementioned identity/decomposi-


tion to find the optimal q∗ – optimal in terms of number of samples required to achieve
a given level of accuracy. First, we rewrite our estimator ŝp (they now use subscript to
denote the sampling distribution) as ŝq :
n
1 X p(x(i) )f (x(i) )
ŝq = (123)
n q(x(i) )
i=1, x(i) ∼q

38
Recall that expectations on such an average are still taken over the underlying (assumed) probability
distribution:
n
1X 
Ep f (x(i) )

Ep [ŝn ] = (119)
n
i=1
n
1X
= s (120)
n
i=1

=s (121)
 
You should think of the expectation Ep f (x(i) ) as the expected value of the random sample from the underlying
distribution, which of course is s, because we defined it that way.

56
At first glance, it feels a little wonky, but recognize that we are sampling from q instead
of p (i.e. if this were an integral, it would be over q(x)dx). The catch is that, now, the
variance can be greatly sensitive to the choice of q:

p(x)f (x)
 
Var [ŝq ] = Var /n (124)
q(x)

with the optimal (minimum) value of q at:


p(x) | f (x) |
q∗ = (125)
Z

• Biased importance sampling. Computing the optimal value of q can be as chal-


lenging/infeasible as sampling from p. Biased sampling does not require us to find a
normalization constant for p or q. Instead, we compute:
Pn p̃(x(i) ) (i)
i=1 q̃(x(i) ) f (x )
ŝBIS = Pn p̃(x(i) ) (126)
i=1 q̃(x(i) )

where p̃ and q̃ are the unnormalized forms of p and q, and the x(i) samples are still drawn
from [the original/unknown] q. E [ŝBIS ] 6= s except asymptotically when n → ∞.

57
Deep Learning Research August 30, 2018

Confronting the Partition Function (Ch. 18)


Table of Contents Local Written by Brandon McKinzie

Noise Contrastive Estimation (NCE) (18.6). We now estimate

log pmodel (x) = log p̃model (x; θ) + c (127)

and explicitly learn an approximation, c, for − log Z(θ). Obviously MLE would just try jacking
up c to maximize this, so we adopt a surrogate supervised training problem: binary classifi-
cation that a given sample x belongs to the (true) data distribution pdata or to the noise
distribution pnoise . We introduce binary variable y to indicate whether the sample is in the
true data distribution (y=1) or the noise distribution (y=0). Our surrogate model is thus
defined by
1
pjoint (y=1) = (128)
2
pjoint (x | y=1) = pmodel (x) (129)
pjoint (x | y=0) = pnoise (x) (130)

We can now use MLE on the optimization problem,

θ, c = arg max Ex,y∼ptrain [log pjoint (y | x)] (131)


θ,c
pmodel (x)
pjoint (y=1 | x) = (132)
pmodel (x) + pnoise (x)
1
= (133)
1 + pnoise (x)/pmodel (x)
= σ (log pmodel (x) − log pnoise (x)) (134)

58
Deep Learning Research Nov 15, 2017

Approximate Inference (Ch. 19)


Table of Contents Local Written by Brandon McKinzie

Overview. Most graphical models with multiple layers of hidden variables have intractable
posterior distributions. This is typically because the partition function scales exponentially
with the number of units and/or due to marginalizing out latent variables. Many approximate
inference approaches make use of the observation that exact inference can be described as an
optimization problem.

Assume we have a probabilistic model consisting of observed variables v and latent variables
h. We want to compute log p(v; θ), but it’s too costly to marginalize out h. Instead, we
compute a lower bound L(v, θ, q) – often called the evidence lower bound (ELBO) or
negative variational free energy – on log p(v; θ)39 :
q is an arbitrary
L(v, θ, q) = log p(v; θ) − DKL (q(h | v) || p(h | v; θ)) (135) probability distribution
over h. Note that the
= −Eh∼q(h|v) [log p(h, v)] + H(q(h | v)) (136) book will write q when
they really mean
q(h | v).
where the second form is the more canonical definition40 . Note that L(v, θ, q) is a lower-bound
on log p(v; θ) by definition, since

log p(v; θ) − L(v, θ, q) = DKL (q(h | v)||p(h | v; θ)) ≥ 0

With equality (to zero) iff q is the same distribution as p(h | v). In other words, L can
be viewed as a function parameterized by q that’s maximized when q is p(h | v), and with
maximal value log p(v). Therefore, we can cast the inference problem of computing the (log)
probability of the observed data log p(v) into an optimization problem of maximizing L. Exact
inference can be done by searching over a family of functions that contains p(h | v).

h i
39 P (x)
Recall that DKL (P ||Q) = Ex∼P (x) log Q(x)
40
This can be derived easily from the first form. Hint:
q(h | v) q(h | v)
log = log
p(h | v) p(h, v; θ)/p(v; θ)

59
Expectation Maximization (19.2). Technically not an approach to approximate inference,
but rather an approach to learning with an approximate posterior. Popular for training models
with latent variables. The EM algorithm consists of alternating between the followi[p]ng 2 steps
until convergence:
1. E-step. For each training example v (i) (in current batch or full set), set

q(h | v (i) ) = p(h | v (i) ; θ (0) ) (137)

where θ (0) denotes the current parameter values of the model at the beginning of the
E-step. This can also be interpreted as maximizing L w.r.t. q.
2. M-step. Update the parameters θ by completely or partially finding
X  
arg max L v (i) , θ, q(h | v (i) ; θ (0) ) (138)
θ i

60
Deep Learning Research July 28, 2018

Deep Generative Models (Ch. 20)


Table of Contents Local Written by Brandon McKinzie

Boltzmann Machines (20.1). An energy-based model over a d-dimensional binary random


vector x ∈ {0, 1}d . The energy function is simply E(x) = −xT Ux − bT x, i.e. parameters
between all pairs of xi , xj , and bias parameters for each xi 41 . In settings where our data
consists of samples of fully observed x, this is clearly limited to very simple cases, since e.g.
the probability of some xi being on is given by logistic regression from the values of the other
units.
Proof: prob of xi being on is logistic regression on other units
It’s important to be as specific as possible here, since the task stated as-is is ambiguous. We want to prove that the probability
of some fully observed state x that has its ith element clamped to 1, which I’ll denote as pi=on (x), is logistic regression over the
other units.

To prove this, it’s easier to use the conventional definition where U is symmetric with zero diagonal, and we write E(x) as

d d d
X X X
E(x) = − xi Ui,j xj − bi x i (139)

i=1 j=i+1 i=1

where the difference is that we explicitly only sum over the upper triangle of U.

Intuitively, since p({x}j6=i ) = pi=on (x) + pi=of f (x), our final formula for pi=on should only contain terms involving the
parameters that interact with xi , and only for those cases where xi = 1. This motivates exploring the formula for ∆Ei (x) ,
Ei=of f − Ei=on where I’ve dropped the explicit notation on x for simplicity/space. Before jumping in to deriving this, step back
and realize that ∆Ei will only contain summation terms where either the row or column index of U is i, and only for terms with
bias element bi . Since our summation is over the upper triangle of U, this means terms along the slices Ui,i+1:d and U1:i−1,i .
Now there is no derivation needed and we can simply write

d i−1
X X
∆Ei = Ui,k xk + xk Uk,i + bi (140)

k=i+1 k=1

The goal is to use this to get a logistic-regression-like formula for pi=on , so we should now think about the relationship between
any given p(x) and the associated E(x). The critical observation is that E(x) = − ln p(x) − ln Z, which therefore means

 
1 − pi=on (x)
∆Ei = ln pi=on (x) − ln pi=of f (x) = − ln (141)
pi=on (x)
1 − pi=on (x) 1
exp(−∆Ei ) = = −1 (142)
pi=on (x) pi=on (x)
1
∴ pi=on (x) = (143)
1 + exp(−∆Ei )

Since ∆Ei is a linear function of all other units, we have proven that pi=on (x) for some state x reduces to logistic regression
over the other units.

41
Authors are being lazy because it’s assumed the reader is familiar (which is fair, I guess). i.e. they aren’t
mentioning that this formula implies that U is either lower or upper triangular, and the diagonal is zero.

61
Restricted Boltzmann Machines (20.2). A BM with variables partitioned into two sets:
hidden and observed. The graphical model is bipartite over the hidden and observed nodes,
as I’ve drawn in the example below.

X1 X2 X3 X4

H1 H2 H3

Although the joint distribution p(x, h) has a potentially intractable partition function, the
conditional distributions can be computed efficiently by exploiting independencies:
nh
Y  
p(h | x) = σ [2h − 1] [c + W T x] (144)
j
j=1
Ynx
p(x | h) = σ ([2x − 1] [b + Wh])i (145)
i=1

where b and c are the observed and hidden bias parameters, respectively.

Deep Belief Networks (20.3). Several layers of (usually binary) latent variables and a single
observed layer. The "deepest" (away from the observed) layer connections are undirected, and
all other layers are directed and pointing toward the data. I’ve drawn an example below.

H1(2) H2(2) H3(2)

H1(1) H2(1) H3(1) H4(1)

X1 X2 X3

We can sample from a DBN via first Gibbs sampling on the undirected layer, then ancestral
sampling through the rest of the (directed) model to eventually obtain a sample from the
visible units.

62
Deep Boltzmann Machines (20.4). Same as DBNs, but now all layers are undirected. Note
that this is very close to the standard RBM, since we have a set of hidden and observed vari-
ables, except now we interpret certain subgroups of hidden units as being in a “layer”, thus
allowing for connections between hidden units in adjacent layers. What’s interesting is that
this still defines a bipartite graph, with odd-numbered layers on one side and even on the
other42 .

Differentiable Generator Networks (20.10.2). Use a differentiable function g(z; θ (g) ) to


transform samples of latent variables z to either (a) samples x, or (b) distributions over samples
x. For an example of case (a), the standard procedure for sampling from N (µ, Σ) is to first Case a: interpret g(z)
as emitting x directly.
sample from N (0, I) into a generator network consisting of a single affine layer:

x ← g(z) = µ + Lz

where L is the Cholesky decomposition 43 of Σ. In general, we think of the generator function


g as providing a change of variables that transforms the distribution over z into the desired
distribution x. Of course, there is an exact formula for doing this,

pz (g −1 (x))
px (x) = (146)
∂g
det ∂z

but it’s usually far easier to use indirect means of learning g, rather than trying to maximize/e-
valuation px (x) directly.

For case (b), the common approach is to train the generator net to emit conditional probabilities Case b: interpret g(z)
as emitting p(x | z).

p(xi | z) = g(z)i p(x) = Ez [p(x | z)] (147)

which can also support generating discrete data (case a cannot). The challenge in training
generator networks is that we often have a set of examples x, but the value of z for each x
is not fixed and known ahead of time. We’ll now look at some ways of training generator
nets given only training samples for x. Note that such a setting is very unlike unsupervised
learning, where we typically interpret x as inputs that we don’t have labels for, while here we
interpret x as outputs that we don’t know the associated inputs for.

42
Recall that this immediately implies that units in all odd layers are conditionally independent given the
even layers (and vice-versa for even to odd).
43
The [unique] Cholesky decomposition of a (real-symmetric) p.d. matrix A is a decomposition of the form
A = LLT , where L is lower triangular.

63
Variational Autoencoders (20.10.3). VAEs are directed models that use learned approx-
imate inference and can be trained purely with gradient-based methods. To generate a
sample x, the VAE first samples z from the code distribution pmodel (z). This sample is
then fed through the a differentiable generator network g(z). Finally, x is sampled from
pmodel (x; g(z)) = pmodel (x | z).

64
65
Papers and
Tutorials
Contents

4.1 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Neural Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Neural Conversation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 NMT By Jointly Learning to Align & Translate . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Detailed Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Effective Approaches to Attention-Based NMT . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Using Large Vocabularies for NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Candidate Sampling – TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Attention Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.9.2 Sentence Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.10 Simple Baseline for Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 Survey of Text Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.11.1 Distance-based Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11.2 Probabilistic Document Clustering and Topic Models . . . . . . . . . . . . . . . . . 98
4.11.3 Online Clustering with Text Streams . . . . . . . . . . . . . . . . . . . . . . . . 100
4.12 Deep Sentence Embedding Using LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.13 Clustering Massive Text Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.14 Supervised Universal Sentence Representations (InferSent) . . . . . . . . . . . . . . . . . . . 107
4.15 Dist. Rep. of Sentences from Unlabeled Data (FastSent) . . . . . . . . . . . . . . . . . . . . 108
4.16 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.17 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.18 Attention Is All You Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.19 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.20 Joint Event Extraction via RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.21 Event Extraction via Bidi-LSTM Tensor NNs . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.22 Reasoning with Neural Tensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.23 Language to Logical Form with Neural Attention . . . . . . . . . . . . . . . . . . . . . . . 128
4.24 Seq2SQL: Generating Structured Queries from NL using RL . . . . . . . . . . . . . . . . . . 130
4.25 SLING: A Framework for Frame Semantic Parsing . . . . . . . . . . . . . . . . . . . . . . . 133
4.26 Poincaré Embeddings for Learning Hierarchical Representations . . . . . . . . . . . . . . . . . 135

66
4.27 Enriching Word Vectors with Subword Information (FastText) . . . . .
. . .
. . . . . . . . . 137
4.28 DeepWalk: Online Learning of Social Representations . . . . . . . . .
. . .
. . . . . . . . . 139
4.29 Review of Relational Machine Learning for Knowledge Graphs . . . . .
. . .
. . . . . . . . . 141
4.30 Fast Top-K Search in Knowledge Graphs . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 144
4.31 Dynamic Recurrent Acyclic Graphical Neural Networks (DRAGNN) . . .
. . .
. . . . . . . . . 146
4.31.1 More Detail: Arc-Standard Transition System . . . . . . . .
. . .
. . . . . . . . . 149
4.32 Neural Architecture Search with Reinforcement Learning . . . . . . . .
. . .
. . . . . . . . . 150
4.33 Joint Extraction of Events and Entities within a Document Context . . .
. . .
. . . . . . . . . 152
4.34 Globally Normalized Transition-Based Neural Networks . . . . . . . .
. . .
. . . . . . . . . 155
4.35 An Introduction to Conditional Random Fields . . . . . . . . . . . .
. . .
. . . . . . . . . 158
4.35.1 Inference (Sec. 4) . . . . . . . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 162
4.35.2 Parameter Estimation (Sec. 5) . . . . . . . . . . . . . . .
. . .
. . . . . . . . . 165
4.35.3 Related Work and Future Directions (Sec. 6) . . . . . . . . .
. . .
. . . . . . . . . 168
4.36 Co-sampling: Training Robust Networks for Extremely Noisy Supervision . . .
. . . . . . . . . 169
4.37 Hidden-Unit Conditional Random Fields . . . . . . . . . . . . . . . . . .
. . . . . . . . . 170
4.37.1 Detailed Derivations . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 172
4.38 Pre-training of Hidden-Unit CRFs . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 177
4.39 Structured Attention Networks . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 179
4.40 Neural Conditional Random Fields . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 181
4.41 Bidirectional LSTM-CRF Models for Sequence Tagging . . . . . . . . . . .
. . . . . . . . . 183
4.42 Relation Extraction: A Survey . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 184
4.43 Neural Relation Extraction with Selective Attention over Instances . . . . . .
. . . . . . . . . 187
4.44 On Herding and the Perceptron Cycling Theorem . . . . . . . . . . . . . .
. . . . . . . . . 189
4.45 Non-Convex Optimization for Machine Learning . . . . . . . . . . . . . . .
. . . . . . . . . 191
4.45.1 Non-Convex Projected Gradient Descent (3) . . . . . . . . . . . .
. . . . . . . . . 194
4.46 Improving Language Understanding by Generative Pre-Training . . . . . . . .
. . . . . . . . . 195
4.47 Deep Contextualized Word Representations . . . . . . . . . . . . . . . . .
. . . . . . . . . 196
4.48 Exploring the Limits of Language Modeling . . . . . . . . . . . . . . . . .
. . . . . . . . . 198
4.49 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 200
4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 202
4.51 Wasserstein is all you need . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 204
4.52 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 206
4.52.1 Self-Normalized NCE . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 208
4.53 Neural Ordinary Differential Equations . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 210
4.54 On the Dimensionality of Word Embedding . . . . . . . . . . . . . . . . .
. . . . . . . . . 212
4.55 Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 213
4.56 A Framework for Intelligence and Cortical Function . . . . . . . . . . . . .
. . . . . . . . . 216
4.57 Large-Scale Study of Curiosity Driven Learning . . . . . . . . . . . . . . .
. . . . . . . . . 217
4.58 Universal Language Model Fine-Tuning for Text Classification . . . . . . . .
. . . . . . . . . 218
4.59 The Marginal Value of Adaptive Gradient Methods in Machine Learning . . . .
. . . . . . . . . 220
4.60 A Theoretically Grounded Application of Dropout in Recurrent Neural Networks . . . . . . . . . 221
4.61 Improving Neural Language Models with a Continuous Cache . . . . . . . . . . . . . . . . . . 222
4.62 Protection Against Reconstruction and Its Applications in Private Federated Learning . . . . . . . 223
4.63 Context Dependent RNN Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 225

67
4.64 Strategies for Training Large Vocabulary Neural Language Models . . . . . . . . . . . . . . . . 226
4.65 Product quantization for nearest neighbor search . . .. . . . . . . . . . . . . . . . . . . . 228
4.66 Large Memory Layers with Product Keys . . . . . .. . . . . . . . . . . . . . . . . . . . 229
4.67 Show, Ask, Attend, and Answer . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 231
4.68 Did the Model Understand the Question? . . . . . .. . . . . . . . . . . . . . . . . . . . 233
4.69 XLNet . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 234
4.70 Transformer-XL . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 236
4.71 Efficient Softmax Approximation for GPUs . . . . . .. . . . . . . . . . . . . . . . . . . . 237
4.72 Adaptive Input Representations for Neural Language Modeling . . . . . . . . . . . . . . . . . 238
4.73 Neural Module Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.74 Learning to Compose Neural Networks for QA . . . . . . . . . . . . . . . . . . . . . . . . 241
4.75 End-to-End Module Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
4.76 Fast Multi-language LSTM-based Online Handwriting Recognition . . . . . . . . . . . . . . . 245
4.77 Multi-Language Online Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . 246
4.78 Modular Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4.79 Transfer Learning from Speaker Verification to TTS . . . . . . . . . . . . . . . . . . . . . . 250

68
Papers and Tutorials January 15, 2017

WaveNet
Table of Contents Local Written by Brandon McKinzie

Introduction.
• Inspired by recent advances in neural autoregressive generative models, and based
on the PixelCNN architecture.
• Long-range dependencies dealt with via “dilated causal convolutions, which exhibit very
large receptive fields.”

WaveNet. The joint probability of a waveform x = {x1 , . . . , xT } is factorised as a product of


conditional probabilities,
T
Y
p(x) = p(xt | x1 , . . . , xt−1 ) (148)
t=1

which are modeled by a stack of convolutional layers (no pooling).

The model outputs a categorical distribution over the next value xt with a softmax
layer and it is optimized to maximize the log-likelihood of the data w.r.t. the pa-
rameters.

Main ingredient of WaveNet is dilated causal convolutions, illustrated below. Note the absence
of recurrent connections, which makes them faster to train than RNNs, but at the cost of
requiring many layers, or large filters to increase the receptive field44 .

44
Loose interpretation of receptive fields here is that large fields can take into account more info (back in
time) as opposed to smaller fields, which can be said to be “short-sighted”

69
Excellent concise definition from paper:

A dilated convolution (a convolution with holes) is a convolution where the filter


is applied over an area larger than its length by skipping input values with a
certain step. It is equivalent to a convolution with a larger filter derived from the
original filter by dilating it with zeros, but is significantly more efficient. A dilated
convolution effectively allows the network to operate on a coarser scale than with
a normal convolution. This is similar to pooling or strided convolutions, but here
the output has the same size as the input. As a special case, dilated convolution
with dilation 1 yields the standard convolution.

Softmax distributions. Chose to model the conditional distributions p(xt | x1 , . . . , xt−1 )


with a softmax layer. To deal with the fact that there are 216 possible values, first apply a
“µ-law companding transformation” to data, and then quantize it to 256 possible values:

ln(1 + µ|xt |)
f (xt ) = sign(xt ) (149)
ln(1 + µ)

which (after plotting in Wolfram) looks identical to the sigmoid function.

Gated activation and res/skip connections. Use the same gated activation unit as Pix-
elCNN:

z = tanh (Wf,k ∗ x) σ (Wg,k ∗ x) (150)

where ∗ denotes conv operator, denotes elem-wise mult., k is layer index, f, g denote filter/-
gate, and W is learnable conv filter. This is illustrated below, along with the residual/skip
connections used to speed up convergence/enable training deeper models.

70
Conditional Wavenets. Can also model conditional distribution of x given some additional
h (e.g. speaker identity).
T
Y
p(x | h) = p(xt | x1 , . . . , xt−1 , h) (151)
t=1

→ Global conditioning. Single h that influences output dist. accross all times. Activation
becomes:
   
T T
z = tanh Wf,k ∗ x + Vf,k h σ Wg,k ∗ x + Vg,k h (152)

→ Local conditioning (confusing). Have a second time-series ht . They first transform this
ht using a “transposed conv net (learned unsampling) that maps it to a new time-series
y = f (h) w/same resolution as x.

Experiments.

• Multi-Speaker Speech Generation. Dataset: multi-speaker corpus of 44 hours of


data from 109 different speakers45 . Receptive field of 300 milliseconds.

• Text-to-Speech. Single-speaker datasets of 24.6 hours (English) and 34.8 hours (Chi-
nese) speech. Locally conditioned on linguistic features. Receptive field of 240 millisec-
onds. Outperformed both LSTM-RNN and HMM.

• Music. Trained the WaveNets to model two music datasets: (1) 200 hours of annotated
music audio, and (2) 60 hours of solo piano music from youtube. Larger receptive fields
sounded more musical.

• Speech Recognition. “With WaveNets we have shown that layers of dilated convolu-
tions allow the receptive field to grow longer in a much cheaper way than using LSTM
units.”

Conclusion (verbatim): “This paper has presented WaveNet, a deep generative model of audio
data that operates directly at the waveform level. WaveNets are autoregressive and combine
causal filters with dilated convolutions to allow their receptive fields to grow exponentially with
depth, which is important to model the long-range temporal dependencies in audio signals. We
have shown how WaveNets can be conditioned on other inputs in a global (e.g. speaker identity)
or local way (e.g. linguistic features). When applied to TTS, WaveNets produced samples that
outperform the current best TTS systems in subjective naturalness. Finally, WaveNets showed
very promising results when applied to music audio modeling and speech recognition.”

45
Speakers encoded as ID in form of a one-hot vector

71
Papers and Tutorials January 22

Neural Style
Table of Contents Local Written by Brandon McKinzie

Notation.
• Content image: p
• Filter responses: the matrix P l ∈ RNl ×Ml contains the activations of the filters in
layer l, where Pijl would give the activation of the ith filter at position j in layer l. Nl is
number of feature maps, each of size Ml (height × width of the feature map)46 .
• Reconstructed image: x (initially random noise). Denote its corresponding filter
response matrix at layer l as P l .

Content Reconstruction.

1. Feed in content image p into pre-trained network, saving any desired filter responses
during the forward pass. These are interpreted as the various “encodings” of the image
done by the network. Think of them analogously to “ground-truth” labels.

2. Define x as the generated image, which we first initialize to random noise. We will be
changing the pixels of x via gradient descent updates.

3. Define the loss function. After each forward pass, evaluate with squared-error loss
between the two representations at the layer of interest:
1X l
Lcontent (p, x, l) = (F − Pijl )2 (1)
2 i,j ij
(
∂Lcontent (F l − P l )ij Fijl > 0
= (2)
∂Fijl 0 Fijl < 0

where it appears we are assuming ReLU activations (?).

4. Compute iterative updates to x via gradient descent until it generates the same re-
sponse in a certain layer of the CNN as the original image p.

46
If not clear, Ml is a scalar, for any given value of l.

72
Style Representation. On top of the CNN responses in each layer, the authors built a style
representation that computes the correlations between the different [aforementioned] filter
responses. The correlation matrix for layer l is denoted in the standard way with a Gram
matrix Gl ∈ RNl ×Nl , with entries
X
Glij = hFil , Fjl i = l
Fik l
Fjk (3)
k

To generate a texture that matches the style of a given image, do the following.

1. Let a denote the original [style] image, with corresponding Al . Let x, initialized to
random noise, denote the generated [style] image, with corresponding Gl .

2. The contribution to the loss of layer l, denoted El , to the total loss, denoted Lstyle , is
given by
1 X
El = (Glij − Alij )2 (4)
4Nl2 Ml2 ij
L
X
Lstyle (a, x) = wl E l (5)
l=0
  
1
∂El 
Nl2 Ml2
(F l )T (Gl − Al ) Fijl > 0
= ji (6)
∂Fijl 0 Fijl < 0

where wl are [as of yet unspecified] weighting factors of the contribution of layer l to the
total loss.

Mixing content with style. Essentially a joint minimization that combines the previous
two main ideas.

1. Begin with the following images: white noise x, content image p, and style image a.

2. The loss function to minimize is a linear combination of 1 and 5:

Ltotal (p, a, x, l) = αLcontent (p, x, l) + βLstyle (a, x) (7)

Note that we can choose which layers Lstyle uses by tweaking the layer weights wl . For
example, the authors chose to set wl = 1/5 for ’conv[1, 2, 4, 5]_1’ and 0 otherwise. For
the ratio α/β, they explored 1 × 10−3 and 1 × 10−4 .

73
Papers and Tutorials February 8

Neural Conversation Model


Table of Contents Local Written by Brandon McKinzie

[Reminder: red text means I need to come back and explain what is meant, once I understand it.]

Abstract. This paper presents a simple approach for conversational modeling which uses the
sequence to sequence framework. It can be trained end-to-end, meaning fewer hand-crafted
rules. The lack of consistency is a common failure of our model.

Introduction. Major advantage of the seq2seq model is it requires little feature engineering
and domain specificity. Here, the model is tested on chat sessions from an IT helpdesk dataset
of conversations, as well as movie subtitles.

Related Work. The authors’ approach is based on the following (linked and saved) papers
on seq2seq:
• Kalchbrenner & Blunsom, 2013.
• Sutskever et al., 2014. (Describes Seq2Seq model)
• Bahdanau et al., 2014.

Model. Succinctly described by the authors:


The model reads the input sequence one token at a time, and predicts the output
sequence, also one token at a time. During training, the true output sequence is
given to the model, so learning can be done by backpropagation. The model is
Example of less greedy
trained to maximize the cross entropy of the correct sequence given its context. approach: beam
During inference, in which the true output sequence is not observed, we simply search.
feed the predicted output token as input to predict the next output. This is a
“greedy” inference approach.

74
The thought vector is the hidden state of the model when it receives [as input] the end
of sequence symbol heosi, because it stores the info of the sentence, or thought, “ABC”. The
authors acknowledge that this model will not be able to “solve” the problem of modeling
dialogue due to the objective function not capturing the actual objective achieved through Ponder: what would be a
human communication, which is typically longer term and based on exchange of information reasonable objective
function & model for
[rather than next step prediction]47 . conversation?

IT Data & Experiment. Reminder: Check out


this git repo
• Data Description: Customers talking to IT support, where typical interactions are 400
words long and turn-taking is clearly signaled.
• Training Data: 30M tokens, 3M of which are used as validation. They built a vocabu-
lary of the most common 20K words, and introduced special tokens indicating turn-taking
and actor.
• Model: A single-layer LSTM with 1024 memory cells.
• Optimization: SGD with gradient clipping.
• Perplexity: At convergence, achieved perplexity of 8, whereas an n-gram model
achieved 18.

47
I’d imagine that, in order to model human conversation, one obvious element needed would be a memory.
Reminds me of DeepMind’s DNC. There would need to be some online filtering & output process to capture the
crucial aspects/info to store in memory for later, and also some method of retrieving them when needed later.
The method for retrieval would likely be some inference process where, given a sequence of inputs, the probability
of them being related to some portion of memory could be trained. This would allow for conversations that
stretch arbitrarily back in the past. Also, when storing the memories, I’d imagine a reasonable architecture
would be some encoder-decoder for a sparse distributed representation of memory.

75
Papers and Tutorials February 27

NMT By Jointly Learning to Align & Translate


Table of Contents Local Written by Brandon McKinzie

[Bahdanau et. al, 2014]. The primary motivation for me writing this is to better understand
the attention mechanism in my sequence to sequence chatbot implementation.

Abstract. The authors claim that using a fixed-length vector [in the vanilla encoder-decoder
for NMT] is a bottleneck. They propose allowing a model to (soft-)search for parts of a source
sentence that are relevant to predicting a target word, without having to form these parts as
a hard segment explicitly.

Learning to Align48 and translate.

• Decoder. Their encoder defines the conditional output distribution as

p(yi | y1 , . . . , yi−1 , x) = g(yi−1 , si , ci ) (153)


si = f (si−1 , yi−1 , ci ) (154)

where si is the RNN [decoder] hidden state at time i.


– NOTE: ci is not the ith element of the standard context vector; rather, it is itself a
distinct context vector that depends on a sequence of annotations (h1 , . . . , hTx ). It
seems that each annotation hi is a hidden (encoder) state “that contains information
about the whole input sequence with a strong focus on the parts surrounding the
i-th word of the input sequence.”
– The context vector ci is computed as follows:
Tx
X
ci = αij hj (155)
j=1
exp(eij )
αij = PTx (156)
k=1 exp(eik )
eij = a(si−1 , hj ) (157)

where the function eij is given by an alignment model which scores how well the
inputs around position j and the output at position i match.
• Encoder. It’s just a bidirectional RNN. What they call “annotation hj ” is literally just
a concatenated vector of hfj orward and hbackward
j

48
By “align” the authors are referring to aligning the source-search to the relevant parts for prediction.

76
4.4.1 Detailed Model Architecture

(Appendix A). Explained with the TensorFlow user in mind.

Decoder Internals. It’s just a GRU. However, it will be helpful to detail how we format the
inputs (given we now have attention). Wherever we’d usually pass the previous decoder state
si−1 , we now pass a concatenated state, [si−1 , ci ], that also contains the ith context vector.
Below I go over the flow of information from GRU input to output:
1. Notation: yt is the loop-embedded output of the decoder (prediction) at time t, st is
the internal hidden state of the decoder at time t, and ct is the context vector at time t.
s̃t is the proposed/proposal state at time t.
2. Gates:

zt = σ (Wz yt−1 + Uz [st−1 , ct ]) [update gate] (158)


rt = σ (Wr yt−1 + Ur [st−1 , ct ]) [reset gate] (159)
(160)

3. Proposal state:

s̃t = tanh (W yt−1 + U [rt ◦ st−1 , ct ]) (161)

4. Hidden state:

st = (1 − zt ) ◦ st−1 + zt ◦ s̃t (162)

Alignment Model. All equations enumerated below are for some timestep t during the
decoding process.
1. Score: For all j ∈ [0, Lenc −1] where Lenc is the number of words in the encoder sequence,
compute:

aj = a(st−1 , hj ) = vaT tanh (Wa st−1 + Ua hj ) (163)

2. Alignments: Feed the unnormalized alignments (scores) through a softmax so they


represent a valid probability distribution.
eaj
aj ← PLenc −1 (164)
k=0 eak
3. Context: The context vector input for our decoder at this timestep:
L
Xenc
c= aj hj (165)
j=1

77
Decoder Outputs. All below is for some timestep t during the decoding process. To find the
probability of some (one-hot) word y [at timestep t]:
TW
Pr (y | s, c) ∝ ey ou
(166)
u = [max{ũ2j−1 , ũ2j }]Tj=1,...,` (167)
ũ = Uo [st−1 , c] + Vo yt−1 (168)

N.B.: From reading other (and more recent) papers, these last few equations do not appear to
be the way it is usually done (thank the lord). See Luong’s work for a much better approach.

78
Papers and Tutorials May 11

Effective Approaches to Attention-Based NMT


Table of Contents Local Written by Brandon McKinzie

[Luong et. al, 2015]


Attention-Based Models. For attention especially, the devil is in the details, so I’m going
to go into somewhat excruciating detail here to ensure no ambiguities remain. For both global
and local attention, the following information holds true:
• “At each time step t in the decoding phase, both approaches first take as input the hidden
state ht at the top layer of a stacking LSTM.”
• Then, they derive [with different methods] a context vector ct to capture source-side info.
• Given ht and ct , they both compute the attentional hidden state as:
h̃t = tanh (Wc [ct ; ht ]) (169)
• Finally, the predictive distribution (decoder outputs) is given by feeding this through a
softmax:
 
p(yt | y<t , x) = softmax Ws h̃t (170)

Global Attention. Now I’ll describe in detail the processes involved in ht → at → ct → h̃t .
1. ht : Compute the hidden state ht in the normal way (not obvious if you’ve read e.g.
Bahdanau’s work...)
2. at :
(a) Compute the scores between ht and each source h̄s , where our options are:

T
ht h̄s

 dot
score(ht , h̄s ) = hTt Wa h̄s general (171)

v T tanh W [h ;

h̄s ]

concat
a a t

(b) Compute the alignment vector at of length Lenc (number of words in the encoder
sequence):

at (s) = align(ht , h̄s ) (172)


exp(score(ht , h̄s ))
=P (173)
s0 exp(score(ht , h̄s0 ))

79
3. ct : The weighted average over all source (encoder) hidden states49 :
L
Xenc
ct = at (i)h̄i (174)
i=1

4. h̃t : For convenience, I’ll copy the equation for h̃t again here:

h̃t = tanh (Wc [ct ; ht ]) (175)

Input-Feeding Approach. A copy of each output h̃t is sent forward and concatenated with
the inputs for the next timestep, i.e. the inputs go from xt+1 to [h̃t ; xt+1 ].

49
NOTE: Right after mentioning the context vector, the authors have the following cryptic footnote that
may be useful to ponder: For short sentences, we only use the top part of at and for long sentences, we ignore
words near the end.

80
Papers and Tutorials March 11

Using Large Vocabularies for NMT


Table of Contents Local Written by Brandon McKinzie

Paper information:
- Full title: On Using Very Large Target Vocabulary for Neural Machine Translation.
- Authors: Jean, Cho, Memisevic, Bengio.
- Date: 18 Mar 2015.
- [arXiv link]

NMT Overview. Typical implementation is encoder-decoder network. Notation for inputs


& encoder:

x = (x1 , . . . , xT ) [source sentence] (176)


h = (h1 , . . . , hT ) [encoder state seq] (177)
ht = f (xt , ht−1 ) (178)

where f is the function defined by the cell state (e.g. GRU/LSTM/etc.). Then the decoder
generates the output sequence y, and with probability given below:
The functions q, g, and r
y = (y1 , . . . , yT0 ) [yi ∈ Z] (179) are just placeholders –
“some function of
Pr[yt | y<t , x] ∝ eq(yt−1 , z t , ct )
(180) [inputs].”
zt = g(yt−1 , zt−1 , ct ) [decoder hidden?] (181)
ct = r(zt−1 , h1 , . . . , hT ) [decoder inp?] (182)

As usual, model is jointly trained to maximize the conditional log-likelihood of correct transla-
tion. For N training sample pairs (xn , y n ), and denoting the length of the n-th target sentence
as Tn , this can be written as,
Tn
N X
θ∗ = arg max
X
log (Pr[ytn | y<t
n
, xn ]) (183)
θ n=1 t=1

81
Model Details. Above is the general structure. Here I’ll summarize the specific model chosen
by the authors.
h i
• Encoder. Bi-directional, which just means ht = hbackward
t ; hft orward . The chosen cell
state (the function f ) is GRU.
• Decoder. At each timestep, computes the following:
→ Context vector ct .

T
X
ct = αi hi (184)
i=1
ea(ht ,zt−1 )
αt = P a(h ,z (185) a is a standard
ke
k t−1 )
single-hidden-layer NN.

→ Decoder hidden state zt . Also a GRU cell. Computed based on the previous hidden
state zt−1 , the previously generated symbol yt−1 , and also the computed context
vector ct .
• Next-word probability. They model equation 180 as50 , Reminder: yi is an
integer token, while wi
is the target vector of
1 wTt φ(yt−1 ,zt ,ct )+bt length vocab size
Pr[yt | y<t , x] = e (186)
Z
T
ewk φ(yt−1 ,zt ,ct )+bk
X
Z= (187)
k: yk ∈V

where φ is affine transformation followed by a nonlinear activation, wt and bt are the


target word vector and bias. V is the set of all target vocabulary.

Approximate Learning Approach. Main idea:


“In order to avoid the growing complexity of computing the normalization constant, we propose
here to use only a small subset V 0 of the target vocabulary at each update.”

Consider the gradient of the log-likelihood51 , written in terms of the energy E.


X
∇ log (Pr[yt | y<t , x]) = ∇E(yt ) − Pr[yk | y<t , x]∇E(yk ) (188)
k: yk ∈V

E(yj ) = wjT φ(yt−1 , zt , ct ) + bj (189)

50
Note: The formula for Z is correct. Notice that the only part of the RHS of Pr(yt ) with a t is as the
subscript of w. To be clear, wk is a full word vector and the sum is over all words in the output vocabulary, the
index k has absolutely nothing to do with timestep. They use the word target but make sure not to misinterpret
that as somehow meaning target words in the sentence or something.
51
NOTE TO SELF: After long and careful consideration, I’m concluding that the authors made a typo
when defining E(yj ), which they choose to subscript all parts of the RHS with j, but that is in direct contradiction
with a step-by-step derivation, which is why I have written it the way it is. I’m pretty sure my version is right,
but I know you’ll have to re-derive it yourself next time you see this. And you’ll somehow prove me wrong.
Actually, after reading on further, I doubt you’ll prove me wrong. Challenge accepted, me. Have fun!

82
The crux of the approach is interpreting the second term as EP [∇E(y)], where P denotes P r(y |
y<t , x). They approximate this expectation by taking it over a subset V 0 of the predefined
proposal distribution Q. So Q is a p.d.f. over the possible yi , and we sample from Q to
generate the elements of the subset V 0 .
X ωk
EP [∇E(y)] ≈ P ∇E(yk ) (190)
k: yk ∈V 0 k0 :yk0 ∈V 0 ωk0

ωk = eE(yk )−log Q(yk ) (191)

Here is some math I did that was illuminating to me; I’m not sure why the authors didn’t
point out these relationships.

eE(yk ) Q(yk )
ωk = thus p(yk | y<t , x) = ωk (192)
Q(yk ) Z
→ eE(yk ) = Z · p(yk | y<t , x) = Q(yk ) · ωk (193)

Now check this out

Below are the exact and approximate formulas for EP [∇E(y)] written in a seductive
suggestive manner. Pay careful attention to subscripts and primes.
X ωk · Q(yk )
EP [∇E(y)] = ∇E(yk ) (194)
k0 :yk0 ∈V ωk0 · Q(yk0 )
P
k: yk ∈V
X ωk
EP [∇E(y)] = P ∇E(yk ) (195)
k: yk ∈V 0 k0 :yk0 ∈V 0 ωk0

They’re almost the same! It’s much easier to see why when written this way. I interpret
the difference as follows: in the exact case, we explicitly attach the probabilities Q(yk )
and sum over all values in V . In the second case, by sampling a subset V 0 from Q, we
have encoded these probabilities implicitly as the relative frequency of elements yk in V 0

How to do in practice (very important).

“In practice, we partition the training corpus and define a subset V 0 of the target
vocabulary for each partition prior to training. Before training begins, we sequen-
tially examine each target sentence in the training corpus and accumulate unique
target words until the number of unique target words reaches the predefined thresh-
old τ . The accumulated vocabulary will be used for this partition of the corpus
during training. We repeat this until the end of the training set is reached. Let us
refer to the subset of target words used for the i-th partition by Vi0 .

83
Papers and Tutorials March 19

Candidate Sampling – TensorFlow


Table of Contents Local Written by Brandon McKinzie

[Link to article]

What is Candidate Sampling The goal is to learn a compatibility function F (x, y) which
says something about the compatibility of a class y with a context x. Candidate sampling:
for each training example (xi , yi ), only need to evaluate F (x, y) for a small set of classes
{Ci } ⊂ {L}, where {L} is the set of all possible classes (vocab size number of elements). We
represent F (x, y) as a layer that is trained by back-prop from/within the loss function.

C.S. for Sampled Softmax. I’ll further narrow this down to my use case of having exactly
1 target class (word) at a given time. Any other classes are referred to as negative classes
(for that example).

Sampling algorithm. For each training example (xi , yi ), do:


• Sample the subset Si ⊂ L. How? By sampling from Q(y|x) which gives the probability
of any particular y being included in Si .
• Create the set of candidates, which is just Ci := Si ∪ yi .

Training task. We are given this set Ci and want to find out which element of Ci is the
target class yi . In other words, we want the posterior probability that any of the y in Ci are
the target class, given what we know about Ci and xi . We can evaluate and rearrange as usual
with Bayes’ rule to get:

Pr yitrue = y | xi · Pr Ci | yitrue = y, xi
   
Pr yitrue = y | Ci , x i = (196)
Pr (Ci | xi )
Pr (y | xi ) 1
= · (197)
Q (y | xi ) K(xi , Ci )

where they’ve just defined

Pr (Ci | xi )
K(xi , Ci ) , Q (198)
Q(y 0 | xi ) y0 ∈(L−Ci ) (1 − Q(y 0 | xi ))
Q
y 0 ∈Ci

84
Clarifications.

• The learning function F (x, y) is the input to our softmax. It is our neural network,
excluding the softmax function.

• After training our network, it should have learned the general form

F (x, y) = log(Pr(y | x)) + K(x) (199)

which is the general form because

elog(Pr(y|x))+K(x)
Softmax(log(Pr(y | x)) + K(x)) = P log(Pr(y0 |x)+K(x) (200)
y0 e
= Pr(y | x) (201)

Note that I’ve been a little sloppy here, since Pr(y | x) up until the last line actually
represented the (possibly) unnormalized/relative probabilities.

• [MAIN TAKEAWAY]. Time to bring it all together. Notice that we’ve only trained
F (x, y) to include part of what’s needed to compute the probability of any y being the
target given xi and Ci . . . equation 199 doesn’t take into account Ci at all! Luckily we
know the form of the full equation because it just the log of equation 197. We can
easily satisfy that by subtracting log(Q(y | x)) from F (x, y) right before feeding into the
softmax.

TL;DR. Train network to learn F (x, y) before softmax, but instead of feeding F (x, y)
to softmax directly, feed

Softmax Input: F (x, y) − log(Q(y | x)) (202)

instead. That’s it.

85
Papers and Tutorials April 04

Attention Terminology
Table of Contents Local Written by Brandon McKinzie

Generally useful info. Seems like there are a few notations floating around, and here I’ll at-
tempt to set the record straight. The order of notes here will loosely correspond with the order
that they’re encountered going from encoder output to decoder output.

Jargon. The people in the attention business love obscure names for things that don’t need
names at all. Terminology:
• Attentions keys/values: Encoder output sequence.
• Query: Decoder [cell] state. Typically the most recent one.
• Scores: Values of eij . For the Bahdanau version, in code this would be computed via

ei = v T tanh(FC(si−1 ) + FC(h)) (203)

where we’d have FC be tf.layers.fully_connected with num_outputs equal to our atten-


tion size (up to us). Note that v is a vector.
• Alignments: output of the softmax layer on the attention scores.
• Memory: The α matrix in the equation ci = Tj=1
P x
αij hj .

When someone lazily calls some layer output the “attention”, they are usually referring to the
layer just after the linear combination/map of encoder hidden states. You’ll often see this as
some vague function of the previous decoder state, context vector, and possibly even decoder
output (after project), like f (si−1 , yi−1 , ci ). In 99.9% of cases, this function is just a fully
connected layer (if even needed) to map back to the state size for decoder input. That is it.
From encoder to decoder. The path of information flow from encoder outputs to decoder
inputs, a non-trivial process that isn’t given the attention (heh) it deserves52

1. Encoder outputs. Tensor of shape [batch size, sequence length, state size]. The
state is typically some RNNCell state.
• Note: TensorFlow’s AttenntionMechanism classes will actually convert this to [batch
size, Lenc , attention size], and refer to it as the “memory”. It is also what is re-
turned when calling myAttentionMech.values.

52
For some reason, the literature favors explaining the path “backwards”, starting with the highly abstracted
“decoder inputs as a weighted sum of encoder states” and then breaking down what the weights are. Unfortu-
nately, the weights are computed via a multi-stage process so that becomes very confusing very quick.

86
2. Compute the scores. The attention scores are the computation described by Lu-
ong/Bahdanau techniques. They both take an inner product of sorts on copies of the
encoder outputs and decoder previous state (query). The main choices are:

T
ht h̄s dot Synonyms:


- scores
score(ht , h̄s ) = hTt Wa h̄s general (204) - unnormalized

v T tanh W [h ;
 
h̄s ] concat alignments
a a t

where the shapes are as follows (for single timestep during decoding process):
• h̄s : [batch size, 1, state size]
• ht : [batch size, 1, state size]
• Wa : [batch size, state size, state size]
• score(ht , h̄s ): [batch size]
Synonyms:
- softmax outputs
3. Softmax the scores. In the vast majority of cases, the attention scores are next fed - attention dist.
through a softmax to convert them into a valid probability distribution. Most papers will - alignments
call this some vague probability function, when in reality they are using softmaxonly.

at (s) = align(ht , h̄s ) (205)


exp(score(ht , h̄s ))
=P (206)
s0 exp(score(ht , h̄s0 ))

where the alignment vector at has shape [batch size, Lenc ]

4. Compute the context vector. The inner product of the softmax outputs and the Synonyms:
- context vector
raw encoder outputs. This will have shape [batch size, attention size] in TensorFlow, - attention
where attention size is from the constructor for your AttentionMechanism.

5. Combine context vector and decoder output: Typically with a concat. The result is
what people mean when they say “attention”. Luong et al. denotes this as h̃t , the decoder
output at timestep t. This is what TensorFlow means by “Luong-style mechanisms output
the attention.” And yes, these are used (at least for Luong) to compute the prediction:

h̃t = tanh (Wc [ct , ht ]) (207)


p(yt | y<t , x) = softmax(Ws h̃t ) (208)

87
Papers and Tutorials May 03

TextRank
Table of Contents Local Written by Brandon McKinzie

Introduction. A graph-based ranking algorithm is a way of deciding on the importance of a Semantic graph: one
vertex within a graph, by taking into account global information recursively computed from whose structure encodes
meaning between the
the entire graph, rather than relying only on local vertex-specific information. TextRank is nodes (semantic
a graph-based ranking model for graphs extracted from natural language texts. The authors elements).
investigate/evaluate TextRank on unsupervised keyword and sentence extraction.

The TextRank PageRank Model. In general [graph-based ranking], a vertex can be ranked
based on certain properties such as the number of vertices pointing to it (in-degree), how highly-
ranked those vertices are, etc. Formally, the authors [of PageRank] define the score of a vertex
Vi as follows:
The factor d is usually
X 1
S(Vi ) = (1 − d) + d ∗ S(Vj ) where d ∈ <[0, 1] (209) set to 0.85.
Vj ∈In(V )
| Out(Vi ) |
i

and the damping factor d is interpreted as the probability of jumping from a given vertex53 to
another random vertex in the graph. In practice, the algorithm is implemented through the
following steps:
(1) Initialize all vertices with arbitrary values.54
(2) Iterate over vertices, computing equation 4.9 until convergence [of the error rate] below
a predefined threshold. The error-rate, defined as the difference between the "true score"
and the score computed at iteration k, S k (Vi ), is approximated as:

Errork (Vi ) ≈ S k (Vi ) − S k−1 (Vi ) (210)

53
Note that d is a single parameter for the graph, i.e. it is the same for all vertices.
54
The authors do not specify what they mean by arbitrary. What range? What sampling distribution?
Arbitrary as in uniformly random? EDIT: The authors claim that the vertex values upon completion are not
affected by the choice of initial value. Investigate!

88
Weighted Graphs. In contrast with the PageRank model, here we are concerned with natural
language texts, which may include multiple or partial links between the units (vertices). The
authors hypothesize that modifying equation 4.9 to incorporate weighted connections may be
useful for NLP applications.
X wji wij denotes the
W S(Vi ) = (1 − d) + d ∗ W S(Vj ) (211) connection between
Vk ∈ Out(Vj )wjk
P
vertices Vi and Vj .
j∈In(Vi )

where I’ve shown the modified part in green. The authors mention they set all weights to
random values in the interval 0-10 (no explanation).

Text as a Graph. In general, the application of graph-based ranking algorithms to natural


language texts consists of the following main steps:
(1) Identify text units that best define the task at hand, and add them as vertices in the
graph.
(2) Identify relations that connect such text unit in order to draw edges between vertices in
the graph. Edges can be directed or undirected, weighted or unweighted.
(3) Iterate the algorithm until convergence.
(4) Sort [in reversed order] vertices based on final score. Use the values attached to each
vertex for ranking/selection decisions.

89
4.9.1 Keyword Extraction

Graph. The authors apply TextRank to extract words/phrases that are representative for a
given document. The individual graph components are defined as follows:
- Vertex: sequence of one or more lexical units from the text.
– In addition, we can restrict which vertices are added to the graph with syntactic filters.
– Best filter [for the authors]: nouns and adjectives only.
- Edge: two vertices are connected if their corresponding lexical units co-occur within a Typically N ∈ Z[2, 10]
window of N words55 .

Procedure:
(1) Pre-Processing: Tokenize and annotate [with POS] the text.
(2) Build the Graph: Add all [single] words to the graph that pass the syntactic filter, and
connect [undirected/unweighted] edges as defined earlier (co-occurrence).
(3) Run algorithm: Initialize all scores to 1. For a convergence threshold of 0.0001, usually
takes about 20-30 iterations.
(4) Post-Processing:
(i) Keep the top T vertices (by score), where the authors chose T = |V |/3.56 Remember
that vertices are still individual words.
(ii) From the new subset of T keywords, collapse any that were adjacent in the original
text in a single lexical unit.

Evaluation. The data set used is a collection of 500 abstracts, each with a set of keywords.
Results are evaluated using precision, recall, and F-measure57 . The best results were
obtained with a co-occurrence window of 2 [on an undirected graph], which yielded:

Precision: 31.2% Recall: 43.1% F-measure: 36.2

The authors found that larger window size corresponded with lower precision, and that directed
graphs performed worse than undirected graphs.

55
That is . . . simpler than expected. Can we do better?
56
Another approach is to have T be a fixed value, where typically 5 < T < 20.
57
Brief terminology review: A PR Curve plots
• Precision: fraction of keywords extracted that are in the "true" set of keywords. precision as a function of
• Recall: fraction of "true" keywords that are in the extracted set of keywords. recall.
• F-score: combining precision and recall to get a single number for evaluation:
2pr
F =
p+r

90
4.9.2 Sentence Extraction

Graph. Now we move to “sentence extraction for automatic summarization.”


• Vertex: a vertex is added to the graph for each sentence in the text.
• Edge: each weighted edge represents the similarity between two sentences. The authors
use the following similarity measure between two sentences Si and Sj :
| Si ∩ Sj |
Similarity(Si , Sj ) = (212)
log(| Si |) + log(| Sj |)
where the numerator is the number of words that occur in both Si and Sj .
The procedure is identical to the algorithm described for keyword extraction, except we run
it on full sentences.

Evaluation. The data set used is 567 news articles. For each article, TextRank generates a
100-word summary (i.e. they set T = 100). They evaluate with the Rouge evaluation toolkit
(Ngram statistics).

91
Papers and Tutorials June 12, 2017

Simple Baseline for Sentence Embeddings


Table of Contents Local Written by Brandon McKinzie

Overview. It turns out that simply taking a weighted average of word vectors and doing
some PCA/SVD is a competitive way of getting unsupervised word embeddings. Apparently Discussion based on
it beats supervised learning with LSTMs (?!). The authors claim the theoretical explanation paper by Arora et al.,
(2017).
for this method lies in a latent variable generative model for sentences (of course).

Algorithm.
1. Compute the weighted average of the word vectors in the sentence:

N The authors call their


1 X a weighted average the
wi (213) Smooth Inverse
N i a + p(wi ) Frequency (SIF).

where wi is the word vector for the ith word in the sentence, a is a parameter, and p(wi )
is the (estimated) word frequency [over the entire corpus].
2. Remove the projections of the average vectors on their first principal component (“com-
mon component removal”) (y tho?).

92
Theory. Latent variable generative model. The model treats corpus generation as a dynamic
process, where the t-th word is produced at time step t, driven by the random walk of a
discourse vector ct ∈ <d (d is size of the embedding dimension). The discourse vector is not
pointing to a specific word; rather, it describes what is being talked about. We can tell how
related (correlation) the discourse is to any word w and corresponding vector vw by taking the
inner product ct · vw . Similarly, we model the probability of observing word w at time t, wt ,
as:

Pr [wt | ct ] ∝ ect ·vw (214)

• The Random Walk. If we assume that ct doesn’t change much over the words in
a single sentence, we can assume it stays at some cs . The authors claim that in their
previous paper they showed that the MAP58 estimate of cs is – up to multiplication by
a scalar – the average of the embeddings of the words in the sentence.
• Improvements/Modifications to 214.
1. Additive term αp(w) where α is a scalar. Allows words to occur even if ct · vw is
very small.
2. Common discourse vector c0 ∈ <d . Correction term for the most frequent discourse
that is often related to syntax.
• Model. Given the discourse vector cs for a sentence s, the probability that w is in the
sentence (at all (?)):
ec̃s ·vw
Pr [w | cs ] = αp(w) + (1 − α) (215)
Zc̃s
c̃s = βc0 + (1 − β)cs (216)
with c0 ⊥ cs and Zc̃s is a normalization constant, taken over all w ∈ V .

58
Review of MAP: X
θM AP = arg max log (pX (x | θ)p(θ))
θ
i

93
Papers and Tutorials July 03, 2017

Survey of Text Clustering Algorithms


Table of Contents Local Written by Brandon McKinzie

Aggarwal et al., “A Survey of Text Clustering Algorithms,” (2012).

Introduction. The unique characteristics for clustering text, as opposed to more traditional
(numeric) clustering, are (1) large dimensionality but highly sparse data, (2) words are typ-
ically highly correlated, meaning the number of principal components is much smaller than
the feature space, and (3) the number of words per document can vary, so we must normalize
appropriately.

Common types of clustering algorithms include agglomerative clustering algorithms, partition-


ing algorithms, and standard parametric modeling based methods such as the EM-algorithm.
Feature Selection.
• Document Frequency-Based. Using document frequency to filter out irrelevant fea-
tures. Dealing with certain words, like “the”, should probably be taken a step further
and simply removed (stop words).
• Term Strength. A more aggressive technique for stop-word removal. It’s used to
measure how informative a word/term t is for identifying two related documents, x and
y. Denoted s(t), it is defined as: See ref 94 of the paper
for more.
s(t) = Pr [t ∈ y | t ∈ x] (217)
So, how do we know x and y are related to begin with? One way is a user-defined cosine
similarity threshold. Say we gather a set of such pairs and randomly identify one of the
pair as the “first” document of the pair, then we can approximate s(t) as
Num pairs in which t occurs in both
s(t) = (218)
Num pairs in which t occurs in the first of the pair
In order to prune features, the term strength may be compared to the expected strength
of a term which is randomly distributed in the training documents with the same fre-
quency. If the term strength of t is not at least two standard deviations greater than
that of the random word, then it is removed from the collection.
• Entropy-Based Ranking. The quality of a term is measured by the entropy reduction
when it is removed [from the collection]. The entropy E(t) of term t in a collection of n
documents is:
n X
X n
E(t) = − (Sij · log(Sij ) + (1 − Sij ) · log(1 − Sij )) (219)
i=1 j=1
−dij /d¯
Sij = 2 (220)
where

94
– Sij ∈ (0, 1) is the similarity between doc i and j.
– dij is the distance between i and j after the term t is removed
– d¯ is the average distance between the documents after the term t is removed.

LSI-based Methods. Latent Semantic Indexing is based on dimensionality reduction where See ref 28. of paper for
more on LSI.
the new (transformed) features are a linear combination of the originals. This helps magnify
the semantic effects in the underlying data. LSI is quite similar to PCA59 , except that we use
an approximation of the covariance matrix C which is appropriate for the sparse and high-
dimensional nature of text data.

Let A ∈ Rn×d be term-document matrix, where Ai,j is the (normalized) frequency for term
j in document i. Then AT A = n · Σ is the (scaled) approximation to covariance matrix60 ,
assuming the data is mean-centered. Quick check/reminder:

(AT A)ij = AT:,i A:,j , aTi aj (221)


≈ n · E [ai aj ] (222)

where the expectation is technically over the underlying data distribution, which gives e.g.
P (ai = x), the probability the ith word in our vocabulary having frequency x. Apparently,
since the data is sparse, we don’t have to worry much about it actually being mean-centered
(why?).

As usual, we using the eigenvectors of AT A with the largest variance in order to represent the
text61 . In addition:

One excellent characteristic of LSI is that the truncation of the dimensions removes the
noise effects of synonymy and polysemy, and the similarity computations are more closely
affected by the semantic concepts in the data.

59
The difference between LSI and PCA is that PCA subtracts out the means, which destroys the sparseness
of the design matrix.
60
Approximation because it is based on our training data, not on true expectations over the underlying
data-distribution.
61
In typical collections, only about 300 to 400 eigenvectors are required for the representation.

95
Non-negative Matrix Factorization. Another latent-space method (like LSI), but partic-
ularly suitable for clustering. The main characteristics of the NMF scheme:
• In LSI, the new basis system consists of a set of orthonormal vectors. This is not the
case for NMF.
• In NMF, the vectors in the basis system directly correspond to cluster topics. There-
fore, the cluster membership for a document may be determined by examining the largest
component of the document along any of the [basis] vectors.
Assume we want to create k clusters, using our n documents and vocabulary size d. The goal
of NMF is to find matrices U ∈ Rn×k and V ∈ Rd×k that minimize:

1
J = ||A − UV T ||2F (223) uij ≥ 0
2
vij ≥ 0
1 
= tr(AAT ) − 2tr(AVU T ) + tr(UV T VU T ) (224)
2

Note that the columns of V provide the k basis vectors which correspond to the k different
clusters. We can interpret this as trying to factorize A ≈ UV T . For each row, a, of A
(document vector), this is

a ≈u·VT (225)
k
X
= ui ViT (226)
i=1

Therefore, the document vector a can be rewritten as an approximate linear (non-negative)


combination of the basis vector which corresponds to the k columns of V T .

Langrange-multiplier stuff: Our optimization problem can be solved using the Lagrange method.
• Variables to optimize: All elements of both U = [uij ] and V = [vij ]
• Constraint: non-negativity, i.e. ∀i, j, uij ≥ 0 and vij ≥ 0.
• Multipliers: Denote as matrices α and β, with same dimensions as U and V, respectively.
• Lagrangian: I’ll just show it here first, and then explain in this footnote62 :
Any matrix
multiplication with a · is
L = J + tr(α · U T ) + tr(β · V T ) (227) just a reminder to think
n n X
n of the matrices as
column vectors.
X X
where tr(α · U T ) = αi · ui = αij uij (228)
i=1 i=1 j=1

You should think of α as a column vector of length n, and U T as a row vector of length n.
The reason we prefer L over just J is because now we have an unconstrained optimization
problem.
62
Recall that in Lagrangian minimization, L takes the form of [the-function-to-be-minimized] + λ
([constraint-function] - [expected-value-of-constraint-at-optimum]). So the second term is expected to tend
toward zero (i.e. critical point) at the optimal values. In our case, since our optimal value is sort-of (?) at 0 for
any value of uij and/or vij , we just have a sum over [langrange-mult] × [variable].

96
• Optimization: Set the partials of L w.r.t both U and V (separately) to zero63 :

∂L
= −A · V + U · V T · V + α = 0 (229)
∂U
∂L
= −AT · U + V · U T · U + β = 0 (230)
∂V
Since, ultimately, these just say [some matrix] = 0, we can multiply both sides (element-
wise) by a constant (x × 0 = 0). Using 64 the Kuhn-Tucker conditions αij · uij = 0
and βij · vij = 0, we get:

(A · V)ij · uij − (U · V T · V)ij · uij = 0 (231)


(AT · U)ij · vij − (V · U T · U)ij · vij = 0 (232)

• Update rules:
(A · V)ij · uij
uij = (233)
(U · V T · V)ij
(AT · U)ij · vij
vij = (234)
(V · U T · U)ij

4.11.1 Distance-based Clustering Algorithms

One challenge in clustering short segments of text (e.g., tweets) is that exact keyword matching
may not work well. One general strategy for solving this problem is to expand text represen- See ref. 66 in the paper
tation by exploiting related text documents, which is related to smoothing of a document for computing
similarities of short text
language model in information retrieval. segments.

Agglomerative and Hierarchical Clustering. “Agglomerative” refers to the process of


bottom-up clustering to build a tree – at the bottom are leaves (documents) and internal
nodes correspond to the merged groups of clusters. The different methods for merging groups
of documents for the different agglomerative methods are as follows:
• Single Linkage Clustering. Defines similarity between two groups (clusters) of doc-
uments as the largest similarity between any pair of documents from these two groups.
First, (1) compute all similarity pairs [between documents; ignore cluster labels], then
(2) sort in decreasing order, and (3) walk through the list in that order, merging clusters
if the pair belong to different clusters. One drawback is chaining: the resulting clusters
assume transitivity of similarity65 .

63
Recall that the Lagrangian consists entirely of traces (re: scalars). Derivatives of traces with respect to
matrices output the same dimension as that matrix, and derivatives are taken element-wise as always.
64
i.e. the equations that follow are not the KT conditions, they just use/exploit them. . .
65
Here, transitivity of similarity means if A is similar to B, and B is similar to C, then A is similar to C.
This is not guaranteed by any means for textual similarity, and so we can end up with A and Z in the same
cluster, even though they aren’t similar at all.

97
• Group-Average Linkage Clustering. Similarity between two clusters is the average
similarity over all unique pairwise combinations of documents from one cluster to the
other. One way to speed up this computation with an approximation is to just compute
the similarity between the mean vector of either cluster.
• Complete Linkage Clustering. Similarity between two clusters is the worst-case
similarity between any pair of documents.

Distance-Based Partitioning Algorithms.


• K-Medoid Clustering. Use a set of points from training data as anchors (medoids)
around which the clusters are built. Key idea is we are using an optimal set of repre-
sentative documents from the original corpus. The set of k reps is successively improved
via randomized inter-changes. In each iteration, we replace a randomly picked rep in the K-Medoid isn’t great for
current set of medoids with a randomly picked rep from the collection, if it improves the clustering text,
especially short texts.
clustering objective function. This approach is applied until convergence is achieved.
• K-Means Clustering. Successively (1) assigning points to the nearest cluster centroid
and then (2) re-computing the centroid of each cluster. Repeat until convergence. Re-
quires typically few iterations (about 5 for many large data sets). Disadvantage: sensitive
to initial set of seeds (initial cluster centroids). One method for improving the initial set
of seeds is to use some supervision - e.g. initialize with k pre-defined topic vectors (see
ref. 4 in paper for more).

Hybrid Approach: Scatter-Gather Method. Use a hierarchical clustering algorithm on


a sample of the corpus in order to find a robust initial set of seeds. This robust set of seeds is Scatter-Gather is
discussed in detail in ref.
used in conjunction with a standard k-means clustering algorithm in order to determine good 25 of the paper
clusters.TODO: resume note-taking; page 19/52 of PDF.

4.11.2 Probabilistic Document Clustering and Topic Models

Overview. Primary assumptions in any topic modeling approach: From pg. 31/52 of paper.

• The n documents in the corpus are assumed to each have a probability of belonging to one
of k topics. Denote the probability of document Di belonging to topic Tj as Pr [Tj | Di ].
This allows for soft cluster membership in terms of probabilities.
• Each topic is associated with a probability vector, which quantifies the probability of
the different terms in the lexicon for that topic. For example, consider a document that
belongs completely to topic Tj . We denote the probability of term tl occurring in that
document as Pr [tl | Tj ].
The two main methods for topic modeling are Probabilistic Latent Semantic Indexing
(PLSA) and Latent Dirichlet Allocation (LDA).

98
PLSA. We note that the two aforementioned probabilities, Pr [Tj | Di ] and Pr [tl | Tj ] allow
us to calculate Pr [tl | Di ]: the probability that term tl occurs in some document Di :
k
X
Pr [tl | Di ] = Pr [tl | Tj ] · Pr [Tj | Di ] (235)
j=1

which should be interpreted as a weighted average66 . From here, we can generate a n × d


matrix of probabilities.

Recall that we also have our n × d term-document matrix X, where Xi,l gives the number
of times term l occurred in document Di . This allows us to do maximum likelihood! Our
negative log-likelihood, J can be derived as follows:
Interpret Pr [X] as the
joint probability of
J = − log (Pr [X]) (236) observing the words in
  our data and with their
assoc. frequencies.
= − log  Pr [tl | Di ]Xi,l 
Y
(237)
i,l
X
=− Xi,l · log (Pr [tl | Di ]) (238)
i,l

and we can plug-in eqn 235 to for evaluating Pr [tl | Di ]. We want to optimize the value of J,
subject to the constraints:
X X
(∀Tj ) : Pr [tl | Tj ] = 1 (∀Di ) : Pr [Tj | Di ] = 1 (239)
l j

This can be solved with a Lagrangian method, similar to the process for NMF described earlier.
See page 33/52 of the paper for details.

Latent Dirichlet Allocation (LDA). The term-topic probabilities and topic-document prob-
abilities are modeled with a Dirichlet distribution as a prior67 . Typically preferred over PLSI
because PLSI more prone to overfitting.

66
This is actually pretty bad notation, and borderline incorrect. Pr [Tj | Di ] is NOT a conditional probability!
It is our prior! It is literally Pr [ClusterOf(Di ) = Tj ].
67
LDA is the Bayesian version of PLSI

99
4.11.3 Online Clustering with Text Streams

Reference List: [3]: A Framework for Clustering Massive Text and Categorical Data Streams; [112]: Efficient Streaming Text Clustering;
[48]: Bursty feature representation for clustering text streams; [61]: Clustering Text Data Streams (Liu et al.)

Overview. Maintaining text clusters in real time. One method is the Online Spherical See ref. 112 for more on
OSKM
K-Means Algorithm (OSKM)68 .

Condensed Droplets Algorithm. I’m calling it that because they don’t call it anything –
it is the algorithm in [3].
• Fading function: f (t) = 2−λ·t . A time-dependent weight for each data point (text
stream). Non-monotonic decreasing; decays uniformly with time.
• Decay rate: λ = 1/t0 . Inverse of the half-life of the data stream.
When a cluster is created by a new point, it is allowed to remain as a trend-setting outlier
for at least one half-life. During that period, if at least one more data point arrives, then the
cluster becomes an active and mature cluster. If not, the trend-setting outlier is recognized as
a true anomaly and is removed from the list of current clusters (cluster death). Specifically, this
happens when the (weighted) number of points in the [single-point] cluster is 0.5. The same
criterion is used to define the death of mature clusters. The statistics of the data points are
referred to as condensed droplets, which represent the word distributions within a cluster,
and can be used in order to compute the similarity of an incoming data point to the cluster.
Main idea of algorithm is as follows:
1. Initialize empty set of clusters C = {}. As new data points arrive, unit clusters containing
individual data points are created. Once a maximum number k of such clusters have been
created, we can begin the process of online cluster maintenance.
2. For a new data point X, compute its similarity to each cluster Cj , denoted as S(X, Cj ).
- If S(X, Cbest ) > threshoutlier , or if there are no inactive clusters left69 , insert X to the
cluster with maximum similarity.
- Otherwise, a new cluster is created70 containing the solitary data point X.

68
Authors only provide a very brief description, which I’ll just copy here:
This technique divides up the incoming stream into small segments, each of which can be processed effectively in main memory.
A set of k-means iterations are applied to each such data segment in order to cluster them. The advantage of using a segment-
wise approach for clustering is that since each segment can be held in main memory, we can process each data point multiple
times as long as it is held in main memory. In addition, the centroids from the previous segment are used in the next iteration
for clustering purposes. A decay factor is introduced in order to age- out the old documents, so that the new documents are
considered more important from a clustering perspective.

69
We specify some max allowed number of clusters k.
70
The new cluster replaces the least recently updated inactive cluster.

100
Misc.
• Mandatory read: reference [61]. Details phrase extraction/topic signatures. The use
of using phrases instead of individual words is referred to as semantic smoothing.
• For dynamic (and more recent) topic modeling, see reference [107] of the paper, titled “A
probabilistic model for online document clustering with application to novelty detection.”

Semi-Supervised Clustering. Useful when we have any prior knowledge about the kinds of
clusters available in the underlying data. Some approaches:
• Incorporate this knowledge when seeding the cluster centroids for k-means clustering.
• Iterative EM approach: unlabeled documents are assigned labels using a naive Bayes
approach on the currently labeled documents. These newly labeled documents are then
again used for re-training a Bayes classifier. Iterate to convergence.
• Graph-based approach: graph nodes are documents and the edges are similarities between
the connected documents (nodes). We can incorporate prior knowledge by adding certain
edges between nodes that we know are similar. A normalized cut algorithm is then applied
to this graph in order to create the final clustering.
We can also use partially supervised methods in conjunction with pre-existing categorical
hierarchies.

101
Papers and Tutorials July 10, 2017

Deep Sentence Embedding Using LSTMs


Table of Contents Local Written by Brandon McKinzie

Palangi et al., “Deep Sentence Embeddings Using Long Short-Term Memory Networks: Analysis and Application
to Information Retrieval,” (2016).

Abstract. Sentence embeddings using LSTM cells, which automatically attenuate unimpor-
tant words and detect salient keywords. Main emphasis on applications for document retrieval
(matching a query to a document71 ).

Introduction. Sentence embeddings are learned using a loss function defined on sentence
pairs. For example, the well-known Paragraph Vector72 is learned in an unsupervised manner
as a distributed representation of sentences and documents, which are then used for sentiment
analysis.

The authors appear to use a dataset of their own containing examples of (search-query, clicked-
title) for a search engine. Their training objective is to maximize the similarity between the
two vectors mapped by the LSTM-RNN from the query and the clicked document, respectively.
One very interesting claim to pay close attention to:
We further show that different cells in the learned model indeed correspond to different
topics, and the keywords associated with a similar topic activate the same cell unit in the
model.

Related Work. (Identified by reference number)


• [2] Good for sentiment, but doesn’t capture fine-grained sentence structure.
• [6] Unsupervised embedding method trained on the BookCorpus [7]. Not good for
document retrieval task.
• [9] Semi-supervised Recursive Autoencoder (RAE) for sentiment prediction.
• [3] DSSM (uses bag-of-words) and [10] CLSM (uses bag of n-grams) models for IR and
also sentence embeddings.
• [12] Dynamic CNN for sentence embeddings. Good for sentiment prediction and ques-
tion type classification. In [13], a CNN is proposed for sentence matching73

71
Note that this similar to topic extraction.
72
Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents.”
73
Might want to look into this.

102
Basic RNN. The information flow (sequence of operations) is enumerated below.
1. Encode tth word [of the given sentence] in one-hot vector x(t).
2. Convert x(t) to a letter tri-gram vector l(t) using fixed hashing operator74 H:

l(t) = Hx(t) (240)

3. Compute the hidden state h(t), which is the sentence embedding for t = T , the length
of the sentence.

h(t) = tanh (Ul(t) + Wh(t − 1) + b) (241)

where U and W are the usual parameter matrices for the input/recurrent paths, respec-
tively.

LSTM. With peephole connections that expose the internal cell state s to the sigmoid com-
putations. I’ll rewrite the standard LSTM equations from my textbook notes, but with the
modifications for peephole connections:
 
(t) (t) (t−1) (t−1) 
= σ bfi + f f f
X X X
fi Ui,j xj + Wi,j hj + Pi,j sj (242)
j j j
 
(t) (t) (t−1) (t) X (t) X (t−1) 
si = fi si + gi σ bi + Ui,j xj + Wi,j hj (243)
j j
 
(t) (t) (t−1) (t−1) 
gi = σ bgi + g g g
X X X
Ui,j xj + Wi,j hj + Pi,j sj (244)
j j j
 
(t) X
o (t)
X
o (t−1)
X
o (t) 
qi = σ boi + Ui,j xj + Wi,j hj + Pi,j sj (245)
j j j

The final hidden state can then be computed via


(t) (t) (t)
hi = tanh(si ) qi (246)

74
Details aside, the hashing operator serves to lower the dimensionality of the inputs a bit. In particular we
use it to convert one-hot word vectors into their letter tri-grams. For example, the word “good” gets surrounded
by hashes, ‘#good#‘, and then hashed from the one-hot vector to vectorized tri-grams, “#go”, “goo”, “ood”,
“od#”.

103
Learning method. We want to maximize the likelihood of the clicked document given query,
which can be formulated as the following optimization problem:
N N
( )
Y h i X
L(Λ) = min − log Pr Dr+ | Qr = min lr (Λ) (247)
Λ Λ
r=1 r=1
 
n
e−γ·∆r,j 
X
lr (Λ) = log 1 + (248)
j=1

where
• N is the number of (query, clicked-doc) pairs in the corpus, while n is the number of
negative samples used during training.
• Dr+ is the clicked document for rth query.

• ∆r,j = R(Qr , Dr+ ) − R(Qr , Dr,j ) (R is just cosine similarity)75 .
• Λ is all the parameter matrices (and biases) in the LSTM.
The authors then describe standard BPTT updates with momentum, which need not be de-
tailed here. See the “Algorithm 1” figure in the paper for extremely detailed pseudo-code of
the training procedure.

75
Note that ∆r,j ∈ [−2, 2]. We use γ as a scaling factor so as to expand this range.

104
Papers and Tutorials July 10, 2017

Clustering Massive Text Streams


Table of Contents Local Written by Brandon McKinzie

Aggarwal et al., “A Framework for Clustering Massive Text and Categorical Data Streams,” (2006).

Overview. Authors present an online approach for clustering massive text and categorical
data streams with the use of a statistical summarization methodology. First, we will go over the
process of storing and maintaining the data structures necessary for the clustering algorithm.
Then, we will discuss the differences which arise from using different kinds of data, and the
empirical results.

Maintaining Cluster Statistics. The data stream consists of d-dimensional records, where
each dimension corresponds to the numeric frequency of a given word in the vector space
representation. Each data point is weighted by the fading function f (t), a non-monotonic
decreasing function which decays uniformly with time t. The authors define the half-life of a
data point (e.g. a tweet) as:
1
t0 s.t. f (t0 ) = f (0) (249)
2
and, similarly, the decay-rate as its inverse, λ = 1/t0 . Thus we have f (t) = 2−λ·t .

To achieve greater accuracy in the clustering process, we require a high level of granularity in
the underlying data structures. To do this, we will use a process in which condensed clusters
of data points are maintained, referred to as cluster droplets. We define them differently for
the case of text and categorical data, beginning with categorical:
• Categorical. A cluster droplet D(t, C) for a set of categorical data points C at time t is
defined as the tuple:
¯ 2, DF
D(t, C) , (DF ¯ 1, n, w(t), l) (250)

where
– Entry k of the vector DF ¯ 2 is the (weighted) number of points in cluster C where
the ith dimension had value x and the j dimension had value y. In other words, all
pairwise combinations of values in the categorical vector76 . di=1 dj6=i vi vj entries
P P

total77 .
¯ 1 consists of the (weighted) counts that some dimension i took on the
– Similarly, DF
Pd
value x. i=1 vi entries total.
76
This is intentionally written hand-wavy because I’m really concerned with text streams and don’t want to
give this much space.
77
vi is the number of values the ith categorical dimension can take on.

105
– w(t) is the sum of the weights of the data points at time t.
– l is the time stamp of the last time a data point was added to the cluster.
• Text. Can be viewed as an example of a sparse numeric data set. A cluster droplet
D(t, C) for a set of text data points C at time t is defined as the tuple:
¯ 2, DF
D(t, C) , (DF ¯ 1, n, w(t), l) (251)

where
– DF¯ 2 contains 3 · wb · (wb − 1)/2 entries, where wb is the number of distinct words
in the cluster C.
– DF¯ 1 contains 2 · wb entries.
– n is the number of data points in the cluster C.

Cluster Droplet Maintenance.


1. We first start of with k trivial clusters (the first k data points that arrived).
2. When a new point X̄ arrives, the cosine similarity to each cluster’s DF ¯ 1 is computed.
3. X̄ is inserted into the cluster for which this is a maximum, so long as the associated
S(X̄, DF ¯ 1) > thresh, a predefined threshold. If not above the threshold and some
inactive cluster exists, a new cluster is created containing the solitary point X̄, which
replaces the inactive cluster. If not above threshold but no inactive clusters, then we just
insert it into the max similarity cluster anyway.
4. If X̄ was inserted (i.e. didn’t replace an inactive cluster), then we need to:
(a) Update the statistics to reflect the decay of the data points at the current moment
in time78 . This is done by multiplying entries in the droplet vectors by 2−λ·(t−l) .
(b) Add the statistics for each newly arriving data point to the cluster statistics.

78
In other words, the statistics for a cluster do not decay, until a new point is added to it.

106
Papers and Tutorials July 12, 2017

Supervised Universal Sentence Representations (InferSent)


Table of Contents Local Written by Brandon McKinzie

Conneau et al., “Supervised Learning of Universal Sentence Representations from Natural Language Inference
Data,” Facebook AI Research (2017).

Overview. Authors claim universal sentence representations trained using the supervised
data of the Stanford Natural Language Inference (SNLI) dataset can consistently outperform
unsupervised methods like SkipThought on a wide range of transfer tasks. They emphasize
that training on NLI tasks in particular results in embeddings that perform well in transfer
tasks. Their best encoder is a Bi-LSTM architecture with max pooling, which they claim is
SOTA when trained on the SNLI data.

The Natural Language Inference Task. Also known as Recognizing Textual Entailment
(RTE). The SNLI data consists of sentence pairs labeled as one of entailment, contradiction,
or neutral. Below is a typical architecture for training on SNLI.

Note that the same sentence encoder is used for both u and v. To obtain a sentence vector
from a BiLSTM encoder, they experiment with (1) the average ht over all t (mean pooling),
and (2) selecting the max value over each dimension of the hidden units [over all timesteps]
(max pooling)79 .

79
Since the authors have already mentioned that BiLSTM did the best, I won’t go over the other architectures
they tried: self-attentive networks, hierarchical convnet, vanilla LSTM/GRU.

107
Papers and Tutorials July 13, 2017

Dist. Rep. of Sentences from Unlabeled Data (FastSent)


Table of Contents Local Written by Brandon McKinzie

Hill et al., “Learning Distributed Representations of Sentences from Unlabelled Data,” (2016).

Overview. A systematic comparison of models that learn distributed representations of phras-


es/sentences from unlabeled data. Deeper, more complex models are preferable for represen-
tations to be used in supervised systems, but shallow log-linear models work best for building
representation spaces that can be decoded with simple spatial distance metrics.

Authors propose two new phrase/sentence representation learning objectives:


1. Sequential Denoising Autoencoders (SDAEs)
2. FastSent: a sentence-level log-linear BOW model.

Distributed Sentence Representations. Existing models trained on text:


• SkipThought Vectors (Kiros et al., 2015). Predict target sentences Si±1 given source
sentence Si . Sequence-to-sequence model.
• ParagraphVector (Le and Mikolov, 2014). Defines 2 log-linear models:
1. DBOW: learns a vector s for every sentence S in the training corpus which, together
with word embeddings vw , define a softmax distribution to predict words w ∈ S The authors use gensim
given S. to implement
2. DM: select k-grams of consecutive words {wi · · · wi+k ∈ S} and the sentence vector ParagraphVector.

s to predict wi+k+1 .
• Bottom-Up Methods. Train CBOW and Skip-Gram word embeddings on the Books
corpus.

Models trained on structured (and freely-available) resources:


• DictRep (Hill et al., 2015a). Map dictionary definitions to pre-trained word embeddings,
using either BOW or RNN-LSTM encoding.
• NMT. Consider sentence representations learned by sequence-to-sequence NMT models.

108
Novel Text-Based Methods.
• Sequential (Denoising) Autoencoders. To avoid needing coherent inter-sentence
narrative, try this representation-learning objective based on DAEs. For a given sentence
S and noise function N (S | po , px ) (where p0 , px ∈ [0, 1]),the approach is as follows:
1. For each w ∈ S, N deletes w with probability po .
2. For each non-overlapping bigram wi wi+1 ∈ S, N swaps wi and wi+1 with probability
px . Authors recommend
po = px = 0.1
We then train the same LSTM-based encoder-decoder architecture as NMT, but with the
denoising objective to predict (as target) the original source sentence S given a corrupted
version N (S | po , px ) (as source).

• FastSent. Designed to be a more efficient/quicker to train version of SkipThought.

109
Papers and Tutorials July 22, 2017

Latent Dirichlet Allocation


Table of Contents Local Written by Brandon McKinzie

Blei et al., “Latent Dirichlet Allocation,” (2003).

Introduction. At minimum, one should be familiar with generative probabilistic models,


mixture models, and the notion of latent variables before continuing. The “Dirichlet” in
LDA of course refers to the Dirichlet distribution, which is a generalization of the beta
distribution, B. It’s PDF is defined as8081 :
K
K QK X
1 Y i=1 Γ (αi ) xi = 1
Dir(x; α) = xαi −1 where B(α) = (252)
B(α) i=1 i Γ( K
P
i=1 αi )
i=1

(∀i ∈ [1, K]) : xi ≥ 0


Main things to remember about LDA:
- Generative probabilistic model for collections of discrete data such as text corpora.
- Three-level hierarchical Bayesian model. Each document is a mixture of topics, each
topic is an infinite mixture over a set of topic probabilities.
Condensed comparisons/history of related models leading up to LDA:
- TF-IDF. Design matrix X ∈ RV ×M , where M is the number of docs, and Xi,j gives the
TF-IDF value for ith word in vocabulary and corresp. to document j.
- LSI:82 Performs SVD on the TF-IDF design matrix X to identify a linear subspace in the
space of tf-idf features that captures most of the variance in the collection.
- pLSI: TODO
pLSI is incomplete in that it provides no probabilistic model at the level of documents. In
pLSI, each document is represented as a list of numbers (the mixing proportions for topics),
and there is no generative probabilistic model for these numbers.

80
Recall that for positive integers n, Γ(n) = (n − 1)!.
81
The Dirichlet distribution is conjugate to the multinomial distribution. TODO: Review how to interpret
this.
82
Recall that LSI is basically PCA but without subtracting off the means

110
Model. LDA assumes the following generative process for each document (word sequence) w:
1. N ∼ Poisson(λ): Sample N , the number of words (length of w), from Poisson(λ) =
n
e−λ λn! . The parameter λ should represent the average number of words per document.
2. θ ∼ Dir(α): Sample k-dimensional vector θ from the Dirichlet distribution (eq. 252),
Dir(α). k is the number of topics (pre-defined by us). Recall that this means θ lies in the
(k-1) simplex. The Dirichlet distribution thus tells us the probability density of θ over
this simplex – it defines the probability of θ being at a given position on the simplex.
3. Do the following N times to generate the words for this document.
(a) zn ∼ Multinomial(θ). Sample a topic zn .
(b) wn ∼ Pr [wn | zn , β]: Sample a word wn from Pr [wn | zn , β], a “multinomial prob-
ability conditioned on topic zn .”83 The parameter β gives the distribution of words
given a topic:

βij = Pr [wj | zi ] (253)

In other words, we really sample wn ∼ βi,:


The defining equations for LDA are thus:
N
Y
Pr [θ, z, w | α, β] = Pr [θ | α] Pr [zn | θ]Pr [wn | zn , β] (254)
n=1
 
Z N X
Pr [zn0 | θ0 ]Pr [wn | zn0 , β] dθ 0
Y
Pr [w | α, β] = Pr [θ0 | α]  (255)
n=1 0
zn
h i M
Y h i
(1) (M )
Pr D = {w ,...,w } | α, β = Pr w(d) | α, β (256)
d=1

Below is the plate notation for LDA, followed by an interpretation:

• Outermost Variables: α and β. Both represent a (Dirichlet) prior distribution: α


parameterizes the probability of a given topic, while β a given word.
• Document Plate. M is the number of documents, θm gives the true distribution of topics
for document m84 .

83
TODO: interpret meaning of the multinomial distributions here. Seems a bit different than standard
interp...
84
In other words, the meaning of θm,i = x is “x percent of document m is about topic i.”

111
• Topic/Word Place. zmn is the topic for word n in doc m, and wmn is the word. It is
shaded gray to indicate it is the only observed variable, while all others are latent
variables.

Theory. I’ll quickly summarize and interpret the main theoretical points. Without having
read all the details, this won’t be of much use (i.e. it is for someone who has read the paper
already).
• LDA and Exchangeability. We assume that each document is a bag of words (order
doesn’t matter; frequency sitll does) and a bag of topics. In other words, a document of
N words is an unordered list of words and topics. De Finetti’s theorem tells us that we
can model the joint probability of the words and topics as if a random parameter θ were
drawn from some distribution and then the variables within w, z were conditionally
independent given θ. LDA posits that a good distribution to sample θ from is a
Dirichlet distribution.
• Geometric Interpretation: TODO

Inference and Parameter Estimation. As usual, we need to find a way to compute the
posterior distribution of the hidden variables given a document w:

Pr [θ, z, w | α, β]
Pr [θ, z | w, α, β] = (257)
Pr [w | α, β]

Computing the denominator exactly is intractable. Common approximate inference algorithms


for LDA include Laplace approximation, variational approximation, and Markov Chain Monte
Carlo.

112
Papers and Tutorials July 30, 2017

Conditional Random Fields


Table of Contents Local Written by Brandon McKinzie

Lafferty et al., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,”
(2001).

Introduction. CRFs offer improvements to HMMs, MEMMs, and other discriminative Markov
models. MEMMs and other non-generative models share a weakness called the label bias
problem: the transitions leaving a given state compete only against each other, rather than
against all other transitions in the model. The key difference between CRFs and MEMMs is
that a CRF has a single exponential model for the joint probability of the entire sequence of labels
given the observation sequence.

The Label Bias Problem. Recall that MEMMs are run left-to-right. One way of interpreting
such a model is to consider how the probabilities (of state sequences) are distributed as we
continue through the sequence of observations. The issue with MEMMs is that there’s nothing
we can do if, somewhere along the way, we observe something that makes one of these state
paths extremely likely/unlikely; we can’t redistribute the probability mass amongst the various
allowed paths. The CRF solution:
Account for whole state sequences at once by letting some transitions “vote” more strongly
than others depending on the corresponding observations. This implies that score mass will
not be conserved, but instead individual transitions can “amplify” or “dampen” the mass
they receive.

Conditional Random Fields. Here we formalize the model and notation. Let X be a
random variable over data sequences to be labeled (e.g. over all words/sentences), and let Y
the random variable over corresponding label sequences85 . Formal definition:
Let G = (V, E) be a graph such that Y = (Yv )v∈V , so that Y is indexed by the vertices of
G. Then (X, Y ) is a CRF if, when conditioned on X, the random variables Yv obey the
Markov property with respect to the graph:

Pr [Yv |X, Yw , w 6= v] = Pr [Yv |X, Yw , w ∼ v] (258)

where w ∼ v means that w and v are neighbors in G.

All this means is a CRF is a random field (discrete set of random-valued points in a space)
where all points (i.e. globally) are conditioned on X. If the graph G = (V, E) of Y is a tree,
its cliques86 are the edges and vertices. Take note that X is not a member of the vertices
85
We assume all components Yi can only take on values in some finite label set Y.
86
A clique is a subset of vertices in an undirected graph such that every two distinct vertices in the clique
are adjacent

113
in G. G only contains vertices corresponding to elements of Y. Accordingly, when the au-
thors refer to cases where G is a “chain”, remember that they just mean the Y vertex sequence.

By the fundamental theorem of random fields:


 
X X
pθ (y | x) ∝ exp  λk fk (e, y|e , x) + µk gk (v, y|v , x) (259)
e∈E,k v∈V,k

where y|S is the set of components of y associated with the vertices in subgraph S. We
assume the K feature [functions] fk and gk are given and fixed. Note that fk are the feature
functions over transitions yt−1 to yt , and gk are the feature functions over states yt and xt .
Our estimation problem is thus to determine parameters θ = (λ1 , λ2 , . . . ; µ1 , µ2 , . . .) from the
labeled training data.

Linear-Chain CRF. Let |Y| denote the number of possible labels. At each position t in the
observation sequence x, we define the |Y| × |Y| matrix random variable Mt (x)

Mt (y 0 , y, | x) = exp(Λt (y 0 , y | x)) (260)


Λt (y 0 , y | x) = λk fk (y 0 , y, x) +
X X
µk gk (y, x) (261)
k k

where yt−1 := y 0 and yt := y. We can see that the individual elements correspond to specific
values of e and v in the double-summations of pθ (y | x) above. Then the normalization
(partition function) Zθ (x) is the (y0 , yT +1 ) entry (the fixed boundary states) of the product:
"T +1 #
Y
Zθ (x) = Mt (x) (262)
t=1 y0 ,yT +1

which includes all possible sequences y that start with the fixed y0 and end with the fixed
yT +1 . Now we can write the conditional probability as a function of just these matrices:
QT +1
t=1 Mt (yt−1i, yt | x)
pθ (y | x) = hQ (263)
T +1
t=1 Mt (x) y ,y
0 T +1

114
Parameter Estimation (for linear-chain CRFs). For each t in [0, T + 1], define the forward
vectors αt (x) with base case α0 (y | x) = 1 if y = y0 , else 0. Similarly, define the backward
vectors βt (x) with base case βT +1 (y | x) = 1 if y = yT +1 else 087 . Their recurrence relations
are

αt (x) = αt−1 (x)Mt (x) (264)


T
βt (x) = Mt+1 (x)βt+1 (x) (265)

87
Remember that y0 and yT +1 are their own fixed symbolic constants representing a fixed start/stop state.

115
Papers and Tutorials September 04, 2017

Attention Is All You Need


Table of Contents Local Written by Brandon McKinzie

Vaswani et al., “Attention Is All You Need,” (2017)

Overview. Authors refer to sequence transduction models a lot – just a fancy way of refer-
ring to models that transform input sequences into output sequences. Authors propose new
architecture, the Transformer, based solely on attention mechanisms (no recurrence!).

Model Architecture.
• Encoder. N =6 identical layers, each with 2 sublayers: (1) a multi-head self-attention
mechanism and (2) a position-wise FC feed-forward network. They apply a residual
connection and layer norm such that each sublayer, instead of outputting Sublayer(x),
instead outputs LayerNorm(x + Sublayer(x)).
• Decoder. N =6 with 3 sublayers each. In addition to the two sublayers described for the
encoder, the decoder has a third sublayer, which performs multi-head attention over
the output of the encoder stack. Same residual connections and layer norm.

Figure shows
encoder-decoder
template layers. The
actual model instantiates
chain of 6 encoder layers
and 6 decoders layers.
The decoder’s
self-attention masks
embeddings at future
timesteps to zero.

116
Attention. An attention function can be described as a mapping:
X
Attn(query, {(k1 , v1 ), . . . , }) ⇒ f n(query, ki )vi (266)
i

where the query, keys, values, and output are all vectors.
• Scaled Dot-Product Attention.
1. Inputs: queries q, keys k of dimension √ dk , values v of dimension dv Appears that dq ≡ dk .

2. Dot Products: Compute ∀k : (q · k)/ dk .


3. Softmax: on each dot product above. This gives the weights on the values shown
earlier.
In practice, this is done simultaneously for all queries in a set via the following matrix
equation:
!
QK T
Attention(Q, K, V ) = softmax √ V (267)
dk
Note that this is identical to the standard
√ dot-product attention mechanism, except for
the scaling factor (hence the name) of 1/ dk . The scaling factor is motivated by the fact
that additive attention outperforms dot-product attention for large dk and the authors
stipulate this is due to the softmax having small gradients in this case, due to the large
dot products88 .

First, let’s explicitly show which indices are being normalized over, since it can get
confusing when presented with the highly vectorized version above. For a given input
sequence of length T , and for the self-attention version where K=Q=V
√ ∈ RT ×dk , the
output attention vector for timestep t is explicitly (ignoring the dk for simplicity)
h   i
Attention(Q, K, V )t = softmax QK T V (271)
t
T
X eQt ·Kt0
= PT V0
Qt ·Kt00 t
(272)
t0 t00 e

88
Assume that q and k are vectors in Rd whose components are independent RVs with E [qi ] = E [kj ] = 0
(∀i, j), and Var [qi ] = Var [kj ] = 1 (∀i, j). Then
" d
# d d
X X X
E [q • k] = E q i ki = E [qi ki ] = E [qi ] E [ki ] = 0 (268)
i i i
" d # d d
X X X
E qi2 ki2 − 
 
Var [q • k] = Var qi ki = Var [qi ki ] = i k
E [q i] (269)
i i i
d d
X  2  2 X
= E qi E ki = Var [qi ] Var [ki ] = d (270)
i i

See this S.O answer and/or these useful formulas for more details.

117
Next, the gradient of the dth softmax output w.r.t its inputs is
∂Softmaxd (x)
= Softmaxd (x) (δdj − Softmaxd (x)) (273)
∂xj
• Multi-Head Attention. Basically just doing some number h of parallel attention
computations. Before each of these, the queries, keys, and values are linearly projected
with different, learned linear projections to dk , dk and dv dimensions respectively (and
then fed to their respective attention function). The h outputs are then concatenated
and once again projected, resulting in the final values.
WiQ ∈ Rdmodel ×dk
O
MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W (274)
WiK ∈ Rdmodel ×dk
where headi = Attention(QWiQ , KWiK , V WiV ) (275)
WiV ∈ Rdmodel ×dv
The authors employ h = 8, dk = dv = dmodel /h = 64. W O ∈ Rhdv ×dmodel

The Transformer uses multi-headed attention in 3 ways:


1. Encoder-decoder attention: the normal kind. Queries are previous decoder layer,
and memory keys and values come from output of the [final layer of] encoder.
2. Encoder self-attention: all of the keys, values, and queries come from the previous
layer in the encoder. Each position in the encoder can attend to all positions in the
previous layer of the encoder.
3. Decoder self-attention: Similarly, self-attention layers in the decoder allow each po-
sition in the decoder to attend to all positions in the decoder up to and including that
position (timestep). The masking is done on the inputs to the softmax, setting all inputs
beyond the current timestep to −∞.

Other Components.
• Position-wise Feed-Forward Networks (FFN): each layer of the encoder and de-
coder contains a FC FFN, applied to each position separately and identically:
FFN(x) = max (0, xW1 + b1 ) W2 + b2 (276) The FFN is linear →
ReLU → linear.
• Embeddings and Softmax: use learned embeddings to convert input/output tokens to
vectors of dimension dmodel , and for the pre-softmax layer at the output of the decoder89 . For inputs to
encoder/decoder, the
• Positional Encoding: how the authors deal with the lack of recurrence (to make use embedding weights are

of the sequence order). They add a sinusoid function of the position (timestep) pos and multiplied by dmodel
vector index i to the input embeddings for the encoder and decoder90 :
pos
 
PE(pos, 2i) = sin (277)
100002i/dmodel
pos
 
PE(pos, 2i + 1) = cos (278)
100002i/dmodel
The authors justify this choice:
89
In other words, they use the same weight matrix for all three of (1) encoder input embedding, (2) decoder
input embedding, and (3) (opposite direction) from decoder output to pre-softmax.
90
Note that the positional encodings must necessarily be of dimension dmodel to be summed with the input
embeddings.

118
We chose this function because we hypothesized it would allow the model to easily learn
to attend by relative positions, since for any fixed offset k, PEpos+k can be represented
as a linear function of PEpos .

Summary of Add-ons. Below is a list of all the little bells and whistles they add to the
main components of the model that are easy to miss since they mention them throughout the
paper in a rather unorganized fashion.
• Shared weights for encoder inputs, decoder inputs, and final softmax projection
√ outputs.
• Multiply the encoder and decoder input embedding [shared] weights by dmodel . TODO:
why? Also this must be highly correlated with their decision regarding weight initializa-
tion (mean/stddev/technique). Add whatever they use here if they mention it.
• Adam optimizer with β1 =0.9, β2 =0.98, =10−9 .
• Learning rate schedule LR(s) = d−0.5 −0.5 , s · w −1.5 for global step s and warmup

model ·min s
steps w=4000.
• Dropout on sublayer outputs pre-layernorm-and-residual. Specifically, they actually re-
turn LayerNorm(x + Dropout(Sublayer(x))). Use Pdrop = 0.1.
• Dropout the summed embeddings+positional-encodings for both encoder and decoder
stacks.
• Dropout on softmax outputs. So do Dropout(Softmax(QK))V.
• Label smoothing with ls = 0.1.

119
Papers and Tutorials September 06, 2017

Hierarchical Attention Networks


Table of Contents Local Written by Brandon McKinzie

Yang et al., “Hierarchical Attention Networks for Document Classification.”

Overview. Authors introduce the Hierarchical Attention Network (HAN) that is designed
to capture insights regarding (1) the hierarchical structure of documents (words -> sentences
-> documents), and (2) the context dependence between words and sentences. The latter is
implemented by including two levels of attention mechanisms, one at the word level and one
at the sentence level.

Hierarchical Attention Networks. Below is an illustration of the network. The first stage
is familiar to sequence to sequence models - a bidirectional encoder for outputting sentence-
level representations of a sequence of words. The HAN goes a step further by feeding this
another bidirectional encoder for outputting document-level representations for sequences of
sentences.

The authors choose the GRU as their underlying RNN. For ease of reference, the defining
equations of the GRU are shown below:
ht = (1 − zt ) ht−1 + zt h̃t (279)
zt = σ (Wz xt + Uz ht−1 + bz ) (280)
h̃t = tanh (Wh xt + rt (Uh ht−1 ) + bh ) (281)
rt = σ (Wr xt + Ur ht−1 + br ) (282)

120
Hierarchical Attention. Here I’ll overview the main stages of information flow.
1. Word Encoder. Let the tth word in the ith sentence be denoted wit . They embed the
vectors with a word embedding matrix We , xit = We wit , and then feed xit through a

− ← −
bidirectional GRU to ultimately obtain hit := [ h it ; hit ].
2. Word Attention. Extracts words that are important to the meaning of the sentence
and aggregates the representation of these informative words to form a sentence vector.
uit = tanh(Ww hit + bw ) (283)
exp(uTit uw )
αit = P T
(284)
t exp(uit uw )
X
si = αit hit (285)
t

Note the context vector uw , which is shared for all words91 and randomly initialized and
jointly learned during the training process.
3. Sentence Encoder. Similar to the word encoder, but uses the sentence vectors si as
the input for the ith sentence in the document. Note that the output of this encoder, hi
contains information from the neighboring sentences too (bidirectional) but focuses on
sentence i.
4. Sentence Attention. For rewarding sentences that are clues to correctly classify a
document. Similar to before, we now use a sentence level context vector us to measure
the importance of the sentences.
ui = tanh(Ws hi + bs ) (286)
exp(uTi us )
αi = P T
(287)
i exp(ui us )
X
v= α i hi (288)
t

where v is the document vector that summarizes all the information of sentences in a
document.
As usual, we convert v to a normalized probability vector by feeding through a softmax:

p = softmax(Wc v + bc ) (289)

91
To emphasize, there is only a single context vector uw in the network, period. The subscript just tells us
that it is the word-level context vector, to distinguish it from the sentence-level context vector in the later stage.

121
Configuration and Training. Quick overview of some parameters chosen by the authors:
• Tokenization: Stanford CoreNLP. Vocabulary consists of words occurring more than 5
times, all others are replaced with UNK token.
• Word Embeddings: train word2vec on the training and validation splits. Dimension
of 200.
• GRU. Dimension of 50 (so 100 because bidirectional).
• Context vectors. Both uw and us have dimension of 100.
• Training: batch size of 64, grouping documents of similar length into a batch. SGD
with momentum of 0.9.

122
Papers and Tutorials Oct 31, 2017

Joint Event Extraction via RNNs


Table of Contents Local Written by Brandon McKinzie

Nguyen, Cho, and Grishman, “Joint Event Extraction via Recurrent Neural Networks,” (2016).

Event Extraction Task. Automatic Context Extraction (ACE) evaluation. Terminology:


• Event: something that happens or leads to some change of state.
• Mention: phrase or sentence in which an event occurs, including one trigger and an
arbitrary number of arguments.
• Trigger: main word that most clearly expresses an event occurrence.
• Argument: an entity mention, temporal expression, or value that serves as a partici-
pant/attribute with a specific role in an event mention.
Example:

In Baghdad, a cameraman died{Die} when an American tank fired{Attack} on TRIGGER


the Palestine hotel. ARGUMENT

Each event subtype has its own set of roles to be filled by the event arguments. For example,
the roles for the Die event subtype include Place, Victim, and Time.

Model.
- Sentence Encoding. Let wi denote the ith token in a sentence. It is transformed into a
real-valued vector xi , defined as

xi := [GloVe(wi ); Embed(EntityType(wi )); DepVec(wi )] (290)

where “Embed” is an embedding we learn, and “DepVec” is the binary vector whose dimen-
sions correspond to the possible relations between words in the dependency trees.
- RNN. Bidirectional LSTM on the inputs xi .
- Prediction. Binary memory vector Gtrg i for triggers; binary memory matrices Garg i and
arg/trg
Gi for arguments (at each timestep i). At each time step i, do the following in order:
1. Predict trigger ti for wi . First compute the feature representation vector Ritrig , defined
as:
h i
Ritrig := hi ; Ltrg trg
i ; Gi−1 (291)

where hi is the RNN output, Ltrg


i is the local context vector for wi , and Gtrg
i−1 is the
trg
memory vector from the previous step. Li := [GloVe(wi−d ); . . . ; GloVe(wi+d )] for

123
some predefined window size d. This is then fed to a fully-connected layer with softmax
activation, F trg , to compute the probability over possible trigger subtypes:
trg
Pi;t := Fttrg (Ritrg ) (292)
 
trg
As usual, the predicted trigger type for wi is computed as ti = arg maxt Pi;t . If wi
is not a trigger, ti should predict “Other.”
2. Argument role predictions, ai1 , . . . , aik , for all of the [already known] entity mentions
in the sentence, e1 , . . . , ek with respect to wi . aij denotes the argument role of ej with
respect to [the predicted trigger of] wi . If NOT(wi is trigger AND ej is one of its
arguments), then aij is set to Other. For example, if wi was the word “died” from our
example sentence, we’d hope that its predicted trigger would be ti = Die, and that the
entity associated with “cameraman” would get a predicted argument role of
Victim.
def getArgumentRoles(triggerType=t, entities=e):
k = len(e)
if isOther(t):
return [Other] * k
else:
for e_j in e:

h i
arg arg/trg
Rij := hi ; hij ; Larg arg
ij ; Bij ; Gi−1 [j]; Gi−1 [j] (293)

3. Update memory. TO BE CONTINUED...(moving onto another paper because this


model is getting a bit too contrived for my tastes. Also not a fan of the reliance on a
dependency parse.)

124
Papers and Tutorials Oct 31, 2017

Event Extraction via Bidi-LSTM Tensor NNs


Table of Contents Local Written by Brandon McKinzie

Y. Chen, S. Liu, S. He, K. Liu, and J. Zhao, “Event Extraction via Bidirectional Long Short-Term Memory
Tensor Neural Networks.”

Overview. The task/goal is the event extraction task as defined in Automatic Content Ex-
traction (ACE). Specifically, given a text document, our goal is to do the following in order
for each sentence:
1. Identify any event triggers in the sentence.
2. If triggers found, predict their subtype. For example, given the trigger “fired,” we may
classify it as having the Attack subtype.
3. If triggers found, identify their candidate argument(s). ACE defines an event argument
as “an entity mention, temporal expression, or value that is involved in an event.”
4. For each candidate argument, predict its role: “the relationship between an argument to
the event in which it participates.”

Context-aware Word Representation. Use pre-trained word embeddings for the input
word tokens, the predicted trigger, and the candidate argument. Note: we assume we already
have predictions for the event trigger t and are doing a pass for one of (possibly many) candidate
arguments a.
1. Embed each word in the sentence with pre-trained embeddings. Denote the embedding
for ith word as e(wi ).
2. Feed each e(wi ) through a bidirectional LSTM. Denote the ith output of the forward
LSTM as cl (wi+1 ) and the output of the backward LSTM at the same time step as
cr (wi−1 ). As usual, they take the general functional form:
−−−−→
cl (wi ) = LST M (cl (wi−1 ), e(wi−1 )) (294)
←−−−−
cr (wi ) = LST M (cr (wi+1 ), e(wi+1 )) (295)
(296)

3. Concatenate e(wi ), cl (wi ), cr (wi ) together along with the embedding of the candidate
argument e(a) and predicted trigger e(t). Also include the relative distance of wi to t or
(??) a, denoted as pi for position information, and the embedding of the predicted event
type pe of the trigger. Denote this massive concatenation result as xi :

xi := cl (wi ) ⊕ e(wi ) ⊕ cr (wi ) ⊕ pi ⊕ pe ⊕ e(a) ⊕ e(t) (297)

125
Dynamic Multi-Pooling. This is easiest shown by example. Continue with our example
sentence:

In California, Peterson was arrested for the murder of his wife and unborn son.

where the colors are given for this specific case where murder is our predicted trigger and we
are considering the candidate argument Peterson 92 . Given our n outputs from the previous
stage, y (1) ∈ Rn×m , where n is the length of the sentence and m is the size of that huge
concatenation given in equation 297. We split our sentence by trigger and candidate argument,
then (confusingly) redefine our notation as
h i
(1) (1) (1)
y1j ← y1j y2j (298) Peterson is the 3rd word,
h i and murder is the 8th
(1) (1) (1)
y2j ← y3j ··· y7j (299) word.
h i
(1) (1) (1)
y3j ← y8j · · · ynj (300)

(1)
where it’s important to see that, for some 1 ≤ j ≤ m, each new yij is a vector of length equal
to the number of words in segment i. Finally, the dynamic multi-pooling layer, y (2) , can be
expressed as
 
(2) (1)
yi,j := max yi,j 1 ≤ i ≤ 3, 1 ≤ j ≤ m (301)

where the max is taken over each of the aforementioned vectors, leaving us with 3m values
total. These are concatenated to form y (2) ∈ R3m .

Output. To predict of each argument role [for the given argument candidate], y (2) is fed
through a dense softmax layer,

O = W2 y (2) + b2 (302)

where W2 ∈ Rn1 ×3m and n1 is the number of possible argument roles (including "None"). The
authors also use dropout on y (2) .

92
Yes, arrested could be another predicted trigger, but the network considers each possibility at separate
times/locations in the architecture.

126
Papers and Tutorials Nov 2, 2017

Reasoning with Neural Tensor Networks


Table of Contents Local Written by Brandon McKinzie

Socher et al., “Reasoning with Neural Tensor Networks for Knowledge Base Completion”

Overview. Reasoning over relationships between two entities. Goal: predict the likely truth
of additional facts based on existing facts in the KB. This paper contributes (1) the new NTN
and (2) a new way to represent entities in KBs. Each relation is associated with a distinct
model. Inputs to a given relation’s model are pairs of database entities, and the outputs score
how likely the pair has the relationship.

Neural Tensor Networks for Relation Classification. Let e1 , e2 ∈ Rd be the vector


representations of the two entities, and let R denote the relation (and thus model) of interest.
The NTN computes a score of how likely it is that e1 and e2 are related by R via:
" # !
[1:k] e1 [1:k]
∈ Rd×d×k
g(e1 , R, e2 ) = uTR tanh eT1 WR e2 + VR + bR (303) WR
e2
VR ∈ Rk×2d
[1:k]
where the bilinear tensor product eT1 WR e2 results in a vector h ∈ Rk with each entry
computed by one slice i = 1, . . . , k of the tensor.

Intuitively, we can see each slice of the tensor as being responsible for one type of
entity pair or instantiation of a relation. . . Another way to interpret each tensor
slice is that it mediates the relationship between the two entity vectors differently.

Training Objective and Derivatives. All models are trained with contrastive max-
margin objective functions and minimize the following objective:
N X
X C     
J(Ω) = max 0, 1 − g T (i) + g Tc(i) + λ||Ω||22 (304)
i=1 c=1
 
(i) (i) (i)
where c is for “corrupted” samples, Tc := e1 , R(i) , ec . Notice that this function is mini-
   
(i)
mized when the difference, g T (i) − g Tc , is maximized. The authors used minibatched
L-BFGS for optimization.

127
Papers and Tutorials Nov 6, 2017

Language to Logical Form with Neural Attention


Table of Contents Local Written by Brandon McKinzie

Dong and Lapata, “Language to Logical Form with Neural Attention,” (2016)

Sequence-to-Tree Model. Variant of Seq2Seq that is more faithful to the compositional


nature of meaning representations. It’s schematic is shown below. The authors define a
“nonterminal” < n > token which indicates [the root of] a subtree.

where the author’s have employed “parent-feeding”: for a given subtree (logical form), at each
timestep, the hidden vector of the parent nonterminal is concatenated with the inputs and fed
into the LSTM (best understood via above illustration).

After encoding input q, the hierarchical tree decoder generates tokens at depth 1
of the subtree corresponding to parts of logical form a. If the predicted token is
< n >, decode the sequence by conditioning on the nonterminal’s hidden vector.
This process terminates when no more nonterminals are emitted.

128
Also note that the output posterior probability over the encoded input q is the product of
subtree posteriors. For example, consider the decoding example in the figure below:

We would compute the output posterior as:

p(a | q) = p(y1 y2 y3 y4 | q)p(y5 y6 | y≤3 , q) (305)

The model is trained by minimizing log-likelihood over the training data, using RMSProp for
optimization. At inference time, greedy search or beam search is used to predict the most
probable output sequence.

129
Papers and Tutorials Nov 6, 2017

Seq2SQL: Generating Structured Queries from NL using RL


Table of Contents Local Written by Brandon McKinzie

Zhong, Xiong, and Socher, “Seq2SQL: Generating Structured Queries From Natural Language Using Reinforce-
ment Learning”

Overview. Deep neural network for translating natural language questions to corresponding
SQL queries. Outperforms state-of-the-art semantic parser.

Seq2Tree and Pointer Baseline. Baseline model is the Seq2Tree model from the previous
note on Dong & Lapata’s (2016) paper. Authors here argue their output space is unnecessarily
large, and employ the idea of pointer networks with augmented inputs. The input sequence is
the concatenation of (1) the column names, (2) the limited vocabulary of the SQL language
such as SELECT, COUNT, etc., and (3) the question.

x := [< col >; xc1 ; xc2 ; . . . ; xcN ; < sql >; xs ; < question >; xq ] (306)
xcj ∈ RTj

where we also insert special (“sentinel”) tokens to demarcate the boundaries of each section.
The pointer network can then produce the SQL query by selecting exclusively from the input.
Let gs denote the sth decoder [hidden] state, and ys denote the output (index/pointer to input
query token).
   
ptr
[ptr net] ys = arg max αsptr where αs,t = wptr · tanh U ptr gs + V ptr ht
t
(307)

Seq2SQL.

1. Aggregation Classifier. Our goal here is to predict which aggregation operation to use
out of COUNT, MIN, MAX, NULL, etc. This is done by projecting the attention-weighted aver-
age of encoder states, κagg , to RC where C denotes the number of unique aforementioned
aggregation operations. The sequence of computations is summarized as follows:

αtinp = winp · henc


t (308)
 
β inp = softmax αinp (309)
W agg ∈ RC×T
T
β agg ∈ RC
β inp henc
X
κagg = t t (310)
t
αagg = W agg tanh (V agg κagg + bagg ) + cagg (311)
agg agg
β = softmax (α ) (312)

130
where βiagg gives the probability for the ith aggregation operation. We use cross entropy
loss Lagg for determining the aggregation operation. Note that this part isn’t really
a sequence-to-sequence architecture. It’s nothing more than an MLP applied to an
attention-weighted average of the encoder states.

2. Get Pointer to Column. A pointer network is used for identifying which column in
the input representation should be used in the query. Recall that xcj,t denotes the tth
word in column j. We use the last encoder state for a given column’s LSTM93 as its
representation; Tj denotes the number of words in the jth column.
 
ecj = hcj,Tj where hcj,t = LSTM emb(xcj,t ), hcj,t−1 (313)

To construct a representation for the question, compute another input representation


κsel using the same architecture (but distinct weights) as for κagg . As usual, we compute
the scores for each column j via:
 
αjsel = W sel tanh V sel κsel + V c ecj (314)
 
β sel = softmax αsel (315)

Similar to the aggregation, we train the SELECT network using cross entropy loss Lsel .

3. WHERE Clause Pointer Decoder. Recall from equation 307 that this is a model with
recurrent connections from its outputs leading back into its inputs, and thus a common
approach is to train it with teacher forcing94 . However, since the boolean expressions
within a WHERE clause can be swapped around while still yielding the same SQL query,
reinforcement learning (instead of cross entropy) is used to learn a policy to directly
optimize the expected correctness of the execution result. Note that this also implies that
we will be sampling from the output distribution at decoding step s to obtain the next
input for s + 1 [instead of teacher forcing].

93
Yes, we encode each column with an LSTM separately.
94
Teacher forcing is just a name for how we train the decoder portion of a sequence-to-sequence model,
wherein we feed the ground-truth output y (t) as input at time t + 1 during training.

131

−2 if q(y) is not a valid SQL query


R (q(y), qg ) = −1 if q(y) is a valid SQL query and executes to an incorrect result


+1 if q(y) is a valid SQL query and executes to the correct result
(316)
Lwhe = −Ey [R (q(y), qg )] (317)
∇Lwhe

Θ = −∇Θ Ey∼py [R (q(y), qg )] (318)
" #
X
= −Ey∼py R (q(y), qg ) ∇Θ log py (yt ; Θ) (319)
t
X
≈ −R (q(y), qg ) ∇Θ log py (yt ; Θ) (320)
t

where
→ y = [y 1 , y 2 , . . . , y T ] denotes the sequences of generated tokens in the WHERE clause.
→ q(y) denotes the query generated by the model.
→ qg denotes the ground truth query corresponding to the question.
and the gradient has been approximated in the last line using a single Monte-Carlo sample
y.

Finally, the model is trained using gradient descent to minimize L = Lagg + Lsel + Lwhe .

Speculations for Event Extraction. I want to explore using this paper’s model for the
task of event extraction. Below, I’ve replaced some words (shown in green) from a sentence in
the paper in order to formalize this as event extraction.

Seq2Event takes as input a sentence and the possible event types of an ontol-
ogy. It generates the corresponding event annotation, which, during training, is
compared against an event template. The result of the comparison is utilized
to train the reinforcement learning algorithm95 .

95
Original: Seq2SQL takes as input a question and the columns of a table. It generates the corresponding
SQL query, which, during training, is executed against a database. The result of the execution is utilized as the
reward to train the reinforcement learning algorithm.

132
Papers and Tutorials Nov 13, 2017

SLING: A Framework for Frame Semantic Parsing


Table of Contents Local Written by Brandon McKinzie

M. Ringgaard, R. Gupta, F. Pereira, “SLING: A framework for frame semantic parsing” (2017)

Model.
• Inputs. [words; affixes; shapes]
• Encoder.
1. Embed.
2. Bidirectional LSTM.
• Inputs to TBRU.
– BLSTM [forward and backward] hidden state for the current token in the parser state.
– Focus. Hidden layer activations corresponding to the transition steps that evoked/brought
into focus the top-k frames in the attention buffer.
– Attention. Recall that we maintain an attention buffer: an ordered list of frames,
where the order represents closeness to center of attention. The attention portion
of inputs for the TBRU looks at the top-k frames in the attention buffer, finds the
phrases in the text (if any) that evoked them. The activations from the BLSTM for
the last token of each of those phrases are included as TBRU inputs96
– History. Hidden layer activations from the previous k steps.
– Roles. Embeddings of (si , ri , ti ), where the frame at position si in the attention
buffer has a role (key) ri with frame at position ti as its value. Back-off features are
added for the source roles (si , ri ), target role (ri , ti ), and unlabeled roles (si , ti ).
• Decoder (TBRU). Outputs a softmax over possible transitions (actions).

96
Okay, how is this attention at all? Seems misleading to call it attention.

133
Transition System. Below is the list of possible actions. Note that, since the system is
trained to predict the correct frame graph result, it isn’t directly told what order it should
take a given set of actions97 .
• SHIFT. Move to next input token.
• STOP. Signal that we’ve reached end of parse.
• EVOKE(type, num). New frame of type from next num tokens in the input; placed
at front of attention buffer.
• REFER(frame, num). New mention from next num tokens, evoking existing frame
from attention buffer. Places at front.
• CONNECT(source-frame, role, target-frame). Inserts (role, target-frame) slot
into source-frame, move source-frame to front.
• ASSIGN(source-frame, role, value). Same as CONNECT, but with primitive/con-
stant value.
• EMBED(target-frame, role, type). New frame of type, and inserts (role, target-
frame) slot. New frame placed to front.
• ELABORATE(source-frame, role, type). New frame of type. Inserts (role, new-
frame) slot to source-frame. New frame placed at front.

Evaluation. Need some way of comparing an annotated document with its gold-standard
annotation. This is done by constructing a virtual graph where the document is the start node.
It is then connected to the spans (which are presumably nodes themselves), and the spans are
connected to the frames they evoke. Frames that refer to other frames are given corresponding
edges between them. Quality is computed by aligning the golden and predicted graphs and
computing precision, recall, and F1. Specifically, these scores are computed separately for
spans, frames, frame types, roles linking to other frames (referred to here as just “roles”), and
roles that link to global constants (referred to here as just “labels”). Results are shown below.

97
This is important to keep in mind, since more than one sequence of actions can result in a given predicted
frame graph.

134
Papers and Tutorials Nov 16, 2017

Poincaré Embeddings for Learning Hierarchical Representations


Table of Contents Local Written by Brandon McKinzie

M. Nickel and D. Kiela, “Poincaré Embeddings for Learning Hierarchical Representations” (2017)

Introduction. Dimensionality of embeddings can become prohibitively large when needed


for complex data. Authors focus on mitigating this problem for large datasets whose objects
can be organized according to a latent hierarchy98 . They propose to compute embeddings in
a particular model of hyperbolic space, the Poincaré ball model, claiming it is well-suited
for gradient-based optimization (they make use of Riemannian optimization).

Prerequisite Math. Recall that a hyperbola is a set of points, such that for any point P of
the set, the absolute difference of the distances |P F1 |, |P F2 | to two fixed points F1 , F2 (the
foci), is constant, usually denoted by 2a, a > 0. We can define a hyperbola by this set of points
or by its canonical form, which are both given, respectively, as:

H = {P | ||P F2 | − |P F1 || = 2a} (321)


x2 y2
− =1 (322)
a2 b2
where b2 := c2 − a2 , (±a, 0) are the two vertices, and (±c, 0) are the two foci. Cannon et al.
define n-dimensional hyperbolic space by the formula

H n = {x ∈ Rn+1 : x ∗ x = −1} (323)

where ∗ denotes the non-euclidean inner product (subtracts last term; same as Minkowski
sapce-time). Notice that this is the defining equation for a hyperboloid of two sheets, and
Cannon et al. says “usually we deal only with one of the two sheets.” Hyperbolic spaces are
well-suited to model hierarchical data, since both circle length and disc area grow exponentially
with r.

98
This begs the question: how useful would a Poincaré embedding be for situations where this assumption
isn’t valid?

135
Poincaré Embeddings. Let B d = {x ∈ Rd | ||x|| < 1} be the open d-dimensional unit ball.
The Poincaré ball model of hyperbolic space corresponds then to the Riemannian manifold99
(B d , gx ), where
2
2

gx = gE (324)
1 − ||x||2

is the Riemannian metric tensor, and g E denotes the Euclidean metric tensor. The distance
between two points u, b ∈ B d is given as
!
||u − v||2
d(u, v) = arccosh 1 + 2 (325)
(1 − ||u||2 )(1 − ||v||2 )

The boundary of the ball corresponds to the sphere S d−1 and is denoted by ∂B. Geodesics
in B d are then circles that are orthogonal to ∂B. To compute Poincaré embeddings for a set
of symbols S = {xi }ni=1 , we want to find embeddings Θ = {θi }ni=1 , where θi ∈ B d . Given
some loss function L that encourages semantically similar objects to be close as defined by the
Poincaré distance, our goal is to solve the optimization problem

Θ0 ← arg min L(Θ) s.t. ∀θi ∈ Θ : ||θi || < 1 (326)


Θ

Optimization. Let Tθ B denote the tangent space of a point θ ∈ B d . Let ∇R ∈ Tθ B denote


the Riemannian gradient of L(θ), and ∇E the Euclidean gradient of L(θ). Using RSGD,
parameter updates take the form

θt+1 = Rθt (−ηt ∇R L(θt )) (327)

where Rθt denotes the retraction onto B at θ and ηt denotes the learning rate at time t.

99
All five analytic models of hyperbolic geometry in Cannon et al. are differentiable manifolds with a Rieman-
nian metric. A Riemannian metric ds2 on Euclidean space Rn is a function that assigns at each point p ∈ Rn
a positive definite symmetric inner product on the tangent space at p, this inner product varying differentiably
with the point p. If x1 , . . . , xn are the standard coordinates in Rn , then ds2 has the form
P
i,j
gij dxi dxj , and
the matrix (gij ) depends differentiably on x and is positive definite and symmetric.

136
Papers and Tutorials Nov 17, 2017

Enriching Word Vectors with Subword Information (FastText)


Table of Contents Local Written by Brandon McKinzie

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information”
(2017)

Overview. Based on the skipgram model, but where each word is represented as a bag of
character n-grams. A vector representation is associated each character n-gram; words being
represented as the sum 100 of these representations.

Skipgram with Negative Sampling. Since this is based on skipgram, recall the objective
of skipgram which is to maximize:
T X
X
log Pr [wc | wt ] (328)
t=1 c∈Ct

for a sequence of words w1 , . . . wT . One way of parameterizing Pr [wc | wt ] is by computing a


softmax over a scoring function s : (wt , wc ) 7→ R,

es(wt ,wc )
Pr [wc | wt ] = PW (329)
s(wt ,j)
j=1 e

However, this implies that, given wt , we only predict one context word wc . Instead, we can
frame the problem as a set of independent binary classification tasks, and independently predict
the presence/absence of context words. Let ` : x 7→ log(1 + e−x ) denote the standard logistic
negative log-likelihood. Our objective is:
 
T
X X X
 `(s(wt , wc )) + `(−s(wt , n)) (330)
t=1 c∈Ct n∈Nt,c

where Nt,c is a set of negative examples sampled from the vocabulary. A common scoring
function involves associating a distinct input vector uw and output vector vw for each word w.
Then the score is computed as s(wt , wc ) = uTwt vwc .

100
It would be interesting to explore other aggregation operations than just summation.

137
FastText. Main contribution is a different scoring function s that utilizes subword information.
Each word w is represented as a bag of character n-grams. Special symbols < and > delimit
word boundaries, and the authors also insert the special sequence containing the full word (with
the delimiters) in its bag of n-grams. The word where is thus represented by first building its
bag of n-grams, for the choice of n = 3:

where −→ {< wh, whe, her, ere, re >, < where >} (331)

Such a set of n-grams for a word w is denoted Gw . Each n-gram g for a word w has its own
vector zg , and the final vector representation of w is the sum of these. The scoring function
becomes
X
s(w, c) = zgT vc (332)
g∈Gw

138
Papers and Tutorials Nov 17, 2017

DeepWalk: Online Learning of Social Representations


Table of Contents Local Written by Brandon McKinzie

B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online Learning of Social Representations,” (2014).

Problem Definition. Classifying members of a social network into one or more categories.
Let G = (V, E), where V are the members of the network, and E be its edges,
E ⊆ (V × V ). Given a partially labeled social network GL = (V, E, X, Y ), with
attributes X ∈ R|V |×S where S is the size of the feature space for each attribute
vector, and Y ∈ R|V |×|Y| , Y is the set of labels.
In other words, the elements of our training dataset, (X, Y ), are the members of the social
network, and we want to label each member, represented by a vector in RS , with one or more
of the |Y| labels. We aim to learn features that capture the graph structure independent of
the labels’ distribution, and to do so in an unsupervised fashion.

Learning Social Representations. We want the representations to be adaptable, community-


aware, low-dimensional, and continuous. The authors’ method learns representations for ver-
tices from a stream of short random walks, optimized with techniques from language modeling.
• Random Walks. Denote a random walk rooted at vertex vi as Wvi , where the kth
visited vertex is chosen at random from the neighbors of the (k − 1)th visited vertex, and
so on. Motivation for their use here is that they’re “the foundation of a class of output
sensitive algorithms which use them to compute local community structure information
in time sublinear to the size of the input graph.”
• Language Modeling. Authors present a generalization of language modeling, which
traditionally aims to maximize Pr [wn | w0 , . . . , wn−1 ] over all words in a training corpus.
The motivation of the generalization is to explore the graph through a stream of short
random walks. The walks are thought of as short sentences/phrases in a special language,
and we want to estimate the probability of observing vertex vi given all previous vertices
so far in the random walk. Since we want to learn a latent social representation of each
vertex, and not simply a probability distribution over node co-occurrences, we condition
on the embeddings of visited nodes in this latent space (rather than the nodes themselves
directly)
Pr [vi | Φ(v1 ), Φ(v2 ), . . . , Φ(vi−1 )] (333)
where, in practice, the mapping function Φ is represented by a |V | × d matrix of free
parameters (an embedding matrix). Since this becomes infeasible to compute as the walk
length grows, the authors opt for an approach resembling CBOW: minimizing the NLL
of vertices in the the context of a given vertex.
min − log Pr [vi−w , . . . , vi−1 , vi+1 , . . . , vi+w | Φ(vi )] (334)
Φ

139
Remember that, here, vj is the jth vertex visited in some given random walk.

DeepWalk Algorithm. Below is a conceptual summary of the procedure, followed by a


figure/illustration of the formal algorithm definition.
1. Inputs. Graph G(V, E), window size w, embedding size d, walks per vertex γ, walk
length t.
2. Random Walk. For each vertex vi , compute Wvi := RandomW alk(G, vi , t).
3. Updates. Upon finishing a walk, Wvi , run skipgram on the sequence of walked vertices
to update the embedding matrix Φ.
4. Outputs. The embedding matrix Φ ∈ R|V |×d .

where SkipGram(Φ, Wvi , w) performs SGD updates on Φ to minimize − log Pr [uk | Φ(vj )] for
each visited vj , for each uk in the “context” of vj . Notice that a binary tree T is build from the
set of vertices V (line 2) – this is done as preparation for computing each Pr [uk | Φ(vj )] via a
hierarchical softmax, to reduce computational burden of its partition function. Finally, a visual
overview of the DeepWalk algorithm is shown below. The authors use this algorithm, com-

bined with a one-vs-rest logistic regression implementation by LibLinear, for various multiclass
multilabel classification tasks.

140
Papers and Tutorials Dec 6, 2017

Review of Relational Machine Learning for Knowledge Graphs


Table of Contents Local Written by Brandon McKinzie

Nickel, Murphy, Tresp, and Gabrilovich, “Review of Relational Machine Learning for Knowledge Graphs,”
(2015).

Introduction. Paper discusses latent feature models such as tensor factorization and multiway
neural networks, and mining observable patterns in the graph. In Statistical Relational
Learning (SRL), the representation of an object can contain its relationships to other objects.
The main goals of SRL include:
• Prediction of missing edges (relationships between entities).
• Prediction of properties of nodes.
• Clustering nodes based on their connectivity patterns.
We’ll be reviewing how SRL techniques can be applied to large-scale knowledge graphs
(KGs), i.e. graph structured knowledge bases (KBs) that store factual information in the form
of relationships between entities.

Probabilistic Knowledge Graphs. Let E = {e1 , . . . , eNe } be the set of all entities and
R = {r1 , . . . , rNr } be the set of all relation types in a KG. We model each possible triple
xijk = (ei , rk , ej ) as a binary random variable yijk ∈ {0, 1} that indicates its existence. The
full tensor Y ∈ {0, 1}Ne ×Ne ×Nr is called the adjacency tensor, where each possible realization
of Y can be interpreted as a possible world.

Clearly, Y will be large and sparse in most applications. Ideally, a relational model for large-
scale KGs should scale at most linearly with the data size, i.e., linearly in the number of
entities Ne , linearly in the number of relations Nr , and linearly in the number of observed
triples |D| = Nd .

Types of SRL Modesls. The presence or absence of certain triples in relational data is
correlated with (i.e. predictive of) the presence or absence of certain other triples. In other
words, the random variables yijk are correlated with each other. There are three main ways
to model these correlations:
1. Latent feature models: Assume all yijk are conditionally independent given latent
features associated with the subject, object and relation type and additional parameters.
2. Graph feature models: Assume all yijk are conditionally independent given observed
graph features and additional parameters.
3. Markov Random Fields: Assume all yijk have local interactions.
The first two model classes predict the existence of a triple xijk via a score function f (xijk ; Θ)

141
which represents the model’s confidence that a triple exists given the parameters Θ. The
conditional independence assumptions can be written as
Ne Y
Y Ne Y
Nr
Pr [Y | D, Θ] = Ber (yijk | σ (f (xijk ; Θ))) (335)
i=1 j=1 k=1

where Ber is the Bernoulli distribution101 . Such models will be referred to as probabilistic
models. We will also discuss score-based models, which optimize f (·) via maximizing the margin
between existing and non-existing triples.

Latent Feature Models. We assume the variables yijk are conditionally independent given
a set of global latent features and parameters. All LFMs explain triples (observable facts)
via latent features of entities102 . One task of all LFMs is to infer these [latent] features
automatically from the data.

• RESCAL: a bilinear model. Models the score of a triple xijk as


He X
X He
RESCAL
fijk := eTi Wk ej = wabk eia ejb (337)
a=1 b=1

where the entity vectors ei ∈ RHe and He denotes the number of latent features in
the model. The parameters of the model are Θ = {{ei }N Nr
i=1 , {Wk }k=1 }. Note that
e

entities have the same latent representation regardless of whether they occur as subjects
or objects in a relationship (shared representation), thus allowing the model to capture
global dependencies in the data. We can make a connection to tensor factorization
methods by seeing that the equation above can be written compactly as

Fk = EWk E T (338)

where Fk ∈ RNe ×Ne is the matrix holding all scores for the k-th relation, and the ith row
of E ∈ RNe ×He holds the latent representation of ei .

• Multi-layer perceptrons. We can rewrite RESCAL as


RESCAL
fijk := wkT φRESCAL
i,j (339)
φRESCAL
i,j := ej ⊗ ei (340)

where wk = vec(Wk ) (vector of size He2 obtained by stacking columns of Wk ). The

101
Notation used:

p if y = 1
Ber(y | p) = (336)
1−p if y = 0

102
It appears that “latent” is being used here synonymously with "not directly observed in the data".

142
authors extend this to what they call the E-MLP (E for entity) model:
E−M LP
fijk := wkT g(haijk ) (341)
haijk := ATk φE−M
ij
LP
(342)
φE−M
ij
LP
:= [ei ; ej ] (343)

Graph Feature Models. Here we assume that the existence of an edge can be predicted by
extracting features from the observed edges in the graph. In contrast to LFMs, this kind of
reasoning explains triples directly from the observed triples in the KG.
- Similarity measures for uni-relational data. Link prediction in graphs that consist only
of a single relation (e.g. (Bob, isFriendOf, Sally) for a social network). Various similarity
indices have been proposed to measure similarity of entities, of which there are three main
classes:
1. Local similarity indices: Common Neighbors, Adamic-Adar index, Preferential Attach-
ment derive entity similarities from number of common neighbors.
2. Global similarity indices: Katz index, Leicht-Holme-Newman index (ensembles of all
paths bw entities). Hitting Time, Commute Time, PageRank (random walks).
3. Quasi-local similarity indices: Local Katz, Local Random Walks.
- Path Ranking Algorithm (PRA): extends the idea of using random walks of bounded
lengths for predicting links in multi-relational KGs. Let πL (i, j, k, t) denote a path of length
r1 r2 rL
L of the form ei −→ e2 −→ e3 · · · −→ ej , where t represents the sequence of edge types
rk
t = (r1 , r2 , . . . , rL ). We also require there to be a direct arc ei −→ ej , representing the ex-
istence of a relationship of type k from ei to ej . Let ΠL (i, j, k) represent the set of all such
paths of length L, ranging over path types t.

We can compute the probability of following a given path by assuming that at each step
we follow an outgoing link uniformly at random. Let Pr [πL (i, j, k, t)] be the probability of
the path with type t. The key idea in PRA is to use these path probabilities as features
for predicting the probabilities of missing edges. More precisely, the feature vector and score
function (logistic regression) are as follows:
φPijkRA = [Pr [π] : π ∈ ΠL (i, j, k)] (344)
P RA
fijk := wkT φPijkRA (345)

TODO: Finish...

143
Papers and Tutorials Dec 6, 2017

Fast Top-K Search in Knowledge Graphs


Table of Contents Local Written by Brandon McKinzie

S. Yang, F. Han, Y. Wu, X. Yan, “Fast Top-K Search in Knowledge Graphs.”

Introduction. Task: Given a knowledge graph G, scoring function F , and a graph query Q,
top-k subgraph search over G returns k answers with the highest matching scores. An example
is searching for movie makers (directors) worked with “Brad” and have won awards, illustrated
below:

Clearly, it would be extremely inefficient to enumerate all possible matches and then rank
them.

Preliminaries/Terminology.
• Queries. A query [graph] is defined as Q = (VQ , EQ ). Each query node in Q provides
information/constraints about an entity, and an edge between two nodes specifies the
relationship or the connectivity constraint posed on the two nodes. Q∗ denotes a star-
shaped query, which is basically a graph that looks like a star (central node with tree-like
structure radially outward).
• Subgraph Matching. Given a graph query Q and a knowledge graph G, a match φ(Q)
of Q in G is a subgraph of G, specified by a one-to-one matching function φ. It maps each
node u (edge e = (u, u0 )) in Q to a node match φ(u) (edge match φ(e) = (φ(u), φ(u0 )))
in φ(Q).

144
The matching score between query Q and its match φ(Q) is
X X
F (φ(Q)) = FV (v, φ(v)) + FE (e, φ(e)) (346)
v∈VQ e∈EQ
X
FV (v, φ(v)) = αi fi (v, φ(v)) (347)
i
X
FE (e, φ(e)) = βj fj (e, φ(e)) (348)
j

Star-Based Top-K Matching.


1. Query decomposition: Given query Q, STAR decomposes Q to a set of star queries
Q. A star query contains a pivot node and a set of leaves as its neighbors in Q.
2. Star querying engine: Generate a set of top matches for each star query Q.
3. Top-k rank join. The top matches for multiple star queries are collected and joined to
produce top-k complete matches of Q.

145
Papers and Tutorials Jan 19, 2018

Dynamic Recurrent Acyclic Graphical Neural Networks (DRAGNN)


Table of Contents Local Written by Brandon McKinzie

Kong et al., “DRAGNN: A Transition-based Framework for Dynamically Connected Neural Networks,” (2017).

Transition Systems. Define a transition system T , {S, A, t}, where


• S = S(x) is a set of states, where that set depends on the input sequence x.
• A special start state s† ∈ S(x).
• A set of allowed decisions A(s, x) ∀s ∈ S(x).
• A transition function t(s, d, x) returning a new state s0 for any decision d ∈ A(s, x).
The authors then define a complete structure as a sequence of state/decision pairs (s1 , d1 ) . . . (sn , dn )
such that s1 = s† , di ∈ A(si ) for i = 1, . . . , n and si+1 = t(si , di ), where n = n(x) is the number
of decisions for input x103 . We’ll use transition systems to map inputs x into a sequence of
output symbols d1 , . . . , dn .

Transition Based Recurrent Networks. When combining transition systems with recur-
rent networks, we will refer to them as Transition Based Recurrent Units (TBRU), which
consist of:
• Transition system T .
• Input function m(s) : S 7→ RK that maps states to some fixed-size vector representation
(e.g. an embedding lookup operation).
• Recurrence function r(s) : S 7→ P{1, . . . , i − 1} that maps states to a set of previous time
steps, where P is the power set. Note that |r(s)| may vary with s. We use r to specify
state-dependent recurrent links in the unrolled computation graph.
• The RNN cell hs ← RNN(m(s), {hi | i ∈ r(s)}).

Example: Sequential tagging RNN. Let x = {x1 , . . . , xn } be a sequence of input tokens.


Let the ith output, di , be a tag from some predefined set of tags A. Then our model can be
defined as:
• Transition system: T = { si =S(xi )={1, . . . , di−1 }, A, t(si , di , xi )=si+1 = si +{di } }.
• Input function: m(si ) = embed(xi ).
• Recurrence function: r(si ) = {i − 1} to connect the network to the previous state.
• RNN cell: hi ← LST M (m(si ) | r(si ) = {i − 1}).

103
The authors state that we are only concerned with complete structures that have the same number of
decisions n(x) for the same input x.

146
Example: Parsey McParseface.
• Transition system: the arc-standard transition system, defined in image below104 .

so the state contains all words and partially built trees (stack) as well as unseen words
(buffer).
• Input function: m(si ) is the concatenation of 52 feature embeddings extracted from
tokens based on their positions in the stack and the buffer.
• Recurrence function: r(si ) is empty, as this is a feed-forward network.
• RNN cell: a feed-forward MLP (so not an RNN...).

Inference with TBRUs. To predict the output sequence {d1 , . . . , dn } given input sequence
x = {x1 , . . . , xn }, do:
1. Initialize s1 = s† .
2. For i = 1, . . . , n:
(a) Compute hi = RNN(m(si ), {hj | j ∈ r(si )}).
(b) Update transition state:

di ← arg max wdT hi (349)


d∈A(si )

si+1 ← t(si , di ) (350)

NOTE: This defines a locally normalized training procedure, whereas Andor et al.
of Syntaxnet clearly conclude that their globally normalized model is the preferred
choice.

104
Image taken from “Transition-Based Parsing” by Joakim Nivre. Note that “right-headed” means “goes
from left to right” or “headed to the right”.

147
Combining multiple TBRUs. We connect multiple TBRUs with different transition systems
via r(s).
1. We execute a list of T TBRU components sequentially, so that each TBRU advances a
global step counter.
2. Each transition state, sτ , from the τ ’th component has access the terminal states from
every prior transition system, and the recurrence function r(sτ ) for any given component
can pull hidden activations from every prior one as well.

Example: Multi-task bi-directional tagging. Say we want to do both POS and NER
tagging (indices start at 1).
• Left-to-right: T = shift-only, m(si ) = xi , r(si ) = {i − 1}.
• Right-to-left: T = shift-only, m(sn+i ) = x(n−i)+1 , r(sn+i ) = {n + i − 1}.
• POS Tagger: TP OS = tagger, m(s2n+i ) = {}, r(s2n+i ) = {i, (2n − i) + 1}
• NER Tagger: TN ER = tagger, m(s3n+i ) = {}, r(s3n+i ) = {i, (2n − i) + 1, 2n + i}
which illustrates the most important aspect of the TBRU:
A TBRU can serve as both an encoder for downstream tasks and a decoder for its
own task simultaneously.

For this example, the POS Tagger served as both an encoder for the NER task as well as a
decoder for the POS task.

Training a DRAGNN. Assume training data consists of examples x along with gold de-
cision sequences for a given TBRU in the DRAGNN. Given decisions d1 , . . . , dN from prior
components 1, . . . , T − 1, the log-likelihood for training the T ’th TBRU along its gold decision
sequence d?N +1 , . . . , d?N +n is then:
X
L(x, d?N +1:N +n ; θ) = log Pr d?N +i | d1:N , d?N +1:N +i−1 ; θ
 
(351)
i

During training, the entire input sequence is unrolled and backpropagation through struc-
ture is used for gradient computation.

148
4.31.1 More Detail: Arc-Standard Transition System

The arc-standard transition system is mentioned a lot, but with little detail. Here I’ll synthesize
what I find from external resources. The literature defines the states in a transition system
slightly differently than the DRAGNN paper. Here we’ll define them as a configuration
c = (Σ, B, A) triplet, where
• Σ is the stack of tokens in x that we’ve [partially] processed.
• B is the buffer of remaining tokens in x that we need to process.
• A is a set of arcs (wi , wj , `) that link wi to wj , and label the arc/link as `.
So, in the arc-standard transition system figure presented with Parsey McParseface earlier,
• SHIFT just means “move the head element of the buffer to the tail element of the buffer”.
• Left-arc just means “make a link from the tail element of the stack to the element before
it. Remove the pointed-to element from the stack.”
• Right-arc just means “make a link from the element before the tail element in the stack
to the tail element. Remove the pointed-to element from the stack.”

149
Papers and Tutorials April 01, 2018

Neural Architecture Search with Reinforcement Learning


Table of Contents Local Written by Brandon McKinzie

B. Zoph and Q. Le, “Neural Architecture Search with Reinforcement Learning,” (2017).

Controller RNN. Generates architectures with a predefined number of layers, which is in-
creased manually as training progresses. At convergence, validation accuracy of the generated
network is recorded. Then, the controller parameters θc are optimized to maximized the ex-
pected validation accuracy over a batch of generated architectures.

Reinforcement Learning to learn the controller parameters θc . Let a1:T denote a list of
actions taken by the controller105 , which defines a generated architecture. We denote the
resulting validation accuracy by R, which is the reward signal for our RL task. Concretely, we
want our controller to maximize its expected reward, J(θc ):

J(θc ) = EP (a1:T ; θc ) [R] (352)

Since the quantity ∇θc R is non-differentiable106 , we use a policy gradient method to itera-
tively update θc . All this means is that we instead compute gradients over the softmax outputs
(the action probabilities), and use the value of R as a simple weight factor.
T
X
∇θc J(θc ) = R EP (a1:T ; θc ) [∇θc log P (at | a1:t−1 ; θc )] (353)
t=1
m T
1 X X
≈ Rk ∇θc log P (at | a1:t−1 ; θc ) (354)
m k=1 t=1

where the second equation is the empirical approximation (batch-average instead of expecta-
tion) over a batch of size m, an unbiased estimator for our gradient107 . Also note that we do
105
Note that T is not necessarily the number of layers, since a single generated layer can correspond to multiple
actions (e.g. stride height, stride width, num filters, etc.).
106
R is a function of the action sequence a1:T and the parameters θc , and implicitly depends on the samples
used for the validation set. Clearly, we do not have access to an analytical form of R, and computing numerical
gradients via small perturbations of thetac is computationally intractable.
107
It is unbiased for the same reason that any average over samples x drawn from a distribution P (x) is an
unbiased estimator for EP [x].

150
have access to the distribution P (a1:T ; θc ) since it is defined to be the joint softmax probabili-
ties of our controller, given its parameter values θc (i.e. this is not a pdata vs pmodel situation).
The approximation 353 is an unbiased estimator for 354, but has high variance. To reduce the
variance of our estimator, the authors employ a baseline function b that does not depend on
the current action:
m X T
1 X
∇θ log P (at | a1:t−1 ; θc )(Rk − b) (355)
m k=1 t=1 c

151
Papers and Tutorials April 26, 2018

Joint Extraction of Events and Entities within a Document Context


Table of Contents Local Written by Brandon McKinzie

B. Yang and T. Mitchell, “Joint Extraction of Events and Entities within a Document Context,” (2016).

Introduction. Two main reasons that state-of-the-art event extraction systems have difficul-
ties:
1. They extract events and entities in separate stages.
2. They extract events independently from each individual sentence, ignoring the rest of
the document.
This paper proposes an approach that simultaneously extracts events and entities within a
document context. They do this by first decomposing the problem into 3 tractable subproblems:
1. Learning the dependencies between a single event [trigger] and all of its potential argu-
ments.
2. Learning the co-occurrence relations between events across the document.
3. Learning for entity extraction.
and then combine these learned models into a joint optimization framework.

Learning Within-Event Structures. For now, assume we have some document x, a set
of candidate event triggers T , and a set of candidate entities N . Denote the set of entity
candidates that are potential arguments for trigger candidate i as Ni . The joint distribution
over the possible trigger types, roles, and entities for those roles, is given by
All fi also depend on
Prθ [ti , ri , a | i, Ni , x] ∝ (356) i, x. In addition, all fi
  except f1 depend on the
X h i current j in the
exp θ1T f1 (ti ) + θ2T f2 (rij ) + θ3T f3 (ti , rij ) + θ4T f4 (aj ) + θ5T f5 (rij , aj )  (357) summation.
j∈Ni

where each fi is a feature function, and I’ve colored the unary feature functions green. The
unary features are tabulated in Table 1 of the original paper. They use simple indicator
functions 1t,r and 1r,a for the pairwise features. They train using maximum-likelihood estimates
with L2 regularization:

θ ∗ = arg max L(θ) − λ||θ||22 (358)


θ
X
L(θ) = log (Prθ [ti , ri , a | i, Ni , x]) (359)
i

and use L-BFGS to optimize the training objective.

152
Learning Event-Event Relations. A pairwise model of event-event relations in a document.
Training data consists of all pair of trigger candidates that co-occur in the same sentence or
are connected by a co-referent subject/object if they’re in different sentences. Given a trigger
candidate pair (i, i0 ), we estimate the probabilities for their event types (ti , ti0 ) as
 
Prφ ti , ti0 | x, i, i0 ∝ exp φT g(ti , ti0 , x, i, i0 )
 
(360)

where g is a feature function that depends on the trigger candidate pair and their context. In
addition to re-using the trigger features in Table 1 of the paper, they also introduce relational
trigger features:
1. whether they’re connected by a conjunction dependency relation
2. whether they share a subject or an object
3. whether they have the same head word lemma
4. whether they share a semantic frame based on FrameNet.
As before, they using L-BFGS to compute the maximum-likelihood estimates of the parameters
φ.

Entity Extraction. Trained a standard linear-chain CRF using the BIO scheme. Their CRF
features:
1. current words and POS tags
2. context words in a window of size 2
3. word type such as all-capitalized, is-capitalized, all-digits
4. Gazetteer-based entity type if the current word matches an entry in the gazetteers col-
lected from Wikipedia.
5. pre-trained word2vec embeddings for each word

Joint Inference. Allows information flow among the 3 local models and finds globally-optimal
assignments of all variables. Define the following objective:
X X X
max E(ti , ri , a) + R(ti , ti0 ) + D(aj ) (361)
t,r,a
i∈T i,i0 ∈T j∈N

where
• The first term is the sum of confidence scores for individual event mentions from the
within-event model.
X
E(ti , ri , a) = log pθ (ti ) + [log pθ (ti , rij ) + log pθ (rij , aj )] (362)
j∈Ni

• The second term is the sum of confidence scores for event relations based on the pairwise
event model.

R(ti , ti0 ) = log pφ (ti , ti0 | i, i0 , x) (363)

153
• The third term is sum of confidence scores for entity mentions, where

D(aj ) = log pψ (aj | j, x) (364)

and pψ (aj | j, x) is the marginal probability derived from the linear-chain CRF.
The optimization is subject to agreement constraints that enforce the overlapping variables
among the 3 components to agree on their values. The joint inference problem can be formu-
lated as an integer linear problem (ILP)108 . To solve it efficiently, they find solutions for
the relaxation of the problem using a dual decomposition algorithm AD3 .

108
From Wikipedia: An integer linear program in canonical form: maximize cT x subject to Ax ≤ b, x ≥ 0, x ∈
n
Z

154
Papers and Tutorials April 27, 2018

Globally Normalized Transition-Based Neural Networks


Table of Contents Local Written by Brandon McKinzie

D. Andor et al., “Globally Normalized Transition-Based Neural Networks,” (2016).

Introduction. Authors demonstrate that simple FF neural networks can achieve comparable
or better accuracies than LSTMs, as long as they are globally normalized. They don’t use
any recurrence, but perform beam search for maintaining multiple hypotheses and introduce
global normalization with a CRF objective to overcome the label bias problem that locally
normalized models suffer from.

Transition System. Given an input sequence x, define:


• Set of states S(x).
• Start state s† ∈ S(x).
• Set of decisions A(s, x) for all s ∈ S(x).
• Transition function t(s, d, x) returning new state s0 for any decision d ∈ A(s, x).
The scoring function ρ(s, d; θ), which gives the score for decision d in state s, will be defined:
ρ(s, d; θ) = φ(s; θ(l) ) · θ(d) (365)
which is just the familiar logits computation for decision d. θ(l) are the parameters of the
network excluding the parameters at the final layer, θ(d) . φ(s; θ(l) ) gives the representation for
state s computed by the neural network under parameters θ(l) .

Global vs. Local Normalization.


• Local. Conditional probabilities Pr [dj | sj ; θ] are normalized locally over the scores for
each possible action dj from the current state sj .
P 
n n
Y exp j=1 ρ(sj , dj ; θ)
PrL [d1:n ] = Pr [dj | sj ; θ] = Qn (366)
j=1 j=1 ZL (sj ; θ)

exp ρ(sj , d0 ; θ)
X 
ZL (sj ; θ) = (367)
d0 ∈A(sj )

Beam search can be used to attempt to find the action sequence with highest probability.
• Global. In contrast, a CRF defines:
P 
n
exp j=1 ρ(sj , dj ; θ)
PrG [d1:n ] = (368)
ZG (θ)
 
n
0 0
X X
ZG (θ) = exp  ρ(s , d ; θ)
j j (369)
d01:n ∈Dn j=1

155
where Dn is the set of all valid sequences of decisions of length n. The inference problem
is now to find
n
X
arg max PrG [d1:n ] = arg max ρ(sj , dj ; θ) (370)
d1:n ∈Dn d1:n ∈Dn j=1

and we can also use beam search to approximately find the argmax.

Training. SGD on the NLL of the data under the model. The NLL takes a different form
depending on whether we choose a locally normalized model vs a globally normalized model.
• Local.

Llocal (d∗1:n ; θ) = − ln PrL [d∗1:n ; θ] (371)


n n
ρ(s∗j , d∗j ; θ) + ln ZL (s∗j ; θ)
X X
=− (372)
j=1 j=1

• Global.

Lglobal (d∗1:n ; θ) = − ln PrG [d∗1:n ; θ] (373)


n
ρ(s∗j , d∗j ; θ) + ln ZG (θ)
X
=− (374)
j=1

To make learning tractable for the globally normalized model, the authors use beam search
with early updates, defined as follows. Keep track of the location of the gold path109 in
the beam as the prediction sequence is being constructed. If the gold path is not found in the
beam after step j, run one step of SGD on the following objective:
 
j j
Lglobal−beam (d∗1:j , θ) = − ρ(d∗1:t−1 , d∗t ; θ) − ln exp  ρ(d0 0
X X X
1:t−1 , dt ; θ) (375)

t=1 d01:j ∈Bj t=1

where Bj contains all paths in the beam at step j, and the gold path prefix d∗ 1 : j. If
the gold path remains in the beam throughout decoding, a gradient step is performed us-
ing BT , the beam at the end of decoding. When training the global model, the authors
first pretrain110 using the local objective function, and then perform additional training steps
using the global objective function..

109
The gold path is the predicted sequence that matches the true labeled sequence, up to the current timestep.

156
The Label Bias Problem. Locally normalized models often have a very weak ability to
revise earlier decisions. Here we will prove that globally normalized models are strictly
more expressive than locally normalized models111 . Let PL denote the set of all possible
distributions pL (d1:n | x1:n ) under the local model as the scores ρ vary. Let PG be the same,
but for the global model.
We are assuming that
Theorem 3.1 PL is a strict subset112 of PG , that is PL ( PG . both PL and PG consist
of log-linear distributions
of scoring functions
In other words, a globally normalized model can model any distribution that a locally normal- ρ(d1:t−1 , dt , x1:t )

ized one can, but the converse is not true. I’ve worked through the proof below.
Proof: PL ( PG

Proof that PL ⊆ PG . For any locally normalized model with scores ρL (d1:t−1 , dt , x1:t ), we can define a
corresponding pG over scores

ρG (d1:t−1 , dt , x1:t ) = ln pL (dt | d1:t−1 , x1:t ) (376)

By definition, this means that pG (d1:t | x1:t ) = pL (d1:t | x1:t ).


Proof that PG * PL . A proof by example. Consider a dataset consisting entirely of one of the following tagged
sequences:

x = abc, d = ABC (377)


x = abe, d = ADE (378)

Similar to a typical linear-chain CRF, let T denote the set of observed label transitions, and let E denote the set
of observed (xt , dt ) pairs. Let α be the single scalar parameter of this simple model, where

ρ(d1:t−1 , dt , x1:t ) = α 1(dt−1 ,dt )∈T + 1(xt ,dt )∈E (379)

for all t. This results in the following distributions pG and pL , evaluating on input sequence of length 3
P3 
exp α t=1
(1(dt−1 ,dt )∈T + 1(xt ,dt )∈E )
pG (d1:3 | x1:3 ) = (380)
ZG (x1:3 )
pL (d1:3 | x1:3 ) = pL (d1 | x1 )pL (d2 | d1 , x1:2 )pL (d3 | d1:2 , x1:3 ) (381)

where I’ve written pL as a product over its local CPDs because it reveals the core observation that P the proof
is based on: for any given subsequence (d1:t−1 , x1:t ), the local CPD is constrained to satisfy p (d |
dt L t
d1:t−1 , x1:t ) = 1. With this, the following comparison of pG and pL for large α completes the proof of PG * PL :

lim pG (ABC | abc) = lim pG (ADE | abe) = 1 (382)


α→∞ α→∞
pL (ABC | abc) + pL (ADE | abe) ≤ 1 (∀α) (383)

∴ PL ( PG .

111
This is for conditional models only.
112
Note that ⊂ and ( mean the same thing. Matter of notational preference/being explicit/etc.

157
Papers and Tutorials April 30, 2018

An Introduction to Conditional Random Fields


Table of Contents Local Written by Brandon McKinzie

Sutton et al., “An Introduction to Conditional Random Fields,” (2012).

Graphical Modeling (2.1). Some notation. Denote factors as ψa (ya ) where 1 ≤ a ≤ A and
A is the total number factors. ya is an assignment to the subset Ya ⊆ Y of variables associated
with ψa . The value returned by ψa is a non-negative scalar that can be thought of as a measure
of how compatible the values ya are with each other. Given a collection of subsets {Ya }Aa=1 of
Y , an undirected graphical model is the set of all distributions that can be written as
A
1 Y
p(y) = ψa (ya ) (384)
Z a=1
A
XY
Z= ψa (ya ) (385)
y a=1

for any choice of factors F = {ψa } that have ψa (ya ) ≥ 0 for all ya . The sum for the partition
function, Z, is over all possible assignments y of the variables Y . We’ll use the term ran-
dom field to refer to a particular distribution among those defined by an undirected model113 .

We can represent the factorization with a factor graph: a bipartite graph G = (V, F, E) in
which one set of nodes V = {1, 2, . . . , |Y |} indexes the RVs in the model, and the set of nodes
F = {1, 2, . . . , A} indexes the factors. A connection between a variable node Ys for s ∈ V to
some factor node ψa for a ∈ F means that Ys is one of the arguments of ψa . It is common to
draw the factor nodes as squares, and the variable nodes as circles.

Generative versus Discriminative Models (2.2). Naive Bayes is generative, while logistic
regression (a.k.a maximum entropy) is discriminative. Recall that Naive Bayes and logistic are
defined as, respectively,
K
Y
p(y, x) = p(y) p(xk | y) (386)
k=1
K
!
1 X
p(y | x) = exp θk fk (y, x) (387)
Z(x) k=1

where the fk in the definition of logistic regression denote the feature functions. We could set
them, for example, as fy0 ,j (y, x) = 1y0 =y xj .

113
i.e. a particular set of factors.

158
An example generative model for sequence prediction is the HMM. Recall that an HMM
defines
T
Y
p(y, x) = p(yt | yt−1 )p(xt | yt ) (388)
t=1

where we are using the dummy notation of assuming an initial-initial state y0 clamped to 0
and begins every state sequence, so we can write the initial state distribution as p(y1 | y0 ).

We see that the generative models, like naive Bayes and the HMM, define a family of joint
distributions that factorizes as p(y, x) = p(y)p(x | y). Discriminative models, like logistic re-
gression, define a family of conditional distributions p(y | x). The main conceptual difference
here is that a conditional distribution p(y | x) doesn’t include a model of p(x). The principal
advantage of discriminative modeling is that it’s better suited to include rich, overlapping fea-
tures. Discriminative models like CRFs make conditional independence assumptions both (1)
among y and (2) about how the y can depend on x, but do not make conditional independence
assumptions among x.

The difference between NB and LR is due only to the fact that NB is generative and LR is
discriminative. Any LR classifier can be converted into a NB classifier with the same decision
boundary, and vice versa. In other words, NB defines the same family as LR, if we interpret
NB generatively as
P
exp ( k θk fk (y, x))
p(y, x) = P P (389)
x exp ( k θk fk (y
y ,e
e e, x
e))

and train it to maximize the conditional likelihood. Similarly, if the LR model is interpreted
as above, and trained to maximize the joint likelihood, then we recover the same classifier as
NB.

159
Linear-Chain CRFs (2.3). Key point: the conditional distribution p(y | x) that follows from
the joint distribution p(y, x) of an HMM is in fact a CRF with a particular choice of feature
functions. First, we rewrite the HMM joint in a form that’s more amenable to generalization:
 
T
1 Y X X
HMM joint distribution
p(y, x) = exp  θi,j 1{yt =i,yt−1 =j} + µo,i 1{yt =i,xt =o}  (390)
Z t=1 i,j∈S i∈S,o∈O
T K
!
1 Y X
= exp θk fk (yt , yt−1 , xt ) (391)
Z t=1 k=1

and the latter provides the more compact notation114 . We can use Bayes rule to then write
p(y | x), which would give us a particular kind of linear-chain CRF that only includes features
for the current word’s identity. The general definition of linear-chain CRFs is given below:
Let Y, X be random vectors, θ = {θk } ∈ RK be a parameter vector, and F = {fk (y, y 0 , xt )}K
k=1
be a set of real-valued feature functions. Then a linear-chain conditional random field
is a distribution p(y | x) that takes the form:
T
!
1 Y X
p(y | x) = exp θk fk (yt , yt−1 , xt ) (392)
Z(x) t=1 k
T
!
XY X
Z(x) = exp θk fk (yt , yt−1 , xt ) (393)
y t=1 k

Notice that a linear chain CRF can be described as a factor graph over x and y, where each
local function (factor) ψt has the special log-linear form:
!
X
ψt (yt , yt−1 , xt ) = exp θk fk (yt , yt−1 , xt ) (394)
k

General CRFs (2.4). Let G be a factor graph over X and Y . Then (X, Y ) is a conditional
random field if for any value x of X, the distribution p(y | x) factorizes according to G. If
F = {ψa } is the set of A factors in G, then the conditional distribution for a CRF is
A
1 Y
p(y | x) = ψa (ya , xa ) (395)
Z(x) a=1

It is often useful to require that the factors be log-linear over a prespecified set of feature
functions, which allows us to write the conditional distribution as
 
K(a)
1 Y X
p(y | x) = exp  θa,k fa,k (ya , xa ) (396)
Z(x) ψ ∈F k=1
a

114
Note how we collapsed the summations over i, j and i, o to simply k. This is purely notational. Each
value k can be mapped to/from a unique i, j or i, o in the first version. Also note that, necessarily, each feature
function fk in the latter version maps to a specific indicator function 1 in the first.

160
In addition, most models rely extensively on parameter tying115 . To denote this, we partition
the factors of G into C = {C1 , C2 , . . . , CP }, where each Cp is a clique template: a set of factors
K(p)
sharing a set of feature functions {fp,k (xc , yc )}k=1 and a corresponding set of parameters
θp ∈ RK(p) . A CRF that uses clique templates can be written as

1 Y Y
p(y | x) = ψc (xc , yc ; θp ) (397)
Z(x) C ∈C ψ ∈C
p c p

 K(p)
1 Y Y X 
= exp θp,k fp,k (xc , yc ) (398)
Z(x) C ∈C c∈C k=1
p p

In a linear-chain CRF, typically one uses one clique template C0 = {ψt }Tt=1 . Again, each factor
in a given template shares the same feature functions and parameters, so the previous sentence
means that we reuse the set of features and parameters for each timestep.

Feature engineering (2.5).


• Label-observation features. When our label variables are discrete, the features fp,k
of a clique template Cp are ordinarily chosen to have a particular form:

fp,k (yc , xc ) = 1{yc =yec } qp,k (xc ) (399)

and we refer to the functions qp,k (xc ) as observation functions.


• Unsupported features. Many observation-label pairs may never occur in our training
data (e.g. having the word “with” being associated with label “CITY”). Such features
are called unsupported features, and can be useful since often their weights will be driven
negative, which can help prevent the model from making predictions in the future that
are far from what was seen in the training data.
• Edge-Observation and Node-Observation features: the two most common types
of label-observation features. Edge-observation features are for the transition factors,
while node-observation features are the form introduced for label-observation features
above.

[edge-obs] f (yt , yt−1 , xt ) = qm (xt )1yt =y,yt−1 =y0 (∀y, y 0 inY, ∀m) (400)
0
[node-obs] f (yt , yt−1 , xt ) = 1yt =y,yt−1 =y0 (∀y, y inY) (401)

and both use the same f (yt , xt ) = qm (xt )1yt =y (∀y, ∈ Y, ∀m). Recall that m is the index
into our set of observation features116 .
• Feature Induction. The model begins with a number of base features, and the training
procedure adds conjunctions of those features.

115
Note how, for CRFs, the actual parameters θ are tightly coupled with the feature functions f .
116
In CRFSuite, the observation features are all the attributes we define, and any features that use both label
and observation are defined within CRFSuite itself.

161
4.35.1 Inference (Sec. 4)

There are two inference problems that arise:


1. Wanting to predict the labels of a new input x using the most likely labeling y ∗ =
arg maxy p(y | x).
2. Computing marginal distributions (during parameter estimation, for example) such as
node marginals p(yt | x) and edge marginals p(yt , yt−1 | x).
For linear-chain CRFs, the forward-backward algorithm is used for computing marginals,
and the Viterbi algorithm for computing the most probable assignment. We’ll first derive
these for the case of HMMs, and then generalize to the linear-chain CRF case.

Forward-backward algorithm (HMMs). An efficient technique for computing marginals.


We begin by writing out p(x), and using the distributive law to convert the sum of products
to a product of sums:
X
p(x) = p(x, y) (402) We define the ψt as
y p(yt | yt−1 )p(xt | yt )
T
XY
= ψt (yt , yt−1 , xt ) (403)
y t=1
X X X X
= p(y1 )p(x1 | y1 ) p(y2 | y1 )p(x2 | y2 ) ··· p(yT | yT −1 )p(xT | yT ) (404)
y1 y2 y3 yT
X X X
= ψ1 (y1 , x1 ) ··· ψT (yT , yT −1 , xT ) (405)
y1 y2 yT
XX X X
= ψT (yT , yt−1 , xT ) ··· ψ1 (y1 , x1 ) (406)
T T −1 yT −2 y1

We see that we can save an exponential amount of work by caching the inner sums as we go.
Let M denote the number of possible states for the y variables. We define a set of T forward
variables αt , each of which is a vector of size M :

αt (j) , p(xh1...ti , yt = j) (407)


X
= p(xh1...ti , yh1...t−1i , yt = j) (408)
yh1...t−1i
0 −1
tY
X
= ψt (j, yt−1 , xt ) ψt0 (yt0 , yt−1 , xt0 ) (409)
yh1...t−1i t0 =1

X X t−2
Y
= ψt (j, i, xt ) ψt−1 (yt−1 , yt−2 , xt−1 ) ψt0 (yt0 , yt−1 , xt0 ) (410)
i∈S yh1...t−2i t0 =1
X
= ψt (j, i, xt )αt−1 (i) (411)
i∈S

162
P P
where S is the set of M possible states. By recognizing that p(x) = j∈S yh1...t−1i p(xh1...ti , yh1...t−1i , j),
we can rewrite p(x) as
X
p(x) = αT (j) (412)
j∈S

Notice how in the step from equation 407 to 408, we marginalized over all possible y sub-
sequences that could’ve been aligned with xh1...ti . We will repeat this pattern to derive the
backward recursion for βt , which is the same idea except now we go from T backward until
some t (instead of going from 1 forward until some t).

βt (i) , p(xht+1...T i | yt = i) (413)


X T
Y We initialize βT (i) = 1.
= ψt+1 (yt+1 , i, xt+1 ) ψt0 (yt0 , yt0 −1 , xt0 ) (414)
yht+1...T i t0 =t+2
X
= ψt+1 (yt+1 , i, xt+1 )βt+1 (yt+1 ) (415)
yt+1

Similar to how we obtained equation 412, we can rewrite p(x) in terms of the β:
X
p(x) = β0 (y0 ) , ψ1 (y1 , y0 , x1 )β1 (y1 ) (416)
y1

We can then combine the definition for α and β to compute marginals of the form p(yt−1 , yt |
x):

1
p(yt−1 , yt | x) = αt−1 (yt−1 )ψt (yt , yt−1 , xt )βt (yt ) (417)
p(x)
X
where p(x) = αT (j) = β0 (y0 ) (418)
j∈S

In summary, the forward-backward algorithm consists of the following steps:


1. Compute αt for all t using equation 412.
2. Compute βt for all t using equation 416.
3. Return the marginal distributions computed from equation 417.

163
Viterbi algorithm (HMMs). For computing y ∗ = arg maxy p(y | x). The derivation is
nearly the same as how we derived the forward-backward algorithm, except now we’ve replaced
the summations in equation 406 with maximization. The analog of α for viterbi are defined
as:
t−1
Y
δt (j) = max ψt (j, yt−1 , xt ) ψt0 (yt0 , yt0 −1 , xt0 ) (419)
yh1...t−1i
t0 =1
= max ψt (j, i, xt )δt−1 (i) (420)
i∈S
and the maximizing assignment is computed by a backwards recursion,

yT∗ = arg max δT (i) (421)


i∈S
yt∗ = arg max ψt (yt+1

, i, xt+1 )δt (i) for t < T (422)
i∈S

Computing the recursions for δt and yt∗ together is the Viterbi algorithm.

Forward-backward and Viterbi for linear-chain CRF. Generalizing to the linear-chain


CRF, where now
T
1 Y
p(y | x) = ψt (yt , yt−1 , xt ) (423)
Z(x) t=1
( )
X
where ψt (yt , yt−1 , xt ) = exp θk fk (yt , yt−1 , xt ) (424)
k
and the results for the forward-backward algorithm become
1
p(yt−1 , yt | x) = αt−1 (yt−1 )ψt (yt , yt−1 , xt )βt (yt ) (425)
Z(x)
1
p(yt | x) = αt (yt )βt (yt ) (426)
Z(x)
X
where Z(x) = αT (j) = β0 (y0 ) (427)
j∈S

Note that the interpretation is also slightly different. The α, β, and δ variables should only be
interpreted with the factorization formulas, and not as probabilities. Specifically, use
( ) t0 −1 ( )
X X Y X
αt (j) = exp θk fk (j, yt−1 , xt ) exp θk fk (yt0 , yt0 −1 , xt0 ) (428)
yh1...t−1i k t0 =1 k
T
( ) ( )
X X Y X
βt (i) = exp θk fk (yt+1 , i, xt+1 ) exp θk fk (yt0 , yt0 −1 , xt0 ) (429)
yht+1...T i k t0 =t+2 k
( ) t−1 ( )
X Y X
δt (j) = max exp θk fk (j, yt−1 , xt ) exp θk fk (yt0 , yt0 −1 , xt0 ) (430)
yh1...t−1i
k t0 =1 k

164
Markov Chain Monte Carlo (MCMC). The two most popular classes of approximate infer-
ence algorithms are Monte Carlo algorithms and variational algorithms. In what follows,
we drop the CRF-specific notation and refer to the more general joint distribution

p(y) = Z −1
Y
ψa (ya ) (431)
a∈F

that factorizes according to some factor graph G = (V, F ). MCMC methods construct a
Markov chain whose state space is the same as that of Y , and sample from this chain to
approximate, e.g., the expectation of some function f (y) over the distribution p(y). MCMC
algorithms aren’t commonly applied in the context of CRFs, since parameter estimation by
maximum likelihood requires calculating marginals many times.

4.35.2 Parameter Estimation (Sec. 5)

Maximum Likelihood for Linear-Chain CRFs. Since we’re modeling the conditional
distribution with CRFs, we use the conditional log likelihood `(θ) with l2-regularization:
N K
X
(i) (i) 1 X
`(θ) = log p(y | x ; θ) − 2 θ2 (432)
i=1
2σ k=1 k
N X K
T X N K
X (i) (i) (i) X 1 X
= θk fk (yt , yt−1 , xt ) − log Z(x(i) ) − θ2 (433)
i=1 t=1 k=1 i=1
2σ 2 k=1 k

with regularization parameter 1/2σ 2 . The partial derivatives are

∂` (i) (i) (i) (i) θk


fk (y, y 0 , xt )p(y, y 0 | x(i) ) − 2
X X
= fk (yt , yt−1 , xt ) − (434)
∂θk i,t i,t,y,y 0
σ

and the derivation for the partial derivative of log(Z(x)) is


 )
T
(
∂ log Z(x) 1 ∂  X Y X
= exp θk fk (yt , yt−1 , xt )  (435)
∂θk Z(x) ∂θk y t=1 k h1...T i
( )
1 X ∂ XX
= exp θk fk (yt , yt−1 , xt ) (436)
Z(x) y ∂θk t k
h1...T i
( )
1 X X XX
= fk (yt , yt−1 , xt ) exp θk fk (yt , yt−1 , xt ) (437)
Z(x) y t t k
h1...T i
 ( )
XX X X X 1 XX
= fk (yt , yt−1 , xt )  exp θk fk (yt , yt−1 , xt ) 
t yt yt−1 yh1...t−2i yht+1...T i Z(x) t k
(438)
fk (y, y 0 , xt )p(yt = y, yt−1 = y 0 | x)
XXX
= (439)
t y y0

165
We can rewrite this in the form of expectations. For now, let pe(y, x) denote the empirical
distribution, and let p̂(y | x; θ)pe(x) denote the model distribution.
" # " #
∂` X X
= Ex,y∼pe(y,x) fk (yt , yt−1 , xt ) − Ex,y∼p̂(y,x) fk (yt , yt−1 , xt ) (440)
∂θk t t

Procedure: Training Linear-Chain CRFs

Here I summarize the main steps involved during a parameter update.

Inference. We need to compute the log probabilities for each instance in the dataset, under the current
parameters. We will need them when evaluating Z(x) and the marginals p(y, y 0 | x) when computing
gradients. P
1. Initialize α1 (j) = exp{ k θk fk (j, y0 , x1 )} (y0 is the fixed initial state) and βT (i) = 1.
2. For all t, compute:
( )
X X
αt (j) = exp θk fk (j, i, xt ) αt−1 (i) (441)
i∈S k
( )
X X
βt (j) = exp θk fk (i, j, xt+1 ) βt+1 (i) (442)
i∈S k

3. Compute the marginals:


1
p(yt , yt−1 , | x) = αt−1 (yt−1 )ψt (yt , yt−1 , xt )βt (yt ) (443)
Z(x)
1
p(yt | x) = αt (yt )βt (yt ) (444)
Z(x)
X
where Z(x) = αT (j) = β0 (y0 ) (445)
j∈S

Gradients. For each parameter θk , compute:


∂` X X θk
fk (y, y 0 , xt )p(y, y 0 | x(i) ) − 2
(i) (i) (i) (i)
= fk (yt , yt−1 , xt ) − (446)
∂θk 0
σ
i,t i,t,y,y

CRF with latent variables. Now we have additional latent variables z:


1 Y
p(y, z | x) = ψt (yt , yt−1 , zt , xt ) (447)
Z(x) t

Since we observe y during training, what if we instead treat this as a CRF with both x and y
observed?
1 Y
p(z | y, x) = ψt (yt , yt−1 , zt , xt ) (448)
Z(y, x) t
XY
Z(y, x) = ψt (yt , yt−1 , zt , xt ) (449)
z t

We can use the same inference algorithms as usual to compute Z(y, x), and the key result is
that we can now write

166
1 XY Z(y, x)
p(y | x) = ψt (yt , yt−1 , zt , xt ) = (450)
Z(x) z t Z(x)

Finally, we can write the gradient as117


∂` X ∂
= p(z | y, x) [log p(y, z | x)] (451)
∂θk z ∂θk
 
0
XX X
= p(zt | y, x)fk (yt , yt−1 , zt , xt ) − p(zt , y, y | xt )fk (yt , yt−1 , zt , xt ) (452)
t zt y,y 0

where I’ve assumed there are no connections between any zt and zt0 6=t .

Stochastic Gradient Methods. Now we compute gradients for a single example, and for a
linear-chain CRF:
∂`i (i) (i) (i) (i) θk
fk (y, y 0 , xt )p(y, y 0 | x(i) ) −
X X
= fk (yt , yt−1 , xt ) − (453)
∂θk t t,y,y 0
N σ2

which corresponds to parameter update (remember: we are using the LL, not the negative
LL):

θ(m) = θ(m−1) + αm ∇`i (θ(m−1) ) (454)

where m denotes this is the mth update of the training process.

117
This uses the trick
df d log f
= f (θ)
dθ dθ

167
4.35.3 Related Work and Future Directions (Sec. 6)

MEMMs. Maximum-entropy Markov models. Essentially a Markov model in which the


transition probabilities are given by logistic regression. Formally, a MEMM is defined by
T
Y
pM EM M (y | x) = p(yt | yt−1 , x) (455)
t=1
 
PK
exp k=1 θk fk (yt , yt−1 , xt )
p(yt | yt−1 , x) = (456)
Zt (yt−1 , x)
X K
X 
Zt (yt−1 , x) = exp θk fk (yt , yt−1 , xt ) (457)
y0 k=1

which has some important differences compared to the linear-chain CRF. Notice how maximum-
likelihood training of MEMMs does not require performance inference over full output se-
quences y, because Zt is a simple sum over the labels at a single position. MEMMs, however,
suffer from label bias, while CRFs do not.

Bayesian CRFs. Instead of predicting the optimal labeling yM∗


L for input sequence x with
maximum likelihood (ML), we can instead use a fully Bayesian (B) approach, both of which
are shown below for comparison:

yM L ← arg maxy p(y | x; θ̂) (458)

yB ← arg maxy Eθ∼p(θ|x) (p(y | x; θ)) (459)

Unfortunately, computing the exact integral (the expectation) is usually intractable, and we
must resort to approximate methods like MCMC.

168
Papers and Tutorials May 02, 2018

Co-sampling: Training Robust Networks for Extremely Noisy Supervision


Table of Contents Local Written by Brandon McKinzie

Han et al., “Co-sampling: Training Robust Networks for Extremely Noisy Supervision,” (2018).

Introduction. The authors state that current methodologies [for training networks under
noisy labels] involves estimating the noise transition matrix (which they don’t define).
Patrini et al. (2017) define the matrix as follows:

Denote by T ∈ [0, 1]c×c the noise transition matrix specifying the probability of one label being flipped
 
to another, so that ∀i, j Tij , Pr ye = ej | y = ei . The matrix is row-stochastic118 and not necessarily
symmetric across the classes.

Algorithm. Authors propose a learning paradigm called Co-sampling. They maintain two
networks fw1 and fw2 simultaneously. For each mini-batch data D̂, each network selects RT
small-loss instances as a “clean” mini-batch D̂1 and D̂2 , respectively. Each of the two networks
then uses the clean mini-batch data to update the parameters w2 (w1 ) of its peer network.
• Why small-loss instances? Because deep networks tend to fit clean instances first, then
noisy/harder instances progressively after.
• Why two networks? Because if we just trained a single network on clean instances,
we would not be robust in extremely high-noise rates, since the training error would
accumulate if the selected instances are not “fully clean.”
The Co-sampling paradigm algorithm is shown below.

118
Each row sums to 1.

169
Papers and Tutorials June 30, 2018

Hidden-Unit Conditional Random Fields


Table of Contents Local Written by Brandon McKinzie

Maaten et al., “Hidden-Unit Conditional Random Fields,” (2011).

Introduction. Three key advantages of CRFs over HMMs:


1. CRFs don’t assume that the observations are conditionally independent given the hidden
(or target if linear-chain CRF) states.
2. CRFs don’t suffer from the label bias problems of models that do local probability nor-
malization119 .
3. For certain choices of factors, the negative conditional L.L. is convex.
The hidden-unit CRF (HUCRF), similar to discriminative restricted Boltzmann machines
(RBMs), has binary stochastic hidden units that are conditionally independent given the data
and the label sequence. By exploiting the conditional independence properties, we can effi-
ciently compute:
1. The exact gradient of the C.L.L.
2. The most likely label sequence.
3. The marginal distributions over label sequences.

Hidden-Unit CRFs. At each time step t, the HUCRF employs H binary stochastic hidden
units zt . It models the conditional distribution as
1 X
p(y | x) = exp (E(x, z, y)) (460)
Z(x) z
T
X T
X
T
xTt Wzt + ztT Vyt + bT zt + cT yt
   
E(x, z, y) = yt−1 Ayt +
t=2 t=1 (461)
+ y1T π + yTT τ

Since the hidden units are conditionally independent given the data and labels, the hidden
units can be marginalized out one-by-one. This, along with the nice property that the hidden
units have binary elements, allows us to write p(y | x) without writing any zt explicitly, as

119
See the introduction in my CRF notes for recap of label-bias.

170
shown below:
T
exp{y1T π + yTT τ} Y

p(y | x) = exp{cT yt + yt−1 Ayt }
Z(x) t=1
H  (462)
Y X
exp{zh bh + zh whT xt + zh vhT yt }
h=1 zh ∈{0,1}
T
exp{y1T π + yTT τ} Y

= exp{cT yt + yt−1 Ayt }
Z(x) t=1
(463)
H 
Y 
1 + exp{bh + whT xt + vhT yt }
h=1

For inference, we’ll need an algorithm for computing the marginals p(yt | x) and p(yt , yt−1 | x).
The equations are essentially the same as the forward-backward formulas for the linear-chain
CRF, but with summations over z:
X 
p(yt , yt−1 | x) ∝ αt−1 (yt−1 ) Ψt (xt , zt , yt , yt−1 ) βt (yt ) (464)
zt
p(yt | x) ∝ αt (yt )βt (yt ) (465)
(466)
XX
αt (j) = Ψt (xt , zt , j, i)αt−1 (i) (467)
i∈Y zt
XX
βt (j) = Ψt+1 (xt+1 , zt+1 , i, j)βt+1 (i) (468)
i∈Y zt+1

Training. The conditional log likelihood for a single example (x, y) is (bias and initial-state
terms omitted)
L = log p(y | x) (469)
T
!
X X
= log Ψt (xt , zt , yt−1 , yt )) − log Z(x) (470)
t=1 zt
where Ψt := exp{yt−1 Ayt + xTt Wzt + ztT Vyt } (471)
Let Υ = {W, V, b, c, } be the set of model parameters. The gradient w.r.t. the data-
dependent120 parameters v ∈ Υ is given by
 !
T H
∂L X X X ∂ohk (xt ) 
=  (1yt =k − p(yt = k | x)) σ(ohk (xt )) (472)
∂v t=1 k∈Y h=1
∂v
where ohk (xt ) = bh + ck + Vhk + whT xt (473)
Unfortunately, the negative CLL is non-convex, and so we are only guaranteed to converge
to a local maximum of the CLL.
120
The data-dependent parameters are each individual element of the elements of Υ. “Data” here means
(x, y). Notice that Υ does not include A, π, or τ.

171
4.37.1 Detailed Derivations

Unfortunately, the paper leaves out a lot of details regarding derivations and implementations.
I’m going to work through them here. First, a recap of the main equations, and with all
biases/initial states included. Not leaving anything out121
 T  H
exp y1T π + yTT τ Y Y 
exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt }

p(y | x) = (474)
Z(x) t=1 h=1
N
X
N LL = − log p(y (i) | x(i) ) (475)
i=1
N
" T ! #
X X X
(i)
=− log ψt (xt , zt , yt−1 , yt ) − log Z(x ) (476)
i=1 t=1 zt

The above formula for p(y | x) implies something that will be very useful:

X H 
Y 
ψt (xt , zt , yt−1 , yt ) = exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt } (477)
zt h=1

Using the generalization of the product rule for derivatives over N products, we can derive that
" #" #
∂ Y Y X ∂o(h)
(1 + exp{o(h)}) = (1 + exp{o(h)}) σ (o (h)) (478)
∂v h h h
∂v
P
Which means the derivatives of z ψ for the data-dependent params vdat and transition params
vtr , are:
P " #
∂ zt ψt X ∂o(h, yt ) X
= σ (o (h, yt )) ψt (479)
∂vdat h
∂vdat zt
P
∂ ψt ∂ T
 X
zt
= c yt + yt−1 Ayt ψt (480)
∂vtr ∂vtr zt

which also conveniently means that


!
∂ X X ∂o(h, yt )
log ψt = σ (o (h, yt )) (481)
∂vdat zt h
∂vdat
!
∂ X ∂ T
log ψt = c yt + yt−1 Ayt (482)
∂vtr zt ∂vtr

I’ll now proceed to derive the gradients of negative (conditional) log-likelihood for the main
parameters. We can save some time by getting the base formula for any of the gradients with
121
The equation for p(y | x) from the paper, and thus here, is technically incorrect. The term exp{cT yt +
yt−1 Ayt } should not be included in the product over t for t = 1.

172
respect to a specific single parameter v:
N
" T ! #
∂N LL X X ∂ X (i) (i) (i) ∂ (i)
=− log ψt (xt , zt , yt−1 , yt ) − log Z(x ) (483)
∂v i=1 i=1
∂v z
∂v
t

∂ log Z(x(i) ) 1 ∂ X
= pe(y1 , . . . , yT | x(i) ) (484)
∂v Z(x(i) ) ∂v y
h1...T i

∂ ∂ YX
pe(y1 , . . . , yT | x(i) ) = ψt (485)
∂v ∂v t z
t
" #" #
X ∂
P
zt ψ t
YX
∂v
= ψt P (486)
t z t t
zt ψ t

where I’ve done some regrouping on the last line to be more gradient-friendly.
Data-dependent parameters

All params v that are not transition params.

N
" T
! #
∂N LL X X ∂ X ∂
=− log ψt − log Z(x(i) ) (487)
∂vdat ∂vdat ∂v
i=1 t=1 zt
 " #" #
N ∂
P
X XX ∂o(h, yt ) 1 X YX X ∂v
ψt
=−  σ (o (h, yt )) − ψt P zt  (488)
∂vdat Z(x(i) ) zt
ψt
i=1 t h yh1...T i t zt t
 " #" #
N
X XX ∂o(h, yt ) 1 X YX XX ∂o(h, yt )
=−  σ (o (h, yt )) − ψt σ (o (h, yt )) 
∂vdat Z(x(i) ) ∂vdat
i=1 t h yh1...T i t zt t h
(489)
N
" " # #
X XX ∂o(h, yt ) XXX X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) ξt,y,y0 (490)
∂vdat ∂vdat
i=1 t h t y y0 h
N
" " # #
X XX ∂o(h, yt ) X X X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) γt,y (491)
∂vdat ∂vdat
i=1 t h t y h
N
" " # !#
X X X ∂o(h, yt ) X X ∂o(h, yt )
=− σ (o (h, yt )) − σ (o (h, yt )) γt,y (492)
∂vdat ∂vdat
i=1 t h y h
N
" !#
X XX X ∂o(h, yt )
=− (1yt =y − γt,y ) σ (o(h, yt )) (493)
∂vdat
i=1 t y h

NOTE: Although I haven’t thoroughly checked the last few steps, they are required to be true in order to match
the paper’s results.

173
Transition parameters

N
" T
! #
∂N LL X X ∂ X ∂
=− log ψt − log Z(x(i) ) (494)
∂vtr ∂vtr ∂vtr
i=1 t=1 zt
N
" #
X X ∂   ∂
T
=− c yt + yt−1 Ayt − log Z(x(i) ) (495)
∂vtr ∂vtr
i t
 " #" #
N
X X ∂   1 X YX X ∂  
=−  cT yt + yt−1 Ayt − ψt cT yt + yt−1 Ayt 
∂vtr Z(x(i) ) ∂vtr
i=1 t yh1...T i t zt t
(496)
N
" #
X XX ∂  T  
=− (1yt =y − γt,y ) c yt + yt−1 Ayt (497)
∂vtr
i=1 t y

Boundary parameters

N
∂N LL X 
=− 1y1 =` − γ1,` (498)
∂π`
i=1
N
∂N LL X 
=− 1yT =` − γT,` (499)
∂τ`
i=1

The results of each of the boxes above are summarized below, for the case of N = 1 to save
space.
!
∂N LL XX X ∂o(h, y)
=− (1yt =y − γt,y ) σ (o(h, y)) (500)
∂vdat t y h
∂vdat
∂N LL XX ∂ h T i
=− (1yt =y − γt,y ) c yt + yt−1 Ayt (501)
∂vtr t y ∂vtr
N
∂N LL X
=− [1y1 =` − γ1,` ] (502)
∂π` i=1
N
∂N LL X
=− [1yT =` − γT,` ] (503)
∂τ` i=1

Now I’ll further go through and show how the equations simplify for each type of data-

174
dependent parameter.
!
∂N LL  ∂  
σ o(h0 , y)
XX X
=− (1yt =y − γt,y ) bh0 + cy + Vh0 ,y + whT0 xt (504)
∂Wc,h t y h0
∂Wc,h
!
XX X
=− (1yt =y − γt,y ) σ (o(h, y)) 1h=h0 1c∈xt (505)
t y h0
XX
=− (1yt =y − γt,y ) σ (o(h, y)) 1c∈xt (506)
t y
∂N LL X
=− (1yt =y − γt,y ) σ (o(h, y)) (507)
∂Vh,y t
∂N LL XX
=− (1yt =y − γt,y ) σ (o(h, y)) (508)
∂bh t y
∂N LL X X
=− (1yt =y − γt,y ) σ (o(h, y)) (509)
∂cy t h
(510)

Alternative Approach. The above was a bit more cumbersome than needed. I’ll now derive
it in an easier way.

 T  H
exp y1T π + yTT τ Y Y 
exp{cT yt + yt−1 Ayt } 1 + exp{bh + whT xt + vhT yt }

p(y | x) = (511)
Z(x) t=1 h=1
T Y
H
exp{I + T } Y
1 + exp{bh + whT xt + vhT yt }

= (512)
Z(x) t=1 h=1
N
X
N LL = − log p(y (i) | x(i) ) (513)
i=1
N
" H
T X
#
X X
log 1 + exp{bh + whT xt + vhT yt } − log Z(x(i) )
 
=− I +T + (514)
i=1 t=1 h=1

Now, focusing on the regular log-likelihood for a single example, we have


∂Li ∂
= log p(y (i) , x(i) ) (515)
∂v ∂v  
∂  X  n o
= I +T + log 1 + exp bh + whT xt + vhT yt − log Z(x(i) ) (516)
∂v t,h

∂ log Z(x(i) ) 1 X ∂
= (i)
pe(y | x(i) ) (517)
∂v Z(x ) yh1...T i ∂v
1 X ∂
= (i)
pe(y | x(i) ) log pe(y | x(i) ) (518)
Z(x ) yh1...T i ∂v

175
as our base formula for partial derivatives.
Transition parameters

" #
∂Li ∂ X  T T
 (i)
= I +T + log 1 + exp bh + wh xt + vh yt − log Z(x ) (519)
∂Ai,j ∂Ai,j
t,h
" #
∂ X ∂ X
= [I + T ] − p(y | x(i) ) y1T π + yT
T
τ+ yt−1 Ayt (520)
∂Ai,j ∂Ai,j
yh1...T i t

T
X X 1 X
= 1 (i) 1 (i) − e(y | x(i) )
p 1yt−1 =i 1yt =j (521)
yt−1 =i yt =j Z(x(i) )
t yh1...T i t=1
X XX X X X 1
= 1 (i) 1 (i) − 1yt−1 =i 1yt =j e(y | x(i) )
p (522)
yt−1 =i yt =j Z(x(i) )
t t yt yt−1 yh1...t−2i yht+1...T i

176
Papers and Tutorials June 30, 2018

Pre-training of Hidden-Unit CRFs


Table of Contents Local Written by Brandon McKinzie

Kim et al., “Pre-training of Hidden-Unit CRFs,” (2018).

Model Definition. The Hidden-Unit CRF (HUCRF) accepts the usual observation sequence
x = x1 , . . . , xn , and associated label sequence y = y1 , . . . yn for training. The HUCRF also
has a hidden layer of binary-valued z = z1 . . . zn . It defines the joint probability

exp θ T Φ(x, z) + γ T Ψ(z, y)



pθ,γ (y, z | x) = P T 0 T 0 0
 (523)
z 0 ,y 0 ∈Y(x,z 0 ) exp θ Φ(x, z ) + γ Ψ(z , y )

where
• Y(x, z) is the set of all possible label sequences for x and z.
• Φ(x, z) = nj=1 φ(x, j, zj )
P
Pn
• Ψ(z, y) = j=1 ψ(zj , yj−1 , yj ).
Also note that we model (zi ⊥ zj6=i | x, y).

Pre-training HUCRFs. Since the objective for HUCRFs is non-convex, we should choose a
better initialization method than random initialization. This is where pre-training comes in,
a simple 2-step approach:
1. Cluster observed tokens from M unlabeled sequences and treat the clusters as labels to
train an intermediate HUCRF. Let C(u(i) ) be the sequence of cluster assignments/labels
for the unlabeled sequence u(i) . We compute:
M
X
(θ1 , γ1 ) ≈ arg max log pθ,γ (C(u(i) ) | u(i) ) (524)
θ,γ i=1

2. Train a final model on the labeled data {(x(i) , y (i) )}N


i=1 , using θ1 as an initialization point:

N
X
(θ2 , γ2 ) ≈ arg max log pθ,γ (y (i) | x(i) ) (525)
θ,γ i=1

Note that pre-training only defines the initialization for θ, the parameters between x and z.
We still train γ, the parameters from z to y, from scratch.

177
Canonical Correlation Analysis (CCA). A general technique that we will need to under-
stand as a prerequisite for the multi-sense clustering approach (defined in the next section).
0
Given n samples of the form (x(i) , y (i) ), where each x(i) ∈ {0, 1}d and y (i) ∈ {0, 1}d , CCA
0
returns projection matrices A ∈ Rd×k and B ∈ Rd ×k that we can use to project the samples
to k dimensions:

x −→ AT x (526)
y −→ B T y (527)

The CCA algorithm is outlined below.


Algorithm: CCA
0 0
1. Calculate D ∈ Rd×d , u ∈ Rd , and v ∈ Rd as follows:

n
X
Di,j = 1 (l) 1 (l) (528)
xi =1 yj =1
l=1
n
X
ui = 1 (l) (529)
xi =1
l=1
n
X
vi = 1 (l) (530)
yi =1
l=1

2. Define Ω̂ = diag(u)−1/2 D diag(v)−1/2 .


0
3. Calculate rank-k SVD Ω̂. Let U ∈ Rd×k and V ∈ Rd ×k contain the left and right, respectively, singular
vectors for the largest k singular values.
4. Return A = diag(u)−1/2 U and B = diag(v)−1/2 V.

Multi-sense clustering. For each word type, use CCA to create a set of context embeddings
corresponding to all occurrences of that word type. Then, cluster these embeddings with k-
means. Set the number of word senses k to the number of label types occurring in the labeled
data.

TODO: finish this note

178
Papers and Tutorials July 07, 2018

Structured Attention Networks


Table of Contents Local Written by Brandon McKinzie

Kim et al., “Structured Attention Networks,” (2017).

Background: Attention Networks. Goal: produce a context c based on input sequence


x and query q. We assume we have an attention distribution122 z ∼ p(z | x, q). Interpret
z as a categorical latent variable over T categories, where T = len(x) is the length of the
input sequence. We can then compute the context, c = Ez∼p(z|x,q) [f (x, z)], where f (x, z) is an
annotation function 123 .

We interpret the attention mechanism as taking the expectation of an annotation function


f (x, z) with respect to a latent variable z ∼ p, where p is parameterized to be a function of
x and q.

For comparisons later on with the traditional attention mechanism, here it is:
T
X
c= p(z = t | x, q)xt (531)
t
p(z = t | x, q) = softmax(θt ) (532)

where usually x is the sequence of hidden state of the encoder RNN, q is the hidden state of
the decoder RNN at the most recent time step, z gives the source position to be attended to,
and θt = score(xt , q).

Structured Attention. In a structured attention model, z is now a vector of discrete


random variables z1 , . . . , zm and the attention distribution p(z | x, q) is now modeled as a
conditional random field, specifying the structure of the z variables. We also assume now that
P
the annotation function f factors into clique annotation functions f (x, z) = C fC (x, zC ),
where the summation is over the C factors, ψC , of the CRF. Our context vector takes the
form:
X
c = Ez∼p(z|x,q) [f (x, z)] = Ez∼p(zC |x,q) [fC (x, zC )] (533)
C
1 Y
p(z | x, q) = ψC (zC ) (534)
Z(x, q) C

122
Also called the “alignments”. It is the output of the softmax layer of attention scores in the majority of
cases.
123
In all applications I’ve seen, f (x, z) = xz .

179
Example 1: Subsequence Selection. Let m = T , and let each zi ∈ {0, 1} be a binary R.V.
Let f (x, z) = Tt ft (x, zt ) = Tt xt 1zt =1 124 . This yields the context vector,
P P

T
X
c = Ez1 ,...,zT [f (x, z)] = p(zt = 1 | x, q)xt (535)
t

Although this looks similar to equation 531, we haven’t yet revealed the functional form for
p(z | x, q). Two possible choices:

T
1 Y The factor ψ for the
Linear-Chain CRF: p(z1 , . . . , zT | x, q) = ψt (zt , zt−1 ) (536) CRF is NOT the same
Z(x, q) t
as the factor for the
T
Y T
Y Bernoulli!
Bernoulli: p(z1 , . . . , zT | x, q) = p(zt = 1 | x, q) = σ(ψt (zt )) (537)
t t

These show why equation 535 is fundamentally different than equation 531:
• It allows for multiple inputs (or no inputs) to be selected for a given query.
• We can incorporate structural dependencies across the zt ’s.
Also note that all methods can use potentials from the same neural network or RNN that takes
x and q as input. By this we mean, for example, that we can take the same parameters we’d use
when computing the scores in our attention layer, and reinterpret them as e.g. CRF parameters.
Then, we can compute the marginals p(zt | x) using the forward-backward algorithm125 .

Crucially this generalization from vector softmax to forward-backward is just a series of


differentiable steps, and we can compute gradients of its output (marginals) with respect
to its input (potentials), allowing the structured attention model to be trained end-to-end
as part of a deep model.

124
Ok, so equivalently, z T x, i.e. the indicator function can just be replace by zt here
125
This is different than the simple softmax we usually use in an attention layer, which does not model any
interdependencies between the zt . The marginals we end up with when using the CRF originate from a joint
distribution over the entire sequence z1 , . . . zT . This seems potentially incredibly powerful. Need to analyze in
depth.

180
Papers and Tutorials July 08, 2018

Neural Conditional Random Fields


Table of Contents Local Written by Brandon McKinzie

Do and Artieres, “Neural Conditional Random Fields,” (2010).

Neural CRFs. Essentially, we feed the input sequence x through a feed-forward network
whose output layer has a linear activation function. The output layer is then connected with
the target variable sequence Y. In other words, instead of feeding instances x of the observation
variables X, we feed the hidden layer activations of the NN. This results in the conditional
probability
yc We can set Φc = Φ for a
e−Ec (x,yc ,w) = ehwc
Y Y
,Φ c (x,wN N )i
p(y | x) ∝ (538) shared-weights approach
c∈C c∈C

where
• wN N are the weights for the NN.
• wcyc are the weights (for clique c) for the CRF.
• Φc (x, wN N ) is the output of the NN. It symbolizes the high-level feature representation
of the input x at clique c computed by the NN.
The authors refer to the linear output layer (containing the CRF weights) as the energy outputs.
For the sake of writing this in more familiar notation for the linear-chain CRF case, here is
the above equation translated for the case where each clique corresponds to a timestep t of the
input sequence and is either a label-label clique or a state-label clique.
T
1 Y
p(y | x) = exp {−Et (x, yt , yt−1 , w)} (539)
Z(x) t
T
1 Y
= exp {−Eloc (x, t, yt , w) − Etrans (x, t, yt−1 , yt , w)} (540)
Z(x) t
(541)

where the authors are using a blanket w to denote all model parameters126 .

126
Also note that the authors allow for utilizing the input sequence x in the transition energy function, Etrans ,
although usually we implement Etrans using only yt−1 and yt .

181
Initialization & Fine-Tuning. The hidden layers of the NN are initialized layer-by-layer in
an unsupervised manner using RBMs. It’s important to note that the hidden layers of the NN
consist of binary units. Then, using the pre-trained hidden layers, the CRF layer is initialized
by training it in the usual way, and keeping the pretrained NN weights fixed.

Next, fine-tuning is used to learn all parameters globally.


n
∂L(w) 1X ∂Li (w)
= (542)
∂w n i=1 ∂w
∂Li (w) ∂Li (w) ∂E(x(i) )
= (543)
∂w ∂E(x(i) ) ∂w
h i
E(x(i) ) = Eloc (x, t, yt , w) + Etrans (x, t, yt−1 , yt , w) (544)
t
(545)
∂Ei
where ∂w is the Jacobian matrix of the NN outputs for input sequence x(i) w.r.t. weights w.
∂Li (w)
By setting ∂Ei as backprop errors of the NN output units, we can backpropagate and get
∂Li (w)
∂w using the chain rule over the hidden layers.

182
Papers and Tutorials July 08, 2018

Bidirectional LSTM-CRF Models for Sequence Tagging


Table of Contents Local Written by Brandon McKinzie

Huang et al., “Bidirectional LSTM-CRF Models for Sequence Tagging,” (2015).

BI-LSTM-CRF Networks. Consider the matrix of scores fθ (x) for input sentence x. The
element [fθ ]`,t gives the score for label ` with the t-th word. This is output by the LSTM
network parameterized by θ. We let [A]i,j denote the transition score from label i to label j
within the CRF. The total set of parameters is denoted θe = θ ∪ A. The total score for input
sentence x and predicted label sequence y is then
T 
X 
s(x, y, θ)
e = Ayt−1 ,yt + [fθ ]yt ,t (546)
t

Features. The authors incorporate 3 distinct types of input features:


• Spelling features. Various standard lexical/syntactical features. One-hot encoded as
usual.
• Context features. Unigram, bigram, and sometimes tri-gram features. One-hot en-
coded.
• Word embeddings. Distinct from the word features. Use a pretrained embedding for
each word.
Although it’s not entirely clear, it appears they concatenate all of the aforementioned features
together as input to the BI-LSTM. This necessarily means they are learning an embedding for
the one-hot encoded spelling and word features. They also add direct connections from the
input to the CRF for the spelling and word features.

EDIT: they may actually replace the one-hot encoded word features with the word embeddings.
Unclear.

183
Papers and Tutorials July 16, 2018

Relation Extraction: A Survey


Table of Contents Local Written by Brandon McKinzie

Pawar et al., “Relation Extraction: A Survey,” (December 2017).

Feature-based Methods. For each entity pair (e1 , e2 ), generate a set of features and train
a classifier to predict the relation127 . Some useful features are shown in the figure below.

Authors found that SVMs outperform MaxEnt (logistic reg) classifiers for this task.

Kernel methods. Instead of explicit feature engineering, we can design kernel functions for
computing similarities between representations of two relation instances128 (a relation instance
is a triplet of the form (e1 , e2 )), and SVM for the classification.

127
If there are N unique relations for our data, it is common to train the classifier to predict 2N total relations,
to handle both possible orderings of relation arguments.
128
Recall that kernel methods are for the general optimization problem
n n
! n n
X X T
X X
min loss αj φ(xj ) φ(xi ), yi +λ αj αk φ(xj )T φ(xk ) (547)
w
i=1 j=1 j=1 k=1

which changes the direct focus from feature engineering to “similarity” engineering.

184
One approach is the sequence kernel. We represent each relation instance as a sequence of
feature vectors:
(e1 , e2 ) → (f1 , . . . , fN )
where N might be e.g. the number of words between the two entities, and the dimension of
each f is the same, and could correspond to e.g. POS tag, NER tag, etc. More formally,
define the generalized subsequence kernel, Kn (s, t, λ), that computes some number of weighted
subsequences u such that
• There exist index sequences ii := (i1 , . . . in ) and jj := (j1 , . . . jn ) of length n such that
ui ∈ si ∀i ∈ ii (548)
uj ∈ tj ∀j ∈ jj (549)
(550)
• The weight of u is λl(ii)+l(jj) , where l(x) = max(x) − min(x) and 0 < λ < 1. Sparser
(more spaced out) subsequences get lower weight.
The authors then provide the recursion formulas for K, and describe some extensions of se-
quence kernels for relation extraction.

Syntactic Tree Kernels. Structural properties of a sentence are encoded by its constituent
parse tree. The tree defines the syntax of the sentence in terms of constituents such as noun Syntactic Tree Kernels
phrases (NP), verb phrases (VP), prepositional phrases (PP), POS tags (NN, VB, IN, etc.)
as non-terminals and actual words as leaves. The syntax is usually governed by Context Free
Grammar (CFG). Constructing a constituent parse tree for a given sentence is called parsing.
The Convolution Parse Tree Kernel KT can be used for computing similarity between two
syntactic trees.

Dependency Tree Kernels. For grammatical relations between words in a sentence. Words
are the nodes and dependency relations are the edges (in the tree), typically from dependent
to parent. In the relation instance representation, we use the smallest subtree containing Dependency Tree
the entity pair of a given sentence. Each node is augmented with additional features like POS, Kernels
chunk, entity level (name, nominal, pronoun), hypernyms, relation argument, etc. Formally,
an augmented dependency tree is defined as a tree T where
• Each node ti has features φ(ti ) = {v1 , . . . , vd }.
• Let ti [c] denote all children of ti , and let P a(ti ) denote its parent.
• For comparison of two nodes we use:
– Matching function m(ti , tj ): equal to 1 if some important features are shared
between ti and tj , else 0.
– Similarity function s(ti , tj ): returns a positive real similarity score, and defined
as
X X
s(ti , tj ) = Compat(vq , vr ) (551)
vq ∈φ(ti ) vr ∈φ(tj )

over some compatibility function between two feature values.

185
Finally, we can define the overall dependency tree kernel K(T1 , T2 ) for similarity between trees
T1 and T2 as follows. Let ri denote the root node of tree Ti .
(
0 if m (r1 , r2 ) = 0 a1 ≤ a2 ≤ . . . ≤ an
K(T1 , T2 ) = (552)
s(r1 , r2 ) + Kc (r1 [c], r2 [c]) otherwise d(a) , an − a1 + 1
X
d(a)+d(b)
Kc (ti [c], tj [c]) = λ K(ti [a], tj [b]) (553) 0<λ<1
a,b,l(a)=l(b)

The interpretation is that, whenever a pair of matching nodes is found, all possible matching
subsequences129 of their children are found. Two subsequences a and b are said to “match”
if m(ai , b1 ) = 1(∀i < n). Similar to the sequence kernel seen earlier, λ is a decay factor that
penalizes sparser subsequences.

129
P
Note that a summation over subsequences of a sequence a, denoted here as a
, expands to
{a1 , . . . an , a1 a2 , a1 a3 , . . . a1 an , a1 a2 a3 , . . . , a2 a5 a6 , . . .} and so on and so forth.

186
Papers and Tutorials July 16, 2018

Neural Relation Extraction with Selective Attention over Instances


Table of Contents Local Written by Brandon McKinzie

Lin et al., “Neural Relation Extraction with Selective Attention over Instances,” (2016).

Introduction. A common distant supervision approach for RE is aligning a KB with text.


For any (e1 , r, e2 ) in the KB, it assumes that all text mentions of (e1 , e2 ) express the relation r.
Of course, this assumption will often not be true. This motivates the notion of multi-instance
learning, wherein we predict whether a set of instances {x1 , . . . xn } (each of which contain
mention(s) of (e1 , e2 )) imply the existence of (e1 , r, e2 ) being true.

Input Representation. Each instance sentence x is tokenized into a sequence of words. Each
word wi is transformed into a concatenation wi ∈ Rd (d = dw + 2dpos ),

wi := [word2Vec(wi ); dist(wi , e1 ); dist(wi , e2 )] (554)

where dist(a, b) returns the [embedded] relative distance (num tokens) between a and b in the
given sentence (positive integer)130 .

Convolutional Network. We use a CNN to encode a sentence of embeddings {w1 , . . . , wT }


into a single sentence vector representation x. Denote the kernel/filter/window size as ` and
the number of words in the given sentence as T . Let qi ∈ R`·d denote the vector for the ith
window,

qi = wi−`+1:i (1 ≤ i ≤ T + ` − 1) (555)

and let Q ∈ R(T +`−1)×`·d be defined such that row Qi = qiT . It follows that, for convolution
matrix W ∈ RK×(`·d) , the output of the kth filter, and subsequent max-pooling, is
h i
pk = WQT +b (556)
k
[x]k = [max(pk1 ); max(pk2 ); max(pk3 )] (557)

where we’ve divided pk into three segments, corresponding to before entity 1, middle, and
after entity 2 of the given sentence. The sentence vector x ∈ R3K is the concatenation of all
of these, after feeding through a non-linear activation function like a ReLU.

130
More specifically, it is an embedded representation of the relative distance. To actually implement it, you’d
first shift the relative distances such that they begin at 0, and learn 2 * window_size + 1 embedding vectors
for each of the possible position offsets. Anything outside the window is embedded into the zero vector.

187
Selective Attention over Instances. An attention mechanism is employed over all n sen-
tence instances xi for some candidate entity pair (e1 , e2 ). The output is a set vector s, a
real-valued vector representation of the set of instances, where
X
s= αi xi (558)
i
αi = softmax(xi Ar) (559)

is an attention-weighted sum over the instance embeddings. Note that they constrain A to be
diagonal. Finally, the predictive distribution is defined as

p(r | S, (e1 , e2 )θ) = softmax (Ms + d)r (560)

where S is the set of n sentences for the given entity pair (e1 , e2 ).

188
Papers and Tutorials July 31, 2018

On Herding and the Perceptron Cycling Theorem


Table of Contents Local Written by Brandon McKinzie

Gelfand et al., “On Herding and the Perceptron Cycling Theorem,” (2010).

Introduction. Begin with the familiar learning rule of Rosenblatt’s perceptron, after some
incorrect prediction yˆi = sgn(wT xi ),

w ← w + xi (yi − yˆi ) (561)

which has the effect that a subsequent prediction on xi will (before taking the sign) be ||xi ||22
closer to the correct side of the hyperplane. The perceptron cycling theorem (PCT) states
that if the data is not linearly separable, the weights will still remain bounded and won’t
diverge to infinity. This paper shows that the PCT implies that certain moments are
conserved on average. Formally, their result says that, for some N number of iterations
over samples selected from training data (with replacement)131 ,
N N
1 X 1 X 1
 
xi yi − xi yˆi ∼ O (562)
N i N i N

where it’s important to remember that yˆi here is the prediction for xi when it was encountered
at that training iteration. This result shows that perceptron learning generate predictions that
correlate with the input attributes the same way as the true labels do, and [the correlations]
converge to the sample mean with a rate of 1/N. This also hints at why averaged percep-
tron algorithms (using the average of weights across training) makes sense, as opposed to just
selected the best weights. This paper also shows that supervised perceptron algorithms and
unsupervised herding algorithms can all be derived from the PCT.

Below are some theorems that will be used throughout the paper. Let {wt } be a sequence of
vectors wt ∈ RD , each generated according to iterative updates wt+1 = wt + vt , where vt is
an element of a finite set V, and the norm of vt is bounded: max ||vt || = R < ∞.

PCT: ∀t ≥ 0: If wtT vt ≤ 0, ∃M > 0 s.t. ||wt − w0 || < M .


PT
Convergence Thm: If PCT holds, then || T1 t=1 vt || ∼ O T1 .

131
Unclear whether this is only for samples that corresponded to an update, or just all samples during training.

189
Herding. Consider a fully observed Markov Random Field (MRF) over m variables, each of
which can take on an integer value in the range [1, K]. In herding, our energy function and
weight updates for observation x (over all m variables in X ),

E(x) = −wT φ(x) (563)


wt+1 = wt + φ̄ − φ(x∗t ) (564)
h i
where φ̄ = Ex(i) ∼pdata φ(x(i) ) (565)
and x∗t = arg max wtT φ(x) (566)
x

What if we consider more complicated features that depend on the weights w? This situation
may arise in e.g. models with hidden units z, where our feature function would take the form
φ(x, z), and we always select z via

z(x, w) = arg max wT φ(x, z 0 ) (567)


z0

and therefore our feature function φ depends on weights w through z. In this case, our herding
update terms from above take the form
h i
φ̄ = Ex(i) ∼pdata φ(x(i) , z(x(i) , w)) (568)
x∗t , zt∗ = arg max wtT φ(x, z) (569)
x,z

Conditional Herding. Main contribution of this paper. It’s basically identical to regular
herding, but now we decompose x into inputs and outputs (x, y) for interpreting in a discrim-
inative setting. In the paper, they express wT φ(x, y, z) identically as a discriminative RBM.
The parameter update for mini-batch Dt is given by
η  
φ(x(i) , y (i) , z) − φ(x(i) , y ∗ , z ∗ )
X
wt+1 = wt + (570)
|Dt |
(x(i) ,y (i) )∈D t

190
Papers and Tutorials August 12, 2018

Non-Convex Optimization for Machine Learning


Table of Contents Local Written by Brandon McKinzie

P. Jain and P. Kar, “Non-convex Optimization for Machine Learning,” (2017).

Convex Analysis (2.1). First, let’s summarize some definitions.


Convex Combination
Pnconvex combination of P
A a set of n vectors xi ∈ Rp , i = 1 . . . n is a vector xθ :=
n
i=1 θi xi , where θi ≥ 0 and i=1 θi = 1.
My interp: A weighted average where the weights can be interpreted as probability mass
associated with each vector.

Convex Set. Sets that contain all [points in] line segments that join any 2 points in the set.
A set C is called a convex set if ∀x, y ∈ C and λ ∈ [0, 1], we have that (1 − λ)x + λy ∈ C
as well.

Proving conv. comb. of 3 vectors ∈ C too.

After reading the definition of a convex set above, it seemed intuitive that any convex combination of points ∈ C
should also be in it as well (i.e. generalizing the pairwise definition). Let x, y, z ∈ C. How can we prove that
θ1 x + θ2 y + θ3 z (where θi satisfy the constraints of a convex comb.) is also in C? Here is how I ended up doing it:
• If we can prove that θ1 x + θ2 y = (1 − θ3 )v for some v ∈ C, then our work is done. This is pretty easy to
show via simple arithmetic.
• Case 1: assume θ3 < 1, so that we can divide both sides by 1 − θ3 :

θ1 θ2
v= x+
1 − θ3 1 − θ3

Clearly, the two coefficients here sum 1 and satisfy the constraints of a convex combination, and therefore
we know that v ∈ C, and this case is done.
• Case 2: assume θ3 = 1. Well, that means θ1 = θ2 = 0. Trivially, z ∈ C and this case is done.

Convex Function
A continuously differentiable function f : Rp 7→ R is a convex function if ∀x, y ∈ Rp ,
we have that
f (y) ≥ f (x) + h∇f (x), y − xi (571)

While thinking about how to gain intuition for the above, I came across chapter 3 of “Convex
Optimization” which describes this in much more detail. It’s crucial to recognize that the RHS
of the inequality is the 1st-order Taylor expansion of the function f about x, evaluated at
y. In other words, the first-order Taylor approximation is a global underestimator of any
convex function f 132 .

132
Consider what this implies about all the 1st-order gradient-based optimizers we use.

191
Strongly Convex/Smooth Function. Informally, strong convexity ensures a convex func-
tion doesn’t grow too slow, while strong smoothness ensures a convex133 function doesn’t grow
too fast. Formally,
A continuously differentiable function is considered α-strongly convex (SC) and β-strongly
smooth (SS) if ∀x, y ∈ Rp we have

α β
||x − y||22 ≤ f (y) − f (x) − h∇f (x), y − xi ≤ ||x − y||22 (572)
2 2

Considering the aforementioned 1st-order Taylor approximation interpretation, we see that α


determines just how much larger f (y) must be compared to its linear approximation. Con-
versely, β determines the upper bound for how large this discrepancy is allowed to be134 .
Exercise 2.1: SS does not imply convexity

Construct a non-convex function f : Rp 7→ R that is 1-SS.

We need to find a function whose linear approximation is always more than 12 times the magnitude of the difference
in inputs squared, compared to the true value. Intuitively, I’d expect any concave function to satisfy this, since
its linear approximation is a global overestimator of the true value. So, for example, f (y) = −||y||22 would satisfy
1 − SS while being non-convex.

Lipschitz Function
A function f is B-Lipschitz if ∀x, y ∈ Rp ,

|f (x) − f (y)| ≤ B · ||x − y||2 (573)

Jensen’s Inequality. Generalizes behavior of convex functions on convex combinations135 .

If X is a R.V. taking values in the domain of a convex function f , then

E [f (X)] ≥ f (E [X]) (574)

133
Strong smoothness alone does not imply convexity.
134
Notice that SC and SS are quadratic lower and upper bounds, respectively. This means that the allowed
deltas grow as a function of the distance between x and y, whereas things like Lipschitzness grow linearly.
135
It should be obvious that expectations are convex combinations.

192
Convex Projections (2.2). Given any closed set C ∈ Rp , the projection operator ΠC (·) is
defined as

ΠC (z) := arg min ||x − z||2 (575)


x∈C

If C is a convex set, then the above reduces to a convex optimization problem. Projections
onto convex sets have three particularly interesting properties. For each of them, the setup is:
“For any convex set C ⊂ Rp , and any z ∈ Rp , let ẑ := ΠC (z). Then ∀x ∈ C, . . . ”
• Property-O136 : ||ẑ − z||2 ≤ ||x − z||2 . Informally: “the projection results in the point
ẑ in C that is closest to the original z”. This basically just restates the optimization
problem.
• Property-I. hx − ẑ, z − ẑi ≤ 0 . Informally: “from the perspective of ẑ, all points x ∈ C
are in the (informally) opposite direction of z.”
• Property-II. ||ẑ − x||2 ≤ ||z − x||2 . Informally: “the projection brings the point closer
to all points in C compared to its original location.”
Proving Property-I

A proof by contradiction.
1. Assume that ∃x ∈ C s.t. hx − ẑ, z − ẑi > 0.
2. We know that ẑ is also in C, and since C is convex, then for any λ ∈ [0, 1],

xλ := λx + (1 − λ)ẑ (576)

must also be in C.
3. If we can show that some value of λ guarantees that ||z − xλ ||2 < ||z − ẑ||2 , this would directly contradict
property-O, implying ẑ is not the closest member of C to z. I’m not sure how to actually derive the range
of λ values that satisfy this, though.

(Convex) Projected Gradient Descent (2.3). Our optimization problem is

min f (x) s.t. x ∈ C (577)


x∈Rp

where C ⊂ Rp is a convex constraint set, and f : Rp 7→ R is a convex objective function.


Projected gradient descent iteratively updates the value of x that minimizes f as usual, but
additionally projects the current iterate (value of best x) onto C at the end of each iteration.
That’s the only difference.

136
In this case only, C need not be convex

193
4.45.1 Non-Convex Projected Gradient Descent (3)

Non-Convex Projections (3.1). Here we look at a few special cases where projecting onto
a non-convex set can still be carried out efficiently.
• Projecting into sparse vectors. The set of s-sparse vectors (vectors with at most s
nonzero elements) is denoted as

B0 (s) , {x ∈ Rp | ||x||0 ≤ s} (578)

It turns out that ẑ := ΠB0 (s) (z) is obtained by setting all except the top-s elements of z
to zero.
• Projecting into low-rank matrices. The set of m × n matrices with rank at most r
is denoted as

Brank (r) , {A ∈ Rm×n | rank(A) ≤ r} (579)

and we want to project some matrix A onto this set,

ΠBrank (r) (A) := arg min ||A − X||F (580)


X∈Brank (r)

This can be done efficiently via SVD on A and retaining the top r singular values and
vectors.

Restricted Strong Convexity and Smoothness (3.2).


Restricted Convexity
A continuously differentiable function f : Rp 7→ R is said to satisfy restricted convexity
over a (possibly non-convex) region C ⊆ Rp if ∀x, y ∈ C, we have that

f (y) ≥ f (x) + h∇f (x), y − xi (581)

and a similar rephrasing for restricted strong convexity (RSC) and restricted strong smoothness
(RSS).

194
Papers and Tutorials August 24, 2018

Improving Language Understanding by Generative Pre-Training


Table of Contents Local Written by Brandon McKinzie

Radford et al., “Improving Language Understanding by Generative Pre-Training,” (2018).

Unsupervised Pre-Training (3.1). Given unsupervised corpus of tokens U = {u1 , . . . , un },


train with a standard LM objective:
n
X
L1 (U) = log P (ui | ui−k , . . . , ui−1 ; Θ) (582)
i

The authors use a Transformer decoder, i.e. literally just the decoder part of the Trans-
former in “Attention is all you need.”

Supervised Fine-Tuning (3.2). Now we have a labeled corpus C, where each instance consists
of a sequence of input tokens x1 , . . . , xm , along with a label y. They just feed the inputs
through the transformer until they obtain the final transformer block’s activation hm l , and
linearly project it to output space:

P (y | x1 , . . . , xm ) = softmax(hm
l Wy ) (583)
X
1 m
L2 (C) = log P (y | x , . . . , x ) (584)
(x,y)

They also found that including a language modeling auxiliary objective helped learning,

L3 (C) = L2 (C) + λL1 (C) (585)

. . . that’s it. Extremely simple, yet somehow effective.

195
Papers and Tutorials August 30, 2018

Deep Contextualized Word Representations


Table of Contents Local Written by Brandon McKinzie

Peters et al., “Deep Contextualized Word Representations,” (2018).

Bidirectional Language Models (3.1). Given a sequence of N tokens, a forward LM com-


putes the probability of the sequence via
N
Y
p(t1 , . . . , tN ) = p(tk | t1 , . . . , tk−1 ) (586)
k=1

A common approach is learning context-independent token representations xk and passing




these through L layers of forward LSTMs. The top layer LSTM output at step k, h k,L ,
is used to predict tk+1 with a softmax layer. The authors’ biLM combines a forward and
backward LM to jointly maximize
N 
X →

log p(tk | t1 , . . . , tk−1 ; Θx , Θ LST M , Θs )
k=1 (587)



+ log p(tk | tk+1 , . . . , tN ; Θx , Θ LST M , Θs )

and it’s important to note the shared parameters Θx (token representation) and Θs (output
softmax).

ELMo (3.2). A task-specific linear combination of the intermediate representations.


L
X
ELM otask
k = γ task stask
j hLM
k,j (588)
j=0

where stask are softmax-normalized weights (so the combination is convex). The authors also
mention that, in some cases, it helped to apply layer normalization to each biLM layer
before weighting.

196
Using biLMs for Supervised NLP (3.3). Given a pretrained biLM and a supervised
architecture, we can learn the ELMo representations (jointly with the given supervised task)
as follows.
1. Freeze the weights of the [pretrained] biLM.
2. Concatenate the token representations (e.g. GloVe) with the ELMo representation.
3. Pass the concatenated representation into the supervised architecture.
The authors found it beneficial to some dropout to ELMo, and in some cases add L2-regularization
on the ELMo weights.

Pretrained biLM Architecture (3.4). In addition to the biLM we introduced earlier, the
authors make the following changes/specifications for their pretrained biLMs:
• residual connections between LSTM layers137 .
• Halved all embedding and hidden dimensions from the CNN-BIG-LSTM model in Ex-
ploring the Limits of Language Modeling.
• The xk token representations use 2048 character n-gram convolutional filters followed by
two highway layers.

137
So the output of some layer, instead of being LSTM(x), becomes (x + LSTM(x))

197
Papers and Tutorials August 30, 2018

Exploring the Limits of Language Modeling


Table of Contents Local Written by Brandon McKinzie

Josefina et al., “Exploring the Limits of Language Modeling,” (2016).

NCE and Importance Sampling (3.1). In this section, assume any p(w) is shorthand for
p(w | {wprev }).
• Noise Contrastive Estimation (NCE). Train a classifier to discriminate between true
data (from distribution pd ) or samples coming from some arbitrary noise distribution pn .
If these distributions were known, we could compute

p(w | true)p(true)
p(Y =true | w) = (589)
p(w)
pd (w)p(true)
= (590)
p(w, true) + p(w, false)
pd (w)p(true)
= (591)
pd (w)p(true) + pn (w)p(false)
pd (w)
= (592)
pd (w) + kpn (w)

where k is the number of negative samples per positive word. The idea is to train a
logistic classifier p(Y =true | w) = σ(log pmodel − log kpn (w)), then softmax(log pmodel ) is
a good approx of pd (w).
• Importance Sampling. Estimates the partition function. Consider that now we have a
set of k + 1 words W = {w1 , . . . , wk+1 }, where w1 is the word coming from the true data,
and the rest are from the noise distribution. We train a multinomial logistic regression
over k + 1 classes,

pd (wi ) 1
p(Y =i | W ) = Pk+1 (593)
pn (wi ) i0 =1 pd (wi0 )/pn (wi0 )
pd (wi )
∝Y (594)
pn (wi )

and we end up seeing that IS is the same as NCE, except in the multiclass setting and
with cross entropy loss instead of logistic loss.

198
CNN Softmax (3.2). Typically the logit for word w is given by zw = hT ew , where h is often
the output state of an LSTM, and ew is a vector of parameters that could be interpreted as
the word embedding for w. Instead of this, the authors propose what they call CNN Softmax,
where we compute ew = CN N (charsw ). Although this makes the function mapping from w
to ew much smoother (due to the tied weights), it ends up having a hard time distinguishing
between similarly spelled words that may have entirely different meanings. The authors use a
correction factor, learned for each word, such that

zw = hT CN N (charsw ) + hT M corrw (595)

where M projects low-dimensional corrw back up to the dimensionality of the LSTM state h.

Char LSTM Predictions (3.3). To reduce the computational burden of the partition func-
tion, the authors feed the word-level LSTM state h through a character-level LSTM that
predicts the target word one character at a time.

199
Papers and Tutorials October 28, 2018

Connectionist Temporal Classification


Table of Contents Local Written by Brandon McKinzie

Graves et al., “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks,” (2006).

Temporal Classification.
• Input space: Let X = (Rm )∗ be the set of all sequences of m dimensional real-valued
vectors.
• Output space: Let Z = L∗ be the set of all sequences of a finite vocabulary of L labels.
• Data distribution: Denote by DX ×Z the probability distribution over samples (x, z).
Let S denote a set of training examples drawn from this distribution.

From Network Outputs to Labellings (3.1). Let L0 = L ∪ {} denote the set of unique
labels combined with the blank token . We refer to the alignment sequences of length T (same
length as x), i.e. elements of the set (L0 )T , as paths and denote them π.

Now that we have alignment sequences π, we need to convert them to label sequences by
(1) merging repeated contiguous labels, and then (2) removing blank tokens. We denote this
procedure as a many-to-one map B : L0T 7→ L≤T . We can then write the conditional posterior
over possible output sequences `:
X
p(` | x) = p(π | x) (596)
π∈B−1 (`)

Constructing the Classifier (3.2). There is no tractable algorithm for exact decoding, i.e.
computing

h(x) , arg max p(` | x) (597)


`∈L≤T

However, the following two approximate methods work well in practice:


1. Best Path Decoding. h(x) ≈ B(π ∗ ) where π ∗ = arg maxπ∈L0T p(π | x).
2. Prefix Search Decoding.

200
The CTC Forward-Backward Algorithm (4.1). Define the probability of obtaining the
first s output labels, `h1...si , at time t as

t
X Y 0
αt (s) , yπt t0 (598)
π∈L0T t0 =1
B(πh1...ti )=`h1...si

Note that the summation here could contain duplicate πh1...ti that differ only in their elements
beyond t.

We insert a blank token at the beginning and end of ` and between each pair of labels, and call
this augmented sequence `0 . We have the following rules for initializing α at the first output
step t=1, followed by the recursion rule:

1
y s=1


α1 (s) = y1 s=2 (599)
 `1
0 s>2

ᾱ (s)y t 0
t `s `0s =b or `0s−2
αt (s) = (600)
(ᾱt (s) + αt−1 (s − 2))y t 0 otherwise
`s

ᾱt (s) , αt−1 (s) + αt−1 (s − 1) (601)

It’s worth emphasizing how to interpret these, given we’ve imposed this weird augmented label
sequence. In as-verbose-as-possible terms,

αt (s) is the probability, after running our RNN for t time steps to pro-
duce the path πh1...ti , that B(πh1...ti )==`h1... s−1 i for which, after inserting 
2
between all elements of `h1... s−1 i , we obtain the augmented labeling `0h1...si .
2

The way you should think about the different possible cases here is that, at time step t, in
order for there to be nonzero probability that we can merge the sequence of t RNN outputs
into the augmented label sequence `0h1...si , it must be true that:

• We emit the token `0s at time t from the RNN.


• At the previous timestep, t − 1, we emitted a token consistent with our rules for merging
combined with the fact that we’ve inserted  between every pair of tokens in the final
output labeling `, in order to produce `0 .
The weird case (in my opinion) to consider is realizing that we can emit, for example, the label
a at time t − 1, then the label b at time t, and this would eventually get mapped to a portion
of the augmented label sequence, [a, , b].

201
Papers and Tutorials November 03, 2018

BERT
Table of Contents Local Written by Brandon McKinzie

Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Google
AI Language (Oct 2018).

TL;DR. Bidirectional Encoder Representations from Transformers. Pretrained by jointly


conditioning on left and right context, and can be fine-tuned with one additional non-task-
specific output layer. Authors claim the following contributions:
• Demonstrate importance of bidirectional pre-training for language representations. Ok,
congrats.
• Show that pre-trained representations eliminate needs of task-specific architectures. We
already knew this. Seriously, how is this news?
• Advances SOTA for eleven NLP tasks.

BERT. Instead of using the unidirectional transformer decoder, they use the bidirectional
encoder architecture for their pre-trained model138 .

The input representation is shown in the figure above. The input is a sentence pair, as com-
monly seen in tasks like QA/NLI.

138
Is this seriously paper-worthy?? I’m taking notes so I can easily refer back on popular approaches, but I
don’t see what’s so special here.

202
Pre-training Tasks (3.3).
1. Masked LM: Mask 15% of all tokens, and try to predict only those masked tokens139
Furthermore, at training time, the mask tokens are either fed through as (a) the special
[MASK] token 80% of the time, (b) a random word 10% of the time, and (c) the original
word unchanged 10% of the time. Now this is just hackery.
2. Next Sentence Prediction: Given two input sentences A and B, train a binary clas-
sifier to predict whether sentence B came directly after sentence A.
They do the pretraining jointly, using a loss function that’s the sum of the mean masked LM
likelihood and mean next sentence prediction likelihood.

139
This is the only “novel” idea I’ve seen in the paper. Seems hacky-ish but also reasonable.

203
Papers and Tutorials November 25, 2018

Wasserstein is all you need


Table of Contents Local Written by Brandon McKinzie

Singh et al., “Wasserstein is all you need,” EPFL Switzerland (August 2018).

TL;DR. Unsupervised representations of objects/entities via distributional + point estimate.


Made possible by optimal transport.

Optimal Transport (3). First, notation. Let. . .


• Ω denote a space of possible outcomes.
• µ denote an empirical probability measure, defined as some convex combination µ(x) =
Pn
i=1 ai δ(xi ), where xi ∈ Ω.
• ν denote a similar measure, also a convex combination, ν(y) = m
P
j bj δ(yj ).
• Mij denote the ground cost of moving from point xi to yj .
Intuition break: recognize that µ and ν are just a formal description of probability densities via
normalized “counts” ai and/or bj . Those weights are basically probability mass The Optimal
Transport distance between µ and ν is the following linear program:
X X X
OT(µ, ν; M) = min Tij Mij s.t. (∀i) Tij = ai , (∀j) Tij = bj (602)
T∈Rn×m
+ ij j i

where T is called the transportation matrix. Informally, the constraints are simply enforcing
bijection to/from µ and ν, in that “all the mass sent from element i must be exactly ai , and
all mass sent to element j must be exactly bj ”. A particular case called the p-Wasserstein
distance, where Ω = Rd and Mij is a distance metric over Rd , is defined as

p 1/p
Wp (µ, ν) , OT(µ, ν; DΩ ) (603)

where D is just a distance metric, e.g. for p = 2 it could be euclidean distance.

Distributional Estimate (4.1). Let C , {c}i be the set of possible contexts, where each
context ci can be a word/phrase/sentence/etc. For a given word w and our set of observed
contexts for that word, we essentially want to embed its context histogram into a space Ω (where
typically Ω = Rd ). Let V denote a matrix of context embeddings, such that Vi,: = ci ∈ Rd , the
embedding for context ci in what the authors call the ground space. Combining the histogram
H w containing observed context counts for word w with V, the distributional estimate of
the word w is defined as
X
PVw , (H w )c δ(vc ) (604)
c∈C

204
Also, the point estimate is just vw , i.e. the embedding of the word w when viewed as a context.

Distance (4.2). Given some distance metric DΩ in ground space Ω = Rd , the distance between
words wi and wj is the solution to the following OT problem140 :
w p w
OT(PVwi , PV j ; DΩ ) := Wpλ (PVwi , PV j )p (605)

Concrete Framework (5). The authors make use of the shifted positive pointwise mu-
tual information (SPPMI), Ssα , for computing the word histograms:

Ssα (w, c)
(H w )c := P α 0
(606)
c0 ∈C Ss (w, c )
Count(w, c) c0 Count(c0 )α
  P  
α
Ss (w, c) , max log − log(s), 0 (607)
Count(w)Count(c)α

140
I’m not sure whether the rightmost exponent of p is a typo in the paper, but that is how it is written.

205
Papers and Tutorials December 9, 2018

Noise Contrastive Estimation


Table of Contents Local Written by Brandon McKinzie

M. Gutman and A. Hyvärinen, “Noise contrastive estimation: A new estimation principle for unnormalized
statistical models,” University of Helsinki (2010).

TL;DR: A few ways of thinking about NCE:


• Instead of directly modeling a normalized word distribution pd (w), we can just model
the unnormalized ped distribution (and an additional parameter c = − ln Z) by training
our model to distinguish between true samples from pd and noise samples from some
distribution pn that we choose.
• Instead of modeling p(w | c), we model p(D | w, c), where D is binary RV indicating
whether w, c are from the true data distribution pd or the noise distribution pn .

Introduction. Setup & notation:


• We observe x ∼ pd (·) but pd (·) itself is unknown.
• We model pd by some model pm (·; α) parameterized by α141 .
So, can we get away with modeling the unnormalized density pem instead of requiring the
normalization constraint to be baked in to our optimization problem? Similar to approaches
like contrastive divergence and score matching, noise-contrastive estimation (NCE)
aims to address this question.

Noise-contrastive estimation (2.1). Let c be an estimator for −lnZ(α), and let θ =


{α, c} denote all of our parameters. Given observed data X = {x1 , . . . , xT }, and noise
Y = {y1 , . . . , yT }, we seek parameters θ̂T that maximize JT (θ)142 :

1 X
JT (θ) = ln [h(xt )] + ln [1 − h(yt )] (608)
2T t
h(u) = σ (G (u)) (609)
G(u) = ln pm (u) − ln pn (u) (610)
ln pm (·; θ) := ln pem (·; α) + c (611)

where for compactness reasons I’ve removed the explicit dependence of all functions above
(except pn ) on θ. Notice how this fixes the issue of the model just setting c arbitrarily high to
obtain a high likelihood143 .

141
The implicit assumption here is that ∃α∗ such that pd (·) = pm (·; α∗ ).
142
This all assumes of course that pn is fully defined.
143
The primary reason why MLE is traditionally unable to parameterize the partition function.

206
Connection to supervised learning (2.2). The NCE objective can be interpreted as binary
logistic regression that discriminates whether a point belongs to the data (pd ) or to the noise
(pn ).
We model with a
P (C= 1)p(u | C=1) uniform prior:
P (C=1 | u ∈ X ∪ Y ) = (612)
p(u) P (C=1) = P (C=0) =
1/2
pm (u)
= (613)
pm (u) + pn (u)
≡ h(u; θ) (614)

where we’re now using the union of X and Y , U := {u1 , . . . , u2T }. The log-likelihood of the
data under the parameters θ is
2T
X
`(θ) = [Ct ln P (Ct =1 | ut ; θ) + (1 − Ct ) ln P (Ct =0 | ut ; θ)] (615)
t
2T
X
= [Ct ln h(ut ; θ) + (1 − Ct ) ln [1 − h(ut ; θ)]] (616)
t
T
X
= [ln h(xt ; θ) + ln [1 − h(yt ; θ)]] (617)
t

which is (up to a constant factor) the same as our NCE objective.

Properties of the estimator (2.3). As T → ∞, JT (θ) → J(θ), where


1
J(θ) , lim JT (θ) = E [ln h(x; θ) + ln [1 − h(y; θ)]] (618)
T →∞ 2
1
      
J(f ) , E ln σ f (x) − ln pn (x)
e + ln 1 − σ f (y) − ln pn (y) (619)
2
Theorem 1
Je attains exactly one maximum, located at f (·) = ln pd (·), provided pd (·)6=0 =⇒ pn (·)6=0.

207
4.52.1 Self-Normalized NCE

Notes from A. Mnih and Y. Teh, “A fast and simple algorithm for training neural probabilistic
language models” (2012).

Maximum likelihood learning. Let Pθh (w) denote the probability of observing word w given
context h. For neural LMs, we assume this is the softmax of a scoring function sθ (w, h) (logits).
In what follows, I’ll drop the explicit θ and h subscript/superscript notation for brevity.
" #
∂ log P (w) ∂ ∂ X 0
= s(w, h) − log es(w ,h) (620)
∂θ ∂θ ∂θ w0
∂ ∂
P (w0 ) s(w0 , h)
X
= s(w, h) − (621)
∂θ w0
∂θ

 

= s(w, h) − Ew∼Pθh s(w, h) (622)
∂θ ∂θ

where the expectation (in red) is expensive due to requiring s(w, h) evaluated for all words in
the vocabulary. One approach is importance sampling where we sample a subset of k words
from the vocab and compute the probabilities from that approximation:

∂ log P (w) ∂ ∂
P (w0 ) s(w0 , h)
X
= s(w, h) − (623)
∂θ ∂θ w0
∂θ
k
∂ 1 X ∂
≈ s(w, h) − v(xj ) s(w0 , h) (624)
∂θ V j=1 ∂θ
esθ (x,h)
where v(x) = (625)
Qh (w=x)

and we refer to v as the importance weights. In NLP, we typically set Q to the Zipfian
distribution144

144
TensorFlow seems to define this as
log(w + 2) − log(w + 1)
PZipf (w) = (626)
log(V + 1)
where V is the vocabulary size. I can’t seem to find this definition anywhere else though. A more common form
seems to be
1
w
P (w) = P 1
(627)
w0 w0s

I plotted both on WolframAlpha (link here) and they do indeed look basically the same, especially for any
reasonably large V .

208
NCE. In NCE, we introduce [unigram] noise distribution Pn (w) and impose a prior that
noise samples are k times more frequent than data samples from Pdh (w), resulting in the joint
distribution,

P h (D, w) = P (D=1)Pdh (w) + P (D=0)Pn (w) (628)


1 k
= Pdh (w) + Pn (w) (629)
k+1 k+1

Our goal is to learn the posterior distribution P h (D=1 | w) (so we replace Pd with Pθ ):
NCE posterior
h Pθh (w)
P (D=1 | w, θ) = h (630)
Pθ (w) + kPn (w)

In NCE, we re-parameterize Pθ by treating − log Z as a parameter itself, ch145 .

Pθh (w) := Pθh0 exp(ch ) (631)

where Pθh0 denotes the unnormalized distribution. It turns out that, in practice, we can im-
pose that exp(ch )=1 and use the unnormalized Pθh0 in place of the true probabilities in all
that follows. Critically, note that this means we rewrite equation 630 using the unnormalized
probabilities in place of Pθh . The NCE objective is to find146 as follows, where I’ve shown each
step of the derivation:

θ∗ = arg max J h (θ) (632)


θ
J h (θ) = E(D,w)∼P h [log P (D | w, θ)] (633)
1
X X
= P h (D, w) log P h (D | w, θ) (634)
D=0 w
X X
= P h (0, w) log P h (0 | w) + P h (1, w) log P h (1 | w) (635)
w w
1 X
kPn (w) log P h (0 | w) + Pdh (w) log P h (1 | w)

= (636)
k+1 w
1 h i
kEPn log P h (0 | w) + EP h log P h (1 | w)
  
= (637)
k+1 d

h
∝ kEPn log P (0 | w) + EP h log P h (1 | w)
   
(638)
d

The gradient of the NCE objective is thus

∂ h X kPn (w)   ∂
J (θ) = h
Pdh (w) − Pθh (w) log Pθh (w) (639)
∂θ w P θ (w) + kPn (w) ∂θ

TODO: incorporate more info from Chris Dyer’s excellent notes.

145
Reminder that the h is a reminder that Z is a function of the context h.
146
I’ll drop off θ dependence wherever obvious for the sake of compactness.

209
Papers and Tutorials December 16, 2018

Neural Ordinary Differential Equations


Table of Contents Local Written by Brandon McKinzie

R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural Ordinary Differential Equations,” University
of Toronto (Oct 2018).

Introduction (1). Let T denote the number of layers of our network. In the limit of T → ∞
and small δh(t) between each “layer”, we can parameterize the dynamics via an ODE:

dh(t)
= f (h (t) , t, θ) (640)
dt
Benefits of defining models using ODE solvers:
• Memory. Constant as a function of depth, since we don’t need to store intermediate
values from the forward pass.
• Adaptive computation. Modern ODE solvers adapt evaluation strategy on the fly.
• Parameter efficiency. Nearby “layers” have shared parameters.
Review: ODE
Remember the basic idea with ODEs like the one shown above. Our goal is to solve for h(t).

dh(t) = f (h (t) , t, θ) dt (641)


Z Z
dh(t) = f (h (t) , t, θ) dt (642)
Z
h(t) + c1 = f (h (t) , t, θ) dt (643)

(644)

and so the solution of an ODE is often represented as an integral.

Reverse-mode automatic differentiation of ODE solutions (2). Our goal is to optimize


Z t1 
L(z(t1 )) = L f (z(t), t, θ)dt (645)
t0

Given our starting definition (eq 640), we can say


Z t+
z(t + ) = z(t) + f (z (t) , t, θ) dt := T (z(t), t) (646)
t

210
which we can use to define the adjoint a(t):

∂L ∂L dz(t + )
a(t) , − =− (647)
∂z(t) ∂z(t + ) dz(t)
∂T (z(t), t)
= a(t + ) (648)
∂z(t)
da(t) ∂f (z(t), t, θ)
= −a(t)T (649)
dt ∂z

where da(t)
dt can be derived using the limit definition of a derivative. We’ll now outline the
algorithm for computing gradients. We use a black box ODE solver as a subroutine that solves
a first-order ODE initial value problem. As such, it accepts an initial state, its first derivative,
the start time, the stop time, and parameters θ as arguments.
Reverse-mode derivative
∂L
Given start time t0 , stop time t1 , final state z(t1 ), parameters θ, and gradient ∂z(t1 ) compute
all gradients of L.

1. Compute t1 gradient:

∂L ∂L T ∂z(t1 ) ∂L T
= = f (z(t1 ), t1 , θ) (650)
∂t1 ∂z(t1 ) ∂t1 ∂z(t1 )
2. Initialize the augmented state:
∂L
h i
s0 := z(t1 ), a(t1 ), 0, − (651)
∂t1
3. Define augmented sate dynamics:
ds ∂f ∂f ∂f
h i
, f (z(t), t, θ), − a(t)T , − a(t)T , − a(t)T (652)
dt ∂z ∂θ ∂t
4. Solve reverse-timea ODE:
 
∂L ∂L ∂L ds
 
z(t0 ), , , = ODESolve s0 , , t1 , t 0 , θ (653)
∂z(t0 ) ∂θ ∂t0 dt
∂L
5. Return ∂z(t0 )
, ∂L
∂θ
∂L ∂L
, ∂t ,
0 ∂t1

a
Notice how our “initial state” actually corresponds to t1 , and we pass in t1 and t0 in the opposite
order we typically do.

211
Papers and Tutorials December 16, 2018

On the Dimensionality of Word Embedding


Table of Contents Local Written by Brandon McKinzie

Z. Yin and Y. Shen, “On the Dimensionality of Word Embedding,” Stanford University (Dec 2018).

Unitary Invariance of Word Embeddings (2.1). Authors interpret result any unitary
transformation (e.g. a rotation) on embedding matrix as equivalent to the original.

Word Embeddings from Explicit Matrix Factorization (2.2). Let M be the V × V


co-occurrence counts matrix. One way of getting embeddings is doing a truncated SVD on
M = U DV T . If we want k-dimensional embedding vectors, we can do
α
E = U1:k D1:k,1:k (654)

for some α ∈ [0, 1].

PIP Loss (3). Given embedding matrix E ∈ RV ×d , define its Pairwise Inner Product
(PIP) matrix to be

PIP(E) , EE T (655)

Notice that PIP(E)i,j = hwi , wj i. Define the PIP loss for comparing two embeddings E1 and
E2 for a common vocab of V words:
v
(2) 2
uX  
(1) (1) (2)
||PIP(E1 ) − PIP(E2 )||F = t hwi , wj i − hwi , wj i (656)
u
i,j

212
Papers and Tutorials December 26, 2018

Generative Adversarial Nets


Table of Contents Local Written by Brandon McKinzie

Goodfellow et al., “Generative Adversarial Nets,” (June 2014)

TL;DR. The abstract is actually quite good:


. . . we simultaneously train two models: a generative model G that captures the data dis-
tribution, and a discriminative model D that estimates the probability that a sample came
from the training data rather than G. The training procedure for G is to maximize the
probability of D making a mistake. This framework corresponds to a minimax two-player
game.

Adversarial Nets (3). As usual, first go over notation:


• Generator produces data samples147 , x := G(z; θg ), where z ∼ pn (noise distribution
prior).
• Discriminator, D(x; θd ), outputs probability that x came from (true) pdata instead of G.
Our two-player minimax optimization problem can be written as:

minmaxV (D, G) = Ex∼pdata [log D(x)] + Ez∼pn [log (1 − D(G(z)))] (657)


G D

Theoretical Results (4). Below is the training algorithm.

SGD with GANs


Repeat the following for each training iteration.
1. Train D . For k steps, repeat:
(a) Sample m noise samples {z1 , . . . , zm } from noise prior pn .
(b) Sample m data samples {x1 , . . . , xm } from data distribution pdata .
(c) Update discriminator by ascending ∇θd V (D , G).
2. Train G: Sample another m noise samples {z1 , . . . , zm } and descend on ∇θg V (D , G).

147
Note that G outputs samples x, not probabilities. By doing this, it implicitly defines a probability distri-
bution pg (x). This is what the authors say.

213
Global Optimality of pg ≡ pdata (4.1).
Proposition 1
For fixed G, the optimal D is

∗ pdata (x)
DG (x) = (658)
pdata (x) + pg (x)


Derivation of DG (x).

Aside: Law of the unconscious statistician (LotUS). The distribution pg (x) should be read as “the proba-
bility that the output of G yields the value x.” Take a step back and recognize that G is simply a function of a
random variable z. As such, we can apply familiar rules like

E [G(z)] = Ez∼pn [G(z)] (659)


Z
= pn (z)G(z)dz (660)
z

However, recall that functions of random variables can themselves be interpreted as random variables. In other
words, we can also use the interpretation that G evaluates to some output x with probability pg (x).

E [G] = Ex∼pg [x] (661)


Z
= pg (x)xdx (662)
x

As this blog post details, this equivalence is NOT due to a change of variables, but rather by the Law of the
unconscious statistician.

The Proof : We can directly use LotUS to rewrite V (G, D ):


 
V (G, D ) = Ex∼pdata [log D(x)] + Ez∼pn log 1 − D(G(z)) (663)
 
= Ex∼pdata [log D(x)] + Ex ∼ pg log 1 − D (x ) (664)
Z
  
= pdata (x) log D (x) + pg (x) log 1 − D (x) dx (665)
x

LotUS allowed us to express V (G, D ) as a continuous function over x. More importantly, it means we can
∂V
evaluate ∂D and take the derivative inside the integrala . Setting the derivative to zero and solving for D yields
D∗G , the form that maximizes V .
a
Also remember that D(·) ∈ [0, 1] since it is a probability distribution.

214
∗ ):
The authors use this proposition to define the virtual training criterion C(G) , V (G, DG
" # " #
pdata (x) pg (x)
C(G) = Ex∼pdata log + Ex∼pg log (666)
pdata (x) + pg (x) pdata (x) + pg (x)

Theorem 1.
The global minimum of C(G) is achieved IFF pg = pdata . At that point C(G) = − log 4.

Proof: Theorem 1
∗ , G; p =p
The authors subtract V (DG g data ) from both sides of 666, do some substitions, and find that

C(G) = 2 · JSD(pdata ||pg ) − log 4 (667)

where JSD is the Jensen-Shannon divergencea . Since 0 ≤ JSD(p||q) always, with equivalence only if p ≡ q,
this proves Theorem 1 above.
a
Recall that the JSD represents the divergence of each distribution from the mean of the two

215
Papers and Tutorials January 01, 2019

A Framework for Intelligence and Cortical Function


Table of Contents Local Written by Brandon McKinzie

J. Hawkins et al., “A Framework for Intelligence and Cortical Function Based on Grid Cells in the NeoCortex,”
Numata Inc. (Oct 2018).

Introduction. Authors propose new framework based on location processing that provides
supporting evidence to the theory that all regions of the neocortex are fundamentally
the same. We’ve known that grid cells exist in the hippocampal complex of mammals, but
only recently have seen evidence that they may be present in the neocortex.

How Grid Cells Represent Location. Grid cells in the entorhinal cortex148 represent
space and location. The main concepts, in order such that they build on one another, are as
follows:
• A single grid cell is a neuron that fires [when the agent is] at [one of many] multiple
locations in a physical environment149 .
• A grid cell module is a set of grid cells that activate with the same lattice spacing and
orientation but at shifted locations within an environment.
• Multiple grid cell modules that differ in tile spacing and/or orientation can provide
unique location information150 .
Crucially, the number of unique locations that can be represented by a set of grid cell modules
scales exponentially with the number of modules. Every learned environment is associated
with a set of unique locations (firing patterns of the grid cells).

Grid Cells in the Neocortex. The authors propose that we learn the structures of objects
(like pencils, cups, etc) via grid cells in the neocortex. Specifically, they propose:
1. Every cortical column has neurons that perform a function similar to grid cells.
2. Cortical columns learn models of objects similarly to how grid/place cells learn models
of environments.

148
The entorhinal cortex is located in the medial temporal lobe and functions as a hub in a widespread network
for memory, navigation and the perception of time.
149
For example, there may be a grid cell in my brain that fires when I’m at certain locations inside my room.
Those locations tend to form a lattice of sorts.
150
A single module alone cannot, because it repeats periodically. In other words, it can only provide relative
location information.

216
Papers and Tutorials January 05, 2019

Large-Scale Study of Curiosity Driven Learning


Table of Contents Local Written by Brandon McKinzie

Burda et al., “Large-Scale Study of Curiosity Driven Learning,” OpenAI and UC Berkeley (Aug 2018).

An agent sees observation xt , takes action at , and transitions to the next state with observation
xt+1 . Goal: incentivize agent with reward rt relating to how informative the transition was.
The main components in what follows are:
• Observation embedding φ(x).
• Forward dynamics network for prediction P (φ(xt+1 | xt , at ).
• Exploration reward (surprisal):
rt = − log p (φ (xt+1 ) | xt , at ) (668)

The authors choose to model the next state embedding with a Gaussian,

φ(xt+1 ) | xt , at ∼ N (f (xt , at ), ) (669)


rt = ||f (xt , at ) − φ(xt+1 )||22 (670)

where f is the learned dynamics model.

Feature spaces (2.1). Some possible ways to define φ:


• Pixels: φ(x) = x.
• Random Features: Literally feeding φ(x) = ConvN et(x) where ConvNet is randomly
initialized and fixed.
• VAE: Use the mapping to the mean [of the approximated distribution] as the embedding
network φ.

Interpretation. It seems that this works because after awhile, it is boring and predictable to
take actions that result in losing a game. The most surprising actions seem to be those that
advance us forward, to new and uncharted territory. However, these experiments are all on
games that have a very "linear" uni-directional-like sequence of success. I wonder how successful
this would be in a game like rocket league, where there is no tight coupling of direction with
success and novelties (e.g. moving forward in mario bros).

217
Papers and Tutorials March 02, 2019

Universal Language Model Fine-Tuning for Text Classification


Table of Contents Local Written by Brandon McKinzie

J. Howard and S. Ruder, “Universal Language Model Fine-Tuning for Text Classification,” (May 2018).

TL;DR. ULMFiT is a transfer learning method. They introduce techniques for fine-tuning
language models.

Universal Language Model Fine-tuning (3). Define the general inductive transfer
learning setting for NLP:
Given a source task Ts and target task TT 6= TS , improve performance on TT .

ULMFiT is defined by the following three steps:


1. General-domain LM pretraining.
2. Target task LM fine-tuning.
3. Target task classifier fine-tuning.

Target Task LM Fine-tuning (3.2). For step 2, the authors propose what they call dis-
criminative fine-tuning and slanted triangular learning rates.

• Discriminative fine-tuning. Tune each layer with different learning rates:

θt` = θt−1
`
− η ` · ∇θ` J(θ) (671)

The authors suggest setting η `−1 = η ` /2.6.

• Slanted triangular learning rates. A type of learning rate schedule that looks like
the picture below.

218
First, we define the following hyperparameters:
– T : total number of training iterations151
– cf rac: fraction of T (in num iterations) where we’re increasing the learning rate.
– cut: bT · cf racc. Iteration where we switch from increasing the LR to decreasing it.
– ratio: ηmax /ηmin . We of course must also define ηmax .
We can now compute the learning rate for a given iteration t:

1 + p · (ratio − 1) Suggested:
ηt = ηmax · (672) cf rac=0.1
 ratio ratio=32
 t ηmax =0.01
cut if t < cut
p= t−cut (673)
1 − otherwise
cut·(1/cf rac−1)

Target Task Classifier Fine-tuning (3.3). Augment the LM with two fully-connected layers.
The first with ReLU activation and the second with softmax. Each uses batch normalization
and dropout. The first is fed the output hidden state of the LM concatenated with the max-
and mean-pooled hidden states over all timesteps152 :

hc = [ht , maxpool(H), meanpool(H)] (674)

In addition to DF-T and STLR from above, they also employ the following techniques:
• Gradual Unfreezing: first unfreeze the last layer and fine-tune it alone with all other
layers frozen for one epoch. Then, unfreeze the next layer and fine-tune the last-two
layers only for the next epoch. Continue until the entire network is being trained, at
which time we just train until convergence.
• BPTT for Text Classification. Divide documents into fixed-length “batches”153 of
size b. They initialize the ith section with the final state of the model run on section
i − 1.

151
Steps-per-epoch times number of epochs.
152
Or just as much as we can fit into GPU memory.
153
Not a fan of how they overloaded this term here.

219
Papers and Tutorials March 10, 2019

The Marginal Value of Adaptive Gradient Methods in Machine Learning


Table of Contents Local Written by Brandon McKinzie

Wilson et al., “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” (May 2018).

TL;DR. For simple overparameterized problems, adaptive methods (a) often find drastically
different solutions than SGD, and (b) tend to give undue influence to spurious features that
have no effect on out-of-sample generalization. They also found that tuning the initial learning
and decay scheme for Adam yields significant improvements over its default settings in all
cases.

Background (2). The gradient updates for general stochastic gradient, stochastic momentum,
and adaptive gradient methods, respectively, can be formalized as follows154

[regular] ∆wk+1 = −αk ∇f


e (wk ) (675)
[momentum] ∆wk+1 = −αk ∇f
e (wk + γk ∆wk ) + βk ∆wk (676)
[adaptive] ∆wk+1 = −αk Hk−1 ∇f
e (wk + γk ∆wk ) + βk H −1 Hk−1 ∆wk
k (677)

where Hk is a p.d. matrix involving the entire sequence of iterates (w1 , . . . wk ). For example,
regular momentum would be γk =0, and Nesterov momentum would be γk =βk . In practice, we
basically always define Hk as the diagonal matrix:

k
!1/2 
X
Hk := diag  ηi gi gi  (678)
i=1

The Potential Perils of Adaptivity (3). Consider the binary least-squares classification
problem, where we aim to minimize
1
Rs [w] := ||Xw − y||22 (679)
2
where X ∈ Rn×d and y ∈ {−1, 1}n .
Lemma 3.1
If there exists a scalar c s.t. Xsign(X T y) = cy, then (assuming w0 := 0) AdaGrad, Adam,
and RMSProp all converge to the unique solution w ∝ sign(X T y).

154
I’m defining ∆wk+1 , wk+1 − wk .

220
Papers and Tutorials March 17, 2019

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks


Table of Contents Local Written by Brandon McKinzie

Y. Gal and Z. Ghahramani, “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks,”
University of Cambridge (Oct 2016).

Background (3). In Bayesian regression, we want to infer the parameters ω of some function
y = f ω (x). We define a prior, p(ω), and a likelihood,
p(y=d | x, ω) = Catd (softmax (f ω (x))) (680)
for a classification setting. Given a dataset X, Y, and some new point x∗ , we can predict its
output via
Z
p(y ∗ | x∗ , X, Y) = p(y ∗ | x∗ , ω)p(ω | X, Y)dω (681)

In a Bayesian neural network, we place the prior over the NNs weights (typically Gaussians).
The posterior p(ω | X, Y) is usually intractable, so we resort to variational inference to
approximate it. We define our approximating distribution q(ω) and aim to minimize the
KLD:
Z
KL (q(ω)||p(ω | X, Y)) ∝ − q(ω) log p(Y | X, ω)dω + KL(q(ω)||p(ω)) (682)
N Z
X
= q(ω) log p(yi | f ω (xi ))dω + KL(q(ω)||p(ω)) (683)
i=1

Variational Inference in RNNs (4). The authors use MC integration to approximate the
integral. The use only a single sample ω̂ ∼ q(ω) for each of the N summations, resulting in
an unbiased estimator. Plugging this in, we obtain our objective:
N   
log p yi | fyω̂i fhω̂i (xi,T , hT −1 )
X
L≈− + KL(q(ω)||p(ω)) (684)
i=1

The crucial observations here are:


• For each sequence xi , we sample a new realization ω̂i .
• For each of the T symbols in xi , we use that same realization.
We define our approximating distribution to factorize over the weight matrices and their rows
in ω. For each weight matrix row wk , we have

q(wk ) , pN (wk ; 0, σ 2 I) + (1 − p)N (wk ; mk , σ 2 I) (685)

with mk variational parameter (row vector).

221
Papers and Tutorials March 24, 2019

Improving Neural Language Models with a Continuous Cache


Table of Contents Local Written by Brandon McKinzie

E. Grave, A. Joulin, and N. Usunier, “Improving Neural Language Models with a Continuous Cache,” Facebook
AI Research (Dec 2016).

The cache stores pairs (ht , xt+1 ) of the final hidden-state representation at time t, along with
the word which was generated 155 based on this representation.
 
pvocab (w | xh1...ti ) ∝ exp hTt ow (686)
t−1
X  
pcache (w | hh1...ti , xh1...ti ) ∝ 1xi+1 =w exp θhTt hi (687)
i=1
X  
= exp θhTt h (688)
(x,h)∈cache
s.t. x=w
p(w | hh1...ti , xh1...ti ) = (1 − λ)pvocab (w | ht ) + λpcache (w | hh1...ti , xh1...ti ) (689)

where θ is a scalar parameter that controls the flatness of the cache distribution.

155
They say this, but everything else in the paper strongly suggests they mean the next gold-standard input
instead.

222
Papers and Tutorials April 26, 2019

Protection Against Reconstruction and Its Applications in Private Federated Learning


Table of Contents Local Written by Brandon McKinzie

A. Bhowmick et al., “Protection Against Reconstruction and Its Applications in Private Federated Learning,”
Apple, Inc. (Dec 2018).

Introduction (1). In many scenarios, it is possible to reconstruct model inputs x given just
∇θ `(θ; x, y). Differential privacy is one approach for obscuring the gradients such that guar-
antees can be made regarding risk of compromising user data x. Locally private algorithms,
however, are preferred to DP when the user wants to keep their data private even from the
data collector. The authors want to find a way to perform SGD while providing both local
privacy to individual data Xi and stronger guarantees on the global privacy of the output θ̂n
of their procedure.

Formally, say we have two users’ data x and x0 (both in X ) and some randomized mechanism
M : X 7→ Z. We say that M is ε-local differentially private if ∀x, x0 ∈ X and sets S ⊂ Z:

Pr [M (x) ∈ S]
≤ eε (690)
Pr [M (x0 ) ∈ S]

Clearly, the RHS will need to be pretty big for this to be achievable. The authors claim that
allowing ε >> 1 “may [still] provide meaningful privacy protections.”

Privacy Protections (2). The focus here is on the curious onlooker: an individual (e.g.
Apple PFL engineer) who can observe all updates to a model and communication from individ-
ual devices. Let X denote some user data. Let ∆W denote the weights difference after some
model update using the data X. Let Z be the result of the randomized mapping ∆W 7→ Z.
Our setting can be described with the Markov chain X → ∆W → Z. The onlooker observes
Z and wants to estimate some function f (X) on the private data.

Separated private mechanisms (2.2). The authors propose, instead of a simple mapping
∆W → Z, to split it up into two parts: Z1 = M1 (U ) and Z2 = M2 (R), where
∆W
U= (691)
||∆W ||2
R = ||∆W ||2 (692)

Separated Differential Privacy


A pair of mechanisms M1 , M2 mapping from U ×R to Z1 ×Z2 is (ε1 , ε2 )-separated differ-
entially private if M1 is ε1 -locally differentially private and M2 is ε2 -locally differentially
private.

223
Privatizing Unit `2 Vectors with High Accuracy (4.1). Given some vector u ∈ Sd−1156 ,
we want to generate an ε-differentially private vector Z such that

E [Z | u] = u ∀u ∈ Sd−1 (693)

Privatized Unit Vector: PrivUnit2

Sample random vector V :


 
U {v ∈ Sd−1 | hv, ui≥γ} with probability p
V ∼  (694)
U {v ∈ Sd−1 | hv, ui<γ} otherwise
1
where γ ∈ [0, 1] and p ≥ 2
together control accuracy and privacy.

d−1 1+γ
Let α = 2
, τ = 2
, and
 
(1 − γ 2 )α p 1−p
m= − (695)
2d−2 (d − 1) B(α, α) − B(τ ; α, α) B(τ ; α, α)

where B(x, α, β) is the incomplete beta function (see paper pg 17 for details).

1
Return Z = m
·V

Privatizing the Magnitude (4.3). We also need to privatize the weight delta norms. We
want to return values r ∈ [0, rmax ] for some rmax < ∞.

156
Here, this denotes an n-sphere:
Sn , {x ∈ Rn+1 : ||x|| = r}

224
Papers and Tutorials June 02, 2019

Context Dependent RNN Language Model


Table of Contents Local Written by Brandon McKinzie

T. Mikolov and G. Zweig, “Context Dependent Recurrent Neural Network Language Model,” BRNO and Mi-
crosoft (2012).

Model Structure (2). Given one-hot input vector xt , output a probability distribution yt
for the next word. Incorporate a feature vector ft that will contain topic information.

yt = Softmax (Vht + Gft ) (696)


ht = σ (Uxt + Wht−1 + Fft ) (697)

LDA for Context Modeling (3). “Documents” fed to LDA here will be individual sentences.
The generative process assumed by LDA is compactly defined by the following sequence of
operations157 :
N: number of words
N ∼ Poisson(ξ) (698) Θi ≡ p(topic[i])
Θ ∼ Dir(α) (699) zn : topic of word n

zn ∼ Multinomal(Θ) (700)
wn ∼ Pr [wn | zn , β] (701)

where Pr [wn =a | zn =b] = βb,a , so we are really just sampling from row zn of β, where β ∈
[0, 1]Z×V (where Z is number of topics). The result of LDA is a learned value for α, and the
topic distributions β.
K
1 Y
ft = tx (702)
Z i=0 t−i
1 γ (
ft = f t 1 − γ) (703)
Z t−1 xt

157
α is a vector with number of elements equal to number of topics.

225
Papers and Tutorials July 27, 2019

Strategies for Training Large Vocabulary Neural Language Models


Table of Contents Local Written by Brandon McKinzie

Chen et al., “Strategies for Training Large Vocabulary Neural Language Models,” FAIR (Dec 2018). arXiv:1512.04906

Setup/Notation. Note that in everything below, the authors are using a rather primitive feed-
forward network as their language model. To predict wt it just concatenates the embeddings
of the previous n words and feeds it through a k-layer FF network. Then, layer k + 1 is the
dense projection and softmax:

hk+1 = W k+1 hk + bk+1 ∈ RV (704)


1 n o
y = exp hk+1 (705)
Z

Using cross-entropy loss, the derivative of log p(wt =i) wrt the jth element of the logits is:

∂ log yi ∂ h k+1 i
= h i − log Z (706)
∂hk+1
j ∂hk+1
j
= δij − yj (707)

When computing gradients of the cross-entropy loss, yi here is the ground truth. Therefore,
to increase the probability of the correct token, we need to increase the logits element for that
index, and decrease the elements for the others. Note how this implies we must compute the
final activations for all words in the vocabulary.

Hierarchical Softmax (2.2). Group words into one of two clusters {c1 , c2 }, based on unigram
frequency158 . Then model p(wt | x) = p(ct | x)p(wt | ct ) where ct is the class that word wt was
assigned to.

158
For example, you could just put the top 50% in c1 and the rest in c2 .

226
NCE (2.5). Define Pnoise (w) by the unigram frequency distribution. For each real token wt
in the training set, sample K noise tokens {nk }K
k=1 . NCE aims to minimize

(i) "
N len(w
X ) K
#
X (i) X (i)
LN CE ({w1 , . . . , wN }) = log h(wt ) + log(1 − h(nk )) (708)
i=1 t=1 k=1
Pmodel (w)
h(wt ) = (709)
Pmodel (w) + kPnoise (w)
Pemodel (w)
≈ (710)
Pemodel (w) + kPnoise (w)

where the final approximation is what makes NCE less computationally expensive in practice
than standard softmax. This would seem to imply that NCE should approach standard softmax
(in terms of correctness) as k increases.

Takeaways.
• Hierarchical softmax is the fastest.
• NCE performs well on large-volume large-vocab datasets.
• Similar NCE values can result in very different validation perplexities.
• Sampled softmax shows good results if the number of negative samples is at 30% of the
vocab size or larger.
• Sampled softmax has a lower ppl reduction per step than others.

227
Papers and Tutorials August 03, 2019

Product quantization for nearest neighbor search


Table of Contents Local Written by Brandon McKinzie

Jégou, “Product quantization for nearest neighbor search,” (2011)

Vector Quantization. Denote the index set I = [0..k − 1] and the set of reproduction values
(a.k.a. centroids) ci as C = {ci ∈ RD : i ∈ I}. We refer to C as the codebook of size k. A
quantizer is a function q : x 7→ q(x), where x ∈ RD and q(x) ∈ C. We typically evaluate the
quality of a quantizer with mean squared error of the reconstruction:
h i
M SE(q) = Ex∼p(x) ||x − q(x)||22 (711)

In order for an optimizer to be optimal, it must satisfy the LLoyd optimality conditions:

(1) q(x) = arg min ||x − ci ||2 (712)


ci ∈mathcalC
Z
(2) ci = Ex [x | i] , xp(x)dx (713)
Vi

a.k.a. literally just K-means.

Product Quantization. Input vector x ∈ RD is split into m distinct subvectors uj ∈ RD/m ,


where j ∈ [1..m]. Note that D must be an integer multiple of m (i.e. D = am for some a ∈ Z).

x = xh1...D∗ i , . . . , xhD−D∗ +1...Di → q1 (u1 (x)), . . . , qm (um (x)) (714)


| {z } | {z }
u1 (x) um (x)

Note that each subquantizer qj has its own index set Ij and codebook Cj . Therefore, the
final reproduction value of a product quantizer is identified by an element of the product set
I = I1 × · · · × Im . The associated final codebook is C = C1 × · · · × Cm .

228
Papers and Tutorials August 03, 2019

Large Memory Layers with Product Keys


Table of Contents Local Written by Brandon McKinzie

Lample et al., “Large Memory Layers with Product Keys,” FAIR (July 2019). arXiv:1907.05242v1

Memory Design (3.1). The high-level structure (sequence of ops) is as follows:


1. Query network computes some query vector q.
2. Compare q with each product key via some scoring function.
3. Select the k highest scoring product keys.
4. Compute output m(x) as weighted sum over the values associated with each of the top
k keys from the previous step.
The query network is usually just a dense layer159 . Since they want it to output query
vectors with good coverage over the key space, the put a batch normalization layer before the
query network160 .

The standard way of doing key assignment/weighting is as follows:

[KNN] I , TopK(q(x)T ki ) 1≤i≤K (715)


[Normalize] w = Softmax (I) (716)
X
[Aggregate] m(x) = wi vi (717)
i∈I

where equation 715 is inefficient for large memory (key-value) stores. The authors propose
instead a structured set of keys that they call product keys. Spoiler alert: it’s just product
quantization with m=2 subvectors. Instead of using the flat key set K , {k1 , . . . , k|K| } with
each ki ∈ Rdq from earlier, we redefine it as

K , {(c, c0 ) | c ∈ C, c0 ∈ C 0 } (718)

where both C and C 0 are sets of sub-keys ki ∈ Rdq /2 .


Then. . .
1. Just run each of subvectors q1 and q2 through the standard TopK. You’ll have k sub-keys
for both, defined by their index into their respective codebook.
2. Let K := {(ci , c0j ) | i ∈ IC , IC 0 }. This new reduced-size key set K has only k × k entries.
159
They choose dq = 512 as the output dimensionality of their query network.
160
Recall that batch norm just normalizes the batch inputs to have 0 mean and unit standard deviation,
followed by a scaling and bias factor.

229
3. Run the standard algorithm using the new reduced key set K.
TODO: finish this note

230
Papers and Tutorials August 10, 2019

Show, Ask, Attend, and Answer


Table of Contents Local Written by Brandon McKinzie

V. Kazemi and A. Elqursh, “Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering”
Google Research (April 2017). arXiv:1704.03162v2

TL;DR: Good for a high-level overview of the VQA task, but extremely vague with so many
details omitted it renders the paper fairly useless.

Method (3). Given a training set of image-question-answer triplets (I, q, a), learn to estimate
the most likely answer â out of the set of most frequent answers161 in the training data:

â = arg max Pr [a | I, q] (719)


a

The method utilizes the following architectural components:


• Image Embedding (3.1). Extracts features φ=CNN(I).
• Question Embedding (3.2). Encode question q as the final state of LSTM: s=LSTM(Embed(q)).
• Stacked Attention (3.3)162 Seems like they feed Concat[s, φ] through two layers of
convolution to produce an output Fc for c ∈ [1..C] (meaning they do C such convolutions
separately an in parallel, like multi-head attention). This represents the scores for the
attention function. The attention output, as usual, is computed as
X
xc = αc,` φ` (720)
`
αc,` ∝ exp Fc (s, φ` ) (721)

where ` appears to be over all [flattened] spatial indices of φ.


• Classifier (3.4). Concat the image glimpses xc with the LSTM output s and feed
through a couple FC layers to eventually obtain softmax probabilities over each possible
answer ai , i ∈ [1..M ].

161
Same approach as how we define vocabulary in language modeling tasks.
162
Authors do a laughably poor job at describing this part in any detail, so I’m taking the liberty of filling in
the blanks. Blows my mind that papers this sloppy can even be published.

231
Dataset. Although, again, the authors are horribly vague/sloppy here, it seems like the data
they use actually provides K “correct” answers for each image-question pair. The model loss
is therefore an average NLL over the K true classes.

232
Papers and Tutorials August 10, 2019

Did the Model Understand the Question?


Table of Contents Local Written by Brandon McKinzie

Mudrakarta et al., “Did the Model Understand the Question?” Univ. Chicago & Google Brain (May 2018).
arXiv:1805.05492v1

TL;DR. Basically all QA-related networks are dumb and don’t learn what we think they learn.
• Networks tend to make predictions based a tiny subset of the input words. Due to this,
altering the non-important words in ways that may drastically change the meaning of
the question can have virtually no impact on the network’s prediction.
• Networks assign high-importance to words like “there”, “what”, “how”, etc. These are
actually low-information words that the network should not heavily rely on.
• Networks rely on the image far more than the question.

Integrated Gradients (IG) (3). Purpose: “isolate question words that a DL system uses to
produce an answer.”

F (x={x1 , . . . xn }) ∈ [0, 1] (722)


0 n
AF (x, x ) = {a1 , . . . an } ∈ R (723)

where x0 is some baseline input we use to compute the relative attribution of input x. The
authors set x0 as the “empty question” (sequence of padding values)163 .

Given an input x and baseline x0 , the integrated gradient along the ith dimension is as
follows.
Z 1
∂F (x0 + α × (x − x0 ))
IGi (x, x0 ) , (xi − x0i ) × dα (724)
α=0 ∂xi

Interpretation: seems like IG gives us a better idea of the total “attribution” of each input
dimension xi relative to baseline x0i along the line connecting xi and x0i , instead of just the im-
mediate derivative around xi . Although, the fact that infinitesimal contributions could cancel
each other out seems problematic (positive and negative gradients along the interpolation).

163
The use the same context though (e.g. the associated image for VQA). Only the question is changed.

233
Papers and Tutorials August 17, 2019

XLNet
Table of Contents Local Written by Brandon McKinzie

Yang et al., “XLNet: Generalized Autoregressive Pretraining for Language Understanding” CMU & Google
Brain (May 2018).

TL;DR: Instead of minimizing the NLL using p(w1 , . . . , wT ), minimize over NLL’s using every
possible order of the given word sequence.

Background. Recall that BERT does denoising auto-encoding. Given text sequence x =
{x1 , . . . xT }, BERT constructs a corrupted version x̂ by randomly masking out some tokens.
Let x̄ denote the tokens that were masked. The BERT training objective is then
X
[BERT] max log pθ (x̄ | x̂) ≈ log pθ (x̄ | x̂) (725)
θ
x̄∈x̄
 
p(x̄ | x̂) = Softmax Hθ (x̂)Tt e(x̄) (726)

Objective & Architecture. Their proposed permutation language modeling objective


is:
" T #
X
max Ez∼ZT log pθ (xzt | xhz1 ...zt−1 i ) (727)
θ
t=1

where ZT is the set of all possible permutations of the length-T index sequence [1..T ]. To
implement this, the authors had to re-parameterize the next-token distribution to be target
position aware:
  T 
pθ (Xzt =x | xhz1 ...zt−1 i ) = Softmax gθ xhz1 ...zt−1 i , zt e (x) (728)

They accomplish this via two-stream self-attention, a technique that utilizes two sets of
hidden representations (instead of one):
• Content representation: hzt , hθ (xhz1 ...zt i ).
• Query representation: gzt , gθ (xhz1 ...zt−1 i , zt ).

234
(0)
The query stream is initialized with some vector gi =w, and the content stream is initialized
(0)
with word embedding hi =e(xi ). For the subsequent attention layers 1 ≤ m ≤ M , they are
computed respectively as follows:

gz(m)
t
← Attention(Q=gz(m−1)
t
, K=V =h(m−1)
z<t ) (729)
h(m) (m−1)
zt ← Attention(Q=hzt , K=V =h(m−1)
z≤t ) (730)

In practice, in order to speed up optimization, the authors do partial prediction: only train
to predict over xz>c targets rather than all of them.

Incorporating Ideas from Transformer-XL. Often times, sequences are too long to feed
all at once. The authors adopt relative positional encoding and segment-level recurrence from
Transformer-XL. To compute the attention update with memory on a given segment, we use
the content representations from the previous segment, h, e along with the current segment,
hz≤t as follows:
  
h(m) ← Attention Q=h(m−1) , K=V e (m−1) ; h(m−1)
= h (731)
zt zt z≤t

235
Papers and Tutorials August 24, 2019

Transformer-XL
Table of Contents Local Written by Brandon McKinzie

Dai et al., “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” CMU & Google
Brain (Jan 2019).

Segment-Level Recurrence with State Reuse. Denote two consecutive segments of length
L as sτ = [xτ,1 , . . . , xτ,L ] and sτ +1 = [xτ +1,1 , . . . , xτ +1,L ]. Denote output of layer n given input
segment sτ as hnτ ∈ RL×d , where d is the hidden dimension. To obtain the output of layer n
given the next segment, sτ +1 , do:

hnτ+1 = TransformerLayer(qτn+1 , kτn+1 , vτn+1 ) (732)


T e n−1 W T , e n−1 W T )
= TransformerLayer(hn−1
τ +1 Wq , h τ +1 k h τ +1 v (733)
h   i
e n−1 = SG hn−1 ; hn−1
h (734)
τ +1 τ τ +1

where the concat in 734 is along the length (time) dimension. In other words, Q remains the
same, but K and V get the previous segment prepended. Ultimately this only changes the
inner dot products in the attention mechanism to attend over both segments. The L output
attention vectors are therefore each weighted sums over the previous 2L timesteps instead of
just L.

Relative Positional Encodings. Instead of absolute positional encodings (as regular trans-
formers do), only√encode the relative positional information in the hidden states. Ignoring the
scale factor of 1/ dk , we can write the score for query vector qi = Wq (exi + ui ) and key vector
kj = Wk (exj + uj ), for input embeddings e and positional encodings u as follows. Below it we
show the authors proposed re-parameterized relative encoding version.

Aabs T T T T T T T T
i,j = exi Wq Wk exj + exi Wq Wk uj + ui Wq Wk exj + ui Wq Wk uj (735)
Ares
i,j = eTxi WqT Wk,E exj + eTxi WqT Wk,R ri−j T
+ u Wk,E exj + v Wk,R ri−jT
(736)
| {z } | {z } | {z } | {z }
cont-based addr cont-dep pos bias global cont bias global pos bias

where content is abbreviated as “cont” and positional is abbrev as “pos”. I’ve shown all
differences introduced by the second version in red font. It appears that ri−j is literally just
ui−j but I guess using new letters is cool. Note that they separate Wk into content-based
Wk,E and location-based Wk,R .

236
Papers and Tutorials August 31, 2019

Efficient Softmax Approximation for GPUs


Table of Contents Local Written by Brandon McKinzie

Grave et al., “Efficient Softmax Approximation for GPUs” FAIR (June 2017).

Adaptive Softmax: Two-Level

Partition the vocabulary V into two clusters Vh and Vt , where


• Vh denotes the head, consisting of the most frequent words.
• Vt denotes the tail, associated with a large number of rare words.
• |Vh | << |Vt | and P (Vh ) >> P (Vt ).
To compute the probability of some word w given context h, do:

PVh (w | h) if w ∈ Vh
Pr [w | h] = (737)
PVt (w | h)PVh (tail | h) otherwise
where both PVh and PVt are modeled with a softmax over the words in their respective clusters (PVh
also includes the special “tail” token).

More generally, we can extend the above algorithm to N clusters (instead of 2). We can also
adapt the capacity of each cluster (varying their embedding size). The authors recommend,
for each successive tail cluster, reducing the output size by a factor of 4. Of course, this then
has to be followed by projecting back up to the number of words associated with the given
cluster.
TODO: detail out how cross entropy loss is computed under this setup.

237
Papers and Tutorials September 02, 2019

Adaptive Input Representations for Neural Language Modeling


Table of Contents Local Written by Brandon McKinzie

A. Baevski and M. Auli, “Adaptive Input Representations for Neural Language Modeling” FAIR (Feb 2019).

TL;DR: Literally just adaptive softmax but for the input embeddings. Official implementation
can be found here.

Adaptive Input Representations (3). Same as Grave et al., they partition the vocabulary
V into

V = V1 ∪ V2 ∪ . . . ∪ Vn (738)

where V1 is the head and the rest are the tails (ordered by decreasing frequency). They
reduce the capacity of each cluster by a factor of k=4 (also same as Grave et al.). Finally,
they add linear projections for each cluster’s embeddings in order to ensure they all result in
d-dimensional output embeddings (even V1 ).

238
Papers and Tutorials September 14, 2019

Neural Module Networks


Table of Contents Local Written by Brandon McKinzie

Andreas et al., “Deep Compositional Question Answering with Neural Module Networks” UC Berkeley (Nov
2015).

NMNs for Visual QA (4). Model and task overview:


• Data: 3-tuples (w, x, y) containing the question, image, and answer, respectively.
• Model: fully specified by a collection of modules {m}. Each module m has parameters
θm and a network layout predictor P (w) that maps from strings to networks. The
high-level procedure is, for each (w, x, y), do:
1. Instantiate a network based on P (w).
2. Pass the image x (and possibly w again) as inputs to the network.
3. Obtain network outputs encoding p(y | w, x; θ).
• Modules:

From strings to networks (4.2).


1. Parse question w with the Stanford Parser to obtain universal dependency representation.
2. Filter dependencies to those connected to the wh-word in the question. Some examples:
• what is standing in the field 7→ what(stand)
• what color is the truck 7→ color(truck)
• is there a circle next to a square 7→ is(circle, next-to(square))
3. Assign identities of modules (already have full network structure).
• Leaves become attend modules.
• Internal nodes become re-attend or combine modules.
• Root nodes become measure (y/n questions) or classify (everything else) modules.

239
Answering natural language questions (4.3). They combine the results from the module
network with an LSTM, which is fed the question as input and outputs a predictive distribution
over the set of answers164 . The final prediction is a geometric average of the LSTM output
probabilities and the root module output probabilities.

164
This is the same distribution that the root module is trying to predict

240
Papers and Tutorials September 14, 2019

Learning to Compose Neural Networks for QA


Table of Contents Local Written by Brandon McKinzie

Andreas et al., “Learning to Compose Neural Networks for Question Answering” UC Berkeley (June 2016).

TL;DR. Improve initial NMN work (previous note) by (1) learning network predictor (P (w)
in previous paper) instead of manually specifying it, and (2) extending visual primitives from
previous work to reason over structured world representations.

Model (4). Training data consists of (world, question, answer) triplets (w, x, y). The model
is built around two distributions:
• layout model p(z | x; θ` ) which predicts a layout z for sentence x.
• execution model pz (y | w; θe ) which applies the network specified by z to the world
representation w.

241
Evaluating Modules (4.1). The execution model is defined as

pz (y | w) = (JzKw )y (739)

where JzKw denotes the output of network with layout z on input world w. The defining
equations for all modules is as follows (σ ≡ ReLU, sm ≡ softmax):

Jlookup[i]K = ef (i) (740)


P
w̄(h) , k
hk w(k)

Jfind[i]K = sm(a σ(Bv i ⊕ CW ⊕ d)) (741)


i
Jrelate[i](h)K = sm(a σ(Bv ⊕ CW ⊕ Dw̄(h) ⊕ e)) (742)
Jand(h1 , h2 , . . .)K = h1 h2 ··· (743)
i
Jdescribe[i](h)K = sm(Aσ(B w̄(h)) + v ) (744)
  
Jexists(h)K = sm max hk a + b (745)
k

To train, maximize
X
log pz (y | w; θe ) (746)
(w,y,z)

Assembling Networks (4.2). TODO: finish note

242
Papers and Tutorials September 14, 2019

End-to-End Module Networks for VQA


Table of Contents Local Written by Brandon McKinzie

R. Hu, J. Andreas, et al., “Learning to Reason: End-to-End Module Networks for Visual Question Answering”
UC Berkeley, FAIR, BU (Sep 2017).

End-to-End Module Networks (3). High-level sequence of operations, given some input
question and image:
1. Layout policy predicts a coarse functional expression that describes the structure of the
computation.
2. Some subset of function applications within the expression receive parameter vectors
predicted from the text.
3. Network is assembled with the modules according to layout expression to output an
answer.

Attentional Neural Modules (3.1). A neural module m is a parameterized function y =


fm (a1 , a2 , . . . ; xvis , xtxt , θm ), where the ai are image attention maps and the output y is either
an image attention map or a probability distribution over answers. The full set of modules
used by the authors, along with their inputs/outputs, is tabulated below.

Note that, whereas the original NMN paper (see previous note) instantiated module types based
on words (e.g. describe[shape] vs describe[where]) and gave different instantiations different
parameters, this paper has a single module for each module type (no“instances” anymore).
To distinguish between cases where e.g. describe should describe a shape vs describing a
(m)
location, the module incorporates a text feature xtxt computed separately/identically for each
module m:
T
(m) X (m)
xtxt = αi wi (747)
i=1

243
Layout Policy with Seq2Seq RNN (3.2). TODO finish note

244
Papers and Tutorials October 01, 2019

Fast Multi-language LSTM-based Online Handwriting Recognition


Table of Contents Local Written by Brandon McKinzie

Carbune et al., “Fast Multi-language LSTM-based Online Handwriting Recognition” Google AI Perception (Feb
2019).

Introduction (1). Task: given input strokes, i.e. sequences of points (x, y, t), output it in the
form of text.

Model Architecture (2). The high-level sequences of operations is:


1. Input time series (v1 , . . . , vT ) encoding user input. The authors experiment with two
representations:
(a) Raw touch points: sequence of 5-dimensional points (xi , yi , ti , pi , ni ), where ti is
seconds since first touch point in current observation, pi is binary-valued equal to 0
if pen-up, else 1 if pen-down, and ni is binary on start-of-new-stroke (1 if True).
(b) Bézier curves: TL;DR is that they model x, y, t each as a cubic polynomial over a
new variable s ∈ [0, 1]. Ultimately this means solving for some coefficients Ω of a
linear system of equations: V T Z = V T V Ω.
2. Several BiLSTM layers for contextual character embedding.
3. Softmax layer providing character probabilities at each time step.
4. CTC decoding with beam search. They also incorporate feature functions into the
output logits to help with decoding. They use the following 3 feature functions:
(a) Character language models. A 7-gram LM over Unicode codepoints using Stupid
back-off.
(b) Word language models. 3-grams pruned to between 1.25M and 1.5M entries.
(c) Character classes. Scoring heuristic which boosts the score of characters from the
LM’s alphabet.

Training (3). Training happens in two stages, each on a different dataset:


1. Training neural network model with CTC loss on large dataset.
2. Decoder tuning using Bayesian optimization through Gaussian Processes in Vizier165 .

165
Vizier is a program made by Google for black-box tuning

245
Papers and Tutorials October 02, 2019

Multi-Language Online Handwriting Recognition


Table of Contents Local Written by Brandon McKinzie

Keysers et al., “Multi-Language Online Handwriting Recognition” Google (June 2017).

System Architecture (3). Segment-and-decode approach consisting of the following steps:


• Preprocessing (4).
1. Resampling.
2. Slope correction.
• Segmentation166 and search lattice creation (5).
1. Segmentation goal: obtain high recall of all actual character boundaries. Accom-
plished via a heuristic which creates a set of potential cut points and then a neural
net which assigns a score to each.
2. Segmentation lattice: a graph (V, E) of ink segments. Each segment is identified by
a unique integer index.
– Nodes (in V ) define the path of ink segments up to that point (e.g. {1, 0, 2})
(i.e. a character hypothesis)
– Edges (in E) from a given node v indicate the ink segments which are grouped
in a character hypothesis. For example, if v={i, j} has some edge k, then that
edge will have node {i, j, k} on the other end, and {i, j, k} is a valid character
hypothesis.
It appears that each node (assign from the empty start node) is passed to the next
stage as a character hypothesis to be scored/classified.
• Generation & scoring of character hypotheses167 (5.3). Goal: cetermine the characters
most likely to have been written.
1. Feature extraction: they make a fixed-length dense feature vector containing point-
wise and character-global features.
2. Classification: single hidden layer NN with tanh activation followed by softmax.
3. Create a labeled latice which will later be decoded to find the final recognition
result.
• Best path search in the resulting lattice using additional knowledge sources (6).

166
Segmentation/cut point: a point at which another character my start. Segment: the (partial) strokes
between 2 consecutive segmentation points.
167
Character hypothesis: a set of one or more segments (not necessarily consecutive).

246
Language Models (6.1). They utilize two types of language models:
• Stupid-backoff entropy-pruned 9-gram character LM. This is their “main” LM. Depending
on the language, they use about 10M to 100M n-grams.
• Word-based probabilistic finite automaton. Creating using 100K most frequent words of
a language.

Search (6.2). Goal: obtain a recognition result by finding the best path from the source
node (no ink recognized) to the target node (all ink recognized). Algorithm: ink-aligned beam
search that starts at the start node and proceeds through the lattice in topological order.

247
Papers and Tutorials October 13, 2019

Modular Generative Adversarial Networks


Table of Contents Local Written by Brandon McKinzie

Zhao et al., “Modular Generative Adversarial Networks” UBC, Tencent AI Lab (April 2018).

TL;DR. Task(s): multi-domain image generation and image-to-image translation.

Network Construction (3.2). Let x and y denote the input and target image, respectively,
wherever applicable. Let A = {A1 , A2 , · · · , An } denote an attribute set. Four types of
modules are used:
1. Initial module is task-dependent (below). Output is feature map in RC×H×W .
• [translation] encoder E: x 7→ E(x)
• [generation] generator G: (z, a0 ) 7→ G(z, a0 ) where z is random noise and a0 is
a condition vector representing auxiliary information.
2. transformer(s) Ti : E(x) 7→ Ti (E(x), ai ). Modifies repr of attrib ai in the FM.
3. reconstructor R : (Ti , Tj , . . .) 7→ y. Reconstructs image from an intermediate FM.
4. discriminator Di : R 7→ {0, 1} × Val(ai ). Predicts probability that R came from ptrue ,
and the [transformed] value of ai .
The authors emphasize that the transformer module is their core module. It’s architecture is
illustrated below.

248
Loss Function (3.4).
n
X n
X
LD (D) = − Ladvi + λcls Lrclsi (748)
i=1 i=1
n n n
!
X X X
LG (E, T, R) = Ladvi + λcls Lfclsi + λcyc LER
cyc + LT i
cyc (749)
i=1 i=1 i=1
Ladvi (E, Ti , R, Di ) = Ey∼pdata (y) [log Di (y)] +
(750)
Ex∼pdata (x) [log (1 − Di (R (Ti (E(x)))))]
Lrclsi = −Ex,ci [log Dclsi (ci | x)] (751)
Lfclsi = −Ex,ci [log Dclsi (ci | R(E(Ti (x))))] (752)
LER
cyc = Ex [||R(E(x)) − x||1 ] (753)
LT i
cyc = Ex [||Ti (E(x)) − E(R(Ti (E(x))))||1 ] (754)

where n is the total number of controllable attributes.

249
Papers and Tutorials October 13, 2019

Transfer Learning from Speaker Verification to TTS


Table of Contents Local Written by Brandon McKinzie

Jia et al., “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis” Google (Jan
2019).

TL;DR: TTS that’s able to generate speech in the voice of different speakers, including those
unseen during training.

Multispeaker Speech Synthesis Model (2). System is composed of three independently


trained NNs:
1. Speaker Encoder. Computes a fixed-dimensional vector from a speech signal.
2. Synthesizer. Predicts a mel spectogram from a sequence of grapheme or phoneme
inputs, conditioned on the speaker vector. Extension of Tacotron 2 to support multiple
speakers.
3. Vocoder. Autoregressive WaveNet, which converts the spectogram into time domain
waveforms.

250
NLP with Deep
Learning
Contents

5.1 Word Vector Representations (Lec 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252


5.2 GloVe (Lec 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

251
NLP with Deep Learning May 08

Word Vector Representations (Lec 2)


Table of Contents Local Written by Brandon McKinzie

Meaning of a word. Common answer is to use a taxonomy like WordNet that has hypernyms
(is-a) relationships. Problems with this discrete representation: misses nuances, e.g. the words
in a set of synonyms can actually have quite different meanings/connotations. Furthermore,
viewing words as atomic symbols is a bit like using one-hot vectors of words in a vocabulary
space (inefficient).

Distributed representations. Want a way to encode word vectors such that two similar
words have a similar structure/representation. The distributional similarity-based168 ap-
proach represents words by means of its neighbors in the sentences in which it appears. You
end up with a dense “vector for each word type, chosen so that it is good at predicting other
words appearing in its context.”

Skip-gram prediction. Given a word wt at position t in a sentence, learn to predict [proba-


bility of] some number of surrounding words, given wt . Standard minimization with negative
log-likelihood:
T
1X X
J(θ) = − log Pr(wt+j |wt ) (755)
T t=1 −m≤j≤m
T I cannot believe this
euo vc
Pr(o | c) = Pvocab size (756) actually works.
w=1 euTw vc

where
• The params θ are the vector representation of the words (they are the only learnable
parameters here).
• m is our radius/window size.
• o and c are indices into our vocabulary (somewhat inconsistent notation).
• Yes, they are using different vector representations for u (context words) and v (center
words). I’m assuming one reason this is done is because it makes the model architecture
simpler/easier to build.
Some subtleties:

168
Note that this is distinct from the way “distributed” is meant in “distributed representation.” In contrast,
distributional similarity-based representations refers to the notion that you can describe the meaning of words
by understanding the context in which they appear.

252
• Looks like e.g. P r(wt+j | wt ) doesn’t really care what the value of j is, it is just modeling
the probability that it is somewhere in the context window. The wt are one-hot vectors
into the vocabulary. Standard tricks for simplifying the cross-entropy loss apply.
• Equation 756 should be interpreted as the probability that the oth word in our vocabulary
occurs in the context window of the cth word in our vocabulary.
• The model architecture is identical to an autoencoder. However, the (big) difference is
the training procedure and interpretation of the model parameters going “in” versus the
parameters going “out”.

Sentence Embeddings. It turns out that simply taking a weighted average of word vectors
and doing some PCA/SVD is a competitive way of getting unsupervised word embeddings. Discussion based on
Apparently it beats supervised learning with LSTMs (?!). The authors claim the theoretical paper by Arora et al.,
(2017).
explanation for this method lies in a latent variable generative model for sentences (of course).
Approach:
1. Compute the weighted average of the word vectors in the sentence:

N The authors call their


1 X a weighted average the
wi (757) Smooth Inverse
N i a + p(wi ) Frequency (SIF).

where wi is the word vector for the ith word in the sentence, a is a parameter, and p(wi )
is the (estimated) word frequency [over the entire corpus].
2. Remove the projections of the average vectors on their first principal component (“com-
mon component removal”) (y tho?).
1

253
Further Reading.
• Learning representations by back-propagating errors (Rumelhard et al., 1986)
• A Neural Probabilistic Language Model (Bengio et al., 2003)
• NLP (almost) from Scratch (Collobert & Watson, 2008)
• Word2Vec (Miklov et al. 2013)

254
NLP with Deep Learning May 08

GloVe (Lec 3)
Table of Contents Local Written by Brandon McKinzie

Skip-gram and negative sampling. Main idea:


• Split the loss function from last lecture into two (additive) terms corresponding to the
numerator and denominator respectively (you’ve done this a trillion times). To sample the negative
• The second term is an expectation over all the words in your vocab space. That is huge, samples, draw from
P (w) = U (w)3/4 /Z,
so instead we only use a subsample of size k (the negative samples). where U is the unigram
• Interpretation: First term is maximizing Pr(o | c), the probability that the true outside distribution.
word (given by index o) occurs given context (index) c. Second term is minimizing the
probability of random words (the negative samples) occurring around the center (context)
word given by c.

GloVe (Global Vectors). Given some co-occurrence matrix we computed with previous meth-
ods, we can use the following GloVe loss function over all pairs of co-occurring words in our
matrix:
W
X
J(θ) = f (Pij )(uTi vj − log Pij )2 (758)
i,j=1

where Pij is computed simply from the counts of words i and j co-occurring (empirical proba-
bility) and f is some squashing function that really isn’t discussed in this lecture (TODO).

Evaluating word vectors.


• Word Vector Analogies: Basically, determining if we can do standard analogy fill-in-
the-blank problems: “man [a] is to woman [b] as king [c] is to <blank>” (if you answered
“queen”, you’d make a good AI). We can determine this using a standard cosine distance
measure:
(xb − xa + xc )T xi
d = arg max (759)
i ||xb − xa + xc ||

Woah that is pretty neat. The solution is xi = queen. xb − xa is the vector pointing
from man to woman, which encodes the type of similarity we are looking for with the
other pair. Therefore, we take the vector to “king” and add the aforementioned difference
vector – the resultant vector should point to “queen”. Neat!

255
Derivation. Based on the descriptions in the original paper169 We want to develop a model
for learning word vectors.
1. The authors argue that “the appropriate starting point for word vector learning should
be with ratios of co-occurrence probabilities rather than the probabilities themselves.”
The most general such model takes the form,
e k | wi ]
Pr [w Pik
F (wi , wj , w
ek ) = ≡ where all w ∈ Rd (760)
e k | wj ]
Pr [w Pjk
and the tilde in wek denotes that wek is a context word vector, which are given a distinct
space from the word vectors wi and wj being compared. We compute all Pik via frequency
counts over the corpus.
2. Now that we’ve specified the inputs and ratio of interest, we can start specifying some
desirable constraints on the function F that we’re trying to find. The authors speculate
that, since vector spaces are linear structures, we should have F encode the information
of the ratio in the vector space via vector differences:
Pik
F (wi − wj , w
ek ) = (761)
Pjk
which basically says “our representation of the word vectors should be s.t. the rela-
tive probability of some word w ek occurring in the context of a word wi compared to it
occurring in the context of a different word wj can be captured by wi − wj and w
ek alone.”
d
3. Next we notice that F maps arguments in R to a scalar in R. The most straightforward
way of doing so while maintaining the linear structure we are trying to capture is via a
dot product:
Pik
F ((wi − wj )T w
ek ) = (762)
Pjk
Note that now F : R 7→ R.
4. We want our model to be invariant under the exchanges w ↔ w e and X ↔ X T . We can
restore this symmetry by first requiring that F be a homomorphism170 between the
groups (R, +) and (R>0 , ×) (in our case, negation and division would be better symbols,
but it’s equivalent). This requires that the following relation hold171
F (wiT w
ek )
F (wiT w
ek − wjT w
ek ) = T
(763)
F (wj wek )

where I’ve grouped terms on the LHS to emphasize how this is the definition of homomor-
phism. The solution for this equation is that F (·) ≡ exp(·). Combining this realization
with equations 762 and 763 yields,
T e
F (wiT w
ek ) = ewi wk
= Pik (764)
⇒ wiT w
ek = log(Pik ) = log(Xik ) − log(Xi ) (765)
169
Pennington et al., “GloVe: Global Vectors for Word Representation.”
170
In more detail, F : (R, +) 7→ (R>0 , ×), which reads “the function F maps elements in R and any summation
of elements in R to elements in R that are greater than zero or any product of positive elements in R.”
171
Note that we do not need to know anything about the RHS of the equations above to state this relation.
We write it by definition of homomorphism.

256
where Xik is the number of times word k appears in the context of word i, and Xi =
P
k Xik is the number of times any word appears in the context of i.
5. Restore symmetry under the exchanges w ↔ w e and X ↔ X T . We absorb Xi into a bias
bi for wi since it is independent of k.

wiT w
ek + bi + e
bk = log(Xik ) (766)

6. A main drawback to this model is that it weighs all co-occurrences equally, even those
that happen rarely or never. The authors propose a new weighted least squares regression
model, introducing a weighting function f (Xij ) into the cost function of our model:
V
X
J= f (Xij )(wiT w bj − log Xij )2
ej + bi + e (767)
i,j=1

257
Speech and
Language
Processing
Contents

6.1 Introduction (Ch. 1 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259


6.2 Morphology (Ch. 3 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.3 N-Grams (Ch. 6 2nd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.4 Naive Bayes and Sentiment (Ch. 6 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.5 Hidden Markov Models (Ch. 9 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.6 POS Tagging (Ch. 10 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.7 Formal Grammars (Ch. 11 3rd Ed.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
6.8 Vector Semantics (Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.9 Semantics with Dense Vectors (Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.10 Information Extraction (Ch. 21 3rd Ed) . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

258
Speech and Language Processing July 10, 2017

Introduction (Ch. 1 2nd Ed.)


Table of Contents Local Written by Brandon McKinzie

Overview. Going to rapidly jot down what seems most important from this chapter.
• Morphology: captures information about the shape and behavior of words in context
(Ch. 2/3).
• Syntax: knowledge needed to order and group words together.
• Lexical semantics: knowledge of the meanings of individual words.
• Compositional semantics: knowledge of how these components (words) combine to
form larger meanings.
• Pragmatics: the appropriate use of polite and indirect language.
• The knowledge of language needed to engage in complex language behavior can be sep-
arated into the following 6 distinct categories:
1. Phonetics and Phonology – The study of linguistic sounds.
2. Morphology – The study of the meaningful components of words.
3. Syntax – The study of the structural relationships between words.
4. Semantics – The study of meaning.
5. Pragmatics – The study of how language is used to accomplish goals.
6. Discourse – The study of linguistic units larger than a single utterance.
• Methods for resolving ambiguity: pos-tagging, word sense disambiguation, probabilis-
tic parsing, and speech act interpretation.
• Models and Algorithms. Among the most important are state space search and
dynamic programming algorithms.

259
Speech and Language Processing July 10, 2017

Morphology (Ch. 3 2nd Ed.)


Table of Contents Local Written by Brandon McKinzie

English Morphology. Morphology is the study of the way words are built up from smaller
meaning- bearing units, morphemes. A morpheme is often defined as the minimal meaning-
bearing unit in a language172 . The two classes of morphemes are stems (the “main” morpheme
of the word) and affixes (the “additional” meanings of various kinds).

Affixes are further divided into prefixes (precede stem), suffixes (follow stem), circumfixes (do
both), and infixes (inside the stem).

Two classes of ways to form words from morphemes: inflection and derivation. Inflection is
the combination of a word stem with a grammatical morpheme, usually resulting in a word of
the same class as the original stem, and usually filling some syntactic function like agreement.
Derivation is the combination of a word stem with a grammatical morpheme, usually resulting
in a word of a different class, often with a meaning hard to predict exactly.

172
Examples: “fox” is its own morpheme, while “cats” consists of the morpheme “cat” and the morpheme
“-s”.

260
Speech and Language Processing July 10, 2017

N-Grams (Ch. 6 2nd Ed.)


Table of Contents Local Written by Brandon McKinzie

Counting Words. Most N -gram based systems deal with the wordform, meaning they treat
words like “cats” and “cat” as distinct. However, we may want to treat the two as instances
of a single abstract word, or lemma: a set of lexical forms having the same stem, the same
major part of speech, and the same word-sense.

Simple (Unsmoothed) N-Grams. An N-gram is a N-1th order Markov model (because it


looks N-1 steps in the past). Notation: the authors use the convention that w1n , w1 , w2 , . . . , wn
to denote a sequence of n words. Given this, we can write the general equation for the N-gram
approximation for the probability of the nth word (n > N ) in a sentence:
h i h i
Pr wn | w1n−1 ≈ Pr wn | wn−N
n−1
+1 (768)

for N ≥ 2. We can compute these probabilities by simply counting:


n−1
h i C(wn−N +1 wn )
Pr wn | w1n−1 = n−1 (769)
C(wn−N +1 )

where C(·) is the number of times the sequence, denoted as ·, occurred in the corpus.

Entropy. Denote the random variable of interest as x with probability function p(x). The
entropy of this random variable is:
X
H(x) = − p(x = x) log2 p(x = x) (770)
x

which should be thought of as a lower bound on the number of bits it would take to encode
a certain decision or piece of information in the optimal coding scheme. The value 2H is the
perplexity, which can be interpreted as the weighted average number of choices a random
variable has to make.

261
Cross Entropy for Comparing Models. Useful when we don’t know the actual prob-
ability distribution p that generated some data. Assume we have some model m that’s an
approximation of p. The cross-entropy of m on p is defined by:
1 X
H(p, m) = lim p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (771)
n→∞ n
W ∈L

That is we draw sequences according to the probability distribution p, but sum the log of their
probability according to m.

262
Speech and Language Processing June 21, 2017

Naive Bayes and Sentiment (Ch. 6 3rd Ed.)


Table of Contents Local Written by Brandon McKinzie

[3rd Ed.] [Quick Review]

Overview. This chapter is concerned with text categorization, the task of classifying an
entire text by assigning it a label drawn from some set of labels. Generative classifiers like naive
Bayes build a model of each class. Given an observation, they return the class most likely to Discriminative systems
have generated the observation. Discriminative classifiers like logistic regression instead learn are often more accurate
what features from the input are most useful to discriminate between the different possible and hence more
commonly used.
classes. Notation: we will assume we have a training set of N documents, each hand-labeled
with some class: {(d1 , c1 ), . . . , (dN , cN )}.

Naive Bayes. A multinomial173 classifier with a naive assumption about how the features
interact. We model a text document as a bag of words, meaning we store (1) the words
that occurred and (2) their frequencies. It’s a probabilistic classifier, meaning it estimates the
label/class of a document d as

ĉ = arg max Pr [c | d] (773)


c∈C
Pr [d | c] Pr [c]
= arg max (774)
c∈C Pr [d]
= arg max Pr [d | c] Pr [c] (775)
c∈C

Computing the class-conditional distribution (the likelihood) Pr [d | c] over all possible d ∈ D


is typically intractable; we must introduce some simplifying assumptions and use an approxi-
mation of it. Our assumptions in this section will be:
• Bag-of-Words: Assume that word position within a document is irrelevant. (Counts
still matter)
173

Rapid review: multinomial distribution. Let x = (x1 , . . . , xk ) denote the result of an experiment with
Pk
n independent trials (n = x ) and k possible outcomes for any given trial. i.e. xi is the number of trials
i i
that had outcome i (1 ≤ i ≤ k). The pmf of this multinomial distribution, over all possible x constrained by
Pk
n = i xi , is:

n! x
Pr [x = (x1 , . . . , xk ); n] = px1 × · · · × pk k (772)
x1 ! · · · xk ! 1
where pi is the probability of outcome i for any single trial.

263
• Naive Bayes Assumption: First, recall that d is typically modeled as a (random)
vector consisting of features f1 , . . . , fn , each of which has an associated probability dis-
tribution Pr [fi | c]. The NB assumption is that the features are mutually independent
given the class c:
Pr [f1 , . . . , fn | c] = Pr [f1 | c] · · · Pr [fn | c] (776)
Y
cN B = arg max Pr [c] Pr [f | c] (777)
c∈C f ∈F

where 777 is the final equation for the class chosen by the naive Bayes classifier.
In text classification we typically use the word at position i in the document as fi , and move
to log space to avoid underflow/increase speed:
len(d)
X
cN B = arg max log Pr [c] + log Pr [wi | c] (778)
c∈C i

Classifiers that use a linear combination of the inputs to make a classification decision (e.g.
NB, logistic regression) are called linear classifiers.

Training the NB Classifier. No real "training" in my opinion, just simple counting from
the data:
Nc
P̂ [c] = (779)
Ndocs
count(wi , c) + 1
P̂ [wi | c] = P (780)
( w∈V count(w, c)) + |V |

The Laplace smoothing is added to avoid the occurrence of zero-probabilities in equation 777.

Optimizing for Sentiment Analysis.


• It often improves performance [for sentiment] to clip word counts in each document to 1
(“binary NB”).
• deal with negations in some way.
• Use sentiment lexicons, lists of words that are pre-annotated with positive or negative
sentiment.

264
Speech and Language Processing July 28, 2017

Hidden Markov Models (Ch. 9 3rd Ed.)


Table of Contents Local Written by Brandon McKinzie

Overview. Here we will first go over the math behind HMMs: the Viterbi, Forward, and
Baum-Welch (EM) algorithms for unsupervised or semi-supervised learning. Recall that
a HMM is defined by specifying the set of N states Q, transition matrix A, sequence of T
observations O, sequence of observation likelihoods B, and the initial/final states. They can
be characterized by three fundamental problems:
1. Likelihood. Given an HMM λ = (A, B) and observation sequence, compute the likeli-
hood (prob. of the observations given the model). (Forward)
2. Decoding. Given an HMM λ = (A, B) and observation sequence, discover the best
hidden state sequence. (Viterbi)
3. Learning. Given an observation sequence and the set of states in the HMM, learn the
HMM parameters A and B. (Baum-Welch/Forward-Backward/EM)

The Forward Algorithm. For likelihood computation. We want to compute the probability
of some sequence of observations O, without knowing the sequence of hidden states (that
emitted the observations) Q. In general, this can be expressed by summing over all possible
hidden state sequences:
X
Pr [O] = Pr [Q] Pr [O | Q] (781)
Q

However, for N hidden states and T observations, this summation involves N T terms, which
becomes intractable rather quickly. Instead, we can use the O(N 2 T ) Forward Algorithm.
The forward algorithm can be defined by initialization, recursion definition, and termination,
shown respectively as follows:

α1 (j) = a0j bj (o1 ) 1≤j≤N (782)


N
X
αt (j) = αt−1 (i)aij bj (ot ) 1 ≤ j ≤ N, 1 ≤ t ≤ T (783)
i=1
N
X
Pr [O | λ] = αT (qF ) = αT (i)aiF (784)
i=1

265
Viterbi Algorithm. For decoding. Want the most likely hidden state sequence given observa-
tions. Let vt (j) represent the probability that we are in state j after t observations and passing
through the most probable state sequence q0 , q1 , . . . , qt−1 . Similar to the forward algorithm,
we show the defining equations for the Viterbi algorithm below:

v1 (j) = a0j bj (o1 ) 1≤j≤N (785)


N
vt (j) = max vt−1 (j)aij bj (ot ) (786)
i=1
N
P ∗ = vT (qF ) = max vT (i)aiF (787)
i=1

N.B.: At each step, the best path up to that point can be found by taking the argmax instead
of max.

Baum-Welch Algorithm. AKA forward-backward algorithm, a special case of the EM


algorithm. Given an observation sequence O and the set of best possible states in the HMM,
learn the HMM parameters A and B. First, we must define some new notation. The backward
probability β is defined as:
Remember λ ≡ (A, B)
βt (i) , Pr [ot+1 , . . . , oT | qt = i, λ] (788)

As usual, we can compute its values inductively as follows:

βT (i) = aiF 1≤i≤N (789)


N
X
βt (i) = aij bj (ot+1 )βt+1 (j) 1 ≤ i ≤ N, 1 ≤ t ≤ T (790)
j=1

Pr [O | λ] = αT (qF ) = β1 (q0 ) (791)


N
X
= a0j bj (o1 )β1 (j) (792)
j=1

We can use the forward and backward probabilities α and β to compute the transition prob-
abilities aij and observation probabilities bi (ot ) from an observation sequence. The derivation
steps are as follows:

266
1. Estimating âij . Begin by defining quantities that will prove useful:
Remember, knowing the
observation sequence
ξt (i, j) , Pr [qt = i, qt+1 = j | O, λ] (793) does NOT give us the
sequence of hidden
ξet (i, j) , Pr [qt = i, qt+1 = j, O | λ] (794) states.
= αt (i)aij bj (t + 1)βt+1 (j) (795)

where you should be able to derive eq. 795 in your head using just logic. If you cannot,
review before continuing. We can then derive ξt (i, j) using basic definitions of conditional
probability, combined with eq. 791. Finally, we estimate âij as the expected number of
transitions qi → qj divided by the expected number of transitions from qi total:
PT −1
ξt (i, j)
âij = PT −1t=1
PN (796)
t=1 k=1 ξt (i, k)

2. Estimating b̂j (vk ). We define our estimate as the expected number of times we are in
qj and emit observation vk , divided by the expected number of times we are in state j.
Similar to our approach for âij we define helper quantities for these values at a given
timestep, then sum over them (all t) to obtain our estimate.

γt (j) , Pr [qt = j | O, λ] (797)


Pr [qt = j, O | λ]
= (798)
Pr [O | λ]
αt (j)βt (j)
= (799)
Pr [O | λ]

Thus, we obtain b̂j (vk ) by summing over all timesteps where ot = vk , denoted as the set
Tvk , divided by the summation over all t regadless of ot :
P
t∈Tvk γt (j)
b̂j (vk ) = PT (800)
t=1 γt (j)

At last we can finally define the Forward-Backward Algorithm as follows:


1. Initialize A and B.
2. E-step. Compute γt (j) (∀t, j), and compute ξt (i, j) (∀t, i, j).
3. M-step. Update all âij and b̂j (vk ).
4. Upon convergence, return A and B.

267
Speech and Language Processing July 30, 2017

POS Tagging (Ch. 10 3rd Ed.)


Table of Contents Local Written by Brandon McKinzie

English Parts-of-Speech. POS are traditionally defined based on syntactic and morpholog-
ical function, grouping words that have similar neighboring words or take similar affixes.

Part of Speech Definition Properties


noun people, places, things occur with determiners, take possessives,
Open classes: verb actions, processes 3rd-person-sg, progressive, past participle
adjective properties, qualities
adverb modify verbs, adverbs, verb phrases directional, locative, degree, manner, tem
Common nouns can be divided into count (e.g. goat/goats) and mass (e.g. snow) nouns.

Closed classes. POS with relatively fixed membership. Some of the most important in
English are:

Some subtleties: the particle resembles a preposition or an adverb and is used in combination
with a verb. An example case where “over” is a particle: “she turned the paper over.” When a
verb and a particle behave as a single syntactic and/or semantic unit, we call the combination
a phrasal verb. Phrasal verbs cause widespread problems with NLP because they often
behave as a semantic unit with a noncompositional meaning – one that is not predictable from
the distinct meanings of the verb and the particle. Thus, “turn down” means something like
“reject”, “rule out” means “eliminate”, “find out” is “discover”, and “go on” is “continue”.

268
HMM POS Tagging. Since we typically train on labeled data, we need only use the Viterbi
algorithm for decoding174 . In the POS case, we wish to find the sequence of n tags, t̂n1 , given
the observation sequence of n words w1n .

t̂n1 = arg max Pr [tn1 | w1n ] = arg max Pr [w1n | tn1 ] Pr [tn1 ] (801)
tn
1 tn
1

where we’ve dropped the denominator after using Bayes’ rule (since argmax is the same).
HMM taggers made two further simplifying assumptions:
n
Y
Pr [w1n | tn1 ] ≈ Pr [wi | ti ] (802)
i=1
Yn
Pr [tn1 ] ≈ Pr [ti | ti−1 ] (803)
i=1

We can thus plug-in these values into eq. 801 to obtain the equation for t̂n1 .
n
Y
t̂n1 = arg max Pr [wi | ti ] Pr [ti | ti−1 ] (804)
tn
1 i=1
Yn
= arg max bi (wi )ai−1,i (805)
tn
1 i=1

where I’ve written a “translated” version on the second line using the familiar syntax from
the previous chapter. In practice, we can obtain quick estimates for the two probabilities on
the RHS by taking counts/averages over our tagged training data. We then run through the
Viterbi algorithm to find all the argmaxes over states for the most likely hidden state sequence.

Maximum Entropy Markov Models (MEMMs). A sequence model adaptation of the


MaxEnt (multinomial logistic regression) classifier175 . Since HMMs are generative models,
they decompose Pr [T | W ] into Pr [W | T ] Pr [T ] when computing the best tag sequence T̂ .
Since MEMMs are discriminative, they compute/model Pr [T | W ] directly:
Y
T̂ = arg max Pr [T | W ] = arg max Pr [ti | wi , ti−1 ] (806)
T T i

Visually, we can think of the difference between HMMs and MEMMs via the direction of
arrows, as illustrated below.

174
Recall that decoding is the problem of finding the best hidden state sequence, given λ = (A, B) and
observation sequence O.
175
Because it is based on logistic regression, the MEMM is a discriminative sequence model. By contrast,
the HMM is a generative sequence model.

269
The top shows the HMM representation, while the bottom is MEMM.
The reason to use a discriminative sequence model is that discriminative models make it
easier to incorporate a much wider variety of features.

Bidirectionality. The one problem with the MEMM and HMM models as presented is that
they are exclusively run left-to-right. MEMMs176 have a weakness known as the label bias
problem. Consider the tagged fragment: “will/NN to/TO fight/VB 177 .” Even though the
word “will“ is followed by “to”, which strongly suggests “will” is a NN, a MEMM will incor-
rectly label “will” as MD (modal verb). The culprit lies in the fact that Pr [TO | to, twill ] is
essentially 1 regardless of twill ; i.e. the fact that “to” must have the tag TO has explained
away the presence of TO and so the model doesn’t learn the importance of the previous NN
tag for predicting TO.

One way to implement bidirectionality (and thus allowing e.g. the link between TO being
available when tagging the NN) is to use a Conditional Random Field (CRF) model.
However, CRFs are much more computationally expensive than MEMMs and don’t work better
for tagging.

176
And other non-generative finite-state models based on next-state classifiers
177
Note on the tag meanings: TO literally means “to”. MD means “modal” and refers to modal verbs such as
will, shall, etc.

270
Speech and Language Processing August 5, 2017

Formal Grammars (Ch. 11 3rd Ed.)


Table of Contents Local Written by Brandon McKinzie

Constituency and CFGs. Discovering the inventory of constituents present in the language.
Groups of words like noun phrases or prepositional phrases can be thought of as single units
which can only be placed within certain parts of a sentence.

The most widely used formal system for modeling constituent structure in English is the
Context-Free Grammar178 . A CFG consists of a set of productions (rules), e.g.

NP −→ Det Nominal (807)


NP −→ ProperNoun (808)
Nominal −→ Noun | Nominal Noun (809)

where the arrow is to be read “is composed of” or “consists of.”

The sequence of rule expansions going from left to right is called a derivation of the string of
words, commonly represented by a parse tree. The formal language defined by a CFG is the
set of strings that are derivable from the designated start symbol.

178
Also called Phrase-Structure Grammars. Equiv formalism as Backus-Naur Form (BNF)

271
Speech and Language Processing June 21, 2017

Vector Semantics (Ch. 15)


Table of Contents Local Written by Brandon McKinzie

Words and Vectors. Vector models are generally based on a co-occurrence, an example
of which is a term-document matrix: Each row is identified by a word, and each column Information Retrieval:
a document. A given cell value is the number of times the assoc. word occurred in the assoc. task of finding document
d, from D docs total,
document. Can also view each column as a document vector. that best matches a
query q.

For individual word vectors, however, it is most common to instead use a term-term matrix179 ,
in which columns are also identified by individual words. Now, cell values are the number
of times the row (target) word and the column (context) word co-occur in some context in
some training corpus. The context is most commonly a window around the row/target word,
meaning a cell gives the number of times the column word occurs in a window of ±N words
from the row word.
• Q: What about the co-occurrence of a word with itself (row i, col i)?
– A: It is included, yes. Source: “The size of the window . . . is generally between 1
and 8 words on each side of the target word (for a total context of 3-17 words).”
• Q: Why is the size of each vector generally |V | (vocab size)? Shouldn’t this vary
substantially with window and corpus size?
– A: idk
(TODO: revisit end of page 5 in my pdf of this)

Pointwise Mutual Information (PMI). Motivation: raw frequencies in a co-occurrence


matrix aren’t that great, and words like “the” (which aren’t useful and occur everywhere) can
really skew things. The best weighting or measure of association between words should tell us
how much more often than chance the two words co-occur. PMI is such a measure.
XX
[Mutual Information] I(X, Y ) = P (X = x, Y = y)PMI(x, y) (810)
x y
P (x, y)
[PMI] PMI(x, y) = ln (811)
P (x)P (y)

which can be applied for our specific use case as PMI(w, c) = ln PP(w)P
(w,c)
(c) . The interpretation is
simple: the denominator tells the joint probability of the given target word w occurring with
context word c if they were independent of each other, while the numerator tells us how often
we observed the two words together (assuming we compute probability by using the MLE).
179
Also called the word-word or term-context matrix

272
Therefore, the ratio gives us how an estimate of how much more the target and feature co-occur
than we expect by chance180 . Most people use Positive PMI, which is just max(0, PMI). We
can compute a PPMI matrix (to replace our co-occurrence matrix), where PPMIij gives the
PPMI value of word wi with context cj . The authors show a few formulas which is really
distracting since all we actually need is the counts fij = counts(wi , cj ), and from there we can
use basic probability and Bayes rule to get the PPMI formula.
• Q: Explain why the following is true: very rare words tend to have very high PMI values.
– A: hi
• Q: What is the range of α used in PPMIα ? What is the intuition behind doing this?
– A: For reference:
P (w, c)
 
PPMIα (w, c) = max ln ,0 (812)
P (w)Pα (c)
count(c)α
Pα (c) = P 0 α
(813)
c0 count(c )

Although there are better methods than PPMI for weighted co-occurrence matrices, most
notably TF-IDF, things like tf-idf are not generally used for measuring word similarity. For
that, PPMI and significance-testing metrics like t-test and likelihood-ratio are more common.
The t-test statistic, like PMI, measures how much more frequent the association is than
chance.
x̄ − µ
t= p 2 (814)
s /N
P (a, b) − P (a)P (b)
t-test(a, b) = p (815)
P (a)P (b)

where x̄ is the observed mean, while µ is the expected mean [under our null-hypothesis of
independence].

Measuring Similarity. By far the most common similarity metric is the cosine of the angle
between the vectors:
v·w
cosine(v, w) = (816)
|v| |w|

Note that, since we’ve been defining vector elements as frequencies/PPMI values, they won’t
have negative elements, and thus our cosine similarities will be between 0 and 1 (not -1 and
1).

180
Computing PMI this way can be problematic for word pairs with small probability, especially if we have a
small corpus. Recognize that PMI should never really be negative, but in practice this happens for such cases

273
Alternatives to cosine:
• Jaccard measure: Described as "weighted number of overlapping features, normalized",
but looks like a silly hack in my opinion:
PN
i=1 min(vi , wi )
simJac (v, w) = PN (817)
i=1 max(vi , wi )

• Dice measure: Another hack. This displeases me.


PN
2× i=1 min(vi , wi )
PN (818)
i=1 (vi + wi )

• Jensen-Shannon Divergence: An alternative to the KL-divergence181 , which repre-


sents the divergence of each distribution from the mean of the two:

v+w v+w
   
simJS (v||w) = D v +D w (820)
2 2

181
Idea:if two vectors, v and w, each express a probability distribution (their values sum to one), then
they are are similar to the extent that these probability distributions are similar. The basis of comparing two
probability distributions P and Q is the Kullback-Leibler divergence or relative entropy, defined as:
X P (x)
D(P ||Q) = P (x) log (819)
Q(x)
x

274
Speech and Language Processing June 21, 2017

Semantics with Dense Vectors (Ch. 16)


Table of Contents Local Written by Brandon McKinzie

Overview. This chapter introduces three methods for generating short, dense vectors: (1)
dimensionality reduction like SVD, (2) neural networks like skip-gram or CBOW, and (3)
Brown clustering.

Dense Vectors via SVD. Method for finding more important dimensions of a dataset, “im-
portant” defined as dimensions wherein the data most varies. First applied (for language)
for generating embeddings from term-document matrices in a model called Latent Semantic
Analysis (LSA). LSA is just SVD on a |V | × c term-document matrix X, factorized into
WΣC T . By using only the top k < m dimensions of these three matrices, the product be- W ∈ R|V |×m
comes a least-squares approx. to the original X. It also gives us the reduced |V |×k matrix Wk , Σ ∈ Rm×m
where each row (word) is a k-dimensional vector (embedding). Voilà, we have our dense vectors! C T ∈ Rm×c

Note that LSA implementations typically use a particular weighting of each cell in the term-
document matrix called the local and global weights.

[local] log f(i, j) + 1 (821) f (i, j) is the raw


P
j p(i, j) log p(i, j) frequency of word i in
[global] 1+ (822) context j. D is number
log D of docs.

For the case of a word-word matrix, it is common to use PPMI weighting.

Skip-Gram and CBOW. Neural models learn an embedding by starting with a random
vector and then iteratively shifting a word’s embeddings to be more like the embeddings of
neighboring words, and less like the embeddings of words that don’t occur nearby182 Word2vec,
for example, learns embeddings by training to predict neighboring words183 .
• Skip-Gram: Learns two embeddings for each word w: the word embedding v (within
matrix W) and context embedding c (within matrix C). Visually:
 
v0T
 T 
 v1   
W= . C = c0 c1 · · · c|V | (823)
 
 ..


 
T
v|V |

182
Why? Why is this a sensible assumption? I see no reason a priori why it ought to be true.
183
Note that the prediction task is not the goal – it just happens to result in good word embeddings. Hacky.

275
For a context window of L = 2, and at a given word v (t) inside the corpus184 , our goal is
to predict the context [words] denoted as [ c(t−2) , c(t−1) , c(t+1) , c(t+2) ].
– Example: Consider one of the context words, say c(t+1) , ck , where we also assume
it’s the kth word in our vocab. Also assume that our target word v (t) , vj is the
jth word in our vocab.
– Our task is to compute Pr [ck | vj ]. We do this with a softmax:
T
eck vj
Pr [ck | vj ] = P (824)
cT vj
i∈|V | e i

• CBOW: Continuous bag of words. Basically the mirror-image of skip-gram. Goal is to


predict current word v (t) from the context window of 2L words [ c(t−2) , c(t−1) , c(t+1) , c(t+2) ].
As usual, the denominator of the softmax is computationally expensive, and usually we ap-
proximate it with negative sampling.
In the training phase, the algorithm walks through the corpus, at each target word choosing the
surrounding context words as positive examples, and for each positive example also choosing k
noise samples or negative samples: non-neighbor words. The goal will be to move the embeddings
toward the neighbor words and away from the noise words.

Example: Suppose we come along the following window (in “[]”) (L=2) in our corpus:
lemon, a [tablespoon of apricot preserves or] jam

Ultimately, we want dot products, ci · vector(“apricot”), to be high for all four of the context
words ci . We do negative sampling by sampling k random noise words according to their
[unigram] frequency. So here, for e.g. k = 2, this would amount to 8 noise words, 2 for each
context word. We want the dot products between “apricot” and these noise words to be low.
For a given single context-word pair (w, c), our training objective is to maximize:
In practice, common to
k
X use p3/4 (w) instead of
log σ(c · w) + Ewi ∼p(w) [log σ(−wi · w)] (825) p(w)
i=1

Again, the above is for a single context-target word-pair and, accordingly, the summation is
only over k = 2 (for our example). Don’t try to split the expectation into a summation or any-
thing – just view it as an expected value. To iteratively shift parameters, we use an optimizer
like SGD.

184
Note that, technically, the position t of v (t) is irrelevant for our computation; we are predicting those words
based on which word v (t) is in the vocabulary, not it’s position in the corpus.

276
The actual model architecture is a typical neural net, progressing as follows:

“apricot” → wone-hot = [ 0 0 ··· 1 ··· 0 ] (826)


T one-hot
→ h=W w (827)
→ o = C T h = [ cT0 h, cT1 h, ··· cT h T
|V | ] (828)
T
→ y = softmax(o) = [ Pr[c0 |h], Pr[c2 |h], ··· Pr [c|V | |h ] ] (829)
(830)

Brown Clustering. An agglomerative clustering algorithm for deriving vector representations


of words by clustering words based on their associations with the preceding or following words.
Makes use of the class-based language model (CBLM), wherein each word w belongs to
some class c ∈ C via the probability P (w | c). CBLMs define

P (wi | wi−1 ) = P (ci | ci−1 )P (wi | ci ) (831)


n
Y
P (corpus | C) = P (wi | wi−1 ) (832)
i−1

A naive and extremely inefficient version of Brown clustering, a hierarchical clustering algo-
rithm, is as follows:
1. Each word is initially assigned to its own cluster.
2. For each cluster pair (ci , cj6=i ), compute the value of eq 832 that would result from merging
ci and cj into a single cluster. The pair whose merger results in the smallest decrease in
eq 832 is merged.
3. Clustering proceeds until all words are in one big cluster.
This process builds a binary tree from the bottom-up, and the binary string corresp. to a
word’s traversal from leaf-to-root is its representation.

277
Speech and Language Processing July 27, 2017

Information Extraction (Ch. 21 3rd Ed)


Table of Contents Local Written by Brandon McKinzie

Overview. The first step in most IE tasks is named entity recognition (NER). Next we
can do relation extraction: finding and classifying semantic relations among the entities,
e.g. “spouse-of.” Event extraction is finding the events in which the entities participate, and
event coreference for figuring out which event mentions actually refer to the same event. It’s
also common to extract dates/times (temporal expression) and perform temporal expression
normalization to map them onto specific calendar dates. Finally, we can do template
filling: finding recurring/stereotypical situations in documents and filling the template slots
with appropriate material.

Named Entity Recognition. Standard algorithm is word-by-word sequence labeling task


by a MEMM or CRF, trained to label tokens with tags indicating the presence of particular
kinds of NEs. It is common to label with BIO tagging, for beginning, inside, and outside
of entities. If we have n unique entity types, then we’d have 2n + 1 BIO tags185 . A helpful
illustration is shown below:

Here we see a classifier determining the label for Corp. with a context window of size 2 and
various features shown in the boxed region. For evaluation of NER, we typically use the familiar
recall, precision, and F1 measure.

185
2n for B-<NE> and I-<NE>, with +1 for the blanket O tag (not any of our NEs)

278
Relation Extraction. The four main algorithm classes used are (1) hand-written patterns,
(2) supervised ML, (3) semi-supervised, and (4) unsupervised. Terminology:
• Infobox: structured tables associated with certain articles/topics/etc. For example, the
Wikipedia infobox for Stanford includes structured facts like state = ’California’.
• Resource Description Framework (RDF): a metalanguage of RDF triples, tuples
of (entity, relation, entity), called a subject-predicate-object expression. For example:
(Golden Gate Park, location, San Francisco).
• hypernym: the “is-a” relation.
• hyponym: the “kind-of” relation. Gelidium is a kind of red algae.

Overview of the four algorithm classes:


1. Patterns. Consider a sentence that has the following form:
NP0 such as NP1 {, N P2 , . . . , (and|or)N Pi }, i ≥ 1
also known as a lexico-syntactic pattern, which implies ∀N Pi , i ≥ 1,hyponym(N Pi , N P0 )186 .
Patterns typically have high precision but low-recall.
2. Supervised. The general approach for finding relations in a given sequence of words is
the following:
(a) Find all pairs of named entities in the sequence (typically a single sentence).
(b) For each pair, use a trained binary classifier to predict whether or not the entities
in the pair are indeed related.
(c) If related, use a classifier trained to predict the relation given the entity-pair.
As with NER, the most important step in this process is to identify useful surface fea-
tures that will be useful for relation classification, including word features, NER features,
syntactic paths (chunk seqs, constituent paths, dependency-tree paths), and more.
3. Semi-supervised via bootstrapping. Suppose we have a few high-precision seed pat-
terns (or seed tuples)187 . Bootstrapping proceeds by taking the entities in the seed
pair, and then finding sentences (on the web, or whatever dataset we are using) that
contain both entities. From all such sentences, we extract and generalize the context
around the entities to learn new patterns.

186
Here, hyponym(A, B) means “A is a kind-of (hyponym) of B.”
187
seed tuples are tuples of the general form (M1, M2) where M1 and M2 are each specific named entities we
know have the relation of interest R.

279
4. Unsupervised. The Re Verb system extracts a relation from a sentence s in 4 steps:

Event Extraction. An event mention is any expression denoting an event or state that can
be assigned to a particular point, or interval, in time. Note that this is quite different than the
colloquial usage of the word “event,” you should think of the two as distinct. Here, most event
mentions correspond to verbs, and most verbs introduce events. Event extraction is typically
modeled via ML, detecting events via sequence models with BIO tagging, and assigning event
classes/attributes with multi-class classifiers.

Template Filling. The task is creation of one template for each event in the input documents,
with the slots filled with text from the document. For example, an event could be “Fare-Raise
Attempt” with corresponding template (slots to be filled) “(<Lead Airline>, <Amount>, <Ef-
fective Date>, <Follower>)”. This is generally modeled by training two separate supervised
systems:
1. Template recognition. Trained to determine if template T is present in sentence S.
Here, “present” means there is a sequence within the sentence that could be used to fill
a slot within template T.
2. Role-filler extraction. Trained to detect each role (slot-name), e.g. “Lead Airline”.

280
Probabilistic
Graphical
Models
Contents

7.1 Foundations (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282


7.1.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
7.1.2 L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
7.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.2 The Bayesian Network Representation (Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . 292
7.3 Undirected Graphical Models (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
7.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
7.4 Local Probabilistic Models (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.5 Template-Based Representations (Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
7.6 Gaussian Network Models (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
7.7 Variable Elimination (Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.8 Clique Trees (Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
7.9 Inference as Optimization (Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
7.10 Parameter Estimation (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
7.11 Partially Observed Data (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

281
Probabilistic Graphical Models May 13, 2018

Foundations (Ch. 2)
Table of Contents Local Written by Brandon McKinzie

(Brief summary of the book’s notation, for ease of reference.)

Graphs. Authors denote directed graphs as G and undirected graphs as H.


• Induced Subgraph. Let K = (X , E) and X ⊂ X . Define the induced subgraph K[X]
to be the graph (X, E 0 ) where E 0 are all edges X Y such that X, Y ∈ X.
• Complete Subgraph. A subgraph over X is complete if every two nodes in X are
connected by some edge. The set X is often called a clique; we say that a clique X is
maximal if for any superset of nodes Y ⊃ X, Y is not a clique.
• Upward Closure. We say that a subset of nodes X ∈ X is upwardly closed in K if
∀X ∈ X, we have that BoundaryX ⊂ X 188 . We define the upward closure of X to be
the minimally upwardly closed subset Y that contains X. We define the upwardly closed
subgraph of X, denoted K+ [X], to be the induced subgraph over Y, K[Y].

Paths and Trails. Definitions for longer-range connections in graphs. We use the notation
Xi Xj+1 to denote that Xi and Xj are connected via some edge, whether directed (in any
direction) or undirected.
• Trail/Path. We say that X1 , . . . , Xk form a trail in the graph K = (X , E) if ∀i =
1, . . . , k − 1, we have that Xi Xj+1 . A path makes an additional restriction: either
Xi → Xi+1 or Xi —Xi+1 .
• Connected Graph. A graph is connected if ∀Xi , Xj there is a trail between Xi and Xj .
• Cycle. A cycle in K is a directed path X1 , . . . Xk where X1 = Xk .
• Loop. A loop in K is a trail where X1 = Xk . A graph is singly connected if it contains no
loops. A node in a singly connected graph is called a leaf if it has exactly one adjacent
node.
• Polytree/Forest. A singly connected graph is also called a polytree. A singly connected
undirected graph is called a forest; if a forest is also connected, it is called a tree.
– A directed graph is a forest if each node has at most one parent. A directed forest
is a tree if it is also connected.
• Chordal Graph. Let X1 —X2 — · · · —Xk —X1 be a loop in the graph. A chord in
the loop is an edge connecting Xi and Xj for two nonconsecutive nodes Xi , Xj . An
undirected graph H is said to be chordal if any loop X1 —X2 — · · · —Xk —X1 for k ≥ 4
has a chord.

188
BoundaryX , PaX ∪ NbX . For DAGs, this is simply X’s parents, and for undirected graphs X’s neighbors.

282
Probability. Some notational reminders for this book. Let Ω denote a space of possible
outcomes, and let S denote a set of measurable events α, each of which are a subset of Ω.
A probability distribution P over (Ω, S) is a mapping from events in S to real values that satisfy:
• P (α) ≥ 0 for all α ∈ S.
• P (Ω) = 1.
• If α, β ∈ S and α ∩ β = ∅, then P (α ∪ β) = P (α) + P (β).

Some useful independence properties:

Symmetry : (X ⊥ Y | Z) =⇒ (Y ⊥ X | Z) (833)
Decomposition : (X ⊥ (Y, W ) | Z) =⇒ (X ⊥ Y | Z) (834)
Weak Union : (X ⊥ (Y, W ) | Z) =⇒ (X ⊥ Y | Z, W ) (835)
Contraction : (X ⊥ W | Z, Y )&(X ⊥ Y | Z) =⇒ (X ⊥ Y, W | Z) (836)

My Proofs: Independence Properties

I’ll be using the definition that (X ⊥ Y | Z) ⇔ P (X, Y | Z) = P (X | Z)P (Y | Z). Given this definition the proof
for the symmetry property is trivial. In what follows, I’ll assume the LHS of the given implication is true, and
then show that the RHS must hold as well.

Decomposition:
X X
P (X, Y | Z) = P (X, Y, w | Z) = P (X | Z)P (Y, w | Z) = P (X | Z)P (Y | Z) X
w w

Weak Union:

P (X, Y, W | Z)
P (X, Y | Z, W ) = (837)
P (W | Z)
P (X | Z)P (Y, W | Z)
= (838)
P (W | Z)
P (X | Z)P (W | Z)P (Y | Z, W )
= (839)
P (W | Z)
= P (X | Z, W )P (Y | Z, W ) X (840)

Contraction:

P (X, Y, W | Z) = P (Y | Z)P (X, W | Z, Y ) (841)

= P (Y | Z)P (X | Z, Y )P (W | Z, Y ) (842)
= P (X | Z) [P (Y | Z)P (W | Z, Y )] (843)
= P (X | Z)P (Y, W | Z) X (844)

283
We now define what “positive distribution” means, and a useful property of such distributions.

A distribution P is said to be positive if for all events α ∈ S, such that α 6= ∅, we have that
P (α) > 0.

For positive distributions, and for mutually disjoint sets X, Y, Z, W, the intersection prop-
erty also holds:

Intersection : (X ⊥ Y | Z, W)&(X ⊥ W | Z, Y) =⇒ (X ⊥ Y, W | Z) (845)

7.1.1 Appendix

Figured this would be a good place to put some of the definitions in the Appendix, too.

Information Theory (A.1).


• Entropy. Let P (X) be a distribution over a random variable X. The entropy of X is
defined as.
1

1
 X 1 We treat 0 log 0
=0
HP (X) = EP lg = P (X = x) lg (846)
P (X) x P (X = x)
0 ≤ HP (X) ≤ lg |V al(X)| (847)

Hp (X) is a lower bound for the expected number of bits required to encode instances
sampled from P (X). Another interpretation is that the entropy is a measure of our
uncertainty about the value of X.
• Conditional Entropy. The conditional entropy of X given Y is
1
 
HP (X | Y ) = HP (X, Y ) − HP (Y ) = EP lg (848)
P (X | Y )
HP (X | Y ) ≤ HP (X) (849)

which captures the additional cost (in bits) of encoding X when we’re already encoding
Y.
• Mutual Information. The mutual information between X and Y is
P (X | Y )
 
IP (X; Y ) = HP (X) − HP (X | Y ) = EP lg (850)
P (X)

which captures how many bits we save (on average) in the encoding of X if we know the
value Y .
• Distance Metric. A distance metric is any distance measure d evaluating the distance
between two distributions that satisfies all of the following properties:
– Positivity: d(P, Q) ≥ 0 and d(P, Q) = 0 if and only if P = Q.

284
– Symmetry: d(P, Q) = d(Q, P ).
– Triangle inequality: For any three distributions P , Q, R, we have that

d(P, R) ≤ d(P, Q) + d(Q, R) (851)

• Kullback-Liebler Divergence. Let P and Q be two distributions over random vari-


ables X1 , . . . Xn . The relative entropy, or KL-divergence, of P and Q is

P (X1 , . . . , Xn )
 
D(P kQ) = EP lg (852)
Q(X1 , . . . , Xn )

Note that this only satisfies the positivity property, and is thus not a true distance metric.

Algorithms and Algorithmic Complexity (A.3).


• Decision Problems. A decision problem Π is a task that accepts an input (instance)
ω and decides whether it satisfies a certain condition or not. The SAT problem accepts
a formula in propositional logic and decides whether there is an assignment to the vari-
ables in the formula such that it evaluates to true. 3-SAT restricts this to accepting
only formulas in conjunctive normal form (CNF), and further restricted s.t. each clause
contains at most 3 literals.
• P. A decision problem is in the class P if there exists a deterministic algorithm that
takes an instance ω and determines whether ω ∈ LΠ (the set of instances for which a
correct algorithm must return true), in polynomial time in the size of the input ω.
• NP. A non-deterministic algorithm takes the general form: (1) nondeterministically
guess some assignment γ to the variables of ω, (2) deterministically verify whether γ sat-
isfies the condition of the problem.The algorithm will repeat these steps until it produces
a γ that satisfies the problem. A decision problem Π is in the class N P if there exists a
nondeterministic algorithm that accepts ω if and only if ω ∈ LΠ , and if the verification
stage can be executed in polynomial time in the length of ω.
• NP-hard. Π is N P-hard if for every DP Π0 ∈ N P, there is a polynomial-time transfor-
mation of inputs such that an input for Π0 belongs to LΠ0 if and only if the transformed The SAT problem is
N P-hard.
instance belongs to LΠ . Note that N P-hard is a superset of N P.
• NP-complete. A problem Π is said to be N P-complete if it is both N P-hard and in
N P.

285
Combinatorial Optimization and Search (A.4). Below, I’ll outline some common search
algorithms. These are designed to address the following task:
Given initial candidate solution σcur , a score function score, and a set of search opera-
tors O, search for the optimal solution σbest that maximizes the value of score(σbest )

Greedy local search (Algorithm A.5)

Repeat the following until didU pdate evaluates to f alse at the end of an iteration.
1. Initialize σbest := σcur .
2. Set didU pdate := f alse.
3. For each operator o ∈ O, do:
(a) Let σo := o(σbest ).
(b) If σo is legal solution, and score(σo ) > score(σbest ), reassign σbest := σo , and set
didU pdate := true.
4. If didU pdate == true, go back to step 2. Otherwise terminate and return σbest .

Beam search (Algorithm A.7)

We are given a beam width K. Initialize our beam, the set of at most K solutions we are currently
tracking, to {σcur }. Repeat the following until terminationa :
1. Initial the set of successor states H := ∅.
2. For each solution σ ∈ Beam, and each operator o ∈ O, insert a candidate successor state o(σ)
into H.
3. Set Beam := KBestScore(H)b .
Once termination is reached, return the best solution σbest in Beam.
a
Termination condition could be e.g. an upper bound on number of iterations or on the improvement
achieved in the last iteration.
b
Notice that this implies an underlying assumption of beam search: all successor states σ ∈ H have
scores greater than any of the states in the current beam. We always assume improvement.

Continuous Optimization (A.5).


• Line Search. Method for adaptively choosing the step size (learning rate) η at each
training step. Assuming we are doing gradient ascent. We’d usually set the parameters
θ at step t + 1 to θ(t) + η∇f (θ(t) ). Line search modifies this by instead defining the “line”
g(η) below, and searching for the optimal value of η along that line.


g(η) = θ (t) + η∇f (θ(t) ) (853)
η (t) = arg max[g(η)] (854)
η

θ(t+1) ← θ(t) + η (t) ∇f (θ(t) ) (855)

At risk of stating the obvious, g(η) is referred to as a “line” because it’s a function of the
form mx + b.

286
7.1.2 L-BFGS

Some notes from the textbook “Numerical Optimization” (chapters 8 and 9).

The BFGS Method (8.1). Begin the derivation by forming the following quadratic model
of the objective function f at the current iterate189 θt : For mt (p), p denotes the
deviation at step t from
the current parameters
1
mt (p) = ft + ∇ftT p + pT Bt p (856) θt .
2
where Bt is an n × n symmetric p.d. matrix that will be revised/updated every iteration (it is
not the Hessian!). The minimizer pt of this function can be written explicitly
∂ 1 T
p Bt p = Bt p
pt = −Bt−1 ∇ft (857) ∂p 2

is used as the search direction, and the new iterate is

θt+1 ← θt + αt pt (858)

where the step length αt is chosen to satisfy the Wolfe conditions190 . This is basically
Newton’s method with line search, except that we’re using the approximate Hessian Bt instead
of the true Hessian. It would be nice if we could somehow avoid recomputing Bt at each
step. One proposed method involves imposing conditions on Bt+1 based on the previous
step(s). Require that ∇mt+1 equal ∇f at the latest two iterates θt and θt+1 . Formally, the
two conditions can be written as

∇mt+1 (−αt pt ) = ∇ft+1 − αt Bt+1 pt = ∇ft (861)


∇mt+1 (0) = ∇ft+1 (862)

We can rearrange the first condition to obtain the secant equation:

Ht+1 yt = st where (863)


−1
Ht+1 , Bt+1 (864)
st , θt+1 − θt (865)
yt , ∇ft+1 − ∇ft (866)

which is true only if st and yt satisfy the curvature condition, sTt yt > 0. The curvature
condition is guaranteed to hold if we impose the Wolfe conditions on the line search. As is, this
189
Recall that an “iterate” is just some variable that gets iteratively computed/updated. Fancy people with
fancy words.
190
The Wolfe conditions are the following sufficient decrease and curvature conditions for line search:

f (θt + αt pt ) ≤ f (θt ) + c1 αt ∇ftT pt (859)


T
∇f (θt + αt pt ) pt ≥ c2 ∇ftT pt (860)

for some constant c1 ∈ (0, 1) and c2 ∈ (c1 , 1).

287
still has infinitely many solutions for Ht+1 . To determine it uniquely, we impose the additional
condition that Ht+1 is the closest of all possible solutions to the current Ht :

min ||H − Ht ||W s.t. H = H T , Hyt = st (867)


H

where || · ||W is the weighted Frobenius norm 191 , and W can be any matrix satisfying W st = yt .
For concreteness, assume that W = G e t , where
Z 1
G
et = ∇2 f (θt + τ αt pt )dτ (870)
0

The unique solution to Ht+1 is given by

(BFGS) Ht+1 = (I − ρt st ytT )Ht (I − ρt yt sTt ) + ρt st sTt (8.16)

where ρt = 1/(ytT st ). The BFGS is summarized in algorithm 8.1 below.

Algorithm 8.1 (BFGS Method). Given starting point θ0 , convergence tolerance  > 0, and
inverse Hessian approximation H0 . Initialize t = 0. While ||∇ft || >  do:
1. Compute search direction pt = −Ht ∇ft .
2. Set θt+1 = θt + αt pt , where αt is computed via line search to satisfy the Wolfe conditions.
3. Define st = θt+1 − θt and yt = ∇ft+1 − ∇ft .
4. Compute Ht by means of equation 8.16.
5. Increment t += 1 and go back to step 1.

191

||H||W , ||W 1/2 HW 1/2 ||F (868)


X
||C||2F , c2ij (869)
i,j

288
L-BFGS. Modifies BFGS to store a modified version of Ht implicitly, by storing some number
m of vector pairs {si , yi }, corresponding to the m most recent time steps. We use a recursive
procedure to compute Ht ∇ft given the set of vectors.

Algorithm 9.1 (L-BFGS two-loop recursion) Subroutine of L-BFGS for computing Ht ∇ft .
We’re given the current value of ∇ft , and we initialize q to this value.
1. For i in the range [t − 1, t − m], compute

αi ← ρi sTi q (871)
q ← q − αi yi (872)

2. Set r → Ht0 q.
3. For i in the range [t − m, t − 1], compute

β ← ρi yiT r (873)
r ← r + si (αi − β) (874)

4. Return result Ht ∇ft = r.


Algorithm 9.2 (L-BFGS). Given starting point θ0 , integer m > 0, and initial t = 0. Repeat
the following until convergence.
1. Choose Ht0 . A popular choice is Ht0 := γt I, where

sTt−1 yt−1
γt , T y
yt−1 t−1

2. Compute pt ← −Ht ∇ft from Algorithm 9.1.


3. Compute θt+1 ← θt + αt pt , where αt is chosen to satisfy the Wolfe conditions.
4. if t > m, discard {st−m , yt−m }. Compute and save st and yt .
5. Increment t += 1 and go back to step 1.

289
7.1.3 Exercises

Going through all the problems with a star for review.

Exercise 2.4
Let α ∈ S be an event s.t. P (α) > 0. Show that P (· | α) satisfies the properties of a valid probability distribution.

• Show P (β | α) ≥ 0 for all β ∈ S. By definition,

1
P (β | α) = P (α ∩ β) (875)
P (α)

and since the full joint P ≥ 0 and since P (α) > 0, we have the desired result.
• Show P (Ωα ) = 1. Again, using just the definitions,
X
P (Ωα ) = P (β | α) (876)
β∈S
1 X
= P (α ∩ β) (877)
P (α)
β∈S
!
1 X X
= P (β) + P (∅) (878)
P (α)
β∈α γ ∈α
/
1
= (P (α) + 0) (879)
P (α)
=1 (880)

• Show, for any β, γ ∈ S, where β ∩ γ = ∅, that P (β ∪ γ | α) = P (β | α) + P (γ | α).

1
P (β ∪ γ | α) = P ((β ∪ γ) ∩ α) (881)
P (α)
1
= P ((β ∩ α) ∪ (γ ∩ α)) (882)
P (α)
1
= (P (β ∩ α) + P (γ ∩ α)) (883)
P (α)
= P (β | α) + P (γ | α) (884)

290
Exercise 2.16: Jensen’s Inequality

Let f be a concave function and P a distribution over a random variable X. Then

EP [f (X)] ≤ f (EP [X]) (885)

Use this inequality to prove the following 3 properties.

• HP (X) ≤ log |V al(X)|. Let f (u) := lg(u) be our concave function.

1
h i
HP (X) , EP lg = EP [f (u)] (886)
P (X)
!
X
≤ f (Ep [u]) = f u(x)P (x) = f (|V al(X)|) = lg |V al(X)| (887)
x
(888)

• HP (X) ≥ 0.

1
h i
−HP (X) = −EP lg (889)
P (X)
= EP [lg P (X)] ≤ 0 (890)

since 0 ≤ P (X = x) ≤ 1 ∀x (any term in the expectation where P (X = x) = 0 is equal to 0, by definition).


• D(P ||Q) ≥ 0. Use the same idea as in the first proof, but let u(x) = Q(x)/P (x).

D(P ||Q) , EP [lg (P (X)/Q(X))] = −Ep [f (u)] (891)


!
X
−Ep [f (u)] ≥ f u(x)P (x) = f (1) = 0 (892)
x

291
Probabilistic Graphical Models May 06, 2018

The Bayesian Network Representation (Ch. 3)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). The Bayesian Network Representation.


Probabilistic Graphical Models: Principles and Techniques.

Goal: represent a joint distribution P over some set of variables X = {X1 , . . . , Xn }. Consider
the case where each Xi is binary-valued. A single joint distribution requires access to the
probability for each of the 2n possible assignments for X . The set of all such possible joint
distributions,
2n
Understanding the
2n
X
{(p1 , . . . , p2n ) ∈ R : pi = 1} (893) exponential blowup.
i=1
n
is a 2n
− 1 dimensional subspace of R2 .
Note that each pi represents the probability for
a unique instantiation of X . Furthermore, in the general case, knowing pi tells you nearly
nothing about pj6=i – i.e. you require an instantiation of 2n − 1 independent parameters to
specify a given joint distribution.
But it would be foolish to parameterize any joint distribution in this way, since we can often
take advantage of independencies. Consider the case where each Xi gives the outcome (H or T) Taking advantage of
independencies.
of coin i being tossed. Then our distribution satisfies (X ⊥ Y) for any disjoint subsets fo the
variables X and Y. Let θi denote the probability that coin i lands heads. The key observation
is that you only need each of the n θi to specify a unique joint distribution over X , reducing
n
the 2n − 1 dimensional subspace to an n dimensional manifold in R2 192 .

The Naive Bayes Model. Say we want to determine the intelligence of an applicant based
on their grade G in some course, and their score S on the SAT. A naive bayes model can be
illustrated as below.

G S

This encodes our general assumption that193 P |= (S ⊥ G | I).

192
TODO: Formally define this manifold using the same notation as in 893. Edit: Not sure how to actually
write it out, but intuitively it’s because each of the 2n pi values, when going from the general case to the case
of i.i.d, become functions of the n θi values. Whereas before they were independent free parameters.
193
We say that an event α is independent of event β in P with the notation P |= (α ⊥ β). (Def 2.2, pg 23 of
book)

292
In general, a naive bayes model assumes that instances fall into one of a number of mutually
exclusive and exhaustive classes, defined as the set of values that the top variable in the graph
can take on194 . The model also includes some number of features X1 , . . . Xk , whose values
are typically observed. The naive Bayes assumption is that the features are conditionally
independent given the instance’s class.

Bayesian Networks. A Bayesian network B is defined by a network structure together with


its set of CPDs. Causal reasoning (or prediction) refers to computing the downstream effects
of various factors (such as intelligence). Evidential reasoning (or explanation) is the reverse
case, where we reason from effects to causes. Finally, intercausal reasoning (or explaining
away) is when different causes of the same effect can interact. For our student example, we
could be trying to determine Pr [I | G], the intelligence of an a student given his/her grade in
a class. In addition to intelligence being a cause for the grade, we could have another causal
variable D for the difficulty of the class:

D I

G S

An example of intercausal reasoning would be observing D, so that we now we want Pr [I | G, d];


the diffulty of the course can help explain away good/bad grades, thus changing our value for
the probability of intelligence based on the grade alone.

A Bayesian network structure G is a DAG whose nodes represent RVs X1 , . . . Xn . Let P aGXi
denote the parents of Xi in G, and NonDescXi denotes the variables that are NOT descendants
of Xi . Then G encodes the following set of local independencies,

I` (G) , {∀Xi : (Xi ⊥ NonDescXi | P aG


Xi )} (894)

194
For the intelligence example, the classes are high intelligence and low intelligence.

293
Graphs and Distributions. Here we see that a distribution P satisfies the local indepen-
dencies associated with a graph G iff P is representable as a set of CPDs associated with
G.
• Local independencies. Let P be a distribution over X . We define I(P ) to be the set
of independence assertions of the form (X ⊥ Y | Z) that hold in P . The statement
“P satisfies the local independencies associated with G” can thus be succinctly written:

I` (G) ⊆ I(P ) (895)

and we’d say that G is an I-map (independency map) for P .


• I-maps. More generally, let K be any graph object associated with a set of independen-
cies I(K). We say that K is an I-map for a set of independencies I if I(K) ⊆ I. Note
that the complete graph (every two nodes connected) is an I-map for any distribution.
• Factorization. Let G be a BN graph over X1 , . . . , Xn . We say that a distribution P
over the same space factorizes according to G if P can be expressed as
n
P (Xi | P aGXi )
Y
P (X1 , . . . , Xn ) = (896)
i=1

For BNs, the following equivalence holds:

G is an I-map for P ⇐⇒ P factorizes according to G

D-separation. We want to understand when we can guarantee that an independence (X ⊥


Y | Z) holds in a distribution associated with a BN structure G. Consider a 3-node network
consisting of X, Y, and Z, where X and Y are not directly connected. There are four such
cases, which I’ve drawn below.
The fourth trail is called
a v-structure.
X Y Z X Y

Z Z X Y Z

Y X

From left-to-right: indirect causal effect, indirect evidential effect, common cause, common
effect. The first 3 satisfy (X ⊥ Y | Z), but the 4th does not. Another way of saying this is
that the first 3 trails are active195 IFF Z is not observed, while the 4th trail is active IFF Z
(or a descendent of Z) is observed.

195
When influence can flow from X to Y via Z, we say that the trail X Y Z is active.

294
General case:

Let G be a BN structure, and X ··· Xn a trail in G. Let Z be a subset of


observed variables. The trail is active given Z if
• Any v-structure Xi−1 → Xi ← Xi+1 has Xi or one of its descendants in Z.
• No other node along the trail is in Z.

D-separation: (Directed separation)

Let X, Y, Z be three sets of nodes in G. We say that X and Y are d-separated in


G, denoted d-sepG (X; Y | Z), if there is no active trail between any node X ∈ X and
Y ∈ Y given Z. We use I(G) to denote the set of independencies that correspond
to d-separation,

I(G) = { X ⊥ Y | Z : d-sepG (X; Y | Z)} (897)

also called the set of global Markov independencies.

Soundness and Completeness of d-separation as a method for determining independence.


• Soundness (Thm 3.3). If a distribution P factorizes according to G, then I(G) ⊆ I(P ).
• Completeness. For any distribution P that factorizes over G, we have that P is faith-
ful196 to G:

X ⊥ Y | Z ∈ I(P ) =⇒ d-sepG (X; Y | Z) (898)

To see the detailed algorithm for finding nodes reachable from X given Z via active trails, see
Algorithm 3.1 on pg. 75 of the book.

196
P is faithful to G if, whenever X ⊥ Y | Z ∈ I(P ), then d-sepG (X; Y | Z).

295
Probabilistic Graphical Models November 03, 2017

Undirected Graphical Models (Ch. 4)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Undirected Graphical Models.


Probabilistic Graphical Models: Principles and Techniques.

The Misconception Example. Consider a scenario where we have four students who get to-
gether in pairs to work on homework for a class. Only the following pairs meet: (Alice, Bob),
(Bob, Charles), (Charles, Debbie), (Debbie, Alice). The professor misspoke in class, giving
rise to a possible misconception among the students. We have four binary random variables,
{A, B, C, D}, representing whether the student has the misconception (1) or not (0) 197 . Intu-
itively, we want to model a distribution that satisfies (A ⊥ C | {B, D}), and (B ⊥ D | {A, C}),
but no other independencies198 . Note that the interactions between variables seem symmetric
here – students influence each other (out of the ones they have a pair with).

D B

The nodes in the graph of a Markov network represent the variables, and the edges corre-
spond to a notion of direct probabilistic interaction between the neighboring variables – an
interaction that is not mediated by any other variable in the network. So, how should we
parameterize our network? We want to capture the affinities between the related variables
(e.g. Alice and Bob are more likely to agree than disagree).
We restrict our attention
Let D be a set of random variables. We define a factor φ to be a function from to nonnegative factors.
V al(D) to R. A factor is nonnegative if all its entries are nonnegative. D is called
the scope of the factor, denoted Scope[φ].

The factors need not be normalized. Therefore, to interpret probabilities over factors, we must

197
A student might not have the misconception if e.g. they went home and figured out the problem via reading
the textbook instead.
198
These independences cannot be naturally captured in a Bayesian (i.e. directed) network.

296
normalize it with what we’ll call the partition function, Z:
1
Pr [a, b, c, d] =φ1 (a, b) · φ2 (b, c) · φ3 (c, d) · φ4 (d, a) (899)
ZX
Z= φ1 (a, b) · φ2 (b, c) · φ3 (c, d) · φ4 (d, a) (900)
a,b,c,d

Parameterization. Associating the graph structure with a set of parameters. We parameter-


ize undirected graphs by associating a set of factors with it. First, we introduce the definition
of factor product:

Let X, Y, and Z be three disjoint sets of variables, and let φ1 (X, Y) and φ2 (Y, Z) be
two factors. We define the factor product φ1 ×φ2 to be a factor ψ : V al(X, Y, Z) 7→
R as follows:

ψ(X, Y, Z) = φ1 (X, Y) · φ2 (Y, Z) (901)

We use this to define an undirected parameterization of a distribution:


A distribution PΦ is a Gibbs distribution parameterized by a set of factors Φ =
{φ1 (D1 , . . . , φK (DK ))} if it is defined as follows:
1 e
PΦ = PΦ (X1 , . . . , Xn ) (902)
Z
PeΦ (X1 , . . . , Xn ) = φ1 (D1 ) × φ2 (D2 ) × · · · × φK (DK ) (903)

where the authors (Koller and Friedman) have made a point to emphasize: A factor is only
one contribution to the overall joint distribution. The distribution as a whole has
to take into consideration the contributions from all of the factors involved. Now,
we relate the parameterization of a Gibbs distribution to a graph structure.

A distribution PΦ with Φ = {φ1 (D1 , . . . , φK (DK ))} factorizes over a Markov


network H if each Dk (k = 1, . . . , K) is a complete subgraph199 of H.

The factors that parameterize a Markov network are often called clique potentials. Although
it can be used without loss of generality, the parameterization using maximal clique potentials
generally obscures structure that is present in the original set of factors. Below are some useful
definitions that we will use often.

199
A subgraph is complete if every two nodes in the subgraph are connected by some edge. The set of nodes
in such a subgraph is often called a clique. A clique X is maximal if for any superset of nodes Y ⊃ X, Y is
not a clique.

297
Factor reduction:

Let φ(Y) be a factor, and U = u an assignment for U ⊆ Y. Define the reduction


of the factor φ to the context U = u, denoted φ[u], to be a factor over the scope
Y 0 = Y − U, such that

φ[u](y 0 ) = φ(y 0 , u) (904)

For U * Y, define φ[u] only for the assignments in u to the variables in U 0 = U∩Y.

Reduced Gibbs distribution:

Let PΦ (X) be a Gibbs distribution parameterized by Φ = {φ1 , . . . , φK } and let u


be a context. The reduced Gibbs distribution PΦ [u] is the Gibbs distribution
defined by the set of factors Φ[u] = {φ1 [u], . . . , φK [u]}. More formally:

PΦ [u] = PΦ (W | u) where W =X−U (905)

Reduced Markov Network:


Let H be a Markov network over X and U = u a context. The reduced Markov
network H[u] is a Markov network over the nodes W = X − U, where we have
an edge X—Y if there’s an edge X—Y in H.

Note that if a Gibbs distribution PΦ (X) factorizes over H, then PΦ [u] factorizes over H[u].

Markov Network Independencies. A formal presentation of the undirected graph as a


representation of independence assertions.
• Active Path. Let H be a Markov network structure, and let X1 — · · · —Xk be a path
in H. Let Z ⊆ X be a set of observed variables. The path is active given Z if none of
the Xi ’s is in Z.
• Separation. A set of nodes Z separates X and Y in H, denoted sepH (X; Y | Z), if
there is no active path between any node X ∈ X and Y ∈ Y given Z.
• Global Independencies. The global independencies associated with H are defined as

I(H) = {(X ⊥ Y | Z) : sepH (X; Y | Z)} (906)

This is the separation criterion. Note that the definition of separation is monotonic in
Z: if it holds for Z, then it holds for any Z 0 ⊃ Z as well200 .

200
Which means that Markov networks are fundamentally incapable of representing nonmonotonic indepen-
dence relations!

298
• Soundness of the separation criterion for detecting independence properties in distri-
butions over H. In other words, we want to prove that
 
(P factorizes over H) =⇒ sepH (X; Y | Z) =⇒ P |= (X ⊥ Y | Z)

where the portion in brackets can equivalently be said as “H is an I-map for P ”.

Proof
– Consider the case where X ∪ Y ∪ Z = X .
– Then, any clique in X is fully contained in either X ∪ Z or Y ∪ Z. In other words,

1
P (X, Y, Z) = f (X, Z)g(Y, Z) (907)
Z
f (X, Z)g(Y, Z)
which implies P (X, Y | Z) = P (908)
x,y
f (x, Z)g(y, Z)

– We can prove that P |= (X ⊥ Y | Z) by showing that P (X, Y | Z) = P (X | Z)P (Y | Z).

f (X, Z)g(Y, Z)
P (X, Y | Z) = (909)
P (Z)
f (X, Z)g(Y, Z) P (Z)
= (910)
P (Z) P (Z)
P P
f (X, Z)g(Y, Z) x
f (x, Z) y
g(y, Z)
= (911)
P (Z) P (Z)
P P
f (X, Z) y
g(y, Z) g(Y, Z) f (x, Z)
x
= (912)
P (Z) P (Z)
= P (X | Z)P (Y | Z) (913)

– For the general case where X∪Y∪Z = X , let U = X −(X∪Y∪Z). Since we know that sepH (X; Y |
Z), we can partition U into two disjoint sets U1 and U2 such that sepH (X ∪ U1 ; Y ∪ U2 | Z).
Combining the previous result with the decomposition propertya give us the desired result that
P |= (X ⊥ Y | Z).
a
The decomposition property:

(X ⊥ (Y, W) | Z) =⇒ (X ⊥ Y | Z)

Pairwise Independencies:

Let H be a Markov network. We define the pairwise independencies assoc. with


H:

Ip (H) = { X ⊥ Y | X − {X, Y } : X—Y ∈
/ H} (914)

which just says “X is indep. of Y given everything else if there’s no edge between
X and Y.”

299
Local Independencies:

For a given graph H, define the Markov blanket of X in H, denoted MBH (X),
to be the neighbors of X in H. We define the local independencies associated
with H:

I` (H) = { X ⊥ X − {X} − MBH (X) | MBH (X) : X ∈ X} (915)

which just says “X is indep. of the rest of the nodes in the graph given its immediate
neighbors.”

Log-Linear Models. Certain patterns involving particular values of variables for a given
factor can often be more easily seen by converting factors into log-space. More precisely, we
can rewrite a factor φ(D) as

φ(D) = e−(D) (916)


(D) , − ln φ(D) (917)
P
Pr [X1 , . . . , Xn ] ∝ e− i (Di )
(918)

where (D) is often called an energy function. Note how  can take on any value along the
real line (i.e. removes our nonnegativity constraint)201 . Also note that as the  summation
approaches 0, the probability approaches one.

This motivates introducing the notion of a feature, which is just a factor without the nonneg-
ativity requirement. A popular type of feature is the indicator feature that takes on value
1 for some values y ∈ V al(D) and 0 otherwise. We can now provide a more general definition
for our notion of log-linear models:

A distribution P is a log-linear model over a Markov network H if it is associated


with:
• A set of features F = {f1 (D1 ), . . . , fk (Dk )} where each Di is a complete
subgraph (i.e. a clique)in H.
• A set of weights w1 , . . . , wk .
such that
k
" #
1 X
Pr [X1 , . . . , Xn ] = exp − wi fi (Di ) (919)
Z i=1

The log-linear model provides a much more compact representation for many distributions,
especially in situations where variables have large domains (such as text).

201
We seem to be implicitly assuming that the original factors are all positive (not just non-negative).

300
Box 4.C – Concept: Ising Models and Boltzmann Machines

The Ising model: Each atom is modeled as a binary RV Xi ∈ {+1, −1} denoting its spin. Each pair of
neighboring atoms is associated with energy function i,j (xi , xj ) = wi,j xi xj . We also have individual energy
functions ui xi for each atom. This defines our distribution:

1
 X X 
P (ξ) = exp − wi,j xi xj − ui xi (920)
Z
i<j i

The Boltzmann distribution: Now the variables are Xi ∈ {0, 1}. The distribution of each Xi given its
neighbors is

P (x1i | N b(Xi )) = sigmoid(z) (921)


X
z = −( wi,j xj ) − ui (922)
j

Box 4.D – Concept: Metric MRFs

Consider the pairwise graph X1 , . . . , Xn in the context of sequence labeling. We want to assign each Xi a label.
We also want adjacent nodes to prefer being similar to each other. We usually use the MAP objective, so our
goal will be to minimize the total energy over the parameters (which are given by the individual energy functions
i ).
X X
E(x1 , . . . , xn ) = i (xi ) + i,j (xi , xj ) (923)
i (i,j)∈E

The simplest place to start for preferring neighboring labels to take on similar values is to define i,j to have low
energy when xi = xj and some positive λi,j otherwise. We want to have finer granularity for our similarities
between labels. To do this, we introduce the definition of a metric: a function µ : V × V 7→ [0, ∞) that satisfies

[reflexivity] µ(vk , vl ) = 0 IFF k=l (924)


[symmetry] µ(vk , vl ) = µ(vl , vk ) (925)
[triangle inequality] µ(vk , vl ) + µ(vl , vm ) ≥ µ(vk , vm ) (926)

and we can now let i,j (vk , vl ) := µ(vk , vl ).

Canonical Parameterization (4.4.2.1). Markov networks are generally overparameter-


ized202 . The canonical parameterization, which requires that P be positive, avoids this.
First, some notation and requirements:
• P must be positive.
• Let ξ ∗ = (x∗1 , . . . , x∗n ) denote some fixed assignment to the network variables X .
• Define xZ , xhZi as the assignment of variables in some subset Z.203
• ∗
Define ξ−Z , ξ ∗ hX − Zi be our fixed assignment for the variables outside Z.
• Let `(ξ) denote ln P (ξ).

202
Meaning: for any PΦ , there are often infinitely many ways to choose its set of parameter values for a given
H.
203
And x is some assignment to some subset of X that also contains Z.

301
The canonical energy function for a clique D is defined below, as well as the associated
total P (ξ) over a full network assignment:
The sum is over all
subsets of D, including
∗D (d) = ∗
) · (−1)|D−Z|
X
`(dZ , ξ−Z (927) D itself and ∅.
Z⊆D
X 
P (ξ) = exp ∗Di (ξhDi i) (928)
i

Conditional Random Fields. So far, we’ve only described Markov network representation
as encoding a joint distribution over X . The same undirected graph representation and pa-
rameterization can also be used to encode a conditional distribution Pr [Y | X], where Y is a
set of target variables and X is a (disjoint) set of observed variables.

X1 X2 X3 X4 Note how there are no


connection between any
of the Xs

Y1 Y2 Y3 Y4

Formally, a CRF is an undirected graph H whose nodes correspond to Y ∪ X. Since we want


to avoid representing204 a probabilistic model over X, we disallow potentials that involve only
variables in X; our set of factors is φ1 (D1 ), . . . , φm (Dm ), such that each Di * X. The network
encodes a conditional distribution as follows:
1 e
P (Y | X) = P (Y, X) (929)
Z(X)
m
Y
Pe (Y, X) = φi (Di ) (930)
i=1
X
Z(X) = Pe (Y, X) (931)
Y

where now the partition function is a function of the assignment x to X.

204
Also note how we never have to deal with a summation over all possible X, due to restricting ourselves to
Z(X).

302
Rapid Summary.
• Gibbs distribution: any probability distribution that can be written as a product of
factors divided by some partition function Z.
• Factorizes: A Gibbs distribution factorizes over H if each factor [in its product of
factors] is a clique.

7.3.1 Exercises

Exercise 4.1
Let H be the graph of binary variables below. Show that P does not factorize over H (Hint: proof by contradic-
tion).

X1 X2 X3 X4

• Example 4.4 recap: P satisfies the global independencies w.r.t H. They showed this by manually checking
the two global indeps of H, (X1 ⊥ X3 | X2 , X4 ) and (X2 ⊥ X4 | X1 , X3 ), against the tabulated list of
possible assignments for P (given in example). Nothing fancy.
• P factorizes over H if it can be written as a product of clique potentials.
• My proof:
1. Assume that P does factorize over H.
2. Then P can be written as

1
P (X1 , X2 , X3 , X4 ) = φ1 (X1 , X2 )φ2 (X2 , X3 )φ3 (X3 , X4 )φ4 (X4 , X1 ) (932)
Z

Let the above assertion be denoted as C, and the statement that P factorizes according to H be
denoted simply as PH . Since PH ⇐⇒ C, if we can prove that C does not hold, then we’ve found
our contradiction, meaning PH also must not hold.
3. I know that the proof must take advantage of the fact that we know P is zero for certain assignments
to X . For example P (0100) = 0. Furthermore, by looking at the assignments where P is not zero,
I can see that all possible combinations of (X1 , X2 ) are present, which means φ1 never evaluates
to zero.
4. From the example, we know that P (1100) = 1/8 6= 0. However, since

P (0100) φ1 (0, 1)φ4 (0, 0)


0= = (933)
P (1100) φ1 (1, 1)φ4 (0, 1)
φ4 (0, 0)
= (φ1 > 0) (934)
φ4 (0, 1)

and we also know that both the numerator and denominator of eq. 934 are positive, and thus we
have a contradiction.

303
Exercise 4.4
Prove theorem 4.7 for the case where H consists of a single clique. Theorem 4.7 is equation 928 in my notes.
For a single clique D, the question reduces to: Show that, for any assignment d to D:
 
exp − (d)
P (d) = P   (935)
d0
exp − (d0 )
 
= exp ∗D (d) (936)

Consider the case where |D| = 1, i.e. d = d is a single variable. Then

∗D (d) = (−1)|D| `(ξD



) + (−1)|D−D| `(d) (937)

= −`(ξD ) + `(d) (938)

= − ln P (d ) + ln P (d) (939)

and therefore
 
exp ∗D (d) = P (d)/P (d∗ ) (940)

which is clearly incorrect (???) TODO: figure out what’s going on here. Either the book has a type in its for
theorem 4.7, or I’m absolutely insane.

304
Probabilistic Graphical Models May 27, 2018

Local Probabilistic Models (Ch. 5)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Local Probabilistic Models.


Probabilistic Graphical Models: Principles and Techniques.

Deterministic CPDs. When X is a deterministic fu nction of its parents P aX :


(
1 x = f (P aX )
P (x | P aX ) = (941)
0 otherwise.

Consider the example below, where the double-line notation on C means that C is a determin-
istic function of A and B. What new conditional dependencies do we have?

A B

D E

Answer: (D ⊥ E | A, B), which would not be true by d-separation alone. It only holds because
C is a deterministic function of A and B. z

305
Probabilistic Graphical Models May 27, 2018

Template-Based Representations (Ch. 6)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Template-Based Representations.


Probabilistic Graphical Models: Principles and Techniques.

In what follows, we will build on an example where a vehicle tries to track its true location
(L) using various sensor readings: velocity (V), weather (W), failure of sensor (F), observed
location (O) of the noisy sensor.

W W’

V
V’

L
L’

F
F’

O’

Temporal Models. We discretize time into slices of interval ∆, and denote the ground
random variables at time t · ∆ by X (t) . We can simplify our formulation considerably by
assuming a Markovian system: a dynamic system over template variables X that satisfies
the Markov assumption:
(X (t+1) ⊥ X (0:(t−1)) | X (t) ) (942)
which allows us to define a more compact representation of the joint distribution from time 0
to T:
−1
TY
P (X (0:T ) ) = P (X (0) ) P (X (t+1) | X (t) ) (943)
t=0
One last simplifying assumption, to avoid having unique transition probabilities for each time
t, is to assume a stationary205 Markovian dynamic system, defined s.t. P (X (t+1) | X (t) ) is
the same for all t.
205
Also called time invariant or homogeneous.

306
Dynamic Bayesian Networks (6.2.2). Above, I’ve drawn the 2-time-slice Bayesian net-
work (2-TBN) for our location example. A 2-TBN is a conditional BN over X 0 given XI ,
where XI ⊆ X is a set of interface variables206 . For each template variable Xi , the CPD
P (Xi0 | P aXi0 ) is a template factor. We can use the notion of the 2-TBN to define the more
general dynamic Bayesian network:

A dynamic Bayesian network (DBN) is a pair hβ0 , β→ i, where β0 is a Bayesian network


over X (0) , representing the initial distribution over states, and β→ is a 2-TBN for the
process. For any T ≥ 0, the unrolled Bayesian network is defined such that
(0)
• p(Xi | P aX (0) ) is the same as the CPD for the corresponding Xi in β0 .
i
(t)
• p(Xi | P aX (t) ) (for t > 0) is the same as the CPD for the corresponding Xi0 in β→ .
i

State-Observation Models (6.2.3). Temporal models that, in addition to the Markov as-
sumption (eq. 942), model the observation variables at time t as conditionally independent of
the entire state sequence given the variables at time t:
 
O (t) ⊥ X (0:(t−1)) , X (t+1:∞) | X (t) (944)

So basically a 2-TBN with the constraint that observation variables are leaves and only have
parents in X 0 . We now view our probabilistic model as consisting of 2 components: the
transition model P (X 0 | X), and the observation model P (O | X). The two main architectures
for such models are as follows:
• Hidden Markov Models. Defined as having a single state variable S and a single
observation variable O. In practice, the transition model P (S 0 | S) is often assumed to
be sparse (many possible transitions having zero probability). In such cases, one usually
represents them visually as probabilistic finite-state automaton207 .
• Linear Dynamical Systems (LDS) represent a system of one or more real-valued
variables that evolve linearly over time, with some Gaussian noise. Such systems are
often called Kalman filters, after the algorithm used to perform tracking. They can be
viewed as a DBN with continuous variables and all dependencies are linear Gaussian208 .
A LDS is traditionally represented as a state-observation model, where both state and
observation are vector-valued RVs, and the transition/observation models are encoded
using matrices. More formally, for X (t) ∈ Rn , O ∈ Rm :

P (X (t) | X (t−1) ) = N (AX (t−1) ; Q) (945) Q ∈ Rn×n


P (O(t) | X (t) ) = N (HX (t) ; R) (946) H ∈ Rm×m

206
Interface variables are those variables whose values at time t can have a direct effect on the variables at
time t + 1.
207
FSA use the graphical notation where the nodes are the individual possible values in V al(S), and the
directed edges from some a to b have weight equal to P (S 0 = b | S = a).
208
Linear Gaussian: some var Z pointing to X denotes that X = ΛZ + noise, where noise ∼ N (µx , Σx ).

307
Template Variables and Template Factors (6.3). It’s convenient to view the world as
being composed of a set of objects, which can be divided into a set of mutually exclusive and
exhaustive classes Q = Q1 , . . . , Qk . Template attributes have a tuple of arguments, each of
which is associated with a particular class of objects, which defines the set of objects that can
be used to instantiate the argument in a given domain. Template attributes thus provide us
with a “generator” for RVs in a given probability space. Formally,

An attribute A is a function A(U1 , . . . , Uk ), whose range is some set V al(A), and where
each argument Ui is a typed logical variable associated with a particular class Q[Ui ]. The
tuple U1 , ldots, Uk is called the argument signature of the attribute A, and denoted α(A).

Plate Models (6.4.1). The simplest example of a plate model is shown below. It describes
multiple RVs generated from the same distribution D.

θX

X
Data m

This could be a plate model for a set of coin tosses sampled from a single coin. We have a set
of m random variables X(d), where d ∈ D. Each X(d) is the random variable for the dth coin
toss. We also explicitly model that the single coin for which the tosses are used is sampled
from a distribution θX , which takes on values [0, 1] and denotes the bias of the coin.

308
Probabilistic Graphical Models July 26, 2018

Gaussian Network Models (Ch. 7)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Gaussian Network Models.


Probabilistic Graphical Models: Principles and Techniques.

Multivariate Gaussians. Here I’ll give two forms of the familiar density function, followed
by some comments and terminology.
1 1
 
p(x) = √ exp − (x − µ)T Σ−1 (x − µ) (947)
(2π)n/2 det Σ 2
1
 
p(x) ∝ exp − xT Jx + (Jµ)T x where J , Σ−1 (948)
2

• The standard Gaussian is defined as N (0, I).


• Σ must be positive definite 209 : ∀x 6= 0, xT Σx > 0. Recall that Σi,j = Cov [xi , xj ] =
E [xi xj ] − µi µj .
The two operations we usually want to perform on a Gaussian are (1) computing marginals,
and (2) conditioning the distribution on an assignment of some subset of the variables. For (1),
its easier to use the standard form of p(x), whereas for (2) it is easier to use the information
form (the one using J).

Multivariate Gaussians are also special because we can easily determine whether two xi and
xj are independent: xi ⊥ xj IFF Σi,j = 0210 . For conditional independencies, the information
matrix J is easier to work with: (xi ⊥ xj | {x}k∈{i,j}
/ ) IFF Ji,j = 0. This condition is also
how we defined pairwise independencies in a Markov network, which leads to the awesome
realization:

We can view the information matrix J as directly defining a minimal I-map Markov network
for [multivariate Gaussian] p, whereby any entry Ji,j 6= 0 corresponds to an edge xi —xj in
the network.

209
Most definitions of p.d. also require that the matrix be symmetric. I also think it helps to view positive
definite from the operator perspective: A linear operator T is positive definite if T is self-adjoint and hT (x), xi >
0. In other words, “positive definite” means that the result of applying the matrix/operator to any nonzero x
will always have a positive component along the original direction x̂.
210
This is not true in general – just for multivariate Gaussians!

309
Probabilistic Graphical Models May 06, 2018

Variable Elimination (Ch. 9)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Variable Elimination.


Probabilistic Graphical Models: Principles and Techniques.

Analysis of Exact Inference. The focus of this chapter is the conditional probability query,

Pr [Y, e]
Pr [Y | E = e] = (9.1)
Pr [e]

Ideally, we want to obtain all instantiations Pr [y | e] of equation 9.1. Let W = X − Y − E be


the RVs that are neither query nor evidence. Then
P
w Pr [y, e, w]
Pr [y | e] = P (9.3)
y Pr [y, e]

and note that, by computing all instantiations for the numerator first, we can reuse them to
obtain the denominator. We can formulate the inference problem as a decision problem, which
we will call BN P rDP , defined as follows:

Given: Bayesian network B over X , a variable X ∈ X , and a value x ∈ V al(X).


Decide: whether PrB [X = x] > 0. BN P rDP

Thm 9.1: The decision problem BN P rDP is N P-complete. Proof :


1. BNPrDP is in N P: Guess a full assignment ξ to the network211 . If the guess is successful, where sucess
if defined as (X = x) ∈ ξ and P (ξ) > 0, then we know that P (X = x) > 0212 . Computing P (ξ) is linear
in the number of factors for a BN, since we just multiply them together.
2. BNPrDP is N P-hard. We show this by proving that we can solve 3-SAT (which is N P-hard) by
transforming inputs to 3-SAT to inputs of BNPrDP in polynomial time. Given any 3-SAT formula φ,
we can create a BN Bφ with some special variable X s.t. φ is satisfiable IFF PBφ (X = x1 ) > 0. You
can easily build such a network by having a node Qi for each binary RV qi , and a node Ci for each
of the clauses that’s a deterministic function of its parents (up to 3 Q nodes). Then, the node X is a
deterministic function of its parents, which are chains of AND gates along the Ci . Since each node has
at most 3 parents, we can ensure that construction is bounded by polynomial time in the length of φ.

211
Apparently the time it takes to generate a guess is irrelevant.
212
P
This is true because P (x) can be decomposed as P (ξ) + P (. . .) ≥ P (ξ).

310
Analysis of Approximate Inference. Consider a specific query P (y | e), where we focus
on a particular assignment y. Let ρ denote some approximate answer, whose accuracy we wish
to evaluate relative to the correct probability. We can use the relative error to estimate the
quality of the approximation: An estimate ρ has relative error  if:
ρ
≤ P (y | e) ≤ ρ(1 + ) (949)
1+
Unfortunately, the task of finding some approximation ρ with relative error  is also N P-hard.
Furthermore, even if we relax this metric by using absolute error instead, we end up finding
that in the case where we have evidence, approximate inference is no easier than
exact inference, in the worst case.

Variable Elimination. Basically, a dynamic programming approach for performing exact


inference. Consider the simple BN X1 → · · · → Xn , where each variable can take on k possible
values. The dynamic programming approach for computing P (Xn ) involves computing
X
P (Xi+1 ) = P (Xi+1 | xi )P (xi ) (950)
xi

n − 1 times, starting with i = 1, all the way up to i = n − 1, reusing the previous computation
at each step, with total cost O(nk 2 ). So for this simple network, even though the size of the
joint is k n (exponential in n), we can do inference in linear time.
First, we formalize some basic concepts before defining the algorithm.
Factor marginalization:
Let X be a set of variables, and Y ∈
/ X a variable. Let φ(X, Y ) be a factor. Define
P
the factor marginalization of Y in φ, denoted Y φ, to be a factor ψ over X
such that: X
ψ(X) = φ(X, Y )
Y

The key observation that’s easy to miss is that we’re only summing entries in the table where
the values of X match up. One useful rule for exchanging factor product and summation: If
X∈ / Scope[φ1 ], then
X X
(φ1 · φ2 ) = φ1 · φ2 (9.6)
X X

311
So, when computing some marginal probability, the main idea is to group factors together and
compute expressions of the form
XY
φ (951)
Z φ∈Φ

where Φ is the set of all factors φ for which Z ∈ Scope[φ]. This is commonly called the sum-
product inference task. The full algorithm for sum-product variable elimination, which is an
instantiation of the sum-product inference task, is illustrated below.

This is what we use to compute the marginal probability P (X) where X = X −Z. To compute
conditional queries of the form P (Y | E = e), simply replace all factors whose scope overlaps
with E with their reduced factor (see chapter 4 notes for definition) to get the unnormalized
φ∗ (Y) (the numerator of P (Y | e)). Then divide by y φ∗ to obtain the final result.
P

Example. We will work through computing P (Job) for the BN below.

312
Coherence

Difficulty Intelligence

Grade SAT

Letter Job

Happy

Due to Happy being a child of Job, P (J) actually requires using all factors in the graph.
Below shows how VE with elimination ordering C, D, I, H, G, S, L progressively simplifies the
equation for computing P (J).
X
P (J) = P (J | s, `)P (s | i)P (` | g)P (g | d, i)P (d | c)P (h | J, g)P (c)P (i) (952)
X
= (P (c)P (d | c)) · P (g | d, i)P (i)P (s | i)P (h | J, g)P (` | g)P (J | s, `) (953)
c,d,i,h,g,s,`
X
= (τ1 (d)P (g | d, i)) · P (i)P (s | i)P (h | J, g)P (` | g)P (J | s, `) (954)
d,i,h,g,s,`
X
= (τ2 (g, i)P (i)P (s | i)) · P (h | J, g)P (` | g)P (J | s, `) (955)
i,h,g,s,`
X
= (P (h | J, g)) · τ3 (g, s)P (` | g)P (J | s, `) (956)
h,g,s,`
X
= (τ4 (g, J)τ3 (g, s)P (` | g)) · P (J | s, `) (957)
g,s,`
X
= τ5 (J, `, s) · P (J | s, `) (958)
s,`
X
= τ6 (J, `) (959)
`

where red indicates the focus of the given step in the VE algorithm.

313
Probabilistic Graphical Models May 06, 2018

Clique Trees (Ch. 10)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Clique Trees.


Probabilistic Graphical Models: Principles and Techniques.

Cluster Graphs. A graphical flowchart of the factor-manipulation process that will be rel-
evant when we discuss message passing. Each node is a cluster, which is associated with a
subset of variables. Formally,

A cluster graph U for a set of factors Φ over X is an undirected graph. Each node
i is associated with a subset Ci ⊆ X . Each factor φ ∈ Φ must be associated with
a cluster Ci , denoted α(φ), such that Scope[φ] ⊆ Ci (family-preserving). Each
edge between a pair of clusters Ci and Cj is associated with a sepset Si,j ⊆ Ci ∩Cj .

Recall that each step of variable elimination involves creating a factor ψi by multiplying a
group of factors213 . Then, denoting the variable we are eliminating at this step as Z, we
P
obtain another factor τi that’s the factor marginalization of Z in ψi (denoted Z ψi ). An
execution of variable elimination defines a cluster graph: we have a cluster for each of the ψi ,
defined as Ci = Scope[ψi ]. We draw an edge between Ci and Cj if the message τi is used in
the computation of τj .

Consider when we applied variable elimination to the student graph network below, to compute
P (J). Elimination ordering C, D, I, H, G, S, L.

213
All whose scope contains the variable we are currently trying to eliminate.

314
D G,I
Coherence 1: C, D 2: D, I, G 3: G, I, S

J,S,L J,L
5: G, J, L, S 6: J, L, S 7: J, L
Difficulty Intelligence

4: G, H, J
Grade SAT

Letter Job

Happy

Clique Trees. Since VE uses each intermediate τi at most once, the cluster graph induced
by an execution VE is necessarily a tree, and it also defines a directionality: the direction of
the message passing (left-to-right in the above illustration). All the messages flow toward a
single cluster where the final result is computed – the root of the tree; we say the messages
“flow up” to the root. Furthermore, for cluster trees induced by VE, the scope of each message
(edge) τi is exactly Ci ∩ Cj , not just a subset214 .

Let Φ be a set of factors over X . A cluster tree over Φ that satisfies the running
intersection property is called a clique tree. For clique trees, the clusters are also
called cliques.

Message Passing: Sum Product. Previously we saw how an execution of VE can be


illustrated with a clique tree. We now go the other direction – given a clique tree, we show
how it can be used for variable elimination. Given a clique tree representation of some BN,
we can use it to guide us along an execution of VE to compute any marginal we’d like. First,
before any run, we generate the set of initial potentials ψi associated with each clique Ci in
the tree, defined as just the multiplication of the initial factors associated with the clique. We
define the root of the tree as any clique containing the variable whose marginal we want to
compute (we pick arbitrarily). Starting from the leaves and moving toward the root, we pass
messages along from clique to clique. A clique is ready to send a message when it has received
a message from all of its downstream neighbors. The message from Ci to [a neighbor] Cj is
computed using the sum-product message passing computation:

214
This follows from the running intersection property, which is satisfied by any cluster tree that’s defined
by variable elimination. It’s defined as, if any variable X is in both cluster Ci and Cj , then X is also in every
cluster in the (unique) path in the tree between Ci and Cj .

315
X Y
δi→j = ψi · δk→i (10.2)
Ci −Si,j k∈(N bi −{j})

where the summation is simply over the variables in Ci that aren’t passed along to Cj , and the
product is over all messages that Ci received. Stated even simpler, we multiply all the incoming
messages by our initial potential, then sum out all variables except those in Si,j . When the
root clique has received all messages, it multiplies them with its own initial potential, resulting
in a factor called the beliefs, βr (Cr ). It represents
X Y
PeΦ (Cr ) = φ (960)
X −Cr φ

where, to be clear, the product is over all φ in the graph.

Below is a more compact summary of all of this, showing the procedure for computing all final
factors (belief) βi for some marginal probability query on the variables in Cr asynchronously.
Algorithm 10.2: Sum-Product Belief Propagation
1. For each clique Ci , compute its initial potential:
Y  
ψi (Ci ) ← φj
φj :α(φj )=i

2. While ∃i, j such that i is ready to transmit to j, compute:


X Y
δi→j ← ψi · δk→i
Ci −Si,j k∈(N bi −{j})

3. Then, compute each belief factors βi by multiplying the initial potential ψi by the in-
coming messages to Ci : Y
βi ← ψi · δk→i
k∈N bCi

4. Return the set of beliefs {βi }, where


X
βi = PeΦ (X ) = PeΦ (Ci ) (961)
X −Ci

The SP Belief Propagation algorithm above is also called clique tree calibration. A clique
tree T is calibrated if all pairs of adjacent cliques are calibrated. A calibrated clique tree
satisfies the following property for what we’ll now call the clique beliefs, βi , and the sepset
beliefs, µi,j over Si,j :
X X
µi,j (Si,j ) , βi = βj (962) µi,j = P
eΦ (Si,j )
Ci −Si,j Cj −Si,j

The main advantage of the clique tree algorithm is that it computes the posterior probability
of all variables in a graphical model using only twice the computation215 of the upward pass
in the same tree.
215
The algorithm is equivalent to doing one upward pass, one downward pass.

316
We can also show that µi,j = δj→i δi→j , which then allows us to derive:
Q
i∈VT βi
PeΦ (X ) = Q (10.10)
ij∈ET µi,j

In other words, the clique and sepset beliefs provide a reparameterization of the unnormal-
ized measure, a property called the clique tree invariant.

Message Passing: Belief Update. We know discuss an alternative message passing ap-
proach that is mathematically equivalent but intuitively different than the sum-product ap-
proach. First, we introduce some new definitions.
Factor Division:
Let X and Y be disjoint sets of variables, and let φ1 (X, Y) and φ2 (Y) be two factors.
We define the division of φ1 and φ2 as a factor ψ with scope X, Y as follows: Define 0/0 = 0

φ1 (X, Y)
ψ(X, Y) ,
φ2 (Y)

Looking back at equation 10.2, we can now see that another way to write δi→j is
P
Ci −Si,j βi
δi→j = (10.13)
δj→i

Now, consider the clique tree below for the simple Markov network A-B-C-D:

1: A,B 2: B,C 3: C,D

If we assigned C2 as the root, then our previous approach would compute δ2→1 as C ψ2 ·δ3→2 .
P

Alternatively, we can use equation 10.13 to realize this is equivalent to dividing β2 by δ1→2
and marginalizing out C. This observation motivates the algorithm below, which allows us to
execute message passing in terms of the clique and sepset beliefs, without having to remember
the initial potentials ψi or explicitly compute the messages δi→j .
Algorithm 10.3: Belief-Update Message Passing
1. For each clique Ci , set its initial belief βi to its initial potential ψi . For each edge in ET ,
set µi,j = 1.
2. Wile there exists an uninformed216 clique in T , select any edge in ET , and compute

216
A clique is informed once it has received informed messages from all of its neighbors. An informed message
is one that has been sent by taking into account information from all of the sending cliques’ neighbors (aside
from the receiving clique of that message, of course).

317
X  
σi→j ← βi (963)
Ci −Si,j
σi→j
βj ← βj · (964)
µi,j
µi,j ← σi→j (965)
3. Return the resulting set of informed beliefs {βi }.
At convergence, σi→j = µi,j = σj→i .

318
Probabilistic Graphical Models June 02, 2018

Inference as Optimization (Ch. 11)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Inference as Optimization.


Probabilistic Graphical Models: Principles and Techniques.

Propagation-Based Approximation. We can use a general-purpose cluster graph rather


than the more restrictive clique tree (needed to guarantee exact inference) for approximate
inference methods. Consider the simple Markov network below on the left.

A
A 1: A, B 4: A, D
B D

C
B D 2: B, C 3: C, D

The clique tree for this network, which can be used for exact inference, has two cliques ABD
and BCD and messages are passed between them consisting of τ (B, D). Suppose that, instead,
we set up 4 clusters corresponding to each of the initial potentials, shown as the cluster graph
above on the right. We can still apply belief propagation here, but due to it now having loops
(as opposed to before when we only had trees), the process may not converge.

319
Probabilistic Graphical Models June 17, 2018

Parameter Estimation (Ch. 17)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Parameter Estimation.


Probabilistic Graphical Models: Principles and Techniques.

Maximum Likelihood Estimation (17.1). In this chapter, assume the network structure
is fixed and that our data set D consists of fully observed instances of the network variables:
D = {ξ[1], . . . , ξ[M ]}. We begin with the simplest learning problem: parameter learning for a
single variable. We want to estimate the probability, denoted via the parameter θ, with which
the flip of a thumbtack will land heads or tails. Define the likelihood function L(θ : x) as
the probability of observing some sequence of outcomes x under the parameter θ. In other
words, it is simply P (x : θ), but interpreted as a function of θ. For our simple case, where D
consists of M thumbtack flip outcomes,

L(θ : D) = θM [1] (1 − θ)M [0] (966)

where M [1] denotes the number of outcomes in D that were heads. Since it’s easier to maximize
a logarithm, and since it yields the same optimal θ̂, optimize the log-likelihood to obtain:
 
θ̂ = arg max `(θ : D) = arg max M [1] log θ + M [0] log(1 − θ) (967)
θ θ
M [1]
= (968)
M [1] + M [0]

Note that MLE has the disadvantage that it can’t communicate confidence of an estimate217 .

We now provide the more general formal definitions for MLE.


• We are given a training set D containing M (IID) instances of a set of random variables
X , where the samples of X are drawn from some unknown distribution P ∗ (X ).
• We are given a parametric model, defined by a function P (ξ; θ), where ξ is an instance
of X , and we want to estimate its parameters θ 218 . The model also defines the space of
legal parameter values Θ, the parameter space.
• We then define the likelihood function L(θ : D) = m P (ξ[m] : θ).
Q

217
We get the same result (0.3) if we get 3 heads out of 10 flips, as we do for getting 300 heads out of 1000
flips; yet, the latter experiment should include a higher degree of confidence
218
We also have the constraint that P (ξ; θ) must be a valid distribution (nonnegative and sums to 1 over all
possible ξ)

320
We can often simplify the likelihood function to simpler terms, like our M [0] and M [1] values
in the thumbtack example. These are called the sufficient statistics, defined as functions of
the data that summarize the relevant information for computing the likelihood. Formally,
A function τ (ξ) : ξ → R` (for some `) is a sufficient statistic if for any two data sets D
and D0 , we have that
 X X   
0 0
τ (ξ[m]) = τ (ξ [m]) =⇒ L(θ : D) = L(θ : D ) (969)
ξ[m]∈D ξ 0 [m]∈D 0

P
We often informally refer to the tuple ξ[m]∈mathcalD τ (ξ[m]) as the sufficient
statistics of the data set D.

MLE for Bayesian Networks – Simple Example. We now move on to estimating pa-
rameters θ for the simple BN X → Y for two binary RVs X and Y . Our parameters θ are the
individual probabilities of P (X) and P (Y | X) (6 total). Since BNs have the nice property that
their joint probability decomposes into a product of probabilities, just like how the likelihood
function is a product of probabilities, we can write the likelihood function as a product of the
individual local probabilities:
Y  Y  decomposability of the
L(θ : D) = P (x[m] : θX ) P (y[m] | x[m] : θY |X ) (970) likelihood function
m m

which can be decomposed even further by e.g. differentiating products over x[m] : x[m] = x0
etc. Just as we used M [0] in the thumbtack example to count the number of instances with a
certain value, we can use the same idea for the general case.
Let Z be some set of RVs, and z be some instantiation to them. We define M [z] to be the
number of entries in data set D that have Z[m] = z:
X
M [z] = 1{Z[m] = z} (971)
m

Global Likelihood Decomposition. We now move to the more general case of computing
the likelihood for BN with structure G.
Y global decomposition of
L(θ : D) = PG (ξ[m] : θ) (972) the likelihood
m
YY
= P (xi [m] | P aXi [m] : θ) (973)
m i
YY 
= P (xi [m] | P aXi [m] : θXi |P aX ) (974)
i
i m
Y
= Li (θXi |P aX : D) (975)
i
i

where Li is the local likelihood function for Xi . Assuming these are each disjoint sets of
parameters from one another, it implies that θ̂ = hθ̂X1 |P aX1 , . . . , θ̂Xn |P aXn i

321
Probabilistic Graphical Models June 17, 2018

Partially Observed Data (Ch. 19)


Table of Contents Local Written by Brandon McKinzie

Koller and Friedman (2009). Partially Observed Data.


Probabilistic Graphical Models: Principles and Techniques.

Likelihood of Data and Observation Models (19.1.1). Consider the simple example of
flipping a thumbtack, but occasionally the thumbtack rolls off the table. We choose to ignore
the tosses for which the thumbtack rolls off. Now, in addition to the random variable X giving
the flip outcome, we have the observation variable OX , which tells us whether we observed the
value of X.

θ ψ

X
OX

The illustration above is a plate model where we choose a thumbtack sampled with bias θ and
repeat some number of flips with that same thumbtack. We also sample the random variable
OX that has probability of observation sampled from ψ and fixed for all the experiments we
do. This leads to the following definition for the observability model.
Let X = {X1 , . . . Xn } be some set of RVs, and let OX = {OX1 , . . . , OXn } be their observ-
ability variable. The observability model is a join distribution
Pmissing (X, OX ) = P (X) · Pmissing (OX | X)
so that P (X) is parameterized by θ and Pmissing (OX | X) is parameterized by ψ. We
define a new set of RVs Y = {Y1 , . . . Yn } where V al(Yi ) = V al(Xi ) ∪ {?}. The actual
observation Y is a deterministic function of X and OX :
(
Xi OXi = o1
Yi = (976)
? OXi = o0

For our simple model above, we have

P (Y = 1) = θψ (977)
P (Y = 0) = (1 − θ)ψ (978)
P (Y =?) = (1 − ψ) (979)
L(θ, ψ; D) = θM [1] (1 − θ)M [0] ψ M [1]+M [0] (1 − ψ)M [?] (980)

322
The main takeaway is to understand that when we have missing data, the data-generation
process involves two steps: (1) generate data by sampling from the model, then
(2) determine which values we get to observe and which ones are hidden from us.

The Likelihood Function (19.1.3). Assume we have a BN network G over a set of variables
X. In general, each instance has a different set of observed variables. Denote by O[m] and
o[m] the observed vars and their values in the m’th instance, and by H[m] the missing (or
hidden) vars in the m’th instance.

323
Information
Theory,
Inference, and
Learning
Algorithms
Contents

8.1 Introduction to Information Theory (Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . 325


8.2 . . . .
Probability, Entropy, and Inference (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . 327
8.2.1 More About Inference (Ch. 3 Summary) . . . . . . . . . . . . . . . . . . . . . . . 329
8.3 The Source Coding Theorem (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.3.1 Data Compression and Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 333
8.3.2 Further Analysis and Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.4 Monte Carlo Methods (Ch. 29) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.5 Variational Methods (Ch. 33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

324
Information Theory, Inference, and Learning Algorithms November 11, 2017

Introduction to Information Theory (Ch. 1)


Table of Contents Local Written by Brandon McKinzie

[Note: Skipping most of this chapter since it’s mostly introductory material.]

Preface. For ease of reference, some common quantities we will frequently be using:
• Binomial distribution. Let r denote the number of successful trials out of N total
trials. Let f denote the probability of success for a single trial.
!
N r
Pr [r | f, N ] = f (1 − f )N −r E [r] = N f Var [r] = N f (1 − f ) (981)
r

• Stirling’s Approximation.
√ 1 Recall that
x! ' xx e−x 2πx ⇔ ln x! ' x ln x − x +
ln 2πx (982) logb x =
loga x
! 2 loga b

N N N
ln ' r ln + (N − r) ln (983)
r r N −r
• Binary Entropy Function and its relation with Stirling’s approximation.
1 1
H2 (x) , x lg + (1 − x) lg (984)
!
x 1−x
N
lg ' N H2 (r/N ) (985)
r

Perfect communication over an imperfect, noisy communication channel (1.1). We


want to make an encoder-decoder architecture, of the general form in the figure below, to
achieve reliable communication over a noisy channel.
Information theory is concerned with the theoretical limitations and potentials of such
systems. Let’s explore some examples for the case of the binary symmetric channel219 :
• Repetition codes. Let RN denote the repetition code that repeats each bit in the
message N times220 . We model the channel as “adding221 ” a sparse noise vector n to
the encoded message t, so r := n + t. What is the optimal way of decoding r?
sˆi ← arg max Pr [si | ri:i+n ] = arg max Pr [ri:i+N | si ] Pr [si ] (986)
si si
219
A binary symmetric channel transmits each bit correctly with probability (1 - f) and incorrectly with
probability f, where f is assumed to be small.
220
So R2 would encode 101 as 110011.
221
We add in modulo 2, which is NOT the same as binary arithmetic (no carry). Addition modulo 2 is the
same as doing XOR.

325
We see that we must make assumptions about the prior probability Pr [si ]. It is common
to assume all possible values of si (0 or 1 in this case) are equally probable. It is useful
to observe the likelihood ratio,

Y−1 Pr [rn | tn = 1]
Pr [ri:i+N | si = 1] i+N Y−1 γ
i+N
(
if rn = 1
= = (987)
Pr [ri:i+N | si = 0] n=i
Pr [rn | tn = 0] n=i γ −1 if rn = 0

where we’ve defined γ := (1 − f )/f , with f being the probability of a bit getting flipped
by the channel. We want to assign ŝi to the most likely hypothesis out of the possible
si . If the likelihood ratio [for the two hypotheses] is greater than 1, we choose ŝi = 1,
else we choose ŝi = 0.
• Block Codes - the (7, 4) Hamming Code. Although, by increasing the number of
repetitions per bit N for our RN repetition code can decrease the error-per-bit probability
pb , we incur a substantial decrease in the rate of information transfer – a factor of 1/N.
The (7, 4) Hamming Code tries to improve this by encoding blocks of bits at a time
(instead of per-bit). It is a linear block code – it encodes each 4-bit block into a 7-bit
block, where the additional 3 bits are linear functions of the original K = 4 bits. The
(7, 4) Hamming code has each of the extra 3 bits act as parity checks. It’s easiest to show
this with the illustration below:

which encodes 1000 as 1000101.

326
Information Theory, Inference, and Learning Algorithms November 12, 2017

Probability, Entropy, and Inference (Ch. 2)


Table of Contents Local Written by Brandon McKinzie

[Note: Skipping most of this chapter since it’s mostly introductory material.]

Notation. Some of the notation this author seems to use a lot.


• Ensemble X: a triple (x, AX , PX ), where the outcome x is the value of a R.V. which
can take on one of a set of possible values (“alphabet”), AX = {a1 , . . . , ai , . . . , aI },
having probabilities PX = {p1 , . . . , pi , . . . pI }. Note that this doesn’t appear technically
consistent with how the author actually uses the term ensemble – in practice, he actually
means “X is the set of all possible triples (x, AX , PX ), ∀x ∈ AX , or something like that.
He uses it to casually refer to the space of possibilities.

Forward Probabilities and Inverse Probabilities. Both of these involve a generative


model of the data. In a forward probability problem, we want find the PDF, expectation,
or some other function of a quantity that depends on the data/is produced by the generative
process. For example, we can model a series of N coin flips as “producing” the quantity
nH , denoting the number of heads. In an inverse probability problem, we want to compute
conditional probabilities on one or more of the unobserved variables in the process, given the
observed variables.

The Likelihood Principle. For a generative model on data d given parameters θ, Pr [d | θ],
and having observed a particular outcome d1 , all inferences and predictions should depend
only on the function Pr [d1 | θ].

Entropy and Related Quantities.


- Shannon information content of an outcome x:
1
 
h(x) , lg (988)
P (x)

They mention the example of the unigram probabilities for each character in a document.
For example, p(z) = 0.007, which has information content of 10.4 bits222 .

222
Intuition digression: Recall from CS how to compute the number of bits required to represent N unique
values (answer: lg(N ).). Similarly, a probability of e.g 1/8 can be interpreted as “one of 8 possible outcomes”,
meaning that lg(1/(1/8))=3 bits are needed to encode all possible outcomes. Similarly, one could interpret
p(z)=0.007 as “belonging to 7 of 1000 possible results”. I guess in some strange world you can then say that
there are 1000/7 ≈ 142.86 evenly-proportioned events like this (how do you even word this) and it would take

327
- Entropy of Ensemble X. Defined to be the average Shannon information content of an
outcome.
1
 
0 × lg ,0
0
1
X  
H(X) , Pr [x] lg (989)
x∈AX
Pr [x]
1
H(X) ≤ lg (|AX |) with equality iff pi = ∀i (990)
AX

- Decomposability of the Entropy. For any probability distribution p = {p1 , p2 , . . . , pI }


and m (where 1 ≤ m ≤ I):

H(p) = H (Σ1:m , Σm+1:I ) (991)


p1 pm
 
+ Σ1:m H ,..., (992)
Σ1:m Σ1:m
pm+1 pI
 
+ Σm+1:I H ,..., (993)
Σm+1:I Σm+1:I

where I’ve let Σ1:m := m


P
i=1 pi .
- Kullback-Leibler Divergence between two probability distributions P (X) and Q(X) that
are defined over the same alphabet AX :
X P (x)
DKL (P ||Q) = P (x) lg (994)
x Q(x)
DKL (P ||Q) ≥ 0 [Gibb’s Inequality] (995)

where, in the words of the author, “Gibb’s inequality is probably the most important
inequality in this book”.
- Convex functions and Jensen’s Inequality. A function f (x) is convex over the interval
[x = a, x = b] if every chord of the function lies above the function. That is, ∀x1 , x2 ∈ [a, b]
and 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) (996)


f (E [x]) ≤ E [f (x)] [Jensen’s Inequality] (997)

and we say f is strictly convex if, ∀x1 , x2 ∈ [a, b], we get equality for only λ = 0 and λ = 1.

log(142.86)=10.4 bits to encode all of them. Low-probability events (such as a character being z) have high
information content.

Perhaps a better way to think of this is explained on the wikipedia page:


When the content of a message is known a priori with certainty, with probability of 1, there is no
actual information conveyed in the message. Only when the advance knowledge of the content of
the message by the receiver is less than 10% certain does the message actually convey information.
Accordingly, the amount of self-information contained in a message conveying content informing an
occurrence of event, ωn , depends only on the probability of that event.

328
8.2.1 More About Inference (Ch. 3 Summary)

Because this is too short to have as its own chapter...

A first inference problem. Author explores a particle decay problem of finding Pr [λ | {x}]
where λ is the characteristic decay length and {x} is our collection of observed decay distances.
Plotting the likelihood Pr [{x} | λ] as a function of λ for any given x ∈ {x} shows each has
a peak value. The kicker is the interpretation: if each measurement x ∈ {x} is independent,
the total likelihood is the product of all the individual likelihoods, which can be interpreted as
updating/narrowing the interval [λa , λb ] within which Pr [{x} | λ] (as a function of λ) peaks.
In the words of the author’s mentor:

what you know about λ after the data arrive is what you knew before (Pr [λ]) and
what the data told you (Pr [{x} | λ])

We update our beliefs regarding the distribution of λ as we collect data.

Lessons learned from problems.


(3.8) The classic Monty Hall problem. Be careful defining probabilities after collecting
data. My blunder: when using Bayes’theorem to get the probability of the prize being
behind door i ∈ {1, 2} after the host opens door 3, I failed to take into account that we
chose door 1 while I was computing the evidence (the marginal probability that the host
opened door 3)223 . Quite embarrassing.
(3.9) Monty Hall problem, but an earthquake opens door 3. Although I correctly answered
that, in this case, both hypotheses (the prize being behind door 1 or door 2) are equiprob-
able, I still failed to account for the subtle fact that the earthquake could’ve opened mul-
tiple doors. The lesson here is always write down the probability of everything,
which just so happens to be suggested by the solution for this problem, too.

So, why did I still get the answer correct? The reason is because enumeration of prob-
abilities wasn’t necessary at all, you just needed to realize that the likelihood for the
two remaining hypotheses (H1 and H2 ) were the same – the probability of observing the
earthquake open door 3 and the prize not being revealed was the same for the case of
the prize being behind door 1 or door 2. So maybe the real lesson here is determine
whether calculations are even needed in order to solve the given problem,
which luckily I’ve had drilled in my head for years from studying physics.
(3.15) Another biased coin variant. One of the best examples I’ve seen for favoring Bayesian
methods over frequentist
Z 1
methods. Also, made use of the beta function:
Γ(x + 1)Γ(y + 1)
px (1 − p)y dp = = B(x + 1, y + 1) (998)
0 Γ(x + y + 2)
223
Note that, while important to recognize and understand, I could’ve avoided this pitfall entirely by just
ignoring the evidence during calculations and normalizing after, since the evidence can be determined solely by
the normalization constraint.

329
where B is the beta function, which is defined by the LHS.

330
Information Theory, Inference, and Learning Algorithms November 23, 2017

The Source Coding Theorem (Ch. 4)


Table of Contents Local Written by Brandon McKinzie

Overview. We will be examining the two following assertions:


1. The Shannon information content is a sensible measure of the information content of a
given outcome x = ai :
1
h(x = ai ) , lg (999)
pi
2. The entropy of an ensemble X is a sensible measure of the ensemble’s average information
content.
X 1
H(X) = pi lg (1000)
i
pi

The weighing problem. We are instructed to ponder the following problem.


You’re given 12 balls, all equal in weight except for one that is either heavier or lighter. You’re
also given a two-pan balance. Your task is to determine which ball is the odd ball, and in as few
uses of the balance as possible. Note: each use of the balance must place the same number of balls
on either side.

An interesting observation is to consider the number of possible outcomes of the weighing


process. Each outcome can be one of three possibilities: equal weight, left heavier, or left
lighter. After N such weighings, the number of unique possible weighing result sequences is
3N . Note that there are 12 × 2 = 24 unique final answers for our task (identifying which is
the odd ball, and whether it is heavier or lighter). Therefore, since we are seeking a procedure
to identify which of the 24 options is the correct option with 100% accuracy, we require our
weighing procedure to take on at least 24 unique possible results. Since N = 3 weighings
corresponds to 3N = 27 possible outcomes, N = 3 is a lower bound on the number of weighings
our approach will involve. It is impossible to guarantee a correct answer for N < 3 weighings224 .

Things I didn’t consider until reading the solution:


• It’s actually not optimal, upon observing both sides equal, to subsequently use only the
balls not involved in that measurement. My initial reaction to this was “why? we already
know the oddball is not any of the balls just measured, since the outcome was equal.”
224
Finally, it should be clear that, regardless of our approach, the final weighing will involve 2 balls, since we
have to identify which is the oddball AND whether it is heavier/lighter AND the number of balls on the left of
the scale must be the same as the right of the scale for every weighing.

331
The response to this reaction is: “yes, exactly, and we must use that information to be
able to discern in the future whether, e.g., a measurement of “left side heavier” means
the oddball is on the left and heavy, or if it’s on the right and light – it’s useful to know
that a given side of the scale does not contain the oddball before a measurement.
• More generally, it’s also not optimal to greedily search for solutions that eliminate the
highest number of possibilities in any given single step. Another way of thinking about
this is that it’s undesirable for the ith measurement outcome to cause any of the 3 possible
measurement outcomes to be impossible at the next stage.
• I focused a disproportionate amount of thought on handling the equal-weight measure-
ment outcome, for whatever reason. I probably would’ve arrived at the solution faster if
I’d actually thought about how my strategies would’ve handled some outcome being “left
heavier” and then considered what that strategy would put on the scale at the next step,
where the italics denote what would’ve illuminated the fatal flaw in all my approaches.

Guessing Games. What’s the smallest number of yes/no questions needed to identify an
integer x between 0 and 63? Although it was obvious to me that the solution is to succes-
sively halve the possible values of x, I found it interesting that you can write down the
list of questions independent of the answers at each step using a basic application of
modular arithmetic. In other words, you can specify the full decision tree of N nodes with
just lg N questions. Nice. Also, recognize that the Shannon information content for any single
1
outcome is lg 0.5 = 1 bit, and thus the total Shannon information content (for our predefined
6 questions) is 6 bits, which is not-coincidentally the number of possible values that x could
be before we ask any questions.

In general, if an outcome x has Shannon information content h(x) number of bits, I like to
interpret that as “learning the result x eliminates 2h(x) possibilities for the final result.” The
battleship example follows this interpretation well. Stated another way (in the author’s words):

The Shannon information content can be intimately connected to the size of a file
that encodes the outcomes of a random experiment.

332
8.3.1 Data Compression and Typicality

Data Compression. A lossy compressor compresses some files, but maps some files to
the same encoding. We introduce a parameter δ that describes the risk (aggressiveness of our
compression) we are taking with a given compression method: δ is the probability that there
will be no name for an outcome225 x.
The smallest δ-sufficient subset
If Sδ is the smallest subset of AX satisfying

Pr [x ∈ Sδ ] ≥ 1 − δ (1001)

then Sδ is the smallest δ-sufficient subset. It can be constructed by ranking the elements
of AX in order of decreasing probability and adding successive elements starting from
the most probable elements until the total probability is ≥ (1 − δ).

• Raw bit content of X: H0 (X) , lg |AX |. A lower bound for the number of binary
questions that are always guaranteed to identify an outcome from the ensemble X – it
simply maps each outcome to a constant-length binary string.
• Essential bit content of X: Hδ (X) , lg |Sδ |. A compression code can be made by
assigning a binary string of Hδ bits to each element of the smallest sufficient subset.
Finally, we can now state Shannon’s source coding theorem: Let X be an ensemble
for the random variable x with entropy H(X) = H bits, and let X N denote a sequence of
identically distributed (but not necessarily independent226 ) of random variables/ensembles,
(X1 , X2 , . . . , XN ).

1
(∃N0 ∈ Z+ )(∀N > N0 ) : Hδ (X N ) − H <  (0 < δ < 1) ( > 0) (1002)
N

which, in English, can be read: N i.i.d. random variables each with entropy H(X) can be
compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞;
conversely if they’re compressed into fewer than N H(X) bits it is virtually certain that infor-
mation will be lost.

225
P {a}, of unique values that x can take on but our compression
More specifically, if there is some subset,
method discards/ignores, then we say δ = i p(x = ai ).
226
Actually, before the actual theorem statement, the author mentions we are now concerned with “string of
N i.i.d. random variables from a single ensemble X.” It’s probably fair to assume this is true for the quantities
in the theorem, but I’m leaving this note here as a reminder.

333
Typicality. The reason that large N in equation 1002 corresponds to larger potential for
better compression is that the subset of likely results for a string of outcomes becomes more
and more concentrated relative to the number of possible sequences as N increases227 . I just
realized this is for the same fundamental reasons that entropy exists in thermodynamics –
there are just more ways to exist in a high entropy state than otherwise. The author showed
N
r as a function of r (the number of 1s in the N-bit string). For large N , this becomes almost
comically concentrated near the center (like a delta function at N/2) – see footnote for more
details228 .

This motivates the notion of typicality for [a string of length N from] an arbitrary ensemble
X with alphabet AX . For large N , we expect to find pi N occurrences of the outcome x = ai .
Hence the probability of such a string, and its information content, is roughly229
(p1 N ) (p2 N ) (p N )
Pr [x]typ = Pr [x1 ] Pr [x2 ] · · · Pr [xN ] ' p1 p2 · · · pI I (1003)
I
1 X 1
h(x)typ = lg 'N pi lg = N H(X) (1004)
Pr [x] typ i=1
pi

Accordingly, we define the typical elements (strings of length N ) of AN


X to be those elements
that have probability close to 2 −N H . We introduce a parameter β that defines what we mean
by “close,” and define the set of typical elements as the typical set TN β :

1 1
 
TN β , x ∈ AN
X : lg −H <β (1005)
N P (x)

It turns out that whatever value of β we choose, the TN β contains almost all the probability
as N increases.

227
The author gave an example for a sequence of bits with probability of any given bit being 1 as 0.1. He
showed how, although the average
√ number of 1s in a sequence of N bits grew as O(N ), the standard deviation
of that average only grew as N .
228
The probability
p of getting a string with r 1s follows a binomial distribution with mean N p1 and standard
deviation N p1 (1 − p1 ). This results in an increasingly narrower distribution P (r) for larger N .
229
We appear to be assuming that each outcome x in the string x are i.i.d. (confirmed)

334
8.3.2 Further Analysis and Q&A

Proving the Source Coding Theorem.


• Setup. We will make use of the following:
– Chebyshev’s Inequalities:

E [x] h i Var [x]


Pr [x ≥ α] ≤ and Pr (x − E [x])2 ≥ α ≤ (1006)
α α
where α is a positive real number, and x is assumed non-negative in the first in-
equality230 .
– Weak Law of Large Numbers (WLLN): Consider a sample h1 , . . . , hN of N
independent RVs all with common mean h̄ and common variance σh2 . Let x =
1 PN
N n=1 hn be their average. Then

h i σh2
Pr (x − h̄)2 ≥ α ≤ (1007)
αN
which can be easily derived from Chebyshev’s inequalities.
• Proving ‘asymptotic equipartition’ principle, i.e. that an outcome x is almost
certain to belong to the typical set, approaching probability 1 for large enough N . It is
a simple application of the WLLN to the random variable
N N
1 1 1 X 1 1 X
lg = lg = h(xn ) (1008)
N Pr [x] N n=1 xn N n=1

where E [h(xn )] = H(X) for all terms in the summation. Observe, then, that the defini-
tion of the typical set given in equation 1005 (squaring both sides) has the same form as
the definition for the WLLN. Plugging in and rearranging yields

σ2
Pr [x ∈ TN β ] ≥ 1 − (1009)
β2N
h i
where σ 2 ≡ Var lg P (x1 n ) . This proves the asymptotic equipartition principle. It will
also be useful to recognize that for any x in the typical set, we can rearrange equation
1005 to obtain

2−N (H+β) < Pr [x] < 2−N (H−β) (1010)

• Proof of SCT Part I. Want to show that N1 Hδ (X N ) < H + . TODO


• Proof of SCT Part II. Want to show that N1 Hδ (X N ) > H − . TODO

230
Notice how the two inequalities are technically the same.

335
Questions & Answers. Collection of my questions as I read through the chapter, which I
answered upon completing it.
• Q: Why isn’t the essential bit content of a string of N i.i.d. variables N H when δ = 0?
– A: I’m not entirely sure how to answer this still, but it seems the question is con-
fused. First off, the essential bit content approaches the raw bit content as δ de-
creases to 0: Hδ → H0 as δ → 0. It’s important to notice that both Hδ and H0
define an entropy where all members (SδN for Hδ ; AN X for H0 ) are equiprobable. I
remember asking this question wondering “what is the significance of Hδ (X N ) ap-
proaching N H(X) (not a typo!) for tiny δ”. The answer: for larger N , more of
the probability mass is concentrated in a relatively smaller region, with elements of
that region being roughly equiprobable. The last part is what I didn’t initially realize
– that allowing for tiny δ combined with large N essentially makes it so
that SδN ≈ TN β .
• Q: Why aren’t the most probable elements necessarily in the typical set?
– A: In the limit of N → ∞, they are, since in that limit, all elements are in the
typical set and they’re equiprobable. However, in essentially any real case, we can
imagine that some elements will be too unlikely to be found within the typical set,
which necessarily requires that there exist elements with probability too high to be
in the typical set. Remember that the typical set is basically defined such that all
elements have probability within the range given in equation 1010.

336
Information Theory, Inference, and Learning Algorithms November 24, 2017

Monte Carlo Methods (Ch. 29)


Table of Contents Local Written by Brandon McKinzie

Overview. The aims of Monte Carlo methods are to solve one or both of the following:
1. Generate samples {x(r) }R
r=1 from a given probability distribution Pr [x].
2. Estimate expectations of function under Pr [x], for example
Z
Φ = Ex∼Pr[x] [φ(x)] ≡ dN xPr [x] φ(x) (1011)

where it’s assumed that Pr [x] is sufficiently complex that we can’t evaluate such expec-
tations by exact methods.
Note that we can concentrate on the first problem (sampling), since we can use it to solve the
second problem (estimating an expectation) by using the random samples {x(r) }R r=1 to give
the estimator
1 X
Φ̂ ≡ φ(x(r) ) (1012)
R r

where
h i 1 X h i
E Φ̂ = Ex(r) ∼Pr[x] φ(x(r) ) = Φ (1013)
R r
P  2
h i r φ(x(r) ) − Φ σ2 1
Z
lim Var Φ̂ = lim = ≡ dN xPr [x] (φ(x) − Φ)2 (1014)
R→∞ R→∞ R−1 R R

Importance Sampling. A generalization of [naively] uniformly sampling x in order to ap-


proximate equation 1011. We assume henceforth that we are able to evaluate (for now, a 1-D)
Pr [x] at any point x at least within a multiplicative constant; thus we can evaluate a function
P ∗ (x) such that P (x) = P ∗ (x)/Z. We assume we have some simpler Q(x), called the sampler
density, from which we can generate samples from and evaluate up to a multiplicative con-
stant, Q(x) = Q∗ (x)/ZQ . We construct an approximation for our estimator in equation 1012
via sampling from Q(x) and computing:

wr φ(x(r) ) P ∗ (x(r) )
P
r
Φ̂ = where wr ≡ (1015)
Q∗ (x(r) )
P
r wr

337
Rejection Sampling. In addition to the assumptions in importance sampling, we now also
assume that we know the value of a constanct c such that

∀x : cQ∗ (x) > P ∗ (x) (1016)

1. Generate sample x from proposal density Q(x), and evaluate cQ∗ (x).
2. Generate a uniformly distributed random variable u from the interval [0, cQ∗ (x)].
3. If u > P ∗ (x), then reject x; else, accept x and add it to our set of samples {x(r) }.

Metropolis-Hastings Method. Proposal density Q now depends on the current state x(t) .
1. Sample tentative next state x0 from proposal density Q(x0 ; x(t) ).
2. Compute:

P ∗ (x0 ) Q(x(t) ; x0 )
a= (1017)
P ∗ (x(t) ) Q(x0 ; x(t) )

3. If a ≥ 1, the new state is accepted. Otherwise, the new state is accepted with probability
a.
4. If accepted, we set x(t+1) = x0 . If rejected, we set x(t+1) = x(t) .

338
Information Theory, Inference, and Learning Algorithms November 15, 2017

Variational Methods (Ch. 33)


Table of Contents Local Written by Brandon McKinzie

Probability Distributions in Statistical Physics231 . Consider the common situation


below, where the state vector x ∈ {−1, +1}N :

e−βE(x;J)
Pr [x | β, J] = (1018)
Z(β, J)
1X X
E(x; J) = − Jmn xm xn − hn xn (1019)
2 m,n n

e−βE(x;J)
X
Z(β, J) = (1020)
x

Note that evaluating E(x; J) for a given x takes polynomial time in the number of spins N , and
evaluating Z is O(2N ). Variational free energy minimization is a method for approximating
the complex distribution P (x) by a simpler ensemble Q(x; θ) parameterized by adjustable θ.

Variational Free Energy. How do we find/evaluate Q? The objective function chosen to


measure the quality of the approximation is the variational free energy, Fe (θ):
X Q(x; θ)
β Fe (θ) = Q(x; θ) ln (1021)
x e−βE(x;J)
= βEx∼Q [E(x; J)] − H(Q) (1022)
= DKL (Q || P ) + βF (1023)

where F , − ln Z(β, J) is the true free energy. It’s not immediately clear why this approxi-
mation Q is more tractable than P – for that we turn to an example below.

231
Yay!

339
Variational Free Energy Minimization for Spin Systems. For the system with energy
function given in equation 1019, consider the separable approximating distribution,
P
an xn Q an xn Q an xn
e n
ne ne
Q(x; a) = = an x0n0
=P a1 x01 P 0 ea2 x02 · · · P an x0n
(1024)
ZQ P Q
x0 n0 e x01 e x2 x0n e

To compute Fe , we must compute the mean energy and entropy under Q.


→ Mean Energy. Given our definition of Q, and the fact that each xn = ±1, the mean value
of any xn is x̄ = tanh(an ) = 2qn − 1, where qn ≡ Q(xn = 1).
" #
X 1X X
Ex∼Q [E(x; J)] = Q(x; a) − Jmn xm xn − hn x n (1025)
x 2 m,n n
1X X
=− Jmn x̄m x̄n − hn x̄n (1026)
2 m,n n

→ Entropy. Recall that if Q(x; a) =


Q
P n Q(xn ; an ) (i.e. Q is separable), that H(x) =
n H(xn ), so we have

X 1
HQ (x) = Q(x; a) ln (1027)
x Q(x; a)
X 1 1
= Q(xn = 1; an ) ln + Q(xn = 0; an ) ln (1028)
n Q(xn = 1; an ) Q(xn = 0; an )
X 1 1
= qn ln + (1 − qn ) ln (1029)
n qn 1 − qn

So the variational free energy is given by

β Fe (a) = βEx∼Q [E(x; J)] − HQ (x) (1030)


! !
1X X X 1 1
=β − Jmn x̄m x̄n − hn x̄n − qn ln + (1 − qn ) ln (1031)
2 m,n n n qn 1 − qn

Remember, our goal is to find parameters a that minimize Fe (a):


" ! #
∂ Fe ∂qm
 X
β =2 −β Jmn x̄n + hm + am (1032)
∂am ∂am n

which, when set to zero, yields


!
X
am ← β Jmn x̄n + hm (1033)
n
x̄n = tanh(an ) (1034)

340
Machine
Learning: A
Probabilistic
Perspective
Contents

9.1 Probability (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342


9.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
9.2 Generative Models for Discrete Data (Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . . 344
9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
9.3 Gaussian Models (Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.4 Bayesian Statistics (Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.5 Linear Regression (Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.6 Generalized Linear Models and the Exponential Family (Ch. 9) . . . . . . . . . . . . . . . . . 358
9.7 Mixture Models and the EM Algorithm (Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . 361
9.8 Latent Linear Models (Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
9.9 Markov and Hidden Markov Models (Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . 366
9.10 Undirected Graphical Models (Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

341
Machine Learning: A Probabilistic Perspective August 07, 2018

Probability (Ch. 2)
Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Probability.


Machine Learning: A Probabilistic Perspective.

Continuous Random Variables (2.2.5). Let X be a continuous RV. We usually want to


know the probability that X lies in the interval a ≤ X ≤ b, which is given by

p(a < X ≤ b) = p(X ≤ b) − p(X ≤ a) (1035)

Define the cumulative distribution function (cdf) F (q) , p(X ≤ q), and the probability
d
density function (pdf) f (x) , dx F (x).

Transformation of Random Variables (2.6). In what follows, we have some x ∼ px and


y = f (x). How should we think about py (y)? For discrete RV x, we just sum up the probability
mass for all x such that f (x) = y,
X
py (y) = px (x) (1036)
x:f (x)=y

If x is continuous, we must instead work with the cdf,

Py (y) , P (Y ≤ y) = P (f (X) ≤ y) = P (X ∈ {x | f (x) ≤ y}) (1037)

If f is monotonic (and hence invertible), then we can derive the pdf py (y) from the pdf px (x)
by taking derivatives as follows:
d d dx
py (y) , Py (y) = Px (f −1 (y)) = px (x) (1038)
dy dy dy

and, since we only work with integrals over densities (i.e. overall sign does not matter), it is
convention to take the absolute value of dx
dy in the above formula. In the multivariate case, the
∂xi
Jacobian matrix [Jy→x ]i,j , ∂yj is used:

py (y) = px (x) det Jy→x (1039)

342
Central Limit Theorem (2.6.3). Consider N i.i.d. RVs each with arbitrary pdf p(xi ), mean
µ, and covariance σ 2 . Let SN = i Xi denote the sum over the N RVs. The central limit
P

theorem states that

lim SN ∼ N (N µ, N σ 2 ) (1040)
N →∞

or equivalently lim N (X̄ − µ) ∼ N (0, σ 2 ) (1041)
N →∞

9.1.1 Exercises

Exercise 2.1

(a) [correct] P(oneIsGirl | hasAtLeastOneBoy) = 2/ 3. Use muh intuition.


(b) [correct] P(otherIsGirl | weSawOneIsABoy) = P(childIsGirl) = 1 / 2. The other being a girl/boy is
independent of the fact that the other is a boy. All about how it is phrased, yo.

Exercise 2.2 - Variance of a sum


Show that Var [X + Y ] = Var [X] + Var [Y ] + 2Cov [X, Y ].
[correct] Math approach:
 
Var [X + Y ] , E ((X + Y ) − E [X] − E [Y ])2 (1042)
 2

= E ((X − E [X]) + (Y − E [Y ])) (1043)
= Var [X] + Var [Y ] + 2Cov [X, Y ] (1044)

Intuition approach: If X and Y are uncorrelated, it is intuitive that their sum should have variance equal to the
sum of the individual variances. If there is some linear dependence between the two, we’d expect it to either
increase (if positively correlated) or decrease (if negatively correlated) the variance of their sum.

343
Machine Learning: A Probabilistic Perspective August 07, 2018

Generative Models for Discrete Data (Ch. 3)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Generative Models for Discrete Data.


Machine Learning: A Probabilistic Perspective.

Bayesian Concept Learning (3.1). The interesting notion of learning a concept C, such
as “all prime numbers”, by only seeing positive examples x ∈ C. How should we approach
predicting whether a new test case x e belongs to the concept C? Well, what we’re actually
doing is as follows: Given an initial hypothesis space H of concepts, we collect data D
that gradually narrows the down the subset of H consistent with our data. We also need
to address how we (humans) will weigh certain hypotheses differently even if they are both
entirely consistent with the evidence. The Bayesian approach can be summarized as follows
(no particular order):
• Likelihood. Probability of observing D given a particular hypothesis h. In the simple
but illustrative case where the data is sampled from a uniform distribution over the
extension 232 of a concept (a.k.a. the strong sampling assumption). The probability here
of sampling N items independently under hypothesis h is
N
1

p(D | h) = (1045)
|h|

which elucidates how the model favors the smallest hypothesis space consistent with the
data 233 .
• Prior. Using just the likelihood can mean we fall prey to contrived/overfitting hypotheses
that basically just enumerate the data. The prior p(h) allows us to specify properties
about how the learned hypothesis ought to look.
• Posterior. This is just the likelihood times the prior, normalized [to one]. In general,
as we collect more and more data, the posterior tends toward a Dirac measure peaked at
the MAP estimate:

p(h | D) → δĥM AP (h) where (1046)


M AP
ĥ = arg max p(h | D) = arg max p(D | h)p(h) = arg max [log p(D | h) + log p(h)]
h h
(1047)

232
The extension of a concept is just the set of numbers that belong to it.
233
A result commonly known as Occam’s razor or the size principle.

344
The Beta-Binomial Model (3.3). In the previous section we considered inferring some
discrete distribution over integers; now we will turn to a continuous version. Consider the
problem of inferring the probability that a coin shows up heads, given a series of observed coin
tosses.
• Likelihood. As should be familiar, we’ll model the outcome of each toss Xi ∈ {1, 0}
indicating heads or tails with Xi ∼ Ber(θ). Assuming the tosses are i.i.d, this gives us
p(D | θ) ∝ θN1 (1 − θ)N0 , where N1 and N0 are the sufficient statistics of the data,
since p(θ | D) can be entirely modeled as p(θ | N1 , N0 ).
• Prior. We technically just need a prior p(θ) with support over the interval [0, 1], but it
would be convenient if the prior had the same form as the likelihood, i.e. p(θ) ∝ θγ1 (1 −
θ)γ2 for some hyperparameters γ1 and γ2 This is satisfied by the Beta distribution234
This would also result in the posterior having the same form as the prior, meaning the
prior is a conjugate prior for the corresponding likelihood.
• Posterior. As mentioned, the posterior corresponding to a Bernoulli/binomial likelihood
with a beta prior is itself a beta distribution:

p(θ | D) ∝ Bin(N1 | θ, N0 + N1 )Beta(θ | a, b) ∝ Beta(θ | N1 + a, N0 + b) (1048)

By either reading off from a table or deriving via calculus, we can obtain the following
properties for our Beta posterior:
a + N1 − 1
[mode] θ̂M AP = (1049)
a+b+N −2
a + N1
[mean] θ̄ = = λm1 + (1 − λ)θ̂M LE (1050)
a+b+N
a
where m1 = a+b is the prior mean. The last form captures the notion that the posterior
is a compromise between what we previously believed and what the data is telling us.

The Dirichlet-Multinomial Model (3.4). We now generalize further to inferring the prob-
ability that a die with K sides comes up as face k.
• Likelihood. As before, we assume a specific observed sequence D of N dice roles.
Assuming the data is i.i.d., the likelihood has the form
K
Nk
Y
p(D | θ) = θk (1051)
k=1

which is the likelihood for the multinomial model up to an irrelevant constant factor.
• Prior. We’d like to find a conjugate prior for our likelihood, and we need it to have sup-
port over the K-dimensional probability simplex 235 . The Dirichlet distribution satisfies
234
You may be wondering, why not the Bernoulli distribution? It trivially has the same form as the Bernoulli
distribution, eh? Then, pause and actually think about what you’re saying for five seconds. You want to model
a prior on θ with a Bernoulli distribution? You do realize that the support for a Bernoulli is in k ∈ {0, 1}.
It’s the opposite domain entirely. We want something that “looks” like a Bernoulli but is a distribution over θ,
NOT the value(s) of x.
235
The K-dimensional probability simplex is the (K −1)th dimensional simplex determined by the unit vectors
e1 , . . . , eK ∈ RK , i.e. the set of vectors x such that xi ≥ 0 and
P
i
xi = 1.

345
both of these and is defined as
K
1 Y
Dir(θ | α) = θαk −1 (1052)
B(α) k=1 k

where θ ∈ SK is a built-in assumption.


• Posterior. By construction, this will also be Dirichlet. Note that to derive the MAP
P
estimator we must enforce the constraint that k θk = 1, which can be done with a
Lagrange multiplier. The constrained objective (the Lagrangian) is
!
X X X
`(θ, λ) = Nk log θk + (αk − 1) log θk + λ 1 − θk (1053)
k k k

To get θ̂M AP , we’d take derivatives w.r.t. λ and each θk , do some substitutions and solve.
Example: Language Models with Bag of Words

Given a sequence of words, predict the most likely next word. Assume that the ith word Xi ∈ {1, . . . , K} (where
K is the size of our vocab) is sampled indep from all others using a Cat(θ) (multinoulli) distribution. This is
the BoW model.

My attempt: We need to derive the form of posterior predictive p(Xi+1 | X1 , . . . , Xi ) where θ ∈ SK . First, I
know that the posterior is

p(θ | X1 , . . . , Xi ) ∝ Dir(θ | α)P (X1 , . . . , Xi | θ) = Dir(θ | α1 + N1 , . . . , αK + NK ) (1054)

and so I can derive the posterior predictive in the usual way, while also exploiting the fact that all Xi ⊥ Xj ,
Z
p(Xi+1 = k | X1 , . . . , Xi ) = p(Xi+1 | θ)p(θ | X1 , . . . , Xi )dθ (1055)
Z
= θk p(θk , θ−k | X1 , . . . , Xi )dθk dθ−k (1056)
Z
= θk p(θk | X1 , . . . , Xi )dθk (1057)

= E [θk | X1 , . . . , Xi ] (1058)
αk + N k
= P (1059)
α + Nj
j j

which also shows another nice example of Bayesian smoothing.

346
Naive Bayes Classifiers (3.5). For classifying vectors of discrete-valued features, x ∈
{1, . . . , K}D . Assumes features are conditionally independent given the class label:
D
Y
p(x | y = c, θ) = p(xj | y = c, θj,c ) (1060)
j=1

with parameters θ ∈ RD×|Y|236 . We can model the individual class-conditional densities with
the multinoulli distribution. If we were modeling real-valued features, we could instead use a
Gaussian distribution.
• MLE fitting. Our log-likelihood is
N X
D
X X (i)
log p(D | θ) = Nc log πc + log p(xj | y (i) , θj,y(i) ) (1061)
c i j

where πc = p(y = c) are the class priors237 . We see that we can optimize the class
priors separately from the others, and that they have the same form as the multinomial
likelihood we used in the last section. We already know that the MLE for these are
π̂c = Nc /N (remember this involves using a Lagrangian). Let’s assume next that we’re
working in the case of binary features (xj | c ∼ Ber(θj,c )). This results in MLE estimates
θ̂j,c = Nj,c /Nc .

236
You could also generalize this to having some number of params N associated with each pairwise (j, c).
It’s also important to recognize that this is the first model of this chapter where the input x is a vector, and
thus introduces pairwise parameters.
237
We only see the class prior parameters here because they appear in the likelihood of generative classifiers,
since p(x, y) = p(y)p(x | y). We don’t see the priors for θ that aren’t class priors because MLE is only concerned
with maximizing the likelihood, not the posterior (which would contain those priors).

347
9.2.1 Exercises

MLE for uniform distribution (3.8)

The density function for the uniform distribution centered on 0 with width 2a is

1
p(x) = 1{x ∈ [−a, a]}
2a

Remember this is for continuous x, and p(x) is interpreted as the derivative of the corresponding CDF P (x).

a. Given data x1 , . . . , xn , find âM LE . So there are a few ways of doing this. We can get the answer pretty quick
with intuition, and not-so-quick by grinding through the math. Quick-n-easy: If you were paying attention to
the chapter, you’d instantly remember that MLE is all about finding the smallest hypothesis space consistent
with the data. It should then be obvious that âM LE = max |xi |. Slightly more rigorous. We can also solve a
constrained optimization problem, with constraint that a ≥ max |xi |, since we must require our solution to assign
nonzero probability to any of our observations.

âM LE = arg max log p(x1 , . . . , xn | a) + λ(a − max |xi |) (1062)


a

The rest is mechanical: (1) take deriv wrt λ, (2) deriv wrt a and set to zero, (3) solve for a as a function of λ, (4)
solve for λ by substituting previous step’s results into contstraint, yielding that λ ≤ n/|xmax |, (5) plug result for
λ into result from step 3 to obtain result that a ≥ xmax . In the limit of many samples, the first term becomes
more important in the optimization, which decreases as a increases, and so we choose the lowest value of a that
satisfies the constraints: a := xmax .
b. What probability would the model assign to a new data point xn+1 using â. We are obviously meant to
trivially answer that it will assign the density with a = â. However, I take objection to this question, since it
makes no sense to evaluate a density at a single point p(x).
c. Do you see any problem with this approach? Yes, it extremely overfits to the data. We’ll assign zero
probabilities to any points outside the interval of observations, and the same probability to anything in that
interval. We should do a more Bayesian approach instead.

348
Machine Learning: A Probabilistic Perspective August 11, 2018

Gaussian Models (Ch. 4)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Gaussian Models.


Machine Learning: A Probabilistic Perspective.

Basics (4.1). I’ll be filling in the gaps that the book leaves out in its derivations, as a way of
reviewing the relevant linear algebra/calculus/etc. For notation’s sake, here how the author
writes the MVN in D dimensions:
1 1
 
N (x | µ, Σ) , exp − (x − µ)T Σ−1 (x − µ) (1063)
(2π)D/2 |Σ|1/2 2

To get a better understanding, let’s inspect the eigendecomposition of Σ. We know that Σ is


positive definite238 , and therefore the eigendecomposition Σ = UΛU T exists, where U is an
orthonormal matrix of eigenvectors. By the invertible matrix theorem, we therefore know that Decomposition of Σ

U is invertible. Since Σ is p.d., it’s eigenvalues are all positive, and thus Λ is also invertible.
We can then apply the basic definition for an invertible matrix to write

D Remember, an
−1 −1 T
X 1 orthonormal matrix
Σ = UΛ U = ui uTi (1064) satisfies U T = U −1 .
i=1
λi

where ui is the ith eigenvector and ith column of U. We can use this to rewrite
Rewriting in form of an
ellipse.
D
!
1
(x − µ)T Σ−1 (x − µ) = (x − µ)T
X
ui uTi (x − µ) (1065)
i
λi
D
X 1
= (x − µ)T ui uTi (x − µ)T (1066)
i
λi
D
X 1 2
= y (1067)
i
λi i

238
We know this because all covariance matrices of any random vector X are symmetric p.s.d., and the
additional requirement that Σ−1 exists means that it is p.d.

349
where yi , uTi (x − µ). The fascinating insight is recalling that the equation for an ellipse in
2D is
y12 y2
+ 2 =1 (1068)
λ1 λ2
Hence we see that the contours of equal probability density of a Gaussian lie along ellipses, as
illustrated above. The eigenvectors determine the orientation of the ellipse, and the eigenvalues
determine how elongated it is.

Maximum Entropy Derivation of the Gaussian (4.1.4). Recall that the exponential
family can be derived as the family of distributions p(x) that maximizes H(p) subject to
constraints that the moments of p match some set of empirical moments Fk specified by us. It
turns out that the MVN is the distribution with maximum entropy subject to having a specified
mean and covariance. Consider the zero-mean MVN and its entropy239 :
1 1
 
p(x) = exp − xT Σ−1 x (1069)
Z 2
1 h i
h(p) = ln (2πe)D det Σ (1070)
2
R
Let p = N (0, Σ) above and let q(x) be any density satisfying q(x)xi xj dx = Σij . The result
is based on the fact that h(q) must be less than or equal to h(p). This can be shown by
evaluating DKL (q||p) and recalling that DKL is always greater than or equal to zero.

239
For derivation, see this wonderful answer on stackexchange.

350
Gaussian Discriminant Analysis (4.2). With generative classifiers, it is common to define
the class-conditional density as a MVN,

p(x | y=c, θ) = N (x | µc , Σc ) (1071)

which results in a technique called (Gaussian) discriminant240 analysis. If Σc is diagonal,


this is just a form of naive Bayes241 . We classify some feature vector x using the decision rule:

ŷ(x) = arg max [log p(y=c | π) + log p(x | y=c, θ)] (1072)
c

If we have uniform prior over classes, we can classify a new test vector as follows:

ŷ(x) = arg min(x − µc )T Σ−1


c (x − µc ) (1073)
c

Linear Discriminant Analysis (LDA). The special case where all covariance matrices are
the same, Σc = Σ. Now the quadratic term xT Σ−1 x is independent of c and thus is not
important for classification. Instead we end up with the much simpler,
T
eβc x+γc
p(y=c | x, θ) = P = S(η)c (1074)
β T0 x+γc0
c0 e c
βc := Σ−1 µc (1075)
1
γc := − µc Σ−1 µc + log πc (1076)
2
where S is the familiar softmax. TODO: come back and compare/contrast LDA with multi-
nomial logistic regression after reading chapter 8.

Inference in Jointly Gaussian Distributions (4.3). Suppose x = (x1 , x2 ) is jointly Gaus-


sian with parameters
! !
µ1 Σ11 Σ12
µ= , Σ= (1077)
µ2 Σ21 Σ22

Then, the marginals are given by

p(x1 ) = N (x1 | µ1 , Σ11 ) (1078)


p(x2 ) = N (x2 | µ2 , Σ22 ) (1079)

and the posterior conditional is given by

240
Don’t confuse the word “discriminant” for “discriminative” – we are still in a generative model setting. See
8.6 for details on the distinction.
241
This is easy to show. If diagonal, then p(x | y) factorizes. We know the ith item in the product corresponds
to p(xi | c) by considering how simple it is to compute marginals for Gaussians with diagonal Σ.

351
p(x1 | x2 ) = N (x1 | µ1|2 , Σ1|2 )
where µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 ) (4.69)
Σ1|2 = Σ11 − Σ12 Σ−1
22 Σ21 = Λ−1
11

where Λ := Σ−1 . Notice that the conditional covariance is a constant matrix independent of
x2 . The proof here relies on Schur complements (see Appendix).

Information Form (4.3.3). Thus far, we’ve been working in terms of the moment parame-
ters µ and Σ. We can also rewrite the MVN in terms of its natural (canonical) parameters
Λ and ξ,
1
 
1
Nc (x | ξ, Λ) = (2π)−D/2 |Λ| 2 exp − xT Λx + ξ T Λ−1 ξ − 2xT ξ (1080)
2
where Λ , Σ−1 and ξ , Σ−1 µ (1081)
where Nc is how we’ll denote “in canonical form.” The marginals and conditionals in canonical
form are
p(x2 ) = Nc (x2 | ξ2 − Λ21 Λ−1 −1
11 ξ1 , Λ22 − Λ21 Λ11 Λ12 ) (1082)
p(x1 | x2 ) = Nc (x1 | ξ1 − Λ12 x2 , Λ11 ) (1083)
and we see that marginalization is easy in moment form, while conditioning is easier
in information form.

Linear Gaussian Systems (4.4). Suppose we have a hidden variable x and a noisy observa-
tion of it y. Let’s assume we have the following prior and likelihood:

p(x) = N (x | µx , Σx )
(4.124)
p(y | x) = N (y | Ax + b, Σy )

which is an example of a linear Gaussian system x → y.

The Wishart Distribution (4.5). We now dive into the distributions over the parameters Σ
and µ. First, we need some math prereqs out of the way. The Wishart is the generalization
of the Gamma to pd matrices:
1 1 
 
Wi(Λ | S, ν) = |Λ|(ν−D−1)/2 exp − tr ΛS −1 (1084)
ZW i 2
ZW i = 2νD/2 ΓD (ν/2)|S|ν/2 (1085)
where
• ν: degrees of freedom
• S: scale matrix (a.k.a. scatter matrix). Basically empirical Σ.
• ΓD : multivariate gamma function
A nice property is that if Σ−1 ∼ W i(X, ν), then Sigma ∼ IW(S −1 , ν + D + 1), the inverse
Wishart (multi-dim generalization of inv Gamma).

352
Machine Learning: A Probabilistic Perspective September 13, 2018

Bayesian Statistics (Ch. 5)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Bayesian Statistics.


Machine Learning: A Probabilistic Perspective.

MAP Estimation (5.2.1). The most popular point estimate for parameters θ is the pos-
terior mode, aka the MAP estimate. However, there are many drawbacks:
• No measure of uncertainty (true for any point estimate).
• Using θM AP for predictions is prone to overfitting.
• The mode is an atypical point.
• It’s not invariant to reparameterization. Say two possible parameterizations θ1 and
θ2 =f (θ1 ), where f is some deterministic function. In general, it is not the case that
θ̂2 = f (θ̂1 ) under MAP.

Bayesian Model Selection (5.3). A model selection technique where we compute the best
model m for data D using the formulas,

p(D | m)p(m)
p(m | D) = P (1086)
m∈M p(D | m)p(m)
Z
p(D | m) = p(D | θ)p(θ | m)dθ (1087)

where the latter is the marginal likelihood242 for model m. Note that this isn’t anything
new; we’ve usually just denoted it simply as p(D), since typically m is specified beforehand.
Although large models with many parameters can achieve higher likelihoods under MLE/MAP,
p(D | θ̂m ), this is not necessarily the case with marginal likelihood, an effect known as the
Bayesian Occam’s razor. Below we give the marginal likelihoods for familiar models:
• Beta-Binomial:
!
N B(a + N1 , b + N0 )
p(D | m=BetaBinom) =
N1 B(a, b)

where B is the Beta function.


• Dirichlet-Multinoulli:
B(N + α)
p(D) =
α

242
Also called the integrated likelihood or the evidence.

353
BIC Approximation (5.3.2.4). The integral involved in computing p(D | m) (henceforth
denoted simply as p(D)) can be intractable. The Bayesian information criterion (BIC) is
a popular approximation:
1
BIC , log p(D | θ̂) − dof(θ̂) log N ≈ log p(D) (1088)
2

where
• dof(θ̂) is the number of degrees of freedom in the model.
• θ̂ is the MLE for the model.
BIC is also closely related to the minimum description length (MDL) principle and the
Akaike information criterion (AIC).

Hierarchical Bayes (5.5). When defining our prior p(θ | η), we have to of course specify the
hyperparameters η required by our choice of prior. The Bayesian approach for doing this is
to put a prior on our prior! This situation can be represented as a directed graphical model,
illustrated below.

η θ D

This is an example of a hierarchical Bayesian model, also called a multi-level model.

Bayesian Decision Theory (5.7). Decision problems can be cast as games against nature,
where natures selects a quantity y ∈ Y unknown to us, and then generates an observation
x ∈ X that we get to see. Our goal is to devise a decision procedure or policy δ : X 7→ A
for generating an action a ∈ A from observation x that’s deemed most compatible with the
hidden state y. We define “compatible” via a loss function L(y, a):

δ(x) = arg min ρ(a | x) (1089)


a∈A
where ρ(a | x) = Ep(y|x) [L(y, a)] (1090)

In this context, we call ρ the posterior expected loss, and δ(x) the Bayes estimator. Some
Bayes estimators for common loss functions are given below.
• 0-1 loss: L(y, a) = 1{y 6= a}. Easy to show that δ(x) = arg maxy∈Y p(y | x).

354
Machine Learning: A Probabilistic Perspective August 19, 2018

Linear Regression (Ch. 7)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Linear Regression.


Machine Learning: A Probabilistic Perspective.

Model Specification (7.2). The linear regression model is a model of the form

p(y | x, θ) = N (y | wT x, σ 2 ) (1091)

Maximum Likelihood Estimation (least squares) (7.3). Most commonly, we’ll estimate
the parameters by computing the MLE, defined by

θ̂ , arg max log p(D | θ) (1092)


θ
= arg min [− log p(D | θ)] (1093)
θ
1 N
=− 2
RSS(w) − log(2πσ 2 ) (1094)
2σ 2
N
X
RSS(w) , (yi − wT xi )2 (1095)
i=1

where RSS(w) is the residual sum of squares. Notice that θ := (w, σ 2 ), but typically we’re
focused on estimating w 243 .

Derivation of the MLE (7.3.1). We’ll now denote the negative log likelihood as NLL(w)
and drop constant terms that are irrelevant for the optimization task.
1 1
NLL(w) = ||y − Xw||22 = wT (X T X)w − wT (X T y) (1096)
2 2
N
X
where XT X = xi xTi (1097)
i=1
XN
XT y = xi yi (1098)
i=1

243
Since our goal is typically to make future predictions ŷ(x) = wT x, rather than sampling y ∼ p(y | x, θ),
we aren’t concerned with estimating σ 2 . We assume some variability, and the goal is focused on fitting the data
to a straight line.

355
And the optimal ŵOLS be found by taking the gradient, setting to zero, and solving for w as
usual:

∇NLL(w) = X T Xw − X T y (1099)
T T
X Xw = X y [normal eq.] (1100)
 −1
ŵOLS = X T X XT y (1101)

Geometric Interpretation (7.3.2). Use the column-vector representation of X ∈ RN ×D ,


where we assume N > D244 . Our prediction can then be written

ŷ = Xw = w1 x̃1 + · · · + wD x̃D (1102)

i.e. a linear combination of the D column vectors x̃i ∈ RN . Crucially, observe that this means
ŷ ∈ span({x̃}D i ) no matter what (a hard constraint by definition of our model). So, how do
you minimize the residual norm ||y − ŷ|| given that ŷ is restricted to a particular subspace?
You require y − ŷ to be orthogonal to that subspace, of course245 ! Formally, this means
x̃Tj (y − ŷ) = 0, for all 1 ≤ j ≤ D. Equivalently,

X T (y − Xw) = 0 =⇒ ŵ = (X T X)−1 X T y (1103)


ŷ = X ŵ = Py (1104)
T −1 T
where P , X(X X) X (1105)

Although this neat, I’m left unsatisfied since there appears to be no intuition of what the
column vectors of X really mean on a conceptual level.

Robust Linear Regression (7.4). Gaussians sensitive to outliers, since their log-likelihood
penalizes deviations quadratically246 . One way to achieve robustness to outliers is to instead Replacing Gaussian with
use a distribution with heavy tails, such that they still allow for higher likelihoods of outliers, heavy-tailed dists.
but they don’t need to shift the whole distribution around to accommodate for them. One
popular choice is the Laplace distribution,
1 1
 
T
p(y | x, w, b) = Lap(y | w x, b) , exp − |y − wT x| (1106)
2b b
X
N LL(w) = |yi − wT xi | (1107)
i

244
This means we have more rows than columns, which means our column space cannot span all of RN .
245
Consider that y − ŷ points from our prediction (which is in the constraint subspace) ŷ to the true y that
we want to get closest to. Intuitively that means ŷ is looking “straight out” at y, in a direction orthogonal to
the subspace that ŷ lives in. Formally, we can write y = (yk , y⊥ ), where yk is the component within Col(x).
The best we can do, then, is ŷ := yk .
246
In other words, outliers initially get huge loss values, and the distribution shifts toward them to minimize
loss (undesirably).

356
Goal: convert N LL to a form easier to optimize (linear). Let ri , yi − wT xi be the i’th
residual. The following steps show how we can convert this into a linear program:

ri , ri+ − ri− (ri+ ≥ 0)(ri− ≥ 0) (1108)


(ri+ + ri− ) wT xi + ri+ − ri− = yi
X
min s.t. (1109)
w,r + ,r −
i
min f T θ s.t. Aθ ≤ b, Aeq θ = beq , 1 ≤ θ ≤ u (1110)
θ

where the last equation is the standard form of a LP.

Ridge Regression (7.5). We know that MLE can overfit by essentially memorizing the data.
If, for example, we model 21 points with a degree-14 polynomial247 , we get many large positive Doing MAP instead of
and negative numbers for our learned coefficients, which allow the curve to wiggle in just MLE
the right way to almost perfectly interpolate the data – this is why we often regularize
weights to have low absolute value. This encourages smoother/less-wiggly curves. One
way to do this is by using a zero-mean Gaussian prior on our weights:
Y
p(w) = N (wj | 0, τ 2 ) (1111)
j

This makes our MAP estimation problem and solution take the form
Note that w0 is NOT
N 
1 X  regularized.
J(w) = yi − wT xi − w0 + λ||w||22 (1112)
N i
ŵridge = (X T X + λID )−1 X T y (1113)

247
To review polynomials, search “Lagrange interpolation” in your CS 70 notes.

357
Machine Learning: A Probabilistic Perspective August 11, 2018

Generalized Linear Models and the Exponential Family (Ch. 9)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Generalized Linear Models and the Exponential Family.
Machine Learning: A Probabilistic Perspective.

The Exponential Family (9.2). Why is the exponential family important?


• It’s the only family with finite-sized sufficient statistics248 .
• It’s the only family for which conjugate priors exist.
• It makes the least set of assumptions subject to some user-chosen constraints.
• It’s at the core of GLMs and variational inference.

A pdf or pmf p(x | θ), for x ∈ X m and θ ∈ Θ ⊆ Rd , is said to be in the exponential


family if it’s of the form
1 n o h(x) is a scaling
p(x | θ) = h(x) exp{θ T φ(x)} = h(x) exp θ T φ(x) − A(θ) (1114) constant, often 1.
Z(θ)
Z
Z(θ) = h(x) exp{θ T φ(x)} (1115)
Xm
A(θ) = log Z(θ) (1116)

where θ are the natural (canonical) parameters249 , and φ(x) are the sufficient
statistics.

Below are some (quick/condensed) examples showing the first couple steps in rewriting familiar
distributions in exponential family form:

[Bernoulli] Ber(x | µ) = µx (1 − µ)1−x = exp{x log µ + (1 − x) log(1 − µ)} (1117)


K
(K−1 )
xk
Y X
[Multinoulli] Cat(x | µ) = µk = exp xk log (µk /µK ) + log µK (1118)
k k

248
Given certain regularity conditions.
249
We often generalize this with η(θ), which maps whatever params θ we’ve chosen to the canonical params
η(θ).

358
Log Partition Function (9.2.3). The derivatives of the log partition, A(θ), can be used to
generate cumulants250 for the sufficient statistics, φ(x).

∂A(θ)
= E [φ(x)] (1119)
∂θ
∇2 A(θ) = cov [φ(x)] (1120)

MLE for the Exponential Family (9.2.4). The likelihood takes the form
"N #
n o
It appears that we are
Y
p(D | θ) = h(x ) g(θ)N exp η(θ)T φ(D)
(i)
(1121) denoting 1/Z with g
i now.
N
X hP i
φ(x(i) ) = i φ1 · · ·
P
φ(D) = i φK (1122)
i

where I’ve denoted N (i)


P P
i=1 φk (x ) as simply i φk . The Pitman-Koopman-Darmois theo-
rem states, given certain regularity conditions/constraints251 , that the exponential family is
the only family with finite sufficient statistics (dimensionality independent of the size of the
data set). For example, in the above formula, we have K + 1 sufficient statistics (+1 since we
need to know the value of N ).

Consider a canonical252 exponential family which also sets h(·) = 1. The log-likelihood is

p(D | θ) = θ T φ(D) − N A(θ) (1123)

Since −A(θ) is concave253 and the other term is linear in θ, the log-likelihood is concave and
thus has a global maximum.

250
The first and second cumulants are mean and variance.
251
The wording is weird here. We mean “out of all families/distributions that already satisfy certain constraints
that must be met, the exponential family is the only...”. For example, the uniform distribution has finite statistics
and is not in the exponential family, but it does not meet the constraint that its support must be independent
of the parameters, so it’s outside the scope of the theorem.
252
Defined as those which satisfy η(θ) = θ.
253
We know −A is concave because A is convex. We know A is convex because ∇2 A is positive definite.
Remember that any twice-differentiable multivariate function f is convex IFF its Hessian is pd for all θ. See
sec 7.3.3 and 9.2.3 for more.

359
Maximum Entropy Derivation of the Exponential Family (9.2.6). Suppose we don’t
know which distribution p to use, but we do know the expected values of certain features or
functions:
X
Fk , Ex∼p(x) [fk (x)] = fk (x)p(x) (1124)
x

The principle of maximum entropy or maxent says we should pick the distribution with
maximum entropy, subject to the constraints that the moments of the distribution match the
empirical moments of the specified functions fk (x). Treating p as a fixed length vector (i.e.
assuming x is discrete), we can take the derivative of our Lagrangian (entropy in units of nats
with constraints) w.r.t. each “element” px = p(x) to find the optimal distribution.
! !
X X X
J(p, λ) = H(p) + λ0 1 − p(x) + λk Fk − p(x)fk (x) (1125)
x k x
∂J X
= −1 − log p(x) − λ0 − λk fk (x) (1126)
∂p(x) k

Setting this derivative to zero yields


( )
1 X
p(x) = exp − λk fk (x) (1127)
Z k

Thus the maxent distribution p(x) has the form of the exponential family, a.k.a. the Gibbs
distribution.

360
Machine Learning: A Probabilistic Perspective August 26, 2018

Mixture Models and the EM Algorithm (Ch. 11)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Mixture Models and the EM Algorithm.


Machine Learning: A Probabilistic Perspective.

Latent Variable Models (LVMs) (11.1). In this chapter, we explore directed GMs that have
hidden/latent variables. Advantages of LVMs:
1. Often have fewer params.
2. Hidden vars can serve as a bottleneck (representation learning).

Mixture Models (11.2). The simplest form of LVM is where the hidden variables zi ∈
{1, . . . , K} represent a discrete latent state. We use discrete prior p(zi ) = Cat(π) = πi , and
likelihood p(xi | z=k) = pk (xi ). A mixture model is defined by
We call pk the kth base
K
X distribution.
p(xi | θ) = πk pk (xi | θ) (1128)
k=1

Some popular mixture models:


• Mixture of Gaussians (11.2.1). Each pk = N (µk , Σk ). Given large enough K, a GMM
can approximate any density defined on RD .
• Mixture of Multinoullis (11.2.2). Let x ∈ {0, 1}D . Each pk (x) = D j Ber(xj | µjk ).
Q

361
Below, I derive the expectation and covariance of x in this case254 .
X
E [x] = xp(x) (1129)
x
X K
X
= x πk pk (x) (1130)
x k
X K
X D
Y
= x πk Ber(xj | µjk ) (1131)
x k j
D
X X X D
Y
= πk ··· x Ber(xj | µjk ) (1132)
k x1 xD j
D
X X X
= πk Ber(x1 | µ1k ) · · · Ber(xD | µDk )x (1133)
k x1 xD
D
X
= πk µk (1134)
k
h i
Next we want to find cov(x) = E xxT − E [x] E [x]T . I think the insight that makes
finding the first term easiest is realizing that you only need to find the two cases, E x2i
 

and E [xi xj6=i ], where in this case


h i X
E x2i = πk µik (1135)
k
X
E [xi xj6=i ] = πk µik µjk (1136)
k
h i X  
T
∴ E xx = πk Σk + µk µTk (1137)
k

where Σk = diag(µjk (1 − µjk )) is the covariance of x under pk . The fact that the mixture
covariance matrix is now non-diagonal confirms that the mixture can capture correlations
between variables xi , xj6=i , unlike a single product-of-Bernoullis model.

The EM Algorithm (11.4). In LVMs, it’s usually intractable to compute the MLE since
we have to marginalize over hidden variables while satisfying constraints like positive definite
covariance matrices, etc. The EM algorithm gets around these issues via a two-step process.
The first step involves taking an expectation, where the expectation is over z ∼ p(z | x, θt−1 )
for each individual observed x, where we use the current parameter estimates when sampling
z in the expectation. This gives us an auxiliary likelihood that’s a function of θ which
will serve as a stand-in (in the 2nd step) for what we typically use as the likelihood in MLE.
The second step is then just finding the optimal θ t over the auxiliary likelihood function from
the first step. This iterates until convergence or some stopping condition.

254
Shown in excruciating detail because I was unable to work through this in my head alone.

362
Procedure: EM Algorithm

First, let’s define our auxiliary function Q as

Q(θ | θ t−1 ) = Ep(z|x,θt−1 ) [`c (θ) | D] (1138)


N
X
where `c (θ) = log p(x(i) , z (i) | θ) (1139)
i=1

where, again, the “expectation” serves the purpose of determining the expected value of z (i) for each
observed x(i) . It’s somewhat of a misnomer to denote the expectation like this, since each z (i) is innately
tied with its corresponding observation x(i) .
1. E-Step: Evaluate Q(θ | θ t−1 ) using (obviously) the previous parameters θ t−1 . This yields a
function of θ. This gives us the expected log-likelihood function of θ for the observed data D.
2. M-Step: Optimize Q w.r.t θ to get θ t :

θ t = arg max Q(θ | θ t−1 ) (1140)


θ

363
Machine Learning: A Probabilistic Perspective September 14, 2018

Latent Linear Models (Ch. 12)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Latent Linear Models.


Machine Learning: A Probabilistic Perspective.

Factor Analysis (12.1). Whereas mixture models define p(z) = Cat(π) for a single hidden
variable z ∈ {1, . . . , K}, factor analysis begins by instead using a vector of real-valued latent
variables, z ∈ RL . The simplest prior is z ∼ N (µ0 , Σ0 ). If x ∈ RD , we can define

p(xi | zi , θ) = N (Wzi + µ, Ψ) (1141)

where
• W ∈ RD×L : factor loading matrix.
• Ψ ∈ RD×D is a diagonal covariance matrix, since we want to “force” z to explain the
correlation255 . If Ψ = σ 2 I, we get probabilistic PCA.
Summaries of key points regarding FA:
• Low-rank parameterization of MVN. FA can be thought of as specifying p(x) using
a small number of parameters. [math] yields that

cov(x) = WW T + Ψ

which has O(LD) params (remember Ψ is diagonal) instead of the usual O(D2 ).
• Unidentifiability: The params of an FA model are unidentifiable.

Classical PCA (12.2.1). Goal: find an orthogonal set of L linear basis vectors wj ∈ RD , and
the scores zi ∈ RL , such that we minimize the average reconstruction error:
N
1 X
J(W, Z) = ||xi − x̂i ||2 , where x̂i := Wzi (1142)
N i=1
= ||X − WZ T ||2F (1143)

where W ∈ RD×L is orthonormal. Solution: assign each column W:,` to the eigenvector
with `’th largest eigenvalue of Σ̂ = N1 N T
P
i xi xi , assuming E [x] = 0. Then we compute
T
ẑi := W xi .

255
It’s easier to think of this graphically. Our model asserts that xi ⊥ xj6=i | z. Independence implies zero
correlation, and we cement this by constraining Ψ to be diagonal. See your related note on LFMs, chapter 13
of the DL book.

364
Proof: PCA
Case: L=1. Goal: Find the best 1-d solution, w ∈ RD , zi ∈ R, z ∈ RN . Remember that ||w||2 = 1.

N
1 X 1  T 
J(w, z) = ||xi − zi w||2 = xi xi − 2zi wT xi + zi2 (1144)
N N
i
∂J
=0 → zi = wT xi (1145)
∂zi
N N
1 X   1 X
J(w) = xT 2
i xi − zi = const − zi2 (1146)
N N
i i
∴ arg min J(w) = arg max Var [z̃] (1147)
w w

This shows why PCA finds directions of maximal variance – aka the analysis view of PCA. Before finding the
optimal w, don’t forget the Lagrange multipliers for constraining unit norm,

˜
J(w) = wT Σ̂w + λ(wT w − 1) (1148)

Singular Value Decomposition (SVD) (12.2.3). Any real N × D matrix X can be decom-
posed as

X= U S VT (1149)
|{z} |{z} |{z}
N ×N N ×D D×D

where the columns of U are the left singular vectors, and the columns of V are the right
singular vectors. Economy-sized SVD will shrink U to be N × D and S to be D × D (we’re
assuming N > D).

365
Machine Learning: A Probabilistic Perspective August 11, 2018

Markov and Hidden Markov Models (Ch. 17)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Markov and Hidden Markov Models.


Machine Learning: A Probabilistic Perspective.

Hidden Markov Models (17.3). HMMs model a joint distribution over a sequence of T
observations xh1...T i and hidden states zh1...T i ,

T
" #" T #
Y Y
p(zh1...T i , xh1...T i ) = p(zh1...T i )p(xh1...T i | zh1...T i ) = p(z1 ) p(zt | zt−1 ) p(xt | zt )
t=2 t=1
(1150)

where each hidden state is discrete: zt ∈ {1, . . . , K}, while each observation xt can be discrete
or continuous.

The Forwards Algorithm (17.4.2). Goal: compute the filtered256 marginals p(zt | xh1...ti ).
1. Prediction step. Compute the one-step-ahead predictive density p(zt | xh1...t−1i ),
X
p(zt =j | xh1...t−1i ) = p(zt = j, zt−1 = i | xh1...t−1i ) (1151)
i

which will serve as our prior for time t (since it does not take into account observed data
at time t).
2. Update step. We “update” our beliefs by observing xt ,

αt (j) , p(zt =j | xh1...ti ) (1152)


p(zt =j, xt , xh1...t−1i )
= (1153)
p(xh1...ti )
 )p(zt =j | xh1...t−1i )p(xt | zt =j, 

p(x
h1...t−1i xh1...t−1i
 )

=  (1154)
 )p(xt | xh1...t−1i )

p(x


h1...t−1i
1
= p(zt =j | xh1...t−1i )p(xt | zt =j) (1155)
Zt
Notice that we can also use the values of Zt to compute the log probability of the evidence:
T
X T
X
log p(xh1...T i | θ) = log p(xt | xh1...t−1i ) = log Zt (1156)
t=1 t=1
256
They’re called “filtered” because they use all observations xh1...ti instead of just xt , which reduces/filters
out the noise more.

366
Forwards Algorithm (Algorithm 17.1)

We are given transition matrix Ti,j = p(zt = j | zt−1 = i), evidence vectors ψt (j) = p(xt | zt =j), and
initial state distribution π(j) = p(z1 = j).
1. First, compute the initial [α1 , Z1 ] = normalize(ψ1 π).
2. For time 2 ≤ t ≤ T , compute [αt , Zt ]P = normalize(ψt T T αt−1 ).
3. Return αh1...T i and log p(xh1...T i ) = t log Zt .

The Forwards-Backwards Algorithm (17.4.3). Goal: compute the smoothed marginals,


p(zt = j | xh1...T i ).

γt (j) , p(zt = j | xh1...T i ) (1157)


∝ p(zt = j | xh1...ti )p(xht+1...T i | zt = j) (1158)
= αt (j)βt (j) (1159)
where βt (j) , p(xht+1...T i | zt =j) (1160)
= p(xt+1 , xht+2...T i | zt =j) (1161)
X
= p(zt+1 =i, xt+1 , xht+2...T i | zt =j) (1162)
i
X
= p(zt+1 =i, xt+1 | zt =j)p(xht+2...T i | 
zt =j,
 z =i, x 
t+1 )
t+1 (1163)
i
X
= p(zt+1 =i | zt =j)p(xt+1 | zt+1 =i)p(xht+2...T i | zt+1 =i) (1164)
i
X
= p(zt+1 =i | zt =j)p(xt+1 | zt+1 =i)βt+1 (i) (1165)
i

Using the same notation as Algorithm 17.1 above, the matrix-vector form for β is
βt = T (ψt βt+1 ) (1166)
with base case βT = 1.

The Viterbi Algorithm (17.4.4). Denote the [probability for] the most probable path leading
to zt =j as

δt (j) , max p(zh1...t−1i , zt =j | xh1...ti ) (1167) It is common to work in


zh1...t−1i the log-domain when
computing δ.
= max δt−1 (i) · Ti,j · ψt (j) (1168)
i

with initialization of δ1 (j) = πj ψ1 (j). We compute this until termination at zT∗ = arg maxi δT (i).
Note the arg max here instead of a max – we keep track of both for all time steps. We do this
so we can perform traceback to get the full most probable state sequence, starting at T and
ending at t = 1:
zt∗ = at+1 (zt+1

) (1169)
where at (j), the most probable state at time t − 1 leading to state j at time t, is the same
formula as δt (j) but with an arg max.

367
Machine Learning: A Probabilistic Perspective August 11, 2018

Undirected Graphical Models (Ch. 19)


Table of Contents Local Written by Brandon McKinzie

Kevin P. Murphy (2012). Undirected Graphical Models.


Machine Learning: A Probabilistic Perspective.

Learning (19.5). Consider a MRF in log-linear form over C cliques and its log-likelihood
(scaled by 1/N):
( )
1 X
p(y | θ) = exp θcT φc (y) (1170)
Z(θ) c
" #
1 X X T
`(θ) = θc φc (yi ) − log Z(θ) (1171)
N i c


We know from chapter 9 that this log-likelihood is concave in θ, and that ∂θc log Z = E [φc ].
So the gradient of the LL is
" #
∂` 1 X
= φc (yi ) − Ep(y|θ) [φc (y)] (1172)
∂θc N i

Note that the first (“clamped”) term only needs to be computed once for any yi , it is completely
independent of any parameters. It’s just evaluating feature functions, which e.g. for CRFs are
often all indicator functions.

CRF Training (19.6.3). For [linear-chain] CRFs, the equations change slightly (but impor-
tantly).
" #
1 X X T
`(w) , wc φc (yi , xi ) − log Z(w, xi ) (1173)
N i c
∂` 1 Xh i
= φc (yi , xi ) − Ep(y|xi ) [φc (yi , xi )] (1174)
∂wc N i

It’s important to recognize that the gradient of the log partition function now must be com-
puted for each instance xi .

368
Convex
Optimization
Contents

10.1 Convex Sets (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

369
Convex Optimization August 19, 2018

Convex Sets (Ch. 2)


Table of Contents Local Written by Brandon McKinzie

Boyd and Vandenberghe (2004). Convex Sets.


Convex Optimization.

Lines and line segments


Viewed as a function of θ ∈ R, we can express the equation for a line in Rn in the following
two ways:

y = θx1 + (1 − θ)x2 (1175)


y = x2 + θ(x1 − x2 ) (1176)

for some x1 , x2 6=x1 ∈ Rn . If we restrict θ ∈ [0, 1], we have a line segment.

Affine sets
A set C ⊆ Rn is affine if the line (not just segment) through any two distinct points in
C also lies in C. More generally, this implies that for any set of points {x1 , . . . , xk }, with
each xi ∈ C, all affine combinations,
k
X X
θ i xi , where θi = 1 (1177)
i=1 i

are in C, too. Related terminology:


• affine hull of any set C ⊆ Rn , denoted aff C, is the set of all affine combinations of
points in C.
• affine dimension of a set C is the dimension of its affine hull.
• relative interior of a set C, denoted relintC, is its interior257 relative to aff C,

relintC , {x ∈ C | B(x, r) ∩ aff C ⊆ C for some r > 0} (1178)

There’s a lot of neat things to say here, but I only have space to state the results:
• If C is an affine set and x0 ∈ C, then

V = C − x0 , {x − x0 | x ∈ C} (1179)

is a subspace, i.e. closed under sums and scalar multiplication.


257
The interior of a set C ⊆ Rn , denoted intC, is the set of all points interior to C. A point x ∈ C is
interior to C if ∃ > 0 for which all points in the set

{y ∈ Rn | ||y − x||2 ≤ }

are also in C.

370
• The solution set of a system of linear equations, C = {x | Ax = b}, is an affine set. The
subspace associated with C is the nullspace of A.
Convex sets
A set C is convex if it contains all convex combinations,
k
X X
θ i xi , where θi = 1, and θi ≥ 0 (1180)
i=1 i

Related terminology:
• convex hull a set C, denoted convC, is the set of all convex combinations of points
in C.

Cones
A set C is called a cone if (∀x ∈ C)(θ ≥ 0) we have θx ∈ C. A set C is called a convex
cone if it contains all conic combinations,
k
X
θ i xi , where θi ≥ 0 (1181)
i=1

of points in C. Related terminology:


• conic hull of a set C is the set of all conic combinations of points in C.

371
Bayesian Data
Analysis
Contents

11.1 Probability and Inference (Ch. 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373


11.2 Single-Parameter Models (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.3 Asymptotics and Connections to Non-Bayesian Approaches (Ch. 4) . . . . . . . . . . . . . . . 378
11.4 Gaussian Process Models (Ch. 21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

372
Bayesian Data Analysis November 22, 2017

Probability and Inference (Ch. 1)


Table of Contents Local Written by Brandon McKinzie

The process of Bayesian Data Analysis can be divided into the following 3 steps:
1. Setting up a full probability model.
2. Conditioning on observed data. Calculating the posterior distribution over unobserved
quantities, given observed data.
3. Evaluating the fit of the model and implications of the posterior.
Notation: In general, we let θ denote unobservable vector quantities or population parameters
of interest, and y as collected data. This means our posterior takes the form p(θ | y), and our
likelihood takes the form p(y | θ).

Means and Variances of Conditional Distributions.

E [u] = Ev [Eu [u | v]] (1182)


Var [u] = Ev [Var [u | v]] + Var [Eu [u | v]] (1183)

Proofs
Z Z
E [u] = dvPr [v] duPr [u | v] (1184)

= Ev [Eu [u | v]] (1185)


 2 2
Var [u] = E u − E [u] (1186)
  2
 2
= Ev Eu u | v − Ev [Eu [u | v]] (1187)
  2
 2
  2
 2
= Ev Eu u | v − Eu [u | v] + Ev Eu [u | v] − Ev [Eu [u | v]] (1188)
= Ev [Var [u | v]] + Var [Eu [u | v]] (1189)

Transformation of Variables.
h i
Prv [v] = |J|Pru f −1 (v) (1190) ∂u ∂f −1 (v)
Ji,j = ∂v
= ∂v

where u and v have the same dimensionality, and |J| is the determinant of the Jacobian of the
transformation u = f −1 (v). When working with parameters defined on the open unit interval,
(0, 1), we often use the logistic transformation:
u
 
logit(u) = log (1191)
1−u
ev
 
logit−1 (v) = log (1192)
1 + ev

373
Standard Probability Distributions258 .

Distribution Notation Density Function


Γ(α+β) α−1
Beta Beta(α, β) p(θ) = Γ(α)Γ(β) θ (1 − θ)β−1
β α −(α+1) −β/θ
Inverse-gamma Inv-gamma(α, β) p(θ) = Γ(α) θ e
N(µ, σ 2 ) p(θ) = √2πσ exp − 2σ1 2 (θ − µ)2
1

Normal (univariate)
Scaled inverse-χ2 Inv-χ2 (ν, s2 ) θ ∼ Inv-gamma( ν2 , ν2 s2 )

258
The gamma function, Γ(x), is defined as
Z ∞
Γ(x) = tx−1 e−t dt (1193)
0

or simply as (x − 1)! if x ∈ Z+ .

374
Bayesian Data Analysis November 22, 2017

Single-Parameter Models (Ch. 2)


Table of Contents Local Written by Brandon McKinzie

Since we’ll be referring to the binomial model frequently, below are the main distributions for
reference:
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)n−y (1194)
y
p(θ | y) ∝ θy (1 − θ)n−y (1195)
θ | y ∼ Beta(y + 1, n − y + 1) (1196)

where y is the number of successes out of n trials. We assume a uniform prior over the interval
[0, 1].
Posterior as compromise between data and prior information. Intuitively, the prior
and posterior distributions over θ should have some general relationship showing how the
process of observing data y updates our distribution on θ. We can use the identities from the
previous chapter to see that

E [θ] = Ey [Eθ [θ | y]] (1197)


Var [θ] = Ey [Var [θ | y]] + Var [Eθ [θ | y]] (1198)

where:
→ Equation 1197 states the obvious: our prior expectation for θ is the expectation, taken over
the distribution of possible data, of the posterior expectation.
→ Equation 1198 states: the posterior variance is on average smaller than the prior variance,
by an amount that depends on the variation [in posterior means] over the distribution of
possible data. Stated another way: we can reduce our uncertainty with regard to θ by
larger amounts with models whose [expected] posteriors are strongly informed by the data.
Informative Priors. We now discuss some the issues that arise in assigning a prior distri-
bution p(θ) that reflects substantive information. Instead of using a uniform prior for our
binomial model, let’s explore the prior θ ∼ Beta(α, β)259 . Now our posterior takes the form

θ | y ∝ Beta(α + y, β + n − y) (1199)

The property that the posterior follows the same parametric form as the prior is called con-
jugacy.

259
Assume we can select reasonable values for α and β.

375
Conjugacy, families, and sufficient statistics. Formally, if F is a class of sampling dis-
tributions {Pri [y | θ]}, and P is a class of prior distributions, {Prj [θ]}, then the class P is
conjugate for F if260

∀Pr [· | θ] ∈ F, Pr [·] ∈ P : Pr [θ | y] ∈ P (1200)

Probability distributions that belong to an exponential family have natural conjugate prior
distributions. The class F is an exponential family if all its members have the form,
T
Pr [yi | θ] = f (yi )g(θ)eφ(θ) u(yi )
(1201)
n n
! !
Y X
Pr [y | θ] = f (yi ) g(θ)n exp φ(θ)T u(yi ) (1202)
i=1 i=1
 
∝ g(θ)n exp φ(θ)T t(y) (1203)

where
• y = (y1 , . . . , yn ) denotes n iid observations.
• φ(θ) is called the natural parameter of F.
• t(y) = ni=1 u(yi ) is said to be a sufficient statistic for θ, because the likelihood for θ
P

depends on the data y only through the value of t(y).

Normal distribution with known variance261 . Consider a single scalar observation y


drawn from N (θ, σ 2 ), where we assume σ 2 is known. The family of conjugate prior densities
for the Gaussian likelihood as well as our choice of parameterization are, respectively,
2 +Bθ+C Defining our conjugate
p(θ) ∝ eAθ (1205) prior
1
− (θ−µ0 )2
2τ 2
p(θ) ∝ e 0 (1206)

By definition, this implies that the posterior should also be normal. Indeed, after some basic
arithmetic/substitutions, we find
Writing our posterior
1
 
Pr [θ | y] ∝ exp − (θ − µ1 )2 (1207) precision and mean.
2τ12
1
µ + σ12 y
τ02 0 1 1 1
µ1 = 1 and = 2+ 2 (1208)
τ02
+ σ12 τ12 τ0 σ

260
In English: A class of prior distributions is conjugate for a class of sampling distributions if, for any
pair of sampling distribution and prior distribution [from those two respective classes], the associated posterior
distribution is also in the same class of prior distributions.
261
The following will be useful to remember:
Z ∞
2 b2
π 4a
e−ax +bx+c
dx = e +c (1204)
−∞
a

376
where we see that the posterior precision (inverse of variance) equals the prior prior precision
plus the data precision. We can see the posterior mean µ1 expressed as a weighted average of
the prior mean and the observed value262 y, with weights proportional to the precisions.

Normal distribution with unknown variance. Now, we assume the mean θ is known, and
the variance σ 2 is unknown. The likelihood for a vector y = (y1 , . . . , yn ) of n iid observations
is
h i 
n
 Computing our
Pr y | σ 2 ∝ (σ 2 )−n/2 exp − v (1209) likelihood for n IID
2σ 2 observations.
n
1X
v := (yi − θ)2 (1210)
n i=1

where v is the sufficient statistic. The corresponding conjugate prior density is the inverse-
gamma. This and our choice for parameterization (how we define α and β) are, respectively,
h i 2 Defining our conjugate
Pr σ 2 ∝ (σ 2 )−(α+1) e−β/σ (1211) prior.

σ 2 ∼ Inv-χ2 (ν0 , σ02 ) = Inv-gamma( ν20 , ν20 σ02 ) (1212)

All that’s left is computing our posterior,

p(σ 2 | y) ∝ p(σ 2 )p(y | σ 2 ) (1213)


!
ν0 σ02
+ nv
σ 2 | y ∼ Inv-χ2 ν0 + 2, (1214)
ν0 + n

Jeffrey’s Invariance Principle. An approach for defining noninformative prior distributions.


Let φ = h(θ), where the function h is one-to-one. By transformation of variables,

p(φ) = p(θ) = p(θ)|h0 (θ)|−1 (1215)

Jeffrey’s principle is to basically take the above equation as a true equivalence – there is no
difference between finding p(θ) and applying the equation above [to get p(φ)] and directly
finding p(φ).

Let J(θ) denote the Fisher information for θ, defined as


" 2 # " #
d log Pr [y | θ] d2 log Pr [y | θ]
J(θ) = E | θ = −E |θ (1216)
dθ dθ2

Jeffrey’s prior model defines the noninformative prior density as Pr [θ] ∝ [J(θ)]1/2 . We can
work out that this model is indeed invariant to parameterization263 .
262
For now, we’re considering the single data point case.
263
Evaluate J(φ) at θ = h−1 (φ). You should find that J(φ)1/2 = J(θ)1/2 | dφ

|

377
Bayesian Data Analysis December 02, 2017

Asymptotics and Connections to Non-Bayesian Approaches (Ch. 4)


Table of Contents Local Written by Brandon McKinzie

Normal Approximations to the Posterior Distribution. If the posterior Pr [θ | y] is


unimodal and roughly symmetric, it can be convenient to approximate it by a normal distri-
bution. Here we’ll consider a quadratic approximation via the Taylor series expansion up to
second-order,
" #
h 1
i d2
log Pr [θ | y] = log Pr θ̂ | y + (θ − θ̂)T log Pr [θ | y] (θ − θ̂) (1217)
2 dθ2 θ=θ̂

where θ̂ is the posterior mode. The remainder terms of higher order fade in importance relative
to the quadratic term when θ is close to θ̂ and n is large. We’d like to cast this into a normal
distribution. First, let

d2
I(θ) , − log Pr [θ | y] (1218)
dθ2
which we will refer to as the observed information. We can then rewrite our approximation
as264

Pr [θ | y] ≈ N (θ̂, [I(θ̂)]−1 ) (1221)

Under the normal approximation, the posterior distribution is summarized by its


mode, θ̂, and the curvature of the posterior density, I(θ̂); that is, asymptotically,
these are sufficient statistics.

264
I also found it helpful to explicitly write the substitution after raising eq 1217 by power of e (all logs are
assumed natural logs)
1 T −1
Pr [θ | y] = elog Pr [θ̂|y]− 2 (θ−θ̂) [I(θ̂)] (θ−θ̂) (1219)
−1
−1 (θ−θ̂)T [I(θ̂)] (θ−θ̂)
 
= Pr θ̂ | y e 2 (1220)

378
Example. Let y1 , . . . , yn be independent observations from N (µ, σ 2 ). Define θ := (µ, log σ)
as the parameters of interest, and assume a uniform prior265 Pr [θ]. Recall that (equation 3.2
in textbook)
n
!
1 X
Pr [θ = (µ, log σ) | y] ∝ σ −(n+2) exp − 2 (yi − µ)2 (1223) Note
Pn that
2σ i=1 i=1
(yi − ȳ) = 0
n
" #!
1
= σ −(n+2) exp −
X
n(ȳ − µ)2 + (yi − ȳ)2 (1224)
2σ 2 i=1
1 h
 i
= σ −(n+2) exp − n(ȳ − µ) 2
+ (n − 1)s2
(1225)
2σ 2
1 n
where s2 = n−1 2
i=1 (yi − ȳ) is the sample variance of the yi ’s. The sufficient statistics are
P

ȳ and s2 . To construct the approximation, we need the second derivatives of the log posterior
density266 ,
1  2 2

log Pr [µ, log σ | y] = const − n log σ − n(ȳ − µ) + (n − 1)s (1226)
2σ 2

in order to compute I(θ̂). After computing first derivatives, we find that the posterior mode is
r !!
n−1
θ̂ = (µ̂, log σ̂) = ȳ, log s (1227)
n

We then compute second derivatives and evaluate at θ = θ̂ to obtain I(θ̂). Combining all this
into the final result:

Pr [µ, log σ | y] ≈ N (θ̂, [I(θ̂)]−1 ) (1228)


! !!
ȳ σ̂ 2 /n 0
=N , (1229)
log σ̂ 0 1/(2n)

where σ̂ 2 /n is the variance along the µ dimension, and 1/(2n) is the variance along the log σ
direction. This example was just meant to illustrate, with a simple case, how we work through
constructing the approximate normal distribution.

265
Recall from Ch 3.2 that the uniform prior on µ, log σ is

Pr µ, σ 2 ∝ (σ 2 )−1
 
(1222)

where we continue to assume µ is uniform in [0, 1] for some reason.


266
I’m not sure why n + 2 has seemingly turned into n.

379
Large-Sample Theory. Asymptotic normality of the posterior distribution: as more data
arrives from the same underlying distribution f (y), the posterior distribution of the parameter
vector θ approaches multivariate normality, even if the true distribution of the data is not
within the parametric family under consideration.

Suppose the data are modeled by a parametric family, Pr [y | θ], with a prior distribution
Pr [θ]. If the true distribution, f (y), is included in the parametric family (i.e. if ∃θ0 : f (y) =
Pr [y | θ0 ]), then it’s also true that consistency holds267 : Pr [θ | y] converges to a point mass
at the true parameter value, θ0 as n → ∞.

267
So, what if the true f (y) is not included in the parametric family? In that case, there is no longer a true
value θ0 , but its role in the theoretical result is replaced by a value θ0 that makes the model distribution Pr [y | θ]
closest to the true distribution f (y), in a technical sense involving the Kullback-Leibler divergence.

380
Bayesian Data Analysis July 29, 2018

Gaussian Process Models (Ch. 21)


Table of Contents Local Written by Brandon McKinzie

Since I like to begin by motivating what we’re going to talk about, and since BDA doesn’t
really do this, I’m going to start with an excerpt from chapter 15 of Kevin Murphy’s book:

In supervised learning, we observe some inputs xi and some outputs yi . We assume that
yi = f (xi ), for some unknown function f , possibly corrupted by noise. The optimal ap-
proach is to infer a distribution over functions given the data, p(f | X, y), and then to use
this to make predictions given new inputs, i.e., to compute
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ )p(f | X, y)df (1230)

Gaussian Processes or GPs define a prior over functions p(f ) which can be converted
into a posterior over functions p(f | X, y) once we’ve seen some data.

Gaussian Process Regression (20.1). We write a GP as µ ∼ GP(m, k) (µ is now taking


the place of f from Kevin Murphy’s notation), parameterized in terms of a mean function m
and a covariance function k. Remember, µ is supposed to represent the predictor function for
obtaining output predictions given inputs, y = µ(x). Instead of making the usual assump-
tion that there is some “best” function µ∗ and trying to learn it via fitting a parameterized
µ̂(θ), we are going full meta 268 (i.e. full Bayesian) and learning the distribution over predictors.

Apparently, we only need to consider a finite (and arbitrary) set of points x1 , . . . , xn to consider
when evaluating any given µ. The GP prior on µ is defined as

µ(x1 ), . . . , µ(xn ) ∼ N ({m(x1 ), . . . , m(xn )} , K(x1 , . . . , xn )) (1231)

with mean m and covariance K 269 . The covariance function k specifies the covariance between
the process at any two points, with K an n × n covariance matrix with Kp,q = k(xp , xq ). The
covariance function controls the smoothness of realizations from the GP270 and the degree of
shrinkage towards the mean.

268
What if there are like, a whole space of different predictors, man? Like, what if there is an infinite sea of
predictor functions, all with their own unique traits and quirks? Woah.
269
Don’t confuse the notation – N uses covariance K as an argument, while GP uses covariance function k
as an argument.
270
In English: How similar we expect different samples of µ to look as a function of x. The reason this was
weird to think about at first is because I’m used to thinking about covariance/smoothness over x rather than
sampled functions of x. Meta.

381
A common choice the squared exponential,
!
0 2 |x − x0 |2
k(x, x ) = τ exp − (1232)
2`2

where τ controls the magnitude and ` the smoothness of the function.

382
Gaussian
Processes for
Machine
Learning
Contents

12.1 Regression (Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

383
Gaussian Processes for Machine Learning July 29, 2018

Regression (Ch. 2)
Table of Contents Local Written by Brandon McKinzie

Rasmussen and Williams (2006). Regression. Gaussian Processes for Machine Learning.

Weight-space view (2.1). We review the standard probabilistic view of linear regression.

f (x) = xT w (1233)
y = f (x) + ε where ε ∼ N (0, σn2 ) (1234)
T X ∈ Rd×n
[likelihood] y | X, w ∼ N (X w, σn2 I) (1235)
[prior] w ∼ N (0, Σp ) (1236)
p(y | w, X)p(w)
[posterior] p(w | y, X) = (1237)
p(y | X)
1 −1
∼ N ( 2 A Xy, A−1 ) (1238)
σn

where A = σn−2 XX T + Σ−1 p , and we often set Σp = I. When analyzing the contour plots for
the likelihood, remember that it is not a probability distribution, but rather it’s interpreted
as a function of w, i.e. likelihood(w ) := N (y; X T w , σn2 I).

Note that we often use Bayesian techniques without realizing it. For example, what does the
following remind you of?
1
ln p(w) ∝ wT Σp w (1239)
2
It’s the l2 penalty from ridge regression (where typically Σp = I). We can also project the
inputs to a higher-dimensional space, often referred to as the feature space, by passing them
through feature functions φ(x). As we’ll see later (ch 5), GPs actually tell us how to define
the basis functions hi (x) which define the value of φi (x), the ith element of the feature vector.
The author then proceeds to give an overview of the kernel trick.

384
Function-space view (2.2). Instead of inference over parameters w, we can equivalently
consider inference in function space with a Gaussian process (GP), formally defined as
A Gaussian process is a collection of random variables, any finite number of which have a
joint Gaussian distribution.

A GP is completely specified by its mean function and covariance function. Define mean
function m(x) and covariance function k(x, x0 ) of a real [Gaussian] process f (x) as

m(x) = Ef [f (x)] (1240)


0  0 0 
k(x, x ) = Ef (f (x) − m(x)) f (x ) − m(x ) (1241)
0 
f (x) ∼ GP m(x), k(x, x ) (1242)

Note that the expectations are over the random variable f (x) for any given (non-random) x.
In other words, the expectation is over the space of possible functions, each evaluated at point
x. Concretely, this is often an expectation over the parameters w, as is true for our linear
regression example. We can now write our Bayesian linear regression model (with feature
functions) as a GP.

Ew∼N (0,Σp ) [f (x; w)] = φ(x)T E [w] = 0 (1243)


h i
Ew∼N (0,Σp ) f (x)f (x0 ) = φ(x)T E wwT φ(x0 ) = φ(x)T Σp φ(x0 )
 
(1244)

A common covariance function is the squared exponential (a.k.a. the RBF kernel),
1
k(xp , xq ) = Cov [f (xp ), f (xq )] = exp(− |xp − xq |2 ) (1245)
2
where it’s important to recognize that, whereas we’ve usually seen this is the RBF kernel for
the purposes of kernel methods on inputs, we are now using it specify the covariance of the
outputs.

Ok, so how do we sample some functions and plot them? Below is an overview for our current
linear regression running example.
1. Choose a number of input points X∗ . For our linear regression example, we could set
this to np.arange(-5, 5.1, 0.1) to get evenly spaced x in [−5, 5] in intervals of 0.1.
2. Write out the covariance matrix defined by Kp,q = k(xp , xq ) using our squared exponen-
tial covariance function, for all pairs of inputs.
3. We can now generate samples of function f , represented as a random vector with size
equal to the number of inputs |X∗ |, by sampling from the GP prior

f∗ ∼ N (0, K(X∗ , X∗ )) (1246)

So far, we’ve only dealt with the GP prior. What do we do when we get labeled training
observations? How do we make predictions on unlabeled test data? Well, for the simple case
where our observations are noise free271 , that is we know {(xi , fi ) | i = 1, . . . , n}, the joint
271
For example, noise-free linear regression would mean we model y=f(x), implicitly defining ε = 0.

385
GP prior over the train set inputs X and test set inputs X∗ is defined exactly how we did
it earlier (zero mean, elementwise evaluation of k). In other words, our GP prior models the
train outputs f and test outputs f∗ as random vectors sampled via
" # " #!
f K(X, X) K(X, X∗ )
∼N 0, (1247)
f∗ K(X∗ , X) K(X∗ , X∗ )

You may be wondering: why are we talking about sampling from the prior on inputs? We
already know the outputs!, and you’d be correct. The way we obtain our posterior is by
restricting our joint prior to only those functions that agree with the observed training data
X, f, which we can do by simply conditioning on them. Our posterior for sampling test outputs
given test inputs X∗ is thus

f∗ | X∗ , X, f ∼ N K(X∗ , X)K(X, X)−1 f,
 (1248)
−1
K(X∗ , X∗ ) − K(X∗ , X)K(X, X) K(X, X∗ )

386
Blogs
Contents

13.1 Conv Nets: A Modular Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388


13.2 Understanding Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
13.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
13.4 Deep Learning for Chatbots (WildML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
13.5 Attentional Interfaces – Neural Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 395

387
Blogs December 21, 2016

Conv Nets: A Modular Perspective


Table of Contents Local Written by Brandon McKinzie

From this post on Colah’s Blog.

The title is inspired by the following figure. Colah mentions how groups of neurons, like A,
that appear in multiple places are sometimes called modules, and networks that use them are
sometimes called modular neural networks. You can feed the output of one convolutional layer
into another. With each layer, the network can detect higher-level, more abstract features.

−→ Function of the A neurons: compute certain features.


−→ Max pooling layers: kind of “zoom out”. They allow later convolutional layers to work on
larger sections of the data. They also make us invariant to some very small transformations
of the data.

388
Blogs December 21, 2016

Understanding Convolutions
Table of Contents Local Written by Brandon McKinzie

From Colah’s Blog.


Ball-Dropping Example. The posed problem:

Imagine we drop a ball from some height onto the ground, where it only has one dimension of
motion. How likely is it that a ball will go a distance c if you drop it and then drop it again from
above the point at which it landed?

From basic probability, we know the result is a sum over possible outcomes, constrained by
a + b = c. It turns out this is actually the definition of the convolution of f and g.
X
Pr(a + b = c) = f (a) · g(b) (1249)
a+b=c
X
(f ∗ g)(c) = f (a) · g(b) (1250)
a+b=c
X
= f (a) · g(c − a) (1251)
a

Visualizing Convolutions. Keeping the same example in the back of our heads, consider a
few interesting facts.

• Flipping directions. If f (x) yields the probability of landing a distance x away from
where it was dropped, what about the probability that it was dropped a distance x from
where it landed? It is f (−x).

• Above is a visualization of one term in the summation of (f ∗ g)(c). It is meant to


show how we can move the bottom around to think about evaluating the convolution for
different c values.

389
We can relate these ideas to image recognition. Below are two common kernels used to convolve
images with.
On the left is a kernel for blurring images, accomplished by taking simple averages. On the
right is a kernel for edge detection, accomplished by taking the difference between two pixels,
which will be largest at edges, and essentially zero for similar pixels.

390
Blogs December 23, 2016

Deep Reinforcement Learning


Table of Contents Local Written by Brandon McKinzie

Link to tutorial – Part I of “Demystifying deep reinforcement learning.”

Reinforcement Learning. Vulnerable to the credit assignment problem - i.e. unsure which
of the preceding actions was responsible for getting some reward and to what extent. Also
need to address the famous explore-exploit dilemma when deciding what strategies to use.

Markov Decision Process. Most common method for representing a reinforcement problem.
MDPs consist of states, actions, and rewards. Total reward is sum of current (includes previous)
and discounted future rewards:

Rt = rt γ(rt+1 + γ(rt+2 + . . .)) = rt + γRt+1 (1252)

Q - learning. Define function Q(s, a) to be best possible score at end of game after performing
action a in state s; the “quality” of an action from a given state. The recursive definition of Q
(for one transition) is given below in the Bellman equation.

Q(s, a) = r + γ max
0
Q(s0 , a0 )
a

and updates are computed with a learning rate α as

Q(st , at ) = (1 − α) · Q(st−1 , at−1 ) + α · (r + γ max


0
Q(s0t+1 , a0t+1 ))
a

Deep Q Network. Deep learning can take deal with issues related to prohibitively large
state spaces. The implementation chosen by DeepMind was to represent the Q-function with a
neural network, with the states (pixels) as the input and Q-values as output, where the number
of output neurons is the number of possible actions from the input state. We can optimize
with simple squared loss:

and our algorithm from some state s becomes


1. First forward pass from s to get all predicted Q-values for each possible action. Choose
action corresponding to max output, leading to next s0 .

391
2. Second forward pass from s0 and again compute maxa0 Q(s0 , a0 ).
3. Set target output for each action a0 from s0 . For the action corresponding to max
(from step 2) set its target as r + γ maxa0 Q(s0 , a0 ), and for all other actions set target to
same as originally returned from step 1, making the error 0 for those outputs. (Interpret
as update to our guess for the best Q-value, and keep the others the same.)
4. Update weights using backprop.

Experience Replay. This the most important trick for helping convergence of Q-values when
approximating with non-linear functions. During gameplay all the experience < s, a, r, s0 > are
stored in a replay memory. When training the network, random minibatches from the replay
memory are used instead of the most recent transition.

Exploration. One could say that initializing the Q-values randomly and then picking the
max is essentially a form of exploitation. However, this type of exploration is greedy, which
can be tamed/fixed with ε-greedy exploration. This incorporates a degree of randomness
when choosing next action at all time-steps, determined by probability ε that we choose the
next action randomly. For example, DeepMind decreases ε over time from 1 to 0.1.

Deep Q-Learning Algorithm.

392
Blogs January 15, 2017

Deep Learning for Chatbots (WildML)


Table of Contents Local Written by Brandon McKinzie

Overview.
• Model. Implementing a retrieval-based model. Input: conversation/context c. Output:
response r.
• Data. Ubuntu Dialog Corpus (UDC). 1 million examples of form (context, utterance,
label). The label can be 1 (utterance was actual response to the context) or a 0 (utterance
chosen randomly). Using NLTK, the data has been . . .
→ Tokenized: dividing strings into lists of substrings.
→ Stemmed. IDK
→ Lemmatized. IDK
The test/validation set consists (context, ground-truth utterance, [9 distractors (incorrect
utterances)]). The distractors are picked at random272

Dual-Encoder LSTM.

1. Inputs. Both the context and the response text are split by words, and each word is
embedded into a vector and fed into the same RNN.
2. Prediction. Multiply the [vector representation ("meaning")] c with param matrix M
to predict some response r0 .
3. Evaluation. Measure similarity of predicted r0 to actual r via simple dot product. Feed
this into sigmoid to obtain a probability [of r0 being the correct response]. Use (binary)
cross-entropy for loss function:
L = −y · ln(y 0 ) − (1 − y) · ln(1 − y 0 ) (1253)
where y 0 is the predicted probability that r0 is correct response r, and y ∈ {0, 1} is the
true label for the context-response pair (c, r).

272
Better example/approach: Google’s Smart Reply uses clustering techniques to come up with a set of
possible responses.

393
Data Pre-Processing. Courtesy of WildML, we are given 3 files after preprocessing: train.tfrecords,
validation.tfrecords, and test.tfrecords, which use TensorFlow’s ’Example’ format. Each Ex-
ample consists of . . .
• context: Sequence of word ids.
• context_len: length of the aforementioned sequence.
• utterance: seq of word ids representing utterance (response).
• utterance_len.
• label: only in training data. 0 or 1.
• distractor_[N]: Only in test/validation. N ranges from 0 to 8. Seq of word ids reppin
the distractor utterance.
• distractor_[N]_len.

394
Blogs April 04, 2017

Attentional Interfaces – Neural Perspective


Table of Contents Local Written by Brandon McKinzie

[Link to article]

Attention Mechanism. Below is a close-up view/diagram of an attention layer. Technically,


it only corresponds to a single time step i; we are using the previous decoder state si−1 to
compute the ith context vector ci which will be fed as an input to the decoder for step i.

For convenience, I’ll rewrite the familiar equations for computing quantities at some step i.

[decoder state] si = f (si−1 , yi−1 , ci ) (1254)


Tx
X
[context vect] ci = αij hj (1255)
j=1
exp(eij )
[annotation weights] αij = PTx (1256)
k=1 exp(eik )
eij = a(si−1 , hj ) (1257)

Now we can see just how simple this really is. Recall that Bahdanau et al., 2015 use the
wording: “eij is an alignment model which scores how well the inputs around position j and
the output at position i match.” But we can see an example implementation of an alignment
model above: the tanh function (that’s it).

395
Appendix
Contents

14.1 Common Distributions and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397


14.2 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
14.3 Matrix Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
14.4 Main Tasks in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5 Misc. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.1 BLEU Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.5.2 Connectionist Temporal Classification (CTC) . . . . . . . . . . . . . . . . . . . . . 413
14.5.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
14.5.4 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.5 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
14.5.6 Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.7 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
14.5.8 Traditional Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

396
Common Distributions and Models
Table of Contents Local Written by Brandon McKinzie

Continuous Distributions.
Distributions with support θ > 0: (∀n ∈ Z+ )Γ(n) = (n − 1)!
Z ∞
Distribution Density Function Notation Γ(z) = xz−1 e−x dx
2−ν/2 ν/2−1 e−θ/2 0
Chi-Square p(θ) = Γ(ν/2) θ θ ∼ χ2ν
β α
p(θ) = α−1 e−βθ θ ∼ Gamma(α, β)
Gamma Γ(α) θ
β α −α−1 −β/θ
Inverse-gamma p(θ) = Γ(α) θ e θ ∼ Inv-gamma(α, β)
1
2−ν/2 −ν/2−1 − 2θ
Inverse-chi-square p(θ) = Γ(ν/2) θ e θ ∼ Inv-χ2ν

Distributions with support θ ∈ [0, 1]:

Distribution Density Function Notation


Γ(α+β)
Beta p(θ) = Γ(α)+Γ(β) θα−1 (1 − θ)β−1 θ ∼ Beta(α, β)
P
Γ( k αk ) Q αk −1
θ ∼ Dirichlet(α1 , . . . , αK )
P
Dirichlet p(θ) = Q
Γ(αk ) k θk , k θk =1
k

Discrete Distributions.
Distribution Density Function Notation
Bernoulli p(x; θ) = θ1x=1 (1 − θ)1x=0 X ∼ Ber(θ)
n x
Binomial p(x; n) = x p (1 − p)n−x x ∼ Bin(n, p)
Qk xi
Multinomial p(x1 , . . . , xk ; n) = Qn!
k p
i i
x!
i i

Logistic Regression. Perhaps the simplest linear method 273 for classification is logistic
regression. Let K be the number of classes that y can take on. The model is defined as

exp(θkT x)
Pr [y = k | x] = PK−1 , for 1 ≤ k ≤ K − 1 (1258)
1 + `=1 exp(θ`T x)
1
Pr [y = K | x] = PK−1 (1259)
1 + `=1 exp(θ`T x)

and we often denote Pr [y = k | x] under the entire set of parameters θ simply as pk (x; θ)
or just pk (x). The decision boundaries are the set of points in the domain of x for which
some pk (x) = pj6=k (x). Equivalently, the model can be specified by K − 1 log-odds or logit

273
We say a classification method is linear if its decision boundary is linear.

397
transformations of the form
pi (x)
 
log = θiT x for 1 ≤ i ≤ K − 1 (1260)
pK (x)

Also note that the parameter vectors are orthogonal to the K-1 decision boundaries. For any
x, x0 on the decision boundary defined as the set of points {x : pa (x) = pb (x)}, we know that
the vector x − x0 is parallel to the decision boundary (by definition), and can derive

pa (x)
= 1 = exp(θaT x − θbT x) =⇒ θaT x = θbT x (1261)
pb (x)
∴ θaT (x − x0 ) = θbT (x − x0 ) = 0 (1262)

and thus θa and θb are both perpendicular to the decision boundary where pa (x) = pb (x).

398
Math
Table of Contents Local Written by Brandon McKinzie

Fancy math definitions/concepts for fancy authors who require fancy terminology.
• Support. Sloppy definition you’ll see in most places: The set-theoretic support of a
real-valued function f : X 7→ R is defined as

supp(f ) , {x ∈ X : f (x) 6= 0}

Note that Wikipedia is surprisingly sloppy with how it defines and/or uses support in
various articles. After some digging, I finally found the formal definition for probability
and measure theory:
If X : Ω 7→ R is a random variable on (Ω, F, P ), then the support of X is the smallest
closed set RX ⊂ R such that P (X ∈ RX ) = 1.
• Infimum and Supremum. The fancy-math way of saying minimum and maximum.
Yes, I recognize that these are important in certain (usually rather abstract) settings,
but often in ML it is used when sup means exactly the same thing as max, but the
authors want to look sophisticated. Here I’ll give the formal definition for sup. You have
a partially ordered set274 P , and are for the moment checking out some subset S ⊆ P .
Someone asks you, “hey, give me an upper bound of S.” You just gotta find some b ∈ P
(the larger/global set) that is greater than or equal to every element in S. The person
then comes back and says “ok no, I need the supremum of S.” Now you need to find the
smallest value out of all the possible upper bounds.

Hopefully it is clear why this is only relevant in real-valued cases where the “edges” aren’t
well-defined.
• Probability Measure. Informal definition: a probability distribution275 . Formal defi-
nition: a function µ : α 7→ R[0, 1] from events to scalar values, where µ(α) = 1 if α = Ω
(the full space) and µ(∅) = 0. Also µ must satisfy the countable additivity property:
µ(∪i αi ) = i µ(αi ) for pairwise disjoint sets {α}i .
P

274
A partially ordered set (P, ≤) is a set of elements such that element i is less than or equal to element i + i.
275
See this great answer detailing how the difference between “measure” and “distribution” is basically just
context.

399
Linear Algebra. Feeling like I need a quick recap from my adv. linalg course and an area
where I can ramble my interpretations. In what follows, let V and W be vector spaces over
some field F .
Linear Transformation
A function T : V 7→ W is called a linear transformation from V to W if ∀x, y ∈ V and
∀c ∈ F :
• T (x + y) = T (x) + T (y).
• T (cx) = cT (x).

Now suppose that V and W have ordered bases β = {v1 , . . . , vn } and γ = {w1 , . . . , wm },
respectively. Then for each basis vector vj , there exist unique scalars aij ∈ F such that
m
X
T (vj ) = aij wi (1263)
i=1

Remember that each vj and wi are members of a vector space (they are not scalars). And also
be careful to not associate the representation of any vector space element with its coordinate
vector relative to a specific ordered basis, which itself is a different linear transformation from
some V 7→ F n . Keep it abstract.
Matrix Representation of a Linear Transformation
We call the m × n matrix A defined by Aij = aij the matrix representation of T in the
ordered bases β and γ and write A = [T ]γβ . If V = W and β = γ, then A = [T ]β .

Given this definition, I think it’s wise to interpret matrix representations by the column-vector
point of view. Each column vector [T (vj )]γ , read as “the coordinate vector of T (vj ) relative to
ordered basis γ,” tells you how each basis vector vj in domain V gets mapped to a [coordinate]
vector in output space W [relative to a given ordered basis γ]. For some reason, my brain has
always had a hard time with the fact that the matrix row indices i correspond to output space,
while the column indices j represent the input space. The easiest way (I think) to help ground
this the right way is to remember that LA (x) , Ax, i.e. the operator view. At the same time,
notice how the effect of Ax is to take successive linear combinations over each element of x.

I just realized another reason why the interpretation felt backwards to my brain: when we are
taught matrix multiplication, we do the computations Ax in our heads “row by row” along A,
taking inner products of the row with x, so I’ve been taught to think of the rows as the main
“units” of the matrix. I’m not sure how to fix this, but notice that the matrix is basically just
a blueprint/roadmap/whatever you want to call it for taking coordinate vectors in one basis
to coordinate vectors in another basis. It’s really important to remember the coefficients of A
are intimately tied to the input/output bases.

AHA. I’ve been thinking about this all wrong. For the longest time, I’ve been trying to force
an interpretation of matrix multiplication that “feels” like scalar multiplication. I realize now
that this is going about it all the wrong way. Matrix multiplication need only be considered from

400
the lens of a linear transformation. After all, that’s exactly the purpose of matrices anyway!
It’s so glaringly obvious from the paragraphs above, but I guess I never took them seriously
enough. Matrices are simply convenient ways for us to write down linear transformations on
vectors in a given [ordered] basis. The jth column of the matrix defines how the original jth
basis vector is transformed. AHA (again). Now I see why I missed this crucial connection
– everything above focuses on the formal definition of input ordered basis β to output ordered
basis γ, but 99 percent of the time in real life we have either β ⊂ γ or γ ⊂ β (we usually are
mapping from Rn to Rm ). For example, let A ∈ M m×n and x ∈ Rn ; the following is always
true:

Ax = LA (x) (1264)
h i
= LA (ê1 ) LA (ê2 ) · · · LA (ên ) x (1265)
n
X
= LA (êi )xi (1266)
i=1

This viewpoint is painfully obvious to me now, but I guess I hadn’t thought deeply enough
about the implications of the definition of a linear transformation, and I definitely took the
representation of a matrix way too seriously, rather than focusing on its sole purpose:
provide a convenient way to write down linear transformations. For example, the above is
actually a direct consequence of the definition of a L.T. itself:

LA (x) = LA (x1 ê1 + · · · + xn ên ) (1267)


= x1 LA (ê1 ) + · · · + xn LA (ên ) (1268)

Time to really nail in the understanding. I also remember getting screwed up trying to think
about ok, so how do I conceptualize of the ith element of x after the transformation? It’s just
a bunch of summed up goobley-gook!. On one hand, yes that’s true, but focus on the following
before/after representations of x to make your life easier:
n n
X Ax X
x, xi êi −−−−→ LA (x), xi LA (êi ) (1269)
i i

Matrix-Matrix Multiplication. Continuing with the viewpoint that a matrix is nothing


more than a convenient way to represent a linear transformation, recognize that any matrix
multiplication AB represents a linear transformation itself, defined as T := TA ◦ TB , the
composition of A and B.

401
Matrix Multiplication and Neural Networks. Let’s use the previous interpretations
in the context of neural networks. A basic feedforward network with one hidden layer will
compute outputs o given inputs x, each of which are vectors of possibly different dimension:

o(x) = W (o) φ(x) (1270)


φ(x) = W (h) x (1271)

where W (o) and W (h) are the output and hidden parameter matrices, respectively. We already
know that we can interpret each columns of these matrices as how the input basis vectors get
mapped to the hidden or output space. However, since we usually think of the parameter
matrices as representing the weighted edges of a network, we often think in terms of individual
units. For example, the ith unit of the hidden layer vector h is given by hi = nj in Wij xj =
P

hWi,: , xi. One interesting interpretation is that the ith element of h is a projection276 of
the input x onto the ith row of W. This is of course true for any linear transformation; we
can always think of the elements of the transformed vector as the result of projections of the
original vector along a particular direction.

Determinants. det A is the volume of space that a unit [hyper] cube is mapped to. Starting
with the simplest non-trivial case, let A ∈ M 2×2 , and define A s.t. it simply scales the basis
vectors (zero rotation). In other words, Ai,j6=i := 0. In this case, det A = a11 a22 , which is
the area enclosed by the new scaled basis vectors. Skipping straight to the general case of
[necessarily square] matrix A ∈ M n×n using Einstein summation notation and the Levi-Cevita
symbol277 :
n
Y
det A , εi1 ,...in a1,i1 . . . an,in = εi1 ,...in aj,ij (1273)
j=1

Consider that if det A = 0, then TA “squishes” the volume of space in such a way that we
essentially lose one or more dimensions. Notice how it only takes one lost dimension, since
the volume of any region in Rn is zero unless there is some amount in all dimensions (e.g. a
cube with zero width has zero volume, regardless of its height/depth). It’s also interesting to
consider the relationships here with the invertible matrix theorem (defined a few paragraphs
below). Having the intuition that determinants can be thought of as a change-in-volume
makes it much more obvious why the equivalence statements of the invertible matrix theorem
are indeed equivalent.

276
This is informally worded. See the footnotes in the dot products section to understand why the element
is technically just the result of a transformation (a projection would require re-mapping the scalar back to the
space that Wi,: lives (input space)).
277
Recall that

+1 if (i1 , . . . , in ) is even perm of (1, . . . , n)
εi1 ,...,in , −1 if (i1 , . . . , in ) is odd perm of (1, . . . , n) (1272)

0 otherwise

Note that this implies equal-to-zero if any of the indices are equal.

402
Dot Products and Projections. First, recall that a projection is defined to be a linear
transformation P that is idempotent (P n = P for n ≥ 1). Also, note that what you generally
think of as a projection is technically an orthogonal projection 278 .

Here we’ll show the intimate link between the dot product and [orthogonal] projection. Let
Pu define the [orthogonal] projection onto some unit vector u ∈ Rn (more generally, we
could project onto a subspace instead of a single vector279 ). We can re-cast this as a linear
transformation Tu : Rn 7→ R (technically not a projection, which would require re-mapping the
output scalar back to Rn ). We interpret the scalar output of Tu (x) as the coordinate along
the line spanned by {u} that input vector x gets mapped to. But wait, didn’t we just talk a
bunch about how to represent/conceptualize of the matrix representation of a transformation?
Yes, we did. Well then, what would the matrix representation of Tu look like? Don’t forget
that we’ve defined ||u|| = 1.
h i
[Tu ]R
Rn = u1 · · · un (1274)
n
X
Tu (x) = ui xi (1275)
i
−→ =x•u (1276)

Furthermore, since linear transformations satisfy T (cx) = cT (x) by definition, the final result
is true even when u is not a unit vector.

Invertibility and Isomorphisms. A function is invertible IFF it is both one-to-one and


onto (i.e. bijective). Recall that rank(T ) is the dimensionality of the range of T , which is the
subspace of W that T maps to.

Let T : V 7→ W be a linear transformation, where V and W are finite-dimensional spaces


of equal dimension. Then T is invertible IFF rank(T ) = dim(V ).

For any invertible functions T and U :


• (T U )−1 = U −1 T −1 . One easy way to show this is

(T U )U −1 T −1 (x) = T (U U −1 )T −1 (x) = T T −1 (x) = x (1277)

• (T −1 )−1 = T . The inverse of T is itself invertible.

278
The more general definition uses wording “projection of vector x along k onto m”, where the distinction
is shown in italics. An orthogonal projection implicitly defines k as its null space; for any α ∈ R, an orthogonal
projection satisfies P (αk) = 0
279
And technically, you don’t project onto a vector, but rather you project onto a line, which is itself technically
a subspace of 1 dimension. Yada yada yada.

403
Inverse of a partitioned matrix
Consider a general partitioned matrix,
!
E F
M= (1278)
G H

where E and H are invertible. Then


!
−1 −1
−1 (M/H) − (M/H) FH −1
M = −1 −1 (1279)
E −1 + E −1 F (M/E) GE −1 (M/E)
where M/H , E − FH −1 G (1280)
−1
M/E , H − GE F (1281)

where M/H denotes the Schur complement of M w.r.t. H.

Invertible Matrix Theorem


Let A be a square n × n matrix over some field K. The following statements are equivalent
(I’ll group statements that are nearly identical, too):

• The ones I consider useful:


1. A is invertible.
2. A is row-equivalent (and thus column-equivalent) to In .
3. det A 6= 0.
4. rank(A) = n.
5. The columns of A are linearly independent. They span K n . Col(A) = K n .
6. The transformation T (x) = Ax is a bijection from K n to K n .
• The rest:
1. A has n pivot positions.
2. A can be expressed as a finite product of elementary matrices.

Understanding correlation vs independence.


• Two events A and B are independent iff P (A ∩ B) = P (A)P (B).
• Although Indep(X, Y ) =⇒ cov(X, Y )=0, the converse is not true. It’s useful to see
that statement more explicitly:

[E [XY ] =E [X] E [Y ]] 
=⇒ [P (X, Y )=P (X)P (Y )] (1282)

404
=⇒
Example: Uncorrelated   Independent


To get an intuition for this, I immediately try formalizing the possible cases where this is true. It
seems that symmetry is always present in such cases, and they do seem like edge cases. The simplest
and by far most common example is the case where we have x,y coordinates {(−1, 0), (0, −1), (0, 1), (1, 0)}.

It’s obvious that X ∗ Y equals zero for all of these points, and also that both X and Y are symmetric
about the origin, meaning that E [XY ] = 0 = E [X] E [Y ]. In other words, they are uncorrelated. The
key insight comes from understanding why this is so: Regardless of whether one variable is either
positive or negative the other is zero. I really want to nail this insight down, because it made me realize
I was thinking about correlation wrong – I was thinking about it more as independence in the first place,
and so looking back it’s no wonder I was confused about the difference. You simply cannot think about
correlation from the perspective of a single instance. For example, when I first read this, I thought “well
if I know X is 1, then I know automatically that Y is zero”, and although that is technically true, that is
not what correlation is about. Rather, correlation is about trends of multiple instances. A more correct
thought would be “Regardless of whether X is positive or negative, Y is zero, therefore positive values of
X are neither positively nor negatively correlated with values of Y.”

Now that we’ve got the thornier part (for me at least) out of the way, recognize that although X and Y
are uncorrelated, they are not independent. This should be fairly obvious, since given either X or Y , we
can say what the other’s value is with higher certainty than otherwise.

Least-Squares Regression. Note how I emphasized least-squares, since in this section we


measure how good an estimator is based on least-squares loss. Recall that least-squares arises
naturally as a result of modeling Y ∼ N (f (X), ε2 ) and conducting MLE on the log probability
of the data.
• Linear Least Squares Estimate (LLSE). The LLSE of response Y to X, which we’ll
denote as L[Y | X], is defined as
h i
L[Y | X] , arg min E(X,Y )∼D (Y − Ŷ (X))2 (1283)

where Ŷ (X) := a + bX (1284)

where finding the optimal linear function Ŷ (X) amounts to finding the optimal coeffi-
cients (a, b) over the dataset D. Unfortunately, it seems that the main way of “deriving”
the result is actually to just proven, given the result, that it does indeed minimize the
MSE. So, with that said, we begin with the result:
cov(X, Y )
L[Y | X] = E [Y ] + (X − E [X]) (1285)
Var(X)

405
Proof: Eq 1285 Minimizes the MSE

Formally, let L(X) , {aX + b | a, b ∈ R}. Prove that ∀Ŷ ∈ L(X),

E (Y − L[Y | X])2 ≤ E (Y − Ŷ )2
   
(1286)

1. Expand the general form of

E (Y − aX − b)2 = E ((Y − L[Y | X]) + (L[Y | X] − aX − b))2


   
(1287)
2
 
= E (Y − L[Y | X])
+ 2E [(Y − L[Y | X])(L[Y | X] − aX − b)] (1288)
2
 
+ E (L[Y | X] − aX − b)

Our next goal is to evaluate the term in red (spoiler alert: it is zero).
2. First, it is easy to show that E [Y − L[Y | X]] = 0 by simple substitution/arithmetic. We
can also show thata ∀aX + b ∈ L(X),

E [(Y − L[Y | X])(aX + b)] = 0

as well.
3. Since L[Y | X] ∈ L(X), it is also true that ∀Ŷ ∈ L(X), we know (L[Y | X] − Ŷ ) ∈ L(X),
too. Therefore, the red term from step 1 equates to zero.
4. We now know that our formula from step 1 can be written

E (Y − aX − b)2 = E (Y − L[Y | X])2 + E (L[Y | X] − aX − b)2


     
(1289)

Clearly, this minimized when aX + b = L[Y | X].


a
Also via simple substitution and using cov(x, y) = E [xy] − E [x] E [y]

TODO: Figure out how this formulation is equivalent to the typical multivariate expres-
sion:
T
ŷ = ŵOLS x (1290)
ŵOLS = (X T X)−1 X T y (1291)

Questions.
• Q: In general, how can one tell if a matrix A has an eigenvalue decomposition? [insert
more conceptual matrix-related questions here . . . ]
• Q: Let A be real-symmetric. What can we say about A?
– Proof that eigendecomposition A = QΛQT exists: Wow this is apparently quite This is the principal axis
theorem: if A symmetric,
hard to prove according to many online sources. Guess I don’t feel so bad now that then orthonorm basis of
it wasn’t (and still isn’t) obvious. e-vects exists.
– Eigendecomposition not unique. This is apparently because two or more eigenvec-
tors may have same eigenvalue.

406
Stuff I Forget.
• Existence of eigenvalues/eigenvectors. Let A ∈ Rn×n . Most info here comes
from chapter 5 of your
– λ is an eigenvalue of A iff it satisfies det(λI − A) = 0. Why? Because it is an “Elementary Linear
equivalent statement as requiring that (λI − A)x = 0 has a nonzero solution for x. Algebra” textbook
(around pg305)
– The following statements are equivalent:
∗ A is diagonalizable.
∗ A has n linearly independent eigenvectors.
– The eigenspace of A corresponding to λ is the solution space of the homogeneous
system (λI − A)x = 0.
– A has at most n distinct eigenvalues.
• Diagonalizability notes from 5.2 of advanced linear alg. book (261). Recall that A is
defined to be diagonalizable if and only if there exists and ordered basis β for the space
consisting of eigenvectors of A. Recall that a linear
operator is a special case
– If the standard way of finding eigenvalues leads to k distinct λi , then the corre- of a linear map where
sponding set of k eigenvectors vi are guaranteed to be linearly independent (but the input space is the
same as the output
might not span the full space). space.
– If A has n linearly independent eigenvectors, then A is diagonalizable.
– The characteristic polynomial of any diagonalizable linear operator splits (can be
factored into product of linear factors). The algebraic multiplicity of an eigen-
value λ is the largest positive integer k for which (t − λ)k is a factor of f (t).
• Expectation of a random vector. Defined as
 
E [x1 ]
 . 
E [x] = 
 . 
.  (1292)
E [xd ]

You can work out that it separates like that (which is not intuitive/immediately obvious
imo) by considering e.g. the case where d = 2. You’ll end up finding that
XX
E [x] = xp(x = x) (1293)
x1 x2
 
E [x1 ]
= h i (1294)
Ex1 Ex2 ∼p(x2 |x1 ) [x2 | x1 ]

and since we know from CS70 that E [E [Y | X]] = E [Y ], we get the desired result.

407
Matrix Cookbook
Table of Contents Local Written by Brandon McKinzie

Compiling a list of most useful equations from the matrix cookbook.

∂ ∂
||X||2F = Tr(XX T ) = 2X (1295)
∂X ∂X
∂g(U) T ∂U
" #
∂g(U)

[chain rule] = Tr (1296)
∂Xij ∂U ∂Xij

408
Main Tasks in NLP
Table of Contents Local Written by Brandon McKinzie

Going to start compiling a list of the main tasks in NLP (alphabetized). Note that NLP-
Progress, a site dedicated to this purpose, is a much more detailed. I’m going for short-and-
sweet here.

Constituency Parsing.
• Task: Generate parse tree of a sentence. Nodes are typically labeled by parts of speech
and/or chunks.
– A constituency parse tree breaks a text into sub-phrases, or constituents. Non-
terminals in the tree are types of phrases, the terminals are the words in the sentence.

Coreference Resolution.
• Task: clustering mentions in text that refer to the same underlying real world entities.
• SOTA: End-to-end Neural Coreference Resolution.

Dependency Parsing.
• Task: Given a sentence, generate its dependency tree (DT). A DT is a labeled directed
tree whose nodes are individual words, and whose edges are directed arcs labeled with
dependency types.
– A dependency parser analyzes the grammatical structure of a sentence, establishing
relationships between “head” words and words which modify those heads.
• SOTA: Deep Biaffine Attention for Neural Dependency Parsing.
Related Tasks:
– Constituency parsing. See this great wiki explanation of dependency vs constituency.

Information Extraction.
• Task: Given a (typically long) portion of raw text, recover information about pre-
specified relations, entities, events, etc.

409
Language Modeling.
• Task: Learning the probability distribution over text sequences. Often used for predict-
ing the next word in a sequence, given the K previous words.
• SOTA: ELMo.

Machine Translation.

Semantic Parsing.
• Task: Translating natural language into a formal meaning representation on which a
machine can act.

Semantic Role Labeling.


• Task: Given a sentence, extract the predicates280 and their respective arguments.
• Historical Approaches.
– CCG Semantic Parsing. Zettlemoyer & Collins 2005, 2007.
– Seq2seq. Dong & Lapata, 2016.

Sentiment Analysis.
• Task: Determining whether a piece of text is positve, negativ, or neutral.
• SOTA: Biattentive classification network (BCN) from Learned in Translation: Contex-
tualized Word Vectors (the CoVe paper) with ELMo embeddings.

Summarization.

Textual Entailment.
• Task: Given a premise, determine whether a proposed hypothesis is true.
• SOTA: Enhanced Sequenctial Inference (ESIM) model from Enhanced LSTM for Nat-
ural Language Inference with ELMo embeddings.

280
The predicate of a sentence mostly corresponds to the main verb and any auxiliaries that accompany the
main verb; whereas the arguments of that predicate (e.g. the subject and object noun phrases) are outside the
predicate.

410
Topic Modeling.

Question Answering. Also called machine reading comprehension.


• Task: Given a paragraph of text, “read” it and then answer questions pertaining to the
text.
• Dataset: The main benchmark dataset is the Stanford Question Answering Dataset
(SQuAD). Each sample has the form ({question, paragraph}, answer), where the
answer is a subsequence found somewhere in the paragraph (i.e. this is an extractive
task, not abstractive).
• SOTA: Improved versions of the Bidirectional Attention Flow (BiDAF) model, with
ELMo embeddings.
• Related tasks:

Word Sense Disambiguation.


• Task: Associating words in context with their most suitable entry in a pre-defined sense
inventory.

411
Misc. Topics
Table of Contents Local Written by Brandon McKinzie

14.5.1 BLEU Score

BiLingual Evaluation Understudy. For scoring machine-generated translations when we have


access to one or more reference translations.
• Unigram precision: Really naive and basically useless version:
num pred words that also appear somewhere in ref words
P = (1297)
total num pred words

It’s important to emphasize how ridiculous this really is. It literally means that we walk
along each word in the prediction and ask “is this word somewhere in any of the reference
translations?” and if that answer is “yes”, we +1 to the numerator. Period.
• Modified unigram precision: actually considering how many times we’ve mentioned a
given word w when incorporating it into the precision calculation. Now, when walking
along a sentence, we add to the aforementioned question, “. . . and have I seen it less
than N times already?” where N = [max(count(sent, w)) for sent in refs]. This
means our numerator can now be at most N for any given word.
• Generalize to n-grams. Below is the formula for Blue score on n-grams only:
P
ngrams∈ŷ Countclip (ngram, refs)
pn (ŷ) = P (1298)
ngrams∈ŷ Count(ngram)

• Combined Blue score.


4
( )
1X
= BP · exp log pn (1299)
4 n=1
(
1 len(pred) > len(ref)
BP = (1300)
exp{1 − len(pred)/len(ref)} otherwise
where BP is the brevity penalty.

412
14.5.2 Connectionist Temporal Classification (CTC)

Approach for mapping input sequences X = {x1 , . . . , xT } to label sequences Y = {y1 , . . . , yU },


where the lengths may vary (T 6= U ). First, we need to define the meaning of an alignment
between input sequence X and label sequence Y . Most generally, an alignment is a composition
of one or more functions, and accept input X and ultimately map to output Y . Take, for
example, the label sequence Y = {h, e, l, l, o} and some input sequence (e.g. raw audio)
X = {x1 , . . . , x12 }. CTC places the following constraints on the [first function of the] alignment
sequence:
1. It must be the same length as the input sequence X.
2. It has the same vocabulary as Y , plus an additional token  to denote blanks.
3. At position i, it either (a) repeats the aligned token at i − 1, (b) assigns the empty token
, or (c) assigns the next letter of the label sequence.
For our example, we could have an aligned sequence A = {h, h, e, , , l, l, l, , l, l, o}. Then we
apply the following two steps (can interpret as functions) to map from A to Y :
1. Merge any repeated [contiguous] characters.
2. Remove any  tokens.
When you hear someone say “the CTC loss,” they usually mean “MLE using a CTC posterior.”
In other words, there is no “CTC loss” function, but rather there is the standard maximum
likelihood objective, but we use a particular form for the posterior p(Y | X) over possible
output labels Y given raw input sequence X:

X T
Y
p(Y | X) = pt (at | X) (1301)
A∈AX,Y t=1

where A is one of the valid alignments from AX,Y , the full set of valid alignments from X
to a given output sequence Y . The per-timestep probabilities p(at | X) can be given by, for
example, an RNN.

413
Number of Valid Alignments

Given X of length T and Y of length U ≤ T (and no repeating letters), how many valid align-
ments exist?

The differences between alignments fall under two categories:


1. Indices where we transition from one label to the next.
2. Indices where we insert the blank token, .

Stated even simpler, the alignments differ first and foremost by which elements of X are “extra” tokens, where
T
I’m using “extra” to mean either blank or repeat token. Given a set of T tokens, there are T −U different ways
to assign T − U of them as “extra.” The tricky part is that we can’t just randomly decide to repeat or insert a
blank, since a sequence of one or more blanks is always followed by a transition to next letter, by definition.
And remember, we have defined Y to have no repeated [contiguous] labels.
T +U

Apparently, the answer is T −U
total valid alignments.

Computing forward probabilities αt (s), defined as the probability of arriving at [prefix of]
augmented label sequence `0h1...si given unmerged alignments up to step t. There are two cases
to consider.
1. (1.1) The augmented label at step s, `0s is the blank token . Remember,  occurs at
every other position in the augmented sequence `0 . At the previous RNN output (time
t − 1), we could’ve emitted either a blank token  or the previous token in the augmented
label sequence, `0s−1 . In other words,

y`t 0s = (αt−1 (s) + αt−1 (s − 1)) (1302)

2. (1.2) The augmented label at step s, `0s is the same augmented label as at step s − 2.
This occurs when the [not augmented] label sequence has repeated labels next to each
other.

y`t 0s =`0 (αt−1 (s) + αt−1 (s − 1)) (1303)


s−2

In this situation, αt−1 (s) corresponds to us just emitting the same token as we did at
t − 1 or emitting a blank token , and αt−1 (s − 1) corresponds to a transition to/from 
and a label.
3. (2) The augmented label at step s - 1, `0s−1 is the blank token  between unique characters.
In addition to the two αt−1 terms from before, we now also must consider the possibility
that our RNN emitted `0s−2 at the previous time (t − 1) and then emitted `0s immediately
after at time t.

414
14.5.3 Perplexity

Per Wikipedia:
In information theory, perplexity is a measurement of how well a probability distribution
or probability model predicts a sample.

The perplexity, P, of discrete probability distribution p over word sequences W = {w1 , . . . wT } =


wh1...T i of length T is defined as:

P(p) = 2−Ex∼p [lg p(x)] = 2H(p) (1304)


P
− wh1...T i
p(wh1...T i ) lg p(wh1...T i )
=2 [theory] (1305)
1
P
−N wh1...T i ∈T
lg q(wh1...T i )
≈2 [empirical] (1306)

where H is entropy (in bits). It’s important to note that, in practice, we are never able to use
the theoretical version since we don’t know p exactly (we are usually trying to estimate it) –
instead of H(p) we thus usually think in terms of H(p, q), the cross entropy 281 . The empirical
definition is when we have N samples in some test set T , and a model q that we want to
approximate the true distribution p.

In NLP, it is more common to want the per-word perplexity of a language model. We typically
do this by flattening out a sequence of words in some test set containing M words total and
simply compute
1
P = 2− M lg p({w1 ,...wM }) (1307)
1
= 1 (1308)
p({w1 , . . . wM }) M

In other words, NLP nearly always defines perplexity as the inverse probability of the test
set, normalized by number of words. So, why is this valid? We are implicitly assuming that
language sources are ergodic:
Ergodic
A random process is ergodic if its (asymptotic) time average is the same as its expectation
value over all possible states (w.r.t the specified probability distribution).
Informally, this means that the system eventually reaches all states, and such that the
probability of observing it in state s is p(s), where p is the true generating distribution.

281
Recall the relationship between entropy H(p) and cross entropy H(p, q):

H(p, q) = H(p) + DKL (p||q)

415
In the per-word NLP case, this means we can assume that
1
lim E [lg p({w1 , . . . wm })] = lim lg p({w1 , . . . wm }) (1309)
m→∞ m→∞ m
where the sequence on the RHS is any sample from p282 .

Intuition. Ok, now that we’ve got definitions out of the way, what does it actually mean?
First consider some limiting cases. If the distribution p is uniform over N possible outcomes,
then P(p) = 2lg N = N . Since the uniform distribution has the highest possible entropy, N
is also the largest possible value for perplexity of a discrete distribution p over N possible
outcomes.

Consider the interpretation of the cross entropy loss as the negative log-likelihood:
M
1 X 1
 
N LL(pdata ) = − log p(w=wi ) = Ewi ∼pdata log (1310)
M i=1 p(w)

we see that N LL (and thus P= exp(N LL)) decreases as our model assigns higher probabilities
to samples drawn from pdata . Better models of pdata are less surprised by samples from pdata .
If we use the typical interpretation of entropy as the number of bits needed (on average) to
represent a sample from p, then the perplexity can be interpreted as the total number of pos-
sible results (on average) when drawing a sample from p.

In the case of language modeling, this represents the total number of reasonable next-word
predictions for wt+1 given some context w1 , . . . wt . As our model assigns higher probabilities
to the true samples in pdata , the number of bits required to specify each word, on average,
becomes smaller. Therefore, you can roughly think of per-word perplexity as telling you the
number of possible choices, on average, your model considers uniformly at random at a given
step. For example, P = 42 could be interpreted loosely as “to predict the next word out of
some vocabulary V , my model can narrow it down on average to about 42 choices, and chooses
uniformly at random from that subset”, where typically |V | >> 42.

282
Something feels off here. I’m synthesizing what I’m reading from wikipedia and this source from berkeley
but I can’t fix the sloppy tone of the wording.

416
14.5.4 Byte Pair Encoding

14.5.5 Grammars

In formal language theory, a formal grammar is a set of production rules for strings in a
formal language. The rules describe how to form strings from the language’s alphabet that
are valid according to the language’s syntax. A grammar does not describe the meaning of the
strings or what can be done with them in whatever context—only their form.
• Regular Grammar: no rule has more than one nonterminal in its right-hand side,
and each of these nonterminals is at the same end of the right-hand side. Every regular
grammar corresponds directly to a nondeterministic finite automaton.
• A context-free grammar (CFG) is a formal grammar that consists of:
– Terminal symbols: characters that appear in the strings generated by the grammar.
– Nonterminal symbols: placeholders for patterns of terminal symbols that can be generated by
the nonterminal symbols.
– Productions: rules for replacing (or rewriting) nonterminal symbols (on the LHS) in a string with
other nonterminal or terminal symbols (on the RHS), which can be applied regardless of context.
– Start symbol: a special nonterminal symbol that appears in the initial string generated by the
grammar.
To generate a string of terminal symbols from a CFG, we:
1. Begin with a string consisting of the start symbol;
2. Apply one of the productions with the start symbol on the left hand size, replacing the start symbol
with the right hand side of the production;
3. Repeat the process of selecting nonterminal symbols in the string, and replacing them with the
right hand side of some corresponding production, until all nonterminals have been replaced by
terminal symbols.
• A Probabilistic CFG extends CFGs the same way has HMMs extend regular grammars,
by defining the set P of probabilities on production rules.

417
14.5.6 Bloom Filter

Data structure for querying whether a data point is a member of some set. It returns either
“no” or “maybe”. It is implemented as a bit vector. Each member of the set is passed through
k hash functions. Each hash function maps an element to an integer index. For each member,
we set the k output indices of the hash functions to 1 in our bit vector. To answer if some
data point x is in the set, we pass x through the k hash functions, which gives us k indices. If
all k indices have their bit value set to 1, the answer is “maybe”, otherwise (if any bit value is
0) the answer is “no”.

14.5.7 Distributed Training

Asynchronous SGD.
Nx
α X ∂L(x(j) )
Wi+1 = Wi − (1311)
Nx j=1 ∂Wi
Nw NX
x (j)
X ∂L(x(k) )
[SyncSGD] Wi+1 = Wi − λ α (1312)
j=1 k=1
∂Wi

where Nx is the number of data samples.

In asynchronous SGD, we just apply the gradient updates to a global version of the parameters
whenever they are available. In practice, this can result in stale gradients, which happens
when a worker takes a long time to compute some gradient step, while the master version of
the parameters has been updated many times. This results is the master computing an update
like Wt+1 = Wt − λ∆Wt−D for larger-than-desired values of D (delay in num updates).

418
14.5.8 Traditional Language Modeling

Some quick terminology/definitions since my brain forgets things.


• Backoff : when we want to estimate e.g. p(w1 , w2 , w3 ) but we’ve never seen the sequence
w1 , w2 , w3 . We can instead backoff to bigrams if we have seen e.g. the sequences w1 , w2
and w2 , w3 . More generally, if we have many N-gram LMs with different values of N, we
backoff to the highest order LM that contains the N -gram we are querying for.
– Example failure mode of backoff: we typically have to backoff for unusual sequences
of words (by definition). The lower order N gram model could drastically over-
estimate the backoff probabilities283 .
• Interpolation: Using a combination of N-gram LMs with different values of N as an
attempt to utilize the strengths of each and mitigate their weaknesses.
• Kneser-Ney: (link)
max (c(wt−1 , wt ), 0) |{wt−1 : c(wt−1 , wt ) > 0}|
PKN (wt | wt−1 ) = +λ (1313)
c(wt−1 ) |{wt0 −1 : c(wt0 −1 , wt0 ) > 0}|

Click this link to see a really good overview of the terms above and more.

283
Good example is how a unigram model assigns a decent probability for "York" but a human bigram could
tell you that it is nearly certain that "New" preceded it. The backoff model would tend to overestimate p(York)
since it has no contextual information

419

You might also like