0% found this document useful (0 votes)
72 views71 pages

ML - Perplexity

Uploaded by

Shashank S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views71 pages

ML - Perplexity

Uploaded by

Shashank S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Polynomial Curve Fitting and Linear Regression — Bishop Style

Problem Setup
Let the training dataset consist of $ N $ data points $ \mathcal{D} = { (x_n, t_n) }_{n=1}^N $,
where:
$ x_n \in \mathbb{R} $ is the input variable (scalar, but can be generalized to vectors).
$ t_n \in \mathbb{R} $ is the target variable.
We seek a model (function) $ y(x, \mathbf{w}) $ that predicts $ t $ given $ x $:

where $ M $ is the polynomial degree, and $ \mathbf{w} = [w_0, w_1, ···, w_M]^T $ the vector of
coefficients. [1] [2]

Design Matrix and Linear Regression


Note: Although nonlinear in $ x $, this model is linear in coefficients $ \mathbf{w} $. For all $ N
$ data points, define the design (Vandermonde) matrix $ \Phi $:

Thus,

where $ \mathbf{y} \in \mathbb{R}^N $ is the vector of model predictions at the input points $
x_n $, $ \mathbf{y} = [y(x_1, \mathbf{w}), ···, y(x_N, \mathbf{w})]^T $. [3]

Least Squares Estimation


Assume the target values are noisy:

where $ \epsilon_n \sim \mathcal{N}(0, \sigma^2) $, i.i.d. Gaussian noise.


Objective: Minimize the sum-of-squares error:
Set derivative to zero to solve for $ \mathbf{w} $:

where $ \mathbf{w}_{ML} $ is the maximum likelihood solution under the Gaussian noise model.
[2] [3]

Overfitting, Underfitting, and Model Selection


Underfitting: $ M $ too small ($ M = 0, 1 $); insufficient model capacity.
Overfitting: $ M $ too large ($ M \gg 1 $); fits noise, poor generalization.
Optimal: Intermediate $ M $; balances bias and variance.
To assess generalization, use a separate test set. The mean squared error on test data reveals
generalization ability. [4] [1]

Regularization (Ridge Regression)


To prevent overfitting, add a regularization (weight decay) term:

The solution:

interpreted as the maximum a posteriori (MAP) estimate under a Gaussian prior on weights. [3]

Polynomial Curve Fitting vs. Linear Regression


Linear Regression conventionally means fitting $ y = w_0 + w_1 x $ (order-1 polynomial).
Polynomial Regression is also linear regression in disguise, since it is linear in weights $
\mathbf{w} $, just with nonlinear features $ x^j $.
Feature mapping:
All machinery of linear regression applies. [3]

Summary Table: Key Mathematical Objects


Symbol Description

$ \mathbf{x} $ Input data (vector)

$ \mathbf{t} $ Target data (vector)

$ \mathbf{w} $ Polynomial coefficients

$ \Phi $ Design matrix (Vandermonde)

$ y(x, \mathbf{w}) $ Model prediction

$ E(\mathbf{w}) $ Sum-of-squares error

$ E_\lambda(\mathbf{w}) $ Regularized error

$ \lambda $ Regularization parameter

Remarks and Further Study


Model selection (choice of $ M $ and $ \lambda $) can be performed via cross-validation.
Bayesian approaches extend this framework to incorporate weight uncertainties.
For multivariate $ x $: include all monomials, e.g. $ x_1^2 x_2 $.

References:
(Style and notation closely follow Bishop’s "Pattern Recognition and Machine Learning") [4] [1] [2]
[3]

1. [Link]
2. [Link]
3. [Link]
4. [Link]
RL_Codes/blob/master/Machine_Learning/Bishop/Polynomial Curve [Link]
Full Mathematical Formulation of Overfitting & Regularized Regression

Overfitting: Mathematical Characterization


Overfitting occurs when a model fits the training data—including its noise—too precisely, rather
than extracting the underlying relationship. Mathematically, this typically happens when the
model has too many parameters relative to the size and complexity of the data, leading to high
variance and low generalization ability.
Given data points $ \mathcal{D} = { (x_n, t_n) }_{n=1}^N $, suppose we fit a polynomial of
degree $ M $, with prediction:

The sum-of-squares error to minimize is:

If $ M $ is large ($ M \to N E(\mathbf{w}) = 0 $), but such a solution typically "memorizes" the
noise—resulting in poor test/validation performance. [1] [2] [3]

Regularized Regression: Complete Formulation


To prevent overfitting, we add a regularization term to penalize complex/large coefficients,
leading to the regularized objective function:

or, using matrix notation:

where:
$ \Phi $ is the design (Vandermonde) matrix, $ \Phi_{nj} = x_n^j $
$ \lambda \geq 0 $ is the regularization parameter (strength) controlling the balance
between fit and complexity.
Solution (Closed form):

This is sometimes called ridge regression (L2 regularization). The penalty $ \lambda |
\mathbf{w} |^2 $ shrinks the magnitude of coefficients, discouraging overly complex models [4]
[5] [2] .

Key features:
If $ \lambda = 0 $: Standard least-squares regression, possible overfitting.
If $ \lambda \to \infty $: Coefficients shrink towards zero, possible underfitting.
Optimizing $ \lambda $: Cross-validation typically used for selection.

Interplay: Bias-Variance Tradeoff


Regularization induces bias (simplifies the model) but reduces variance (model's sensitivity to
data fluctuations). The optimal regime balances these: [2] [6] [7]
Overfitting: Low bias, high variance (complex model, fits noise).
Underfitting: High bias, low variance (too simple, misses structure).
Good generalization: Appropriate balance.

Summary Table
Phenomenon Formulation Implication

$ E(\mathbf{w}) = \frac{1}{2} | \Phi \mathbf{w} - \mathbf{t} |^2 Fits train/noise, poor


Overfitting
$, high $ M $ test error

$ E_\lambda(\mathbf{w}) = \frac{1}{2} | \Phi \mathbf{w} - Penalizes complexity;


Regularization
\mathbf{t} |^2 + \frac{\lambda}{2} | \mathbf{w} |^2 $ curbs overfitting

Reference Notation
This is the notation and structure closely following Bishop’s "Pattern Recognition and Machine
Learning" and standard ML literature. [4] [8] [2]
Notes:
L1 regularization (lasso): $ \lambda \sum_j |w_j| $ — promotes sparsity.
L2 regularization (ridge): $ \lambda \sum_j w_j^2 $ — shrinks all weights. [5] [2]
Extensions: Elastic Net, Bayesian Ridge.
Conclusion:
Regularized regression mathematically augments the error function with a complexity penalty.
The balance of loss and regularization parameter $ \lambda $ determines the model's
generalization capacity and prevents overfitting by controlling coefficient values.

1. [Link]
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
[Link]
Curse of Dimensionality
The curse of dimensionality refers to the host of problems that arise when working with data in
very high-dimensional feature spaces. [1] [2] [3]
As the number of dimensions (features) increases, data points become exponentially more
sparse, which makes generalization and pattern discovery difficult for learning algorithms.
Key impacts:
Exponential growth of sample requirements: To represent all possible combinations of
features well, you need exponentially more samples.
Overfitting risk: Models become complex and start to memorize training data noise
rather than general patterns, harming test performance.
Computational inefficiency: Training and processing require much more time and
resources.
Distance metrics break down: In high dimensions, all data points tend to be nearly
equidistant, impeding methods like k-NN and clustering.
Mitigation: Use dimensionality reduction (e.g., PCA), feature selection, and regularization to
reduce the impact and make modeling feasible. [2] [3] [1]

MAP Hypothesis in Model Selection


MAP (Maximum A Posteriori) hypothesis in model selection is a Bayesian framework for
choosing models based on the posterior probability of a hypothesis (model) given the observed
data.
Given models $ M $, data $ D $, and prior information,

This incorporates both the likelihood of the data given the model and the prior probability of
the model, thus:

In practice, MAP model selection often leads to minimizing an objective that combines the fit
(e.g., sum of squares error) with a complexity penalty derived from the prior. This results in a
regularized regression form, where more complex models are penalized. [4]
Regularization as prior: In linear regression, the MAP estimate corresponds to ridge
regression when using a Gaussian prior over weights. This shrinks parameters, prevents
overfitting, and thus affects model selection. [5]
MDL (Minimum Description Length) Principle
Minimum Description Length (MDL) is an information-theoretic approach to model selection.
The central idea is:
The best model is the one that gives the shortest overall message encoding both the
model and the data given the model. [6] [7] [8]
Mathematically, it seeks to minimize:

Where:
$ L(model) $: length (in bits) to describe the model complexity,
$ L(data \mid model) $: length to encode the data given the model's predictions.
MDL is a formalization of Occam's Razor: balance model simplicity (short description)
against data fit.
MDL penalizes overly complex models (which are hard to describe and unlikely to
generalize) and avoids underfitting (models that cannot compress data well due to poor fit).
[7] [8] [6]

Model Selection Concepts: Contrast and Application


Concept Key Idea Effect on Model Selection

Curse of High dimensions increase sparsity, risk of Favors simpler models, requires
Dimensionality overfitting, resource use dimensionality reduction

Select model with highest posterior Combines prior and data fit, regularizes
MAP Hypothesis
probability (Bayesian) model selection

Select model with shortest encoding Balances complexity and fit, penalizes
MDL Principle
(model + data) overfitting

Context in Polynomial Regression/Linear Regression


Curse of dimensionality: Including many polynomial terms (high-degree) or many features
can make the model overfit and struggle to generalize.
MAP hypothesis: Applying regularization (ridge, lasso) to penalize excessive complexity,
interpreted as placing priors over coefficients.
MDL: Prefer models that succinctly capture signal, penalize unnecessary complexity,
aligning with regularization approaches.

Summary:
Efficient model selection in machine learning relies on understanding these concepts. High-
dimensional data can render simple models ineffective, but overly complex ones are likely to
overfit. MAP hypothesis and MDL principle provide rigorous mathematical frameworks for
balancing fit and complexity, both incorporating penalties against complex models to promote
generalizability. [8] [1] [6]

1. [Link]
2. [Link]
3. [Link]
4. [Link]
tion-in-Gaussian-regression/10.1214/[Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
Linear Models for Regression — Basis Function Models (Bishop Style)

1. Concept: Linear in Parameters, Flexible in Inputs


A linear model for regression predicts a target variable as:

where and . This is linear in both the parameters and the input
variables.

Limitation
Such models only capture linear trends in input space. To model more complex, nonlinear
relationships, we introduce basis functions.

2. Basis Function Models


Instead of using raw inputs directly, map inputs through fixed nonlinear transformations (basis
functions) :

or, by introducing a "dummy" basis ,

with .
Linear in (parameters)
Possibly nonlinear in (inputs), due to [1] [2] [3]

Why "Linear" Models?


Despite possible nonlinear transformations, the model remains linear in the coefficients . This
ensures tractable, closed-form solutions for parameter estimation.
3. Examples of Basis Functions
Polynomial: (classic polynomial regression)
Gaussian:
Sigmoid:
Fourier/Cosine/Sines/Other wavelets [4] [3]
The choice of basis functions determines the expressiveness and flexibility of the regression
model.

4. Matrix Notation and Estimation


Let data , with :

: design matrix.
: predictions.
Learning: Fit by minimizing sum-of-squares (or via maximum likelihood under Gaussian
noise):

Optionally, add regularization (ridge regression).

5. Advantages
Expressivity: Model nonlinear relations with simple linear estimation.
Tractable optimization: Still allows efficient analytical or numerical training.
Extends to generalized linear models (GLMs) and Bayesian approaches.

6. Typical Workflow
1. Choose basis functions (domain knowledge, cross-validation, etc.)
2. Construct (evaluate on data).
3. Solve for (least squares; add regularization as needed).
4. Make predictions: .
7. Summary Table
Model Form Expression Notes

Classic Linear Only linear trends

Basis Function Nonlinear in , linear in

Prediction (vectorized)

Least Squares Solution

8. Key Points
Choice of basis critically affects performance and capacity.
Still "linear models" in the mathematical sense; enables simplicity and power. [2] [3] [1]
Includes classical polynomial regression as a special case.

These notations and structure follow Bishop's Pattern Recognition and Machine Learning.

References
No explicit URLs included per instructions; see Bishop's "Pattern Recognition and Machine
Learning," and referenced PDFs for full derivations and detailed worked examples. [3] [1] [2]

1. [Link]
2. [Link]
df
3. [Link]
4. [Link]
Linear Models for Classification — Least Squares Method (Bishop Style)

Overview
Linear models for classification use a linear function of the input features to assign inputs to
classes. While learned commonly by methods like logistic regression or Fisher’s discriminant, an
alternative approach applies least squares regression to classification targets. This method
treats classification as a regression problem, fitting linear functions to numerical encodings of
class labels.

Model Setup
Consider a classification problem with $ K $ classes. Use a one-hot (1-of-K) encoding of the
target vector $ \mathbf{t} \in {0,1}^K $, where the element $ t_k = 1 $ for the correct class and 0
otherwise.
For input $ \mathbf{x} \in \mathbb{R}^D $, define the classifier output vector:

where
$ \tilde{\mathbf{x}} = (1, \mathbf{x}\top)\top \in \mathbb{R}^{D+1} $ is the augmented input
vector (including bias term),
$ \tilde{\mathbf{W}} \in \mathbb{R}^{(D+1) \times K} $ is the weight matrix, with the $ k $-
th column containing parameters $ \tilde{\mathbf{w}}_k $ for class $ k $.
The element-wise output $ y_k(\mathbf{x}) = \tilde{\mathbf{w}}_k^\top \tilde{\mathbf{x}} $ is a
linear discriminant function for class $ k $.

Learning by Least Squares


Given training data $ {\mathbf{x}_n, \mathbf{t}n}{n=1}^N $, define:
Design matrix $ \tilde{\mathbf{X}} \in \mathbb{R}^{N \times (D+1)} $, where each row is $
\tilde{\mathbf{x}}_n^\top $,
Target matrix $ \mathbf{T} \in \mathbb{R}^{N \times K} $, where each row is $
\mathbf{t}_n^\top $ (one-hot vectors).
The sum-of-squares error to minimize is
Setting the derivative of $ E_D $ w.r.t. $ \tilde{\mathbf{W}} $ to zero yields the closed-form
solution:

where $ \tilde{\mathbf{X}}^\dagger $ is the Moore-Penrose pseudoinverse of $


\tilde{\mathbf{X}} $.

Classification Rule
The predicted class for a new point $ \mathbf{x} $ is:

Conceptual Interpretation
Least squares for classification approximates the conditional expectation $ \mathbb{E}
[\mathbf{t} | \mathbf{x}] $ of the target vector given input.
This corresponds roughly to approximating class posterior probabilities, but the outputs $
y_k(\mathbf{x}) $ are not guaranteed to be probabilities (they can be negative or not sum
to 1).
The solution corresponds to maximum likelihood estimation under a Gaussian noise
assumption on the target vectors, which is not strictly valid for discrete class labels.

Limitations of Least Squares Classification


Sensitivity to outliers: Because least squares minimizes squared errors, points far from
decision boundaries can disproportionately influence the decision boundaries, unlike more
robust methods such as logistic regression.
Poor probability interpretation: Outputs are not bounded or normalized probabilities.
Masking effect in multi-class: Some classes may be "masked" where the least squares
solution tends to assign very few or no regions to certain classes.
Linear decision boundaries: The method only learns linear decision surfaces, which may be
insufficient for complex data.
Comparison with Other Linear Classifiers
Method Key Idea Advantages Disadvantages

Fit linear functions to Closed-form solution, Poor probability modeling,


Least Squares
one-hot labels simple sensitive to outliers

Logistic Models conditional class Requires iterative


Probabilistic, robust
Regression probabilities optimization

Effective
Fisher’s Linear Maximizes class Assumes homoscedastic
dimensionality
Discriminant separation in projection Gaussian classes
reduction

Summary
The least squares approach to linear classification provides a simple, closed-form solution by
treating class labels as numerical targets. It defines linear discriminant functions per class and
assigns labels by taking the maximal output. While elegant for computational efficiency and
mathematical simplicity, it lacks robustness and probabilistic rigor compared to methods like
logistic regression.

This explanation and notation closely follow Bishop’s Pattern Recognition and Machine Learning,
chapter 4.1.3, and standard machine learning literature on linear classification. [1] [2] [3]

1. [Link]
hapter_04_part.pdf?cache=nocache
2. [Link]
3. [Link]
Fisher's Linear Discriminant: Full Formulation and Derivation
Fisher's Linear Discriminant (FLD) is a classical method for linear classification and dimensionality
reduction that seeks a linear projection of data maximizing class separability. It is specifically
designed for the two-class problem but can be generalized.

Problem Setup
Suppose we have two classes $ C_1 $ and $ C_2 $ with data points:
$ { \mathbf{x}_i } $, $ i = 1, \dots, N_1 $ from class $ C_1 $,
$ { \mathbf{x}_j } $, $ j = 1, \dots, N_2 $ from class $ C_2 $,
each point $ \mathbf{x} \in \mathbb{R}^D $.
Define:
Class means:

Goal
Find a projection vector $ \mathbf{w} \in \mathbb{R}^D $ such that the projection of the data
onto $ \mathbf{w} $:

maximizes between-class separation while minimizing within-class scatter in the projected


space.

Quantities in Projected Space


Projected class means:

Projected scatter (variance) within each class:


Fisher Criterion to Maximize
Fisher defined a criterion function (also called the Fisher ratio):

Maximizing $ J(\mathbf{w}) $ finds a projection that maximizes the distance between the class
means while keeping within-class variance small, leading to better class separability.

Reformulation with Scatter Matrices


Define scatter matrices as:
Within-class scatter matrix $ \mathbf{S}_W $:

Between-class scatter matrix $ \mathbf{S}_B $:

Using these, the Fisher criterion becomes:

Optimization Problem
Maximize $ J(\mathbf{w}) $ with respect to $ \mathbf{w} $. This is a generalized Rayleigh
quotient maximization:

Derivation of Optimal $ \mathbf{w} $


To solve:

We set the derivative of $ J(\mathbf{w}) $ w.r.t. $ \mathbf{w} $ to zero under a normalization


constraint.
This leads to the generalized eigenvalue problem:
For the two-class case, since $ \mathbf{S}_B $ is rank 1 (outer product of mean difference), the
solution reduces to:

Here, the optimal projection vector is proportional to the inverse of the within-class scatter
matrix multiplied by the difference in class means.

Interpretation
$ \mathbf{S}_W^{-1} $ weights the mean difference by the inverse of the variability within
classes, emphasizing directions of small within-class variation.
Projecting onto $ \mathbf{w} $ maximizes class separation normalized by within-class
spread.

Classification Rule
Having found $ \mathbf{w}^* $, classify a new point $ \mathbf{x} $ by projecting:

then choosing the class according to which projected mean $ y $ is closest to:

or equivalently by applying a threshold between $ m_1' $ and $ m_2' $.

Extension to Multiple Classes


For $ K > 2 $ classes, Fisher’s method generalizes by defining:
Total within-class scatter matrix $ \mathbf{S}_W $ as the sum over all classes,
Between-class scatter matrix $ \mathbf{S}_B $ capturing scatter of class means around the
global mean.
The objective becomes finding $ \mathbf{W} \in \mathbb{R}^{D \times (K-1)} $ maximizing:

which is solved via eigen decomposition of $ \mathbf{S}_W^{-1} \mathbf{S}_B $.


Summary Table
Quantity Definition/Expression

Class means $ \mathbf{m}_k = \frac{1}{N_k} \sum_i \mathbf{x}_i $

Within-class scatter $ \mathbf{S}W = \sum_k \sum{i \in C_k} (\mathbf{x}_i - \mathbf{m}_k)(\mathbf{x}_i -


matrix \mathbf{m}_k)^\top $

Between-class scatter $ \mathbf{S}_B = \sum_k N_k (\mathbf{m}_k - \mathbf{m})(\mathbf{m}_k -


matrix \mathbf{m})^\top $ (general case)

$ J(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top


Fisher criterion
\mathbf{S}_W \mathbf{w}} $

Optimal projection
$ \mathbf{w}^* \propto \mathbf{S}_W^{-1} (\mathbf{m}_1 - \mathbf{m}_2) $
vector

This derivation and formulation closely follow classical presentations found in standard machine
learning texts such as Bishop’s Pattern Recognition and Machine Learning and related
academic resources.

If you want, I can provide a detailed step-by-step linear algebra derivation or examples of
numerical computation for Fisher’s Linear Discriminant.
Let me know if you need further explanation or examples!

Perceptron Algorithm: Full Formulation and Explanation
The Perceptron algorithm is one of the earliest and simplest algorithms for supervised learning
of binary classifiers. It belongs to the family of linear classifiers and forms the basis for many
neural network architectures.

Model Setup
Inputs: A feature vector $ \mathbf{x} = (x_1, x_2, ···, x_D)^\top \in \mathbb{R}^D $.
Parameters: A weight vector $ \mathbf{w} = (w_1, w_2, ···, w_D)^\top $ and a bias term $ b
$.
Output: A predicted class label in (or sometimes ).
The Perceptron computes a linear combination of the inputs plus bias:

and then applies an activation function, typically a step (Heaviside) function:

This results in a linear decision boundary defined by the hyperplane $ \mathbf{w}^\top


\mathbf{x} + b = 0 $ that separates the two classes.

Learning and Weight Update Rule


Given a labeled training dataset , the Perceptron seeks to find weights that
correctly classify all samples (if possible).
Initialization:
Weights and bias are initialized, often to zero or small random values.
For each training example:
1. Compute prediction:

2. Compute error:

3. Update weights and bias only if misclassified ( ):


where is the learning rate controlling step sizes.
This iterative update causes the model to slowly shift its decision boundary toward correctly
classifying the inputs.

Geometric Interpretation of Updates


If a positive example ( ) is misclassified as 0, the weight vector moves closer to .
If a negative example ( ) is misclassified as 1, the weight vector moves away from .
Each update pushes the weight vector in a direction to reduce future misclassification of that
sample.

Perceptron Convergence Theorem


If the training data is linearly separable, the Perceptron algorithm is guaranteed to find a weight
vector that perfectly separates the classes in a finite number of updates.

Summary: Perceptron Algorithm Steps


1. Initialize , , and learning rate .
2. For each training point :
Compute $ \hat{y}_n = h(\mathbf{w}^\top \mathbf{x}_n + b) $.
Update , if .
3. Repeat over the dataset for multiple epochs until convergence or maximum iterations.

Mathematical Formulation Recap

Limitations
Only linearly separable data can be perfectly classified.
Does not provide probabilistic outputs.
Sensitive to the choice of learning rate and initial weights.
Cannot solve problems like XOR that are not linearly separable.

The Perceptron algorithm is a foundational concept in machine learning and neural networks,
providing insight into linear classification and learning rules.

This explanation closely follows classical references and standard formulations as presented in
sources such as Bishop's Pattern Recognition and Machine Learning and standard machine
learning literature. [1] [2] [3] [4]

1. [Link]
2. [Link]
work/
3. [Link]
4. [Link]
Probabilistic Generative Classifiers — Naive Bayes Classifier
A probabilistic generative classifier models how data is generated by each class and uses
Bayes' theorem to invert this to classify new instances. The Naive Bayes classifier is one of the
simplest and most popular examples in this category, known for its computational efficiency and
decent performance despite a strong independence assumption.

Core Idea of Probabilistic Generative Classification


We model the joint probability distribution of data and class labels:

where
is the feature vector,
is the class label.
Classification is done by applying Bayes' theorem to compute the posterior probabilities of
classes given features:

The predicted class is the one with the highest posterior (MAP decision rule):

(Since is constant for all classes.)

Naive Bayes: The Naive Independence Assumption


The hallmark of Naive Bayes is the assumption that all features are conditionally independent
given the class:

This simplifies the potentially complex joint distribution into a product of simpler one-dimensional
distributions.
Hence, the posterior simplifies to:
This assumption is rarely true in practice but dramatically reduces the number of parameters to
estimate, making training efficient and feasible even with high-dimensional data.

Model Components
1. Prior : The class prior probability, typically estimated as the relative frequency of each
class in the training data.
2. Likelihoods : Class-conditional distributions for each feature. These are estimated
separately for each feature and class, using methods appropriate to the type of data:
For discrete features, typically categorical or multinomial distributions.
For continuous features, typical choices include:
Gaussian (normal) distribution for numeric data.
Kernel density estimates or other parametric distributions as suitable.

Training
The parameters for and are estimated from data by maximum likelihood
estimation (MLE) or Bayesian methods.
Prior:

Likelihoods:

or estimated from parametric distributions using class-specific statistics.

Classification Rule and Prediction


Given a new data point , compute for each class :

and assign the class with the largest score. Logs are used for numerical stability and to convert
products into sums.

Advantages and Limitations


Advantages:
Simple and fast to train and predict.
Requires relatively few training data, since parameters are estimated independently per
feature.
Works well in many real-world scenarios such as text classification (spam detection,
sentiment analysis), medical diagnosis, and more.
Robust to irrelevant features due to the independence assumption.
Limitations:
The conditional independence assumption is rarely true in reality, which can degrade
probability estimates though often classification accuracy remains good.
It tends to be a poor estimator of calibrated probabilities (overconfident or
underconfident predictions).
Performance can be suboptimal compared to more complex models like logistic
regression or discriminative classifiers.

Conceptual Summary Table


Aspect Description

Generative model Models $p(\mathbf{x}, y) = p(\mathbf{x} y) p(y)$

Key assumption Conditional independence of features given class

Classification decision $\hat{y} = \arg\max_y p(y) \prod_i p(x_i y)$

Parameter estimation MLE of priors and likelihoods (separately per feature)

Strength use cases High-dimensional data, categorical and text data

Main limitation Unrealistic independence assumption; poor probability calibration

Connection to Other Models


Naive Bayes is a generative classifier since it models the data generation process per
class.
It contrasts with discriminative classifiers (e.g., logistic regression) that directly model
.
Despite its simplicity, Naive Bayes classifiers can achieve surprisingly effective classification
performance.

This comprehensive overview follows standard presentations as found in authoritative texts like
Bishop's Pattern Recognition and Machine Learning and other key machine learning literature. [1]
[2] [3]

If you would like, I can further provide mathematical formulations of specific Naive Bayes
variants (Gaussian, Multinomial) or examples of implementing the classifier. Let me know!

1. [Link]
2. [Link]
3. [Link]
Naive Bayes and Gaussian Naive Bayes: Full Mathematical Derivations

1. Naive Bayes Classifier: Derivation


Naive Bayes is a probabilistic generative classifier based on Bayes’ theorem with the simplifying
assumption that features are conditionally independent given the class label. This strong
independence assumption leads to a highly simplified model and efficient learning.

Problem Setup
Given:
Feature vector $ \mathbf{x} = (x_1, x_2, ..., x_d) $,
Class variable $ C \in {C_1, C_2, ..., C_K} $.
Goal: Assign a class label $ \hat{C} $ to a new instance by maximizing the posterior probability:

By Bayes’ theorem:

Since is constant across classes:

Naive Independence Assumption


Assuming conditional independence of features given the class:

Thus, the classification rule becomes:


Parameter Estimation (Training)
Given labeled data , estimate:
Prior probability for each class:

where is the number of examples in class .


Class-conditional probabilities:

Depending on feature types, can be modeled differently, e.g., categorical


distributions for discrete features or parametric distributions (like Gaussian) for continuous
features.

2. Gaussian Naive Bayes: Derivation for Continuous Features


For continuous input features, Gaussian Naive Bayes assumes that each feature given the
class follows a univariate Gaussian distribution. The likelihood for feature under class is:

Full Class-Conditional Likelihood


Under the naive assumption, the joint conditional distribution is the product of univariate
Gaussian likelihoods:

Parameter Estimation
Given training data points $ { \mathbf{x}n }{n=1}^{N_k} $ for class , estimate parameters:
Mean for each feature:
Variance for each feature:

Classification Rule
A new data point is classified according to:

For numerical stability and computational convenience, work in log domain:

3. Summary of Naive Bayes vs Gaussian Naive Bayes


Aspect Naive Bayes Gaussian Naive Bayes

Discrete or categorical or
Feature type Continuous features
continuous

Likelihood for (empirical or Gaussian $ \mathcal{N}(x_j \mid \mu_{jk},


feature parametric) \sigma_{jk}^2) $

Independence Conditional independence of


Same
assumption features given class

Parameter Class prior and feature


Class prior, feature means and variances
estimation conditional probabilities

$ \hat{C} = \arg\max_k P(C_k) $ \hat{C} = \arg\max_k P(C_k) \prod_j


Decision rule
\prod_j P(x_j \mid C_k) $ \mathcal{N}(x_j \mid \mu_{jk}, \sigma_{jk}^2) $

4. Interpretation
Naive Bayes simplifies joint likelihood estimation by assuming independence, making
parameter estimation tractable.
Gaussian Naive Bayes models continuous features using Gaussian distributions, providing a
parametric family for likelihood estimation.
Both methods are efficient, scalable, and often perform surprisingly well even when
independence assumptions are violated.
This rigorous formulation and derivation closely follow classical probabilistic pattern recognition
literature, including Bishop's Pattern Recognition and Machine Learning. If you want, I can
provide worked numerical examples or implementation details for these models.
Bayesian Belief Networks
Bayesian Belief Networks (BBNs), also known simply as Bayesian Networks, are a type of
probabilistic graphical model that represent a set of variables and their conditional
dependencies using a directed acyclic graph (DAG). They provide a powerful framework for
reasoning and making decisions under uncertainty by encoding joint probability distributions
efficiently.

Key Concepts of Bayesian Belief Networks


Graphical Structure:
Nodes represent random variables, which can be observable data, latent variables, or
hypotheses.
Directed edges represent probabilistic dependencies (often causal relationships)
between variables.
The graph is acyclic, meaning it does not contain directed loops.
Conditional Probabilities:
Each node has an associated Conditional Probability Table (CPT) that quantifies the
probability of the node’s states given the states of its parent nodes.
Nodes without parents are assigned prior probabilities.
The network factorizes the joint probability distribution as a product of these local
conditional probabilities:

This decomposition dramatically reduces the complexity of representing joint distributions


by leveraging conditional independence properties encoded in the graph.
Inference:
Given observed evidence variables, BBNs compute posterior probabilities of other
variables using Bayes’ theorem.
This probabilistic inference allows updating beliefs dynamically and reasoning under
uncertainty.
Exact inference methods include variable elimination and clique tree propagation;
approximate methods include sampling and variational approaches.
Learning:
BBNs can be constructed using expert knowledge or learned from data.
Learning involves determining the DAG structure (structure learning) and estimating
CPT parameters (parameter learning).

Applications and Benefits


BBNs are widely used in fields like medical diagnosis, finance, robotics, and environmental
modeling.
They help model complex uncertain systems, support decision making, handle missing data,
and provide interpretable graphical representations of dependencies.
The modularity and causal interpretation facilitate understanding and managing complex
relationships in large systems.

Summary Table
Aspect Description

Model Type Probabilistic graphical model (DAG + CPTs)

Nodes Random variables

Edges Directed dependencies, often causal

Joint Probability Factorization Product of conditional probabilities

Inference Posterior computation given evidence

Learning Structure and parameter estimation from data

Applications Reasoning under uncertainty, decision support

Bayesian Belief Networks thus combine graph theory and probability theory to represent and
manipulate uncertain knowledge efficiently, enabling sophisticated probabilistic reasoning and
decision-making.
This explanation follows standard presentations in machine learning literature and authoritative
sources on Bayesian networks. [1] [2] [3] [4]

1. [Link]
2. [Link]
3. [Link]
4. [Link]
Probabilistic Discriminative Classifiers — Logistic Regression

Overview
Logistic regression is a foundational probabilistic discriminative classifier used primarily for
binary classification problems. It models the conditional probability of the class label given the
input features directly, without attempting to model the data generation process (unlike
generative models such as Naive Bayes).

Model Formulation
Given:
Input feature vector $ \mathbf{x} \in \mathbb{R}^D $
Binary target $ t \in {0,1} $
Logistic regression models the posterior probability of the positive class as a logistic sigmoid
function applied to a linear function of the inputs:

where

and
$ \mathbf{w} $ is the weight vector,
$ \boldsymbol{\phi}(\mathbf{x}) $ is the basis function vector of input features (can include
nonlinear transformations),
$ \sigma(\cdot) $ is the sigmoid function.
The probability of the negative class is $ 1 - y(\mathbf{x}, \mathbf{w}) $.

Likelihood and Objective Function


For a training set $ {(\mathbf{x}n, t_n)}{n=1}^N $, assuming independent samples, the likelihood
of observed targets is:
where $ y_n = y(\mathbf{x}_n, \mathbf{w}) $.
The negative log-likelihood (also called the cross-entropy error function) to minimize is:

Parameter Estimation
The objective function $ E(\mathbf{w}) $ is convex but nonlinear in $ \mathbf{w} $.
Parameters are typically estimated by maximum likelihood using iterative optimization
methods such as:
Gradient descent
Iteratively Reweighted Least Squares (IRLS)
Newton-Raphson method

Multiclass Extension
For classification with $ K > 2 $ classes, logistic regression generalizes to multinomial logistic
regression (also known as softmax regression):

where each class has its weight vector $ \mathbf{w}_k $, collected as columns in weight matrix
$ \mathbf{W} $.

Characteristics and Advantages


Logistic regression directly models the posterior probabilities $ p(t|\mathbf{x}) $, making
it a discriminative model.
It does not assume any distribution on input features $ \mathbf{x} $, unlike generative
models.
It allows easy incorporation of features via basis functions while maintaining convex
optimization.
Provides probabilistic outputs useful for calibrated decision making.
It tends to have better predictive performance than generative classifiers when model
assumptions on data distribution are violated.

Summary Table
Aspect Description

Model Type Probabilistic discriminative classifier


Aspect Description

\mathbf{x}) = \sigma(\mathbf{w}^\top
Output $ p(t=1
\boldsymbol{\phi}(\mathbf{x})) $

Loss Function Cross-entropy (negative log-likelihood)

Parameter Maximum likelihood via iterative


Estimation numerical optimization

Extension to
Softmax function over linear scores
Multiclass

Probabilistic, convex loss, flexible


Advantages
features, often better performance

This explanation and notation align closely with Bishop's Pattern Recognition and Machine
Learning as well as standard machine learning literature. Logistic regression remains a
fundamental tool for classification due to its simplicity, interpretability, and probabilistic
foundation.

Logistic Regression: Full Mathematical Derivation

1. Problem Setup
Given a binary classification problem with:
Input feature vector $ \mathbf{x} \in \mathbb{R}^D $,
Target label $ t \in {0, 1} $.
We want to model the conditional probability of class label given input:

using a parametric form that enables learning from data.

2. Model Formulation: Sigmoid Function


Logistic regression models the posterior probability using a sigmoid (logistic) function applied
to a linear function of the inputs:

where

and

$ \boldsymbol{\phi}(\mathbf{x}) $ is the basis function vector (often $ \boldsymbol{\phi}


(\mathbf{x}) = (1, x_1, ···, x_D)^\top $ including bias),
$ \mathbf{w} $ is the parameter vector to be estimated.
The probability of the other class $ t=0 $ is:
3. Likelihood Function
Assuming independent samples $ {(\mathbf{x}n, t_n)}{n=1}^N $, the likelihood of the observed
targets given parameters is:

where

4. Log-Likelihood and Objective Function


Taking the logarithm gives the log-likelihood:

We typically maximize this log-likelihood, or equivalently minimize the negative log-likelihood


(cross-entropy loss):

5. Gradient of the Loss Function


To find the optimal parameters $ \mathbf{w} $, we differentiate $ E(\mathbf{w}) $ with respect
to $ \mathbf{w} $:

Since

we use the derivative of sigmoid:

By chain rule,
Plugging in:

In matrix form, if $ \Phi $ is the design matrix with rows $ \boldsymbol{\phi}(\mathbf{x}_n)^\top


$, and vector $ \mathbf{y} = (y_1, ···, y_N)^\top $, $ \mathbf{t} = (t_1, ···, t_N)^\top $, then:

6. Parameter Estimation: Iterative Optimization (Newton-Raphson/IRLS)


Since there is no closed-form solution (due to nonlinearity of sigmoid), we use iterative
methods for optimization:
Newton-Raphson update:

where $ \mathbf{H} $ is the Hessian matrix (second derivative),

with

a diagonal matrix of weights.


This update corresponds to Iteratively Reweighted Least Squares (IRLS).

7. Summary of Algorithm
1. Initialize $ \mathbf{w} $ (e.g., zero vector).
2. For each iteration:
Compute $ y_n = \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) $.
Form $ \mathbf{R} $.
Compute gradient $ \nabla_{\mathbf{w}} E(\mathbf{w}) = \Phi^\top (\mathbf{y} -
\mathbf{t}) $.
Compute Hessian $ \mathbf{H} = \Phi^\top \mathbf{R} \Phi $.
Update:

3. Repeat until convergence.


8. Extension to Multiclass Logistic Regression (Softmax)
For $ K > 2 $ classes, use softmax regression:

Parameter estimation is done via maximizing multinomial log-likelihood, using similar iterative
optimization.

This full derivation follows the principles and notation set out in Bishop's Pattern Recognition and
Machine Learning and is foundational in modern machine learning for classification tasks.
Probit Regression: Full Formulation and Derivation

Overview
Probit regression is a type of probabilistic classification model used for binary outcomes, where
the dependent variable can take only two values (e.g., 0 or 1). It models the probability of the
positive class as the cumulative distribution function (CDF) of a standard normal distribution
applied to a linear combination of input features. This is in contrast to logistic regression, which
uses the logistic sigmoid function.

Model Formulation
Given:
Input vector $ \mathbf{x} \in \mathbb{R}^D $,
Target label $ y \in {0,1} $,
the Probit regression model assumes:

where:
$ \boldsymbol{\phi}(\mathbf{x}) $ is a vector of basis functions of the input (possibly
including an intercept term),
$ \mathbf{w} $ is the parameter (weight) vector,
$ \Phi(\cdot) $ is the CDF of the standard normal distribution:

The function $ \Phi(z) $ maps the linear predictor $ z = \mathbf{w}^\top \boldsymbol{\phi}


(\mathbf{x}) $ to a probability lying in (0,1).

Latent Variable Interpretation


Probit regression can be derived from a latent variable model. Suppose there is an unobserved
continuous variable $ y^* $ such that
where $ \varepsilon \sim \mathcal{N}(0,1) $ is Gaussian noise.
Then, the observed binary outcome is

Thus,

Likelihood Function
Given a dataset $ { (\mathbf{x}n, y_n) }{n=1}^N $, the likelihood of the observed data under the
Probit model is:

where $ z_n = \mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n) $.

Log-Likelihood and Parameter Estimation


Taking the log of the likelihood gives the log-likelihood:

Parameters $ \mathbf{w} $ are estimated by maximizing $ \ell(\mathbf{w}) $ with respect to $


\mathbf{w} $.
Since $ \Phi(z) $ is nonlinear and does not have a closed-form inverse, Newton-Raphson or
other iterative numerical optimization methods (e.g., gradient ascent, quasi-Newton) are used
for maximization.

Gradient and Hessian Computation


For optimization, the gradient of the log-likelihood w.r.t. $ \mathbf{w} $ can be written using the
probability density function (PDF) of the standard normal, $ \phi(z) = \frac{1}{\sqrt{2\pi}} e{-
z2/2} $, as:

The Hessian matrix can also be derived, allowing second-order optimization techniques like
Newton-Raphson.
Comparison with Logistic Regression
Logistic regression uses the logistic sigmoid function as the link function:

Probit regression uses the cumulative normal $ \Phi(\cdot) $.


Both model probabilities between 0 and 1 and belong to generalized linear models with
different link functions.
Probit regression assumes Gaussian-distributed errors in the latent variable, logistic
regression assumes logistic-distributed errors.
The choice between the two often results in very similar fits in practice; Probit is more
common in econometrics, logistic in machine learning.

Summary Table
Aspect Probit Regression Logistic Regression

Standard normal CDF $ \Phi(z) Logistic sigmoid $


Link function
$ \sigma(z) $

Latent variable Gaussian noise $ \varepsilon


Logistic noise
noise \sim \mathcal{N}(0,1) $

$ \mathbf{x}) \in
Output range $ P(y=1 \mathbf{x}) \in (0,1) $
P(y=1 (0,1) $

Parameter Maximum likelihood, Maximum likelihood,


estimation numerical optimization numerical optimization

Coefficients interpreted as
Coefficients interpreted
Interpretability effects on latent variable's z-
as log-odds ratios
score

This full formulation and derivation of Probit regression follow classical statistical and
econometric treatments of the model, closely aligned with presentations in canonical texts and
standard machine learning resources. It provides a principled, mathematically rigorous approach
for modeling binary outcomes using a latent Gaussian model and the standard normal cumulative
distribution function as the link.
If you want, I can also provide a detailed worked example or step-by-step computational
algorithm for parameter estimation. Let me know!

Instance-Based Learning — K-Nearest Neighbors (K-NN)

Overview
Instance-based learning is a family of learning algorithms that work by storing training
instances and making predictions for new queries by comparing them directly to these stored
examples. Unlike model-based methods that explicitly learn a global generalization model from
the training data, instance-based learners postpone the induction or generalization until a query
is made. This is why they are also called lazy learners.
The most fundamental and popular instance-based algorithm is the K-Nearest Neighbors (K-
NN) algorithm.

K-Nearest Neighbors (K-NN) Algorithm


K-NN is a non-parametric, supervised learning method used for classification and
regression:
Classification: Predict the class label by majority vote of the "k" nearest training points.
Regression: Predict the target output as the average (or weighted average) of the target
values of the k nearest neighbors.

How K-NN Works


1. Store all training data. K-NN does not build an explicit model but keeps all examples.
2. Define a distance metric to measure the similarity between instances. Usually, Euclidean
distance is used for numeric features:

where $ a_r(\mathbf{x}) $ is the value of the $ r^th $ attribute for instance $ \mathbf{x} $.
3. For a new query instance, compute its distance to all stored training instances.
4. Select the k closest neighbors based on the smallest distances.
5. Prediction:
Classification: Assign the class which is the majority among the k neighbors. The
simplest form is unweighted voting.
Regression: Predict the average target value among the k neighbors.
Optionally, weighted voting or averaging can be applied where neighbors closer to the query
have more influence (e.g., weighted by the inverse of distance).

Choosing k
Small values of k (e.g., 1) can lead to noisy decisions and high variance (overfitting).
Large values of k smooth decisions but may lead to underfitting (high bias).
Typically, an odd number is chosen to avoid ties in classification.
The optimal value of k can be determined via cross-validation or methods such as the elbow
method.

Characteristics of K-NN
Feature Description

Type Instance-based, lazy learning

Parametric/Non-parametric Non-parametric (no fixed form of the function assumed)

Training phase Essentially none (just store data)

Prediction phase Compute distances, find neighbors, output prediction

Distance metric Usually Euclidean, but others like Manhattan or Minkowski can be used

Complexity Prediction can be expensive for large datasets (search cost)

Sensitive to Feature scaling, irrelevant/noisy features

Handles Both classification and regression

Advantages
Simple to understand and implement.
Makes no assumptions about data distribution.
Naturally handles multi-class problems.
Flexible decision boundaries that adapt to data shape.
Effective with large training data when good indexing/search structures are used.

Limitations
Computational cost high at prediction time (scales with dataset size).
Sensitive to irrelevant features and feature scaling.
Performance deteriorates in high-dimensional spaces (curse of dimensionality).
Capacity depends on quality and representativeness of stored instances.
Sensitive to noise and outliers (can be mitigated by weighting or data cleaning).

Summary
Instance-based learning methods like K-NN classify or regress for new samples by comparing
directly to stored data points. K-NN predicts based on the "k" most similar neighbors, making it
intuitive and effective for many practical tasks, especially when the relationship between input
and output is locally smooth but complex globally.

This explanation follows standard machine learning literature on instance-based methods and K-
NN algorithms. [1] [2] [3]

1. [Link]
2. [Link]
3. [Link]
Formulation of Nonlinear Models — Decision Trees
Decision trees are a fundamental class of nonlinear models used for both classification and
regression tasks. Unlike linear models that assume a linear relationship between features and the
output, decision trees model complex, nonlinear patterns by recursively partitioning the input
feature space into distinct regions and making simple predictions within each region.

Core Idea and Model Structure


A decision tree recursively splits the feature space based on feature values, creating a
tree-like structure of decision rules.
Each internal node corresponds to a test on one feature (e.g., "Is feature less than
threshold ?").
Branches correspond to the outcome of the test (usually binary: yes/no or true/false).
Leaf nodes (terminal nodes) represent predictions—for classification, the predicted class
probabilities or labels; for regression, a constant value (like the average of training targets in
that region).

Mathematical Formulation
1. Data and Notation:
Given training data:

Here, is discrete labels for classification, or continuous values for regression.


2. Partitioning the Input Space:
The tree partitions the feature space into disjoint regions:

where each region corresponds to a leaf node.


3. Prediction Function:
The prediction for an input is:

where is the indicator function and is the prediction assigned to region .


For regression: is typically the mean of the target values of training points in .
For classification: is usually the majority class or class probability distribution in .
4. Splitting Criteria:
At each internal node , a split is chosen, where:
is the feature index,
is a threshold.
The data at node is split into:

5. Impurity / Loss Function :


For classification, common impurity measures are Gini impurity, Entropy, or
Misclassification error:

where is the proportion of class in .


For regression, the squared error or variance is often used:

where is the mean target in node .


6. Optimizing Splits:
The split at node is chosen to minimize the weighted impurity of the children:

7. Tree Growth and Stopping:


This splitting process is repeated recursively until some stopping criteria are met, such as:
Maximum depth reached,
Minimum number of samples in a node,
No more improvement in impurity.

Nonlinearity and Flexibility


Decision trees define piecewise constant (or simple) functions over regions of the input
space.
The regions are formed by axis-aligned splits (e.g., hyperplanes ), allowing for
highly nonlinear and non-smooth decision boundaries.
This recursive binary partitioning adaptively captures complex relationships and interactions
between features without requiring explicit feature transformations.
Summary Table of Decision Tree Formulation
Component Description

Input Data , ,

Partition Split into disjoint regions via recursive binary splits

Prediction Function

Split Parameters Feature , threshold , split data into left/right subsets

Impurity / Loss Gini, entropy, variance, or other impurity functions to measure split quality

Split Optimization Minimize weighted impurity over candidate splits

Stopping Criteria Depth limit, minimum samples, or no impurity decrease

Nonlinearity Model is piecewise constant, nonlinear, and flexible

Intuition and Use


Decision trees are inherently nonlinear models because their prediction function is piecewise
defined based on input regions created by hierarchical binary splits. This allows them to model
complex patterns, handle feature interactions, and provide interpretable decision rules.
They serve as base learners for more powerful ensemble methods such as Random Forests and
Gradient Boosted Trees.

This explanation captures the standard mathematical formulation and interpretation of decision
trees as nonlinear models widely used in machine learning literature and practical applications. [1]
[2] [3]

1. [Link]
2. [Link]
3. [Link]
Ensemble Classifiers — Mathematical Formulations of Bagging and Boosting
Ensemble classifiers combine multiple base models to improve prediction accuracy, robustness,
and generalization performance relative to individual models. Two widely used ensemble
techniques are Bagging (Bootstrap Aggregating) and Boosting. Below is a detailed
mathematical formulation of each.

1. Bagging (Bootstrap Aggregating)


Bagging aims to reduce variance and overfitting by training base classifiers independently on
different random subsets of the training data and aggregating their predictions.

Setup:
Original training dataset: .
Number of base learners: .
Each base learner is trained on a bootstrap sample created by sampling points from
with replacement.
The base learner's prediction function is .

Mathematical Formulation:
1. For :
Sample by bootstrap sampling from .
Train base classifier on .
2. To predict the class label for a new instance :
Classification: Use majority voting over base classifiers:

where is the indicator function for class .


Regression: Use averaging of outputs:
Key Points:
Each base learner is trained independently and in parallel.
Diversity among base classifiers arises from training on different bootstrap samples.
Bagging reduces variance by averaging out errors caused by high-variance base models.
Commonly used with base learners that have high variance, such as decision trees.

2. Boosting
Boosting sequentially trains base learners, where each learner focuses on the mistakes of the
previous ones to form a strong composite model by weighted combination of weak learners.

Setup:
Original training dataset: .
Number of base learners: .
Weight distribution over training samples at iteration : .
Base learner at iteration is trained on weighted data.
Learner's weight in final model: .

Mathematical Formulation (Generalized):


1. Initialize sample weights equally:

2. For :
Train base classifier using weighted dataset with weights .
Compute weighted error rate:

Compute base learner weight , for instance in AdaBoost:

Update sample weights to emphasize misclassified points:

followed by normalization to maintain .


3. Final strong classifier prediction:

For binary classification with labels , prediction is often:


Key Points:
Base learners are trained sequentially, each focusing more on training instances
misclassified by previous classifiers.
Instance weights are updated dynamically to emphasize difficult cases.
Model weights reflect each learner’s accuracy; more accurate learners contribute more.
Boosting combines weak learners to create a strong learner, often with reduced bias and
variance.

Summary Table
Aspect Bagging Boosting

Train base models independently on Train base models sequentially with


Training process
bootstrap samples weighted data

Original dataset with updated weights


Data sampling Bootstrap sampling with replacement
on instances

Base learner Weights increased on misclassified


Equal weight on all instances
emphasis instances

Combination Majority voting or averaging Weighted sum of base learners

Parallelization Can be parallelized Sequential, dependent training

Primary effect Reduces variance Reduces bias and variance

Typical base High variance models (e.g., unpruned


Weak learners (e.g., decision stumps)
learners decision trees)

These mathematical formulations capture the essence and differences of Bagging and Boosting
ensemble methods widely used for classification and regression tasks. They follow the principles
outlined in standard machine learning literature and align with classical sources.

Lagrange Multipliers: Full Derivation and Formulation

Background and Problem Setup


Suppose we want to optimize a function $ f(\mathbf{x}) $ with respect to variables
, subject to one or more equality constraints:

where each $ g_i $ is a continuously differentiable function.

Intuition of Lagrange Multipliers


At an optimum point subject to the constraints, the level surfaces of the objective $ f $ must be
tangent to the constraint surfaces defined by $ g_i $. This leads to the condition that the
gradient of the objective can be expressed as a linear combination of the gradients of the
constraints:

where $ \lambda_i $ are scalar multipliers known as Lagrange multipliers.

Formulation Using the Lagrangian Function


Define the Lagrangian:

where .
The constrained optimization problem becomes finding stationary points of with respect to
both and .
Necessary Conditions (Equality-Constrained)
Solve the system:

These yield candidate points for maxima or minima, turning a constrained problem into solving a
system of equations.

Karush-Kuhn-Tucker (KKT) Conditions: Generalization of Lagrange Multipliers


KKT conditions extend the method of Lagrange multipliers to problems with both equality and
inequality constraints:

Lagrangian for KKT


Define the Lagrangian with dual variables (Lagrange multipliers):

where
are multipliers for inequality constraints,
are multipliers for equality constraints.

KKT Necessary Conditions for Optimality


At a local minimum (under suitable regularity conditions), there exist , such that:
1. Stationarity:

2. Primal feasibility:

3. Dual feasibility:
4. Complementary slackness:

Interpretation of Complementary Slackness


This means that for each inequality constraint, either the constraint is inactive ( ) and
the corresponding multiplier is zero ( ), or the constraint is active ( ) and the
multiplier can be positive.

Summary Table of KKT Conditions


Condition Mathematical Expression Meaning

Stationarity Gradient of Lagrangian zero

Primal feasibility , Constraints satisfied

Dual feasibility Multipliers non-negative

Complementary slackness Constraint-multiplier pairing condition

When Are KKT Conditions Sufficient?


If the objective function and the inequality constraints are convex and equality constraints
are affine, then satisfying the KKT conditions is both necessary and sufficient for global
optimality.

Detailed Derivation Approach (Outline)


1. Form a Lagrangian with multipliers for equality and inequality constraints.
2. Use directional derivatives and constraint qualification conditions to analyze optimality.
3. Enforce stationarity by setting the gradient of the Lagrangian to zero.
4. Enforce feasibility of constraints.
5. Enforce non-negativity of multipliers for inequality constraints.
6. Enforce complementary slackness to handle inequality constraints properly.

Summary
The Lagrange multiplier method handles equality constraints by introducing multipliers to
form a Lagrangian; optimizing this reduces to solving system of equations.
The KKT conditions generalize this to inequality constraints, introducing extra conditions
(dual feasibility and complementary slackness) alongside stationarity and primal feasibility.
These conditions are a fundamental tool in constrained nonlinear optimization, widely used in
economics, engineering, machine learning, and operations research.

This presentation follows classical derivations and explanations as found in standard


optimization textbooks and literature, such as Bishop's Pattern Recognition and Machine
Learning for basic Lagrange multipliers and advanced optimization literature for KKT conditions.
[1] [2] [3] [4] [5]

1. [Link]
d003b3_MIT18_02SC_notes_22.pdf
2. [Link]
3. [Link]
4. [Link]
5. [Link]
Support Vector Machines (SVM): Formulations, Dual Method, Soft Margin and
Hard Margin — Full Derivations

1. Hard Margin SVM Formulation


SVM aims to find the optimal separating hyperplane that maximizes the margin between two
classes in a linearly separable dataset.
Given training data: , where and .
Goal: Find hyperplane defined by and bias such that

and the margin is maximized.


This translates to the optimization problem:

subject to

This is a convex quadratic optimization problem with linear constraints.

2. Lagrangian and Dual Formulation (Hard Margin)


To solve, introduce Lagrange multipliers for each inequality constraint and form the
Lagrangian:

KKT Conditions guide the solution:


Stationarity:
Substitute back into to derive the dual optimization problem:

subject to

This is a quadratic programming problem with as variables.


The support vectors correspond to non-zero . The optimal classifier is:

3. Soft Margin SVM Formulation


Hard margin SVM requires perfect linear separability, which is often not realistic. Soft margin
SVM introduces slack variables to allow misclassification or margin violation, controlling
the trade-off between margin maximization and classification error.
Optimization problem:

subject to

where is the regularization parameter controlling penalty for margin violations.

4. Soft Margin SVM Dual Formulation


Construct the Lagrangian with multipliers for first inequality and for slack
constraints:

Dual problem (after optimizing primal variables) is:


subject to

Note the key difference from hard margin: the are upper bounded by , reflecting soft
margin slack penalties.

5. Summary of Differences Between Hard and Soft Margin


Aspect Hard Margin Soft Margin

Assumption Linearly separable data required Allows misclassification and margin violations

Minimize subject to no Minimize balancing margin and


Objective
violations error

Constraints

Dual
with no upper bound reflecting slack penalties
constraints

6. Interpretation
The SVM solution depends only on support vectors (training points with nonzero ).
Dual formulation enables the kernel trick, allowing implicit mapping to high-dimensional
feature spaces via kernel functions .
Soft margin SVM is robust to noise and non-separable data via slack variables and
regularization parameter .

This comprehensive summary captures the formulation, derivation, and core mathematical
foundations of SVMs including primal and dual form, and differences between soft and hard
margin methods. The derivations align with classical treatments in machine learning theory and
optimization literature.

K-Means Algorithm — Derivation Using EM and Gaussian Mixture Models (GMM)

Overview
The K-Means clustering algorithm can be rigorously derived as a special case of the
Expectation-Maximization (EM) algorithm applied to a Gaussian Mixture Model (GMM) with
simplifying assumptions. Understanding this connection helps explain K-Means as a hard
clustering method approximating the soft probabilistic clustering in GMMs.

Gaussian Mixture Model (GMM) Recap


A GMM models the data distribution as a weighted sum of Gaussian components:

where:
is the number of clusters,
are mixing coefficients ( ),
are means,
covariance matrices.
The EM algorithm for GMM alternates between:
E-step: Compute the soft responsibilities:

where is the latent cluster assignment.


M-step: Update parameters using responsibilities:
K-Means as a Limit Case of EM on GMM
K-Means emerges as a special case of EM on GMM under the assumptions:
1. Spherical covariance matrices: , identical and isotropic for all clusters.
2. Covariance variance : Gaussians become very peaked.
3. Equal mixing weights: (uniform cluster priors).
Under these assumptions, the responsibilities become binary hard assignments since the
Gaussian likelihood dominates for the closest mean:

and zero otherwise, meaning each data point is assigned to the nearest centroid.

EM Steps Reduced to K-Means Steps


E-step (Assignment): Assign each data point to the nearest cluster centroid :

M-step (Centroid update): Given these hard assignments, update cluster centers to mean
of the assigned points:

K-Means Objective Function


K-Means minimizes the within-cluster sum of squares (distortion):

where if point is assigned to cluster , else 0.


The alternating minimization of this objective with respect to and corresponds exactly to
the two EM steps above under the simplifying GMM assumptions.
Summary Table
Aspect K-Means GMM + EM

Model Hard assignments Soft probabilistic assignments

Cluster assignment Nearest mean (hard) assignment Posterior responsibilities (soft)

Cluster shape assumption Spherical and equal variance Full covariance, learned from data

Objective Minimize sum of squared distances Maximize data log-likelihood

Algorithm Alternating minimization steps EM algorithm with E-step and M-step

Limiting assumptions bridging (variance to zero) Full probabilistic treatment

Intuition
K-Means corresponds to a simplified, hard clustering version of GMM where clusters are
spherical with equal variance.
The probabilistic EM soft assignments collapse to hard assignments in K-Means.
EM algorithm optimizes the likelihood in GMM, while K-Means optimizes a distortion function.

This derivation and interpretation help bridge the understanding of K-Means from a heuristic
clustering method to a probabilistically principled algorithm derived from Gaussian mixture
modeling and expectation-maximization.

This explanation is based on classical EM and GMM theory, as found in machine learning
literature such as Bishop's Pattern Recognition and Machine Learning and supporting academic
sources. [1] [2] [3] [4] [5] [6]

1. [Link]
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
Dimension Reduction — Principal Component Analysis (PCA): Full Derivation and
Steps, Including SVD-Based Derivation

Overview
Principal Component Analysis (PCA) is a fundamental technique for dimensionality reduction. It
finds a new set of orthogonal variables (principal components) that capture the maximal
variance in the data in descending order, reducing dimensionality while preserving as much
information as possible.

Problem Setup
Consider a dataset with $ N $ observations and $ D $ features arranged in a data matrix:

Assume the data is mean-centered:

The goal is to find a projection matrix (with ) to reduce dimensionality, such


that the projected data:

retains maximal variance.

Step 1: Variance Maximization Formulation


Each principal component is represented by a direction vector (a unit vector). The first
principal component looks for that maximizes the variance of the projected data points onto
:
where is the sample covariance matrix:

By constraining (to avoid trivial scaling), it becomes a constrained optimization


problem.

Step 2: Solving via Eigenvalue Decomposition


Form the Lagrangian:

Taking the derivative w.r.t. and setting to zero:

Thus, must be an eigenvector of the covariance matrix , and its corresponding


eigenvalue. The maximal variance is obtained by choosing the eigenvector corresponding to the
largest eigenvalue.
The subsequent principal components are eigenvectors corresponding to the next
largest eigenvalues, with the added orthogonality constraint for .

Step 3: Constructing the Projection Matrix


Let eigenvalues and eigenvectors of be:

Select the top eigenvectors . The projection matrix is:

Dimensionality reduction maps each sample as:

Step 4: Reconstruction Error Minimization (Equivalent Formulation)


PCA can equivalently be derived by minimizing the reconstruction error between the original
data and its projection back into original space:
Minimizing this over orthonormal also leads to being the top eigenvectors of covariance
matrix .

PCA Derivation Using Singular Value Decomposition (SVD)


Given the data matrix , mean-centered:
1. Compute the Singular Value Decomposition of :

where:
is orthogonal,
is diagonal (rectangular) with singular values ,
is orthogonal, columns are right singular vectors.
2. The covariance matrix can be expressed as:

3. The eigenvectors of are the columns of , and eigenvalues are .


4. Thus, the principal components correspond to the right singular vectors , and the
projected data is:

PCA: Step-By-Step Summary


1. Mean-center the data: Subtract the mean of each feature to center data at zero.
2. Compute the covariance matrix: .
3. Eigen-decomposition: Find eigenvalues and eigenvectors of .
4. Select principal components: Choose the top eigenvectors with largest eigenvalues.
5. Project data: Compute .
6. (Optional) Reconstruction: Approximate .
Or alternatively, using SVD of :
1. Compute .
2. Select top columns of as principal directions.
3. Project data: .
Key Insights
PCA finds orthogonal directions capturing maximal variance.
It is optimal in minimizing reconstruction error with low-dimensional linear subspace.
The equivalence of eigen-decomposition and SVD-based methods allows flexible
computation optimized for data size (SVD often preferred numerically).
Dimensionality reduction can mitigate curse of dimensionality, noise, and redundancy.

Summary Table
Step Description Formulation/Operation

Data centering Subtract mean vector from data

Covariance
Calculate covariance matrix
computation

Find eigenvalues/eigenvectors of
Eigen-decomposition
covariance

Choose top eigenvectors by eigenvalue


Principal components
magnitude

Projection Map data onto principal components

SVD alternative Decompose directly

Relation between SVD , eigenvalues from singular


PCA directions = right singular vectors
& PCA values

This comprehensive derivation and explanation follows the classical PCA theory and numerical
linear algebra foundations as presented in standard references including Bishop’s Pattern
Recognition and Machine Learning and authoritative machine learning and statistics literature. [1]
[2] [3] [4]

If you need, I can provide explicit numerical examples or implementation details.


1. [Link]
on-for-pca
2. [Link]
3. [Link]
4. [Link]
Artificial Neural Networks (ANNs): Full Mathematical Formulation
Artificial Neural Networks (ANNs) are computational models inspired by the biological neural
networks of animal brains. They consist of interconnected layers of simple processing units
called neurons, arranged in layers: an input layer, one or more hidden layers, and an output layer.
ANNs approximate complex functions mapping input data to outputs through compositions of
linear transformations and nonlinear activation functions.

1. Basic Building Block: Artificial Neuron


Each neuron computes a weighted sum of its inputs plus a bias term, followed by an activation
function:

where:
are inputs to the neuron (either original inputs or outputs from previous layer neurons),
are weights connecting input to neuron ,
is the bias of neuron ,
is an element-wise nonlinear activation function (e.g., sigmoid, ReLU, tanh),
is the pre-activation input to the activation function,
is the neuron's output (activation).

2. Feedforward Computation in Multi-Layer Networks


Consider a network with layers, indexed by :
The input layer ( ) has input vector ,
Each hidden layer has neurons producing activations ,
The output layer yields output .
The feedforward operation through the layers is:
with the convention . Here:
is the weight matrix for layer ,
is the bias vector,
is the nonlinear activation applied element-wise, often nonlinear for hidden layers and
appropriate for output layer (e.g., softmax for classification).

3. Activation Functions
Common choices include:
Sigmoid (logistic):
Hyperbolic tangent (tanh):
Rectified Linear Unit (ReLU):
Softmax (output layer for multiclass classification):

4. Loss Function and Training Objective


ANNs are typically trained to minimize a loss function $ \mathcal{L} $ over training data
, where is the target output:

where $ \mathbf{y}_n = F(\mathbf{x}_n; \Theta) $ is the ANN output for input with
parameters , and is a loss function, such as:

Mean squared error for regression:

Cross-entropy loss for classification:


5. Backpropagation and Parameter Updates
Training uses gradient-based optimization to update weights and biases:
Compute gradients via backpropagation,
Update parameters using gradient descent:

where is the learning rate.

6. Summary of the ANN Functional Form


Given input , the ANN function is a composition of affine transformations and nonlinearities:

7. Notes on Network Architecture


The number of hidden layers and neurons per layer affect the model capacity,
expressivity, and training complexity.
Common architectures include:
Feedforward networks (Multilayer Perceptron)
Convolutional Neural Networks (CNNs) for image data (involving convolutional layers)
Recurrent Neural Networks (RNNs) for sequential data
Regularization techniques like weight decay, dropout, and batch normalization improve
generalization.

This mathematical formulation captures the core building blocks and function of artificial neural
networks, consistent with rigorous treatments found in seminal machine learning literature.
If you want, I can provide detailed derivations of backpropagation, examples of specific
architectures, or explanations of optimization methods used in training ANNs.

This explanation synthesizes standard mathematical formulations from authoritative sources on


neural networks and machine learning.

Backpropagation: Full Derivation and Step-by-Step Explanation
Backpropagation is the fundamental algorithm used to train artificial neural networks by
efficiently computing gradients of the loss function with respect to all weights and biases. It
applies the chain rule of calculus to propagate errors backward through the network, allowing
gradient-based optimization methods like gradient descent.

1. Setting: Feedforward Neural Network


Consider a feedforward network with layers. For layer :
Inputs to layer are activations .
Linear combination before activation:

where is the weight matrix and the bias vector.


Activations after applying nonlinear activation function :

Input layer activations: .


Output of the network:

2. Objective: Loss Function


Given a training example with true target , the loss function measures the prediction
error. For instance, mean squared error or cross-entropy.
The goal is to minimize the total loss over the dataset by updating weights and biases via
gradient descent using derivatives (gradients) of the loss w.r.t. all parameters:
3. Backpropagation Key Idea
Backpropagation computes the gradient of the loss w.r.t. each layer's parameters by
propagating an error signal backward starting from the output layer and using the chain rule.
Define the error at layer (also called the delta vector)

which quantifies how the loss changes with changes in the pre-activation vector .

4. Step-by-Step Derivation
(a) Output layer error:
For the output layer , assuming element-wise activation,

where
is the gradient of loss w.r.t. outputs,
is the element-wise derivative of the activation function,
denotes element-wise (Hadamard) product.

(b) Backpropagation to hidden layers:


For each hidden layer , the error is:

This expresses the error at layer in terms of error at layer , the weights connecting them,
and the derivatives of the activation.

(c) Gradient with respect to parameters:


Once is obtained:
Gradient w.r.t. weights:

Gradient w.r.t. biases:


5. Algorithmic Steps of Backpropagation
Given one training instance :
1. Forward pass:
Compute and for all layers to .
2. Compute output error:
Compute using the loss gradient and output activation derivative.
3. Backpropagate errors:
For down to 1, compute as above.
4. Compute gradients:
Calculate and for each layer.
5. Update parameters (e.g., gradient descent):

where is the learning rate.

6. Summary: Compact Mathematical Flow

7. Intuition
Errors at output propagate backward, scaled by weights and element-wise modulated by
activation derivatives.
The gradient for each weight reflects how much a small change in that weight affects the
final error.
Backpropagation provides an efficient way to compute all gradients simultaneously,
avoiding redundant calculations inherent in naive differentiation.

This derivation and explanation follow the standard approach used in neural network training
literature, such as Bishop's Pattern Recognition and Machine Learning, and are foundational to
modern deep learning. Let me know if you want detailed examples or computational illustrations!

You might also like