ML - Perplexity
ML - Perplexity
Problem Setup
Let the training dataset consist of $ N $ data points $ \mathcal{D} = { (x_n, t_n) }_{n=1}^N $,
where:
$ x_n \in \mathbb{R} $ is the input variable (scalar, but can be generalized to vectors).
$ t_n \in \mathbb{R} $ is the target variable.
We seek a model (function) $ y(x, \mathbf{w}) $ that predicts $ t $ given $ x $:
where $ M $ is the polynomial degree, and $ \mathbf{w} = [w_0, w_1, ···, w_M]^T $ the vector of
coefficients. [1] [2]
Thus,
where $ \mathbf{y} \in \mathbb{R}^N $ is the vector of model predictions at the input points $
x_n $, $ \mathbf{y} = [y(x_1, \mathbf{w}), ···, y(x_N, \mathbf{w})]^T $. [3]
where $ \mathbf{w}_{ML} $ is the maximum likelihood solution under the Gaussian noise model.
[2] [3]
The solution:
interpreted as the maximum a posteriori (MAP) estimate under a Gaussian prior on weights. [3]
References:
(Style and notation closely follow Bishop’s "Pattern Recognition and Machine Learning") [4] [1] [2]
[3]
1. [Link]
2. [Link]
3. [Link]
4. [Link]
RL_Codes/blob/master/Machine_Learning/Bishop/Polynomial Curve [Link]
Full Mathematical Formulation of Overfitting & Regularized Regression
If $ M $ is large ($ M \to N E(\mathbf{w}) = 0 $), but such a solution typically "memorizes" the
noise—resulting in poor test/validation performance. [1] [2] [3]
where:
$ \Phi $ is the design (Vandermonde) matrix, $ \Phi_{nj} = x_n^j $
$ \lambda \geq 0 $ is the regularization parameter (strength) controlling the balance
between fit and complexity.
Solution (Closed form):
This is sometimes called ridge regression (L2 regularization). The penalty $ \lambda |
\mathbf{w} |^2 $ shrinks the magnitude of coefficients, discouraging overly complex models [4]
[5] [2] .
Key features:
If $ \lambda = 0 $: Standard least-squares regression, possible overfitting.
If $ \lambda \to \infty $: Coefficients shrink towards zero, possible underfitting.
Optimizing $ \lambda $: Cross-validation typically used for selection.
Summary Table
Phenomenon Formulation Implication
Reference Notation
This is the notation and structure closely following Bishop’s "Pattern Recognition and Machine
Learning" and standard ML literature. [4] [8] [2]
Notes:
L1 regularization (lasso): $ \lambda \sum_j |w_j| $ — promotes sparsity.
L2 regularization (ridge): $ \lambda \sum_j w_j^2 $ — shrinks all weights. [5] [2]
Extensions: Elastic Net, Bayesian Ridge.
Conclusion:
Regularized regression mathematically augments the error function with a complexity penalty.
The balance of loss and regularization parameter $ \lambda $ determines the model's
generalization capacity and prevents overfitting by controlling coefficient values.
⁂
1. [Link]
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
[Link]
Curse of Dimensionality
The curse of dimensionality refers to the host of problems that arise when working with data in
very high-dimensional feature spaces. [1] [2] [3]
As the number of dimensions (features) increases, data points become exponentially more
sparse, which makes generalization and pattern discovery difficult for learning algorithms.
Key impacts:
Exponential growth of sample requirements: To represent all possible combinations of
features well, you need exponentially more samples.
Overfitting risk: Models become complex and start to memorize training data noise
rather than general patterns, harming test performance.
Computational inefficiency: Training and processing require much more time and
resources.
Distance metrics break down: In high dimensions, all data points tend to be nearly
equidistant, impeding methods like k-NN and clustering.
Mitigation: Use dimensionality reduction (e.g., PCA), feature selection, and regularization to
reduce the impact and make modeling feasible. [2] [3] [1]
This incorporates both the likelihood of the data given the model and the prior probability of
the model, thus:
In practice, MAP model selection often leads to minimizing an objective that combines the fit
(e.g., sum of squares error) with a complexity penalty derived from the prior. This results in a
regularized regression form, where more complex models are penalized. [4]
Regularization as prior: In linear regression, the MAP estimate corresponds to ridge
regression when using a Gaussian prior over weights. This shrinks parameters, prevents
overfitting, and thus affects model selection. [5]
MDL (Minimum Description Length) Principle
Minimum Description Length (MDL) is an information-theoretic approach to model selection.
The central idea is:
The best model is the one that gives the shortest overall message encoding both the
model and the data given the model. [6] [7] [8]
Mathematically, it seeks to minimize:
Where:
$ L(model) $: length (in bits) to describe the model complexity,
$ L(data \mid model) $: length to encode the data given the model's predictions.
MDL is a formalization of Occam's Razor: balance model simplicity (short description)
against data fit.
MDL penalizes overly complex models (which are hard to describe and unlikely to
generalize) and avoids underfitting (models that cannot compress data well due to poor fit).
[7] [8] [6]
Curse of High dimensions increase sparsity, risk of Favors simpler models, requires
Dimensionality overfitting, resource use dimensionality reduction
Select model with highest posterior Combines prior and data fit, regularizes
MAP Hypothesis
probability (Bayesian) model selection
Select model with shortest encoding Balances complexity and fit, penalizes
MDL Principle
(model + data) overfitting
Summary:
Efficient model selection in machine learning relies on understanding these concepts. High-
dimensional data can render simple models ineffective, but overly complex ones are likely to
overfit. MAP hypothesis and MDL principle provide rigorous mathematical frameworks for
balancing fit and complexity, both incorporating penalties against complex models to promote
generalizability. [8] [1] [6]
⁂
1. [Link]
2. [Link]
3. [Link]
4. [Link]
tion-in-Gaussian-regression/10.1214/[Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
Linear Models for Regression — Basis Function Models (Bishop Style)
where and . This is linear in both the parameters and the input
variables.
Limitation
Such models only capture linear trends in input space. To model more complex, nonlinear
relationships, we introduce basis functions.
with .
Linear in (parameters)
Possibly nonlinear in (inputs), due to [1] [2] [3]
: design matrix.
: predictions.
Learning: Fit by minimizing sum-of-squares (or via maximum likelihood under Gaussian
noise):
5. Advantages
Expressivity: Model nonlinear relations with simple linear estimation.
Tractable optimization: Still allows efficient analytical or numerical training.
Extends to generalized linear models (GLMs) and Bayesian approaches.
6. Typical Workflow
1. Choose basis functions (domain knowledge, cross-validation, etc.)
2. Construct (evaluate on data).
3. Solve for (least squares; add regularization as needed).
4. Make predictions: .
7. Summary Table
Model Form Expression Notes
Prediction (vectorized)
8. Key Points
Choice of basis critically affects performance and capacity.
Still "linear models" in the mathematical sense; enables simplicity and power. [2] [3] [1]
Includes classical polynomial regression as a special case.
These notations and structure follow Bishop's Pattern Recognition and Machine Learning.
References
No explicit URLs included per instructions; see Bishop's "Pattern Recognition and Machine
Learning," and referenced PDFs for full derivations and detailed worked examples. [3] [1] [2]
⁂
1. [Link]
2. [Link]
df
3. [Link]
4. [Link]
Linear Models for Classification — Least Squares Method (Bishop Style)
Overview
Linear models for classification use a linear function of the input features to assign inputs to
classes. While learned commonly by methods like logistic regression or Fisher’s discriminant, an
alternative approach applies least squares regression to classification targets. This method
treats classification as a regression problem, fitting linear functions to numerical encodings of
class labels.
Model Setup
Consider a classification problem with $ K $ classes. Use a one-hot (1-of-K) encoding of the
target vector $ \mathbf{t} \in {0,1}^K $, where the element $ t_k = 1 $ for the correct class and 0
otherwise.
For input $ \mathbf{x} \in \mathbb{R}^D $, define the classifier output vector:
where
$ \tilde{\mathbf{x}} = (1, \mathbf{x}\top)\top \in \mathbb{R}^{D+1} $ is the augmented input
vector (including bias term),
$ \tilde{\mathbf{W}} \in \mathbb{R}^{(D+1) \times K} $ is the weight matrix, with the $ k $-
th column containing parameters $ \tilde{\mathbf{w}}_k $ for class $ k $.
The element-wise output $ y_k(\mathbf{x}) = \tilde{\mathbf{w}}_k^\top \tilde{\mathbf{x}} $ is a
linear discriminant function for class $ k $.
Classification Rule
The predicted class for a new point $ \mathbf{x} $ is:
Conceptual Interpretation
Least squares for classification approximates the conditional expectation $ \mathbb{E}
[\mathbf{t} | \mathbf{x}] $ of the target vector given input.
This corresponds roughly to approximating class posterior probabilities, but the outputs $
y_k(\mathbf{x}) $ are not guaranteed to be probabilities (they can be negative or not sum
to 1).
The solution corresponds to maximum likelihood estimation under a Gaussian noise
assumption on the target vectors, which is not strictly valid for discrete class labels.
Effective
Fisher’s Linear Maximizes class Assumes homoscedastic
dimensionality
Discriminant separation in projection Gaussian classes
reduction
Summary
The least squares approach to linear classification provides a simple, closed-form solution by
treating class labels as numerical targets. It defines linear discriminant functions per class and
assigns labels by taking the maximal output. While elegant for computational efficiency and
mathematical simplicity, it lacks robustness and probabilistic rigor compared to methods like
logistic regression.
This explanation and notation closely follow Bishop’s Pattern Recognition and Machine Learning,
chapter 4.1.3, and standard machine learning literature on linear classification. [1] [2] [3]
⁂
1. [Link]
hapter_04_part.pdf?cache=nocache
2. [Link]
3. [Link]
Fisher's Linear Discriminant: Full Formulation and Derivation
Fisher's Linear Discriminant (FLD) is a classical method for linear classification and dimensionality
reduction that seeks a linear projection of data maximizing class separability. It is specifically
designed for the two-class problem but can be generalized.
Problem Setup
Suppose we have two classes $ C_1 $ and $ C_2 $ with data points:
$ { \mathbf{x}_i } $, $ i = 1, \dots, N_1 $ from class $ C_1 $,
$ { \mathbf{x}_j } $, $ j = 1, \dots, N_2 $ from class $ C_2 $,
each point $ \mathbf{x} \in \mathbb{R}^D $.
Define:
Class means:
Goal
Find a projection vector $ \mathbf{w} \in \mathbb{R}^D $ such that the projection of the data
onto $ \mathbf{w} $:
Maximizing $ J(\mathbf{w}) $ finds a projection that maximizes the distance between the class
means while keeping within-class variance small, leading to better class separability.
Optimization Problem
Maximize $ J(\mathbf{w}) $ with respect to $ \mathbf{w} $. This is a generalized Rayleigh
quotient maximization:
Here, the optimal projection vector is proportional to the inverse of the within-class scatter
matrix multiplied by the difference in class means.
Interpretation
$ \mathbf{S}_W^{-1} $ weights the mean difference by the inverse of the variability within
classes, emphasizing directions of small within-class variation.
Projecting onto $ \mathbf{w} $ maximizes class separation normalized by within-class
spread.
Classification Rule
Having found $ \mathbf{w}^* $, classify a new point $ \mathbf{x} $ by projecting:
then choosing the class according to which projected mean $ y $ is closest to:
Optimal projection
$ \mathbf{w}^* \propto \mathbf{S}_W^{-1} (\mathbf{m}_1 - \mathbf{m}_2) $
vector
This derivation and formulation closely follow classical presentations found in standard machine
learning texts such as Bishop’s Pattern Recognition and Machine Learning and related
academic resources.
If you want, I can provide a detailed step-by-step linear algebra derivation or examples of
numerical computation for Fisher’s Linear Discriminant.
Let me know if you need further explanation or examples!
⁂
Perceptron Algorithm: Full Formulation and Explanation
The Perceptron algorithm is one of the earliest and simplest algorithms for supervised learning
of binary classifiers. It belongs to the family of linear classifiers and forms the basis for many
neural network architectures.
Model Setup
Inputs: A feature vector $ \mathbf{x} = (x_1, x_2, ···, x_D)^\top \in \mathbb{R}^D $.
Parameters: A weight vector $ \mathbf{w} = (w_1, w_2, ···, w_D)^\top $ and a bias term $ b
$.
Output: A predicted class label in (or sometimes ).
The Perceptron computes a linear combination of the inputs plus bias:
2. Compute error:
Limitations
Only linearly separable data can be perfectly classified.
Does not provide probabilistic outputs.
Sensitive to the choice of learning rate and initial weights.
Cannot solve problems like XOR that are not linearly separable.
The Perceptron algorithm is a foundational concept in machine learning and neural networks,
providing insight into linear classification and learning rules.
This explanation closely follows classical references and standard formulations as presented in
sources such as Bishop's Pattern Recognition and Machine Learning and standard machine
learning literature. [1] [2] [3] [4]
⁂
1. [Link]
2. [Link]
work/
3. [Link]
4. [Link]
Probabilistic Generative Classifiers — Naive Bayes Classifier
A probabilistic generative classifier models how data is generated by each class and uses
Bayes' theorem to invert this to classify new instances. The Naive Bayes classifier is one of the
simplest and most popular examples in this category, known for its computational efficiency and
decent performance despite a strong independence assumption.
where
is the feature vector,
is the class label.
Classification is done by applying Bayes' theorem to compute the posterior probabilities of
classes given features:
The predicted class is the one with the highest posterior (MAP decision rule):
This simplifies the potentially complex joint distribution into a product of simpler one-dimensional
distributions.
Hence, the posterior simplifies to:
This assumption is rarely true in practice but dramatically reduces the number of parameters to
estimate, making training efficient and feasible even with high-dimensional data.
Model Components
1. Prior : The class prior probability, typically estimated as the relative frequency of each
class in the training data.
2. Likelihoods : Class-conditional distributions for each feature. These are estimated
separately for each feature and class, using methods appropriate to the type of data:
For discrete features, typically categorical or multinomial distributions.
For continuous features, typical choices include:
Gaussian (normal) distribution for numeric data.
Kernel density estimates or other parametric distributions as suitable.
Training
The parameters for and are estimated from data by maximum likelihood
estimation (MLE) or Bayesian methods.
Prior:
Likelihoods:
and assign the class with the largest score. Logs are used for numerical stability and to convert
products into sums.
This comprehensive overview follows standard presentations as found in authoritative texts like
Bishop's Pattern Recognition and Machine Learning and other key machine learning literature. [1]
[2] [3]
If you would like, I can further provide mathematical formulations of specific Naive Bayes
variants (Gaussian, Multinomial) or examples of implementing the classifier. Let me know!
⁂
1. [Link]
2. [Link]
3. [Link]
Naive Bayes and Gaussian Naive Bayes: Full Mathematical Derivations
Problem Setup
Given:
Feature vector $ \mathbf{x} = (x_1, x_2, ..., x_d) $,
Class variable $ C \in {C_1, C_2, ..., C_K} $.
Goal: Assign a class label $ \hat{C} $ to a new instance by maximizing the posterior probability:
By Bayes’ theorem:
Parameter Estimation
Given training data points $ { \mathbf{x}n }{n=1}^{N_k} $ for class , estimate parameters:
Mean for each feature:
Variance for each feature:
Classification Rule
A new data point is classified according to:
Discrete or categorical or
Feature type Continuous features
continuous
4. Interpretation
Naive Bayes simplifies joint likelihood estimation by assuming independence, making
parameter estimation tractable.
Gaussian Naive Bayes models continuous features using Gaussian distributions, providing a
parametric family for likelihood estimation.
Both methods are efficient, scalable, and often perform surprisingly well even when
independence assumptions are violated.
This rigorous formulation and derivation closely follow classical probabilistic pattern recognition
literature, including Bishop's Pattern Recognition and Machine Learning. If you want, I can
provide worked numerical examples or implementation details for these models.
Bayesian Belief Networks
Bayesian Belief Networks (BBNs), also known simply as Bayesian Networks, are a type of
probabilistic graphical model that represent a set of variables and their conditional
dependencies using a directed acyclic graph (DAG). They provide a powerful framework for
reasoning and making decisions under uncertainty by encoding joint probability distributions
efficiently.
Summary Table
Aspect Description
Bayesian Belief Networks thus combine graph theory and probability theory to represent and
manipulate uncertain knowledge efficiently, enabling sophisticated probabilistic reasoning and
decision-making.
This explanation follows standard presentations in machine learning literature and authoritative
sources on Bayesian networks. [1] [2] [3] [4]
⁂
1. [Link]
2. [Link]
3. [Link]
4. [Link]
Probabilistic Discriminative Classifiers — Logistic Regression
Overview
Logistic regression is a foundational probabilistic discriminative classifier used primarily for
binary classification problems. It models the conditional probability of the class label given the
input features directly, without attempting to model the data generation process (unlike
generative models such as Naive Bayes).
Model Formulation
Given:
Input feature vector $ \mathbf{x} \in \mathbb{R}^D $
Binary target $ t \in {0,1} $
Logistic regression models the posterior probability of the positive class as a logistic sigmoid
function applied to a linear function of the inputs:
where
and
$ \mathbf{w} $ is the weight vector,
$ \boldsymbol{\phi}(\mathbf{x}) $ is the basis function vector of input features (can include
nonlinear transformations),
$ \sigma(\cdot) $ is the sigmoid function.
The probability of the negative class is $ 1 - y(\mathbf{x}, \mathbf{w}) $.
Parameter Estimation
The objective function $ E(\mathbf{w}) $ is convex but nonlinear in $ \mathbf{w} $.
Parameters are typically estimated by maximum likelihood using iterative optimization
methods such as:
Gradient descent
Iteratively Reweighted Least Squares (IRLS)
Newton-Raphson method
Multiclass Extension
For classification with $ K > 2 $ classes, logistic regression generalizes to multinomial logistic
regression (also known as softmax regression):
where each class has its weight vector $ \mathbf{w}_k $, collected as columns in weight matrix
$ \mathbf{W} $.
Summary Table
Aspect Description
\mathbf{x}) = \sigma(\mathbf{w}^\top
Output $ p(t=1
\boldsymbol{\phi}(\mathbf{x})) $
Extension to
Softmax function over linear scores
Multiclass
This explanation and notation align closely with Bishop's Pattern Recognition and Machine
Learning as well as standard machine learning literature. Logistic regression remains a
fundamental tool for classification due to its simplicity, interpretability, and probabilistic
foundation.
⁂
Logistic Regression: Full Mathematical Derivation
1. Problem Setup
Given a binary classification problem with:
Input feature vector $ \mathbf{x} \in \mathbb{R}^D $,
Target label $ t \in {0, 1} $.
We want to model the conditional probability of class label given input:
where
and
where
Since
By chain rule,
Plugging in:
with
7. Summary of Algorithm
1. Initialize $ \mathbf{w} $ (e.g., zero vector).
2. For each iteration:
Compute $ y_n = \sigma(\mathbf{w}^\top \boldsymbol{\phi}(\mathbf{x}_n)) $.
Form $ \mathbf{R} $.
Compute gradient $ \nabla_{\mathbf{w}} E(\mathbf{w}) = \Phi^\top (\mathbf{y} -
\mathbf{t}) $.
Compute Hessian $ \mathbf{H} = \Phi^\top \mathbf{R} \Phi $.
Update:
Parameter estimation is done via maximizing multinomial log-likelihood, using similar iterative
optimization.
This full derivation follows the principles and notation set out in Bishop's Pattern Recognition and
Machine Learning and is foundational in modern machine learning for classification tasks.
Probit Regression: Full Formulation and Derivation
Overview
Probit regression is a type of probabilistic classification model used for binary outcomes, where
the dependent variable can take only two values (e.g., 0 or 1). It models the probability of the
positive class as the cumulative distribution function (CDF) of a standard normal distribution
applied to a linear combination of input features. This is in contrast to logistic regression, which
uses the logistic sigmoid function.
Model Formulation
Given:
Input vector $ \mathbf{x} \in \mathbb{R}^D $,
Target label $ y \in {0,1} $,
the Probit regression model assumes:
where:
$ \boldsymbol{\phi}(\mathbf{x}) $ is a vector of basis functions of the input (possibly
including an intercept term),
$ \mathbf{w} $ is the parameter (weight) vector,
$ \Phi(\cdot) $ is the CDF of the standard normal distribution:
Thus,
Likelihood Function
Given a dataset $ { (\mathbf{x}n, y_n) }{n=1}^N $, the likelihood of the observed data under the
Probit model is:
The Hessian matrix can also be derived, allowing second-order optimization techniques like
Newton-Raphson.
Comparison with Logistic Regression
Logistic regression uses the logistic sigmoid function as the link function:
Summary Table
Aspect Probit Regression Logistic Regression
$ \mathbf{x}) \in
Output range $ P(y=1 \mathbf{x}) \in (0,1) $
P(y=1 (0,1) $
Coefficients interpreted as
Coefficients interpreted
Interpretability effects on latent variable's z-
as log-odds ratios
score
This full formulation and derivation of Probit regression follow classical statistical and
econometric treatments of the model, closely aligned with presentations in canonical texts and
standard machine learning resources. It provides a principled, mathematically rigorous approach
for modeling binary outcomes using a latent Gaussian model and the standard normal cumulative
distribution function as the link.
If you want, I can also provide a detailed worked example or step-by-step computational
algorithm for parameter estimation. Let me know!
⁂
Instance-Based Learning — K-Nearest Neighbors (K-NN)
Overview
Instance-based learning is a family of learning algorithms that work by storing training
instances and making predictions for new queries by comparing them directly to these stored
examples. Unlike model-based methods that explicitly learn a global generalization model from
the training data, instance-based learners postpone the induction or generalization until a query
is made. This is why they are also called lazy learners.
The most fundamental and popular instance-based algorithm is the K-Nearest Neighbors (K-
NN) algorithm.
where $ a_r(\mathbf{x}) $ is the value of the $ r^th $ attribute for instance $ \mathbf{x} $.
3. For a new query instance, compute its distance to all stored training instances.
4. Select the k closest neighbors based on the smallest distances.
5. Prediction:
Classification: Assign the class which is the majority among the k neighbors. The
simplest form is unweighted voting.
Regression: Predict the average target value among the k neighbors.
Optionally, weighted voting or averaging can be applied where neighbors closer to the query
have more influence (e.g., weighted by the inverse of distance).
Choosing k
Small values of k (e.g., 1) can lead to noisy decisions and high variance (overfitting).
Large values of k smooth decisions but may lead to underfitting (high bias).
Typically, an odd number is chosen to avoid ties in classification.
The optimal value of k can be determined via cross-validation or methods such as the elbow
method.
Characteristics of K-NN
Feature Description
Distance metric Usually Euclidean, but others like Manhattan or Minkowski can be used
Advantages
Simple to understand and implement.
Makes no assumptions about data distribution.
Naturally handles multi-class problems.
Flexible decision boundaries that adapt to data shape.
Effective with large training data when good indexing/search structures are used.
Limitations
Computational cost high at prediction time (scales with dataset size).
Sensitive to irrelevant features and feature scaling.
Performance deteriorates in high-dimensional spaces (curse of dimensionality).
Capacity depends on quality and representativeness of stored instances.
Sensitive to noise and outliers (can be mitigated by weighting or data cleaning).
Summary
Instance-based learning methods like K-NN classify or regress for new samples by comparing
directly to stored data points. K-NN predicts based on the "k" most similar neighbors, making it
intuitive and effective for many practical tasks, especially when the relationship between input
and output is locally smooth but complex globally.
This explanation follows standard machine learning literature on instance-based methods and K-
NN algorithms. [1] [2] [3]
⁂
1. [Link]
2. [Link]
3. [Link]
Formulation of Nonlinear Models — Decision Trees
Decision trees are a fundamental class of nonlinear models used for both classification and
regression tasks. Unlike linear models that assume a linear relationship between features and the
output, decision trees model complex, nonlinear patterns by recursively partitioning the input
feature space into distinct regions and making simple predictions within each region.
Mathematical Formulation
1. Data and Notation:
Given training data:
Input Data , ,
Prediction Function
Impurity / Loss Gini, entropy, variance, or other impurity functions to measure split quality
This explanation captures the standard mathematical formulation and interpretation of decision
trees as nonlinear models widely used in machine learning literature and practical applications. [1]
[2] [3]
1. [Link]
2. [Link]
3. [Link]
Ensemble Classifiers — Mathematical Formulations of Bagging and Boosting
Ensemble classifiers combine multiple base models to improve prediction accuracy, robustness,
and generalization performance relative to individual models. Two widely used ensemble
techniques are Bagging (Bootstrap Aggregating) and Boosting. Below is a detailed
mathematical formulation of each.
Setup:
Original training dataset: .
Number of base learners: .
Each base learner is trained on a bootstrap sample created by sampling points from
with replacement.
The base learner's prediction function is .
Mathematical Formulation:
1. For :
Sample by bootstrap sampling from .
Train base classifier on .
2. To predict the class label for a new instance :
Classification: Use majority voting over base classifiers:
2. Boosting
Boosting sequentially trains base learners, where each learner focuses on the mistakes of the
previous ones to form a strong composite model by weighted combination of weak learners.
Setup:
Original training dataset: .
Number of base learners: .
Weight distribution over training samples at iteration : .
Base learner at iteration is trained on weighted data.
Learner's weight in final model: .
2. For :
Train base classifier using weighted dataset with weights .
Compute weighted error rate:
Summary Table
Aspect Bagging Boosting
These mathematical formulations capture the essence and differences of Bagging and Boosting
ensemble methods widely used for classification and regression tasks. They follow the principles
outlined in standard machine learning literature and align with classical sources.
⁂
Lagrange Multipliers: Full Derivation and Formulation
where .
The constrained optimization problem becomes finding stationary points of with respect to
both and .
Necessary Conditions (Equality-Constrained)
Solve the system:
These yield candidate points for maxima or minima, turning a constrained problem into solving a
system of equations.
where
are multipliers for inequality constraints,
are multipliers for equality constraints.
2. Primal feasibility:
3. Dual feasibility:
4. Complementary slackness:
Summary
The Lagrange multiplier method handles equality constraints by introducing multipliers to
form a Lagrangian; optimizing this reduces to solving system of equations.
The KKT conditions generalize this to inequality constraints, introducing extra conditions
(dual feasibility and complementary slackness) alongside stationarity and primal feasibility.
These conditions are a fundamental tool in constrained nonlinear optimization, widely used in
economics, engineering, machine learning, and operations research.
1. [Link]
d003b3_MIT18_02SC_notes_22.pdf
2. [Link]
3. [Link]
4. [Link]
5. [Link]
Support Vector Machines (SVM): Formulations, Dual Method, Soft Margin and
Hard Margin — Full Derivations
subject to
subject to
subject to
Note the key difference from hard margin: the are upper bounded by , reflecting soft
margin slack penalties.
Assumption Linearly separable data required Allows misclassification and margin violations
Constraints
Dual
with no upper bound reflecting slack penalties
constraints
6. Interpretation
The SVM solution depends only on support vectors (training points with nonzero ).
Dual formulation enables the kernel trick, allowing implicit mapping to high-dimensional
feature spaces via kernel functions .
Soft margin SVM is robust to noise and non-separable data via slack variables and
regularization parameter .
This comprehensive summary captures the formulation, derivation, and core mathematical
foundations of SVMs including primal and dual form, and differences between soft and hard
margin methods. The derivations align with classical treatments in machine learning theory and
optimization literature.
⁂
K-Means Algorithm — Derivation Using EM and Gaussian Mixture Models (GMM)
Overview
The K-Means clustering algorithm can be rigorously derived as a special case of the
Expectation-Maximization (EM) algorithm applied to a Gaussian Mixture Model (GMM) with
simplifying assumptions. Understanding this connection helps explain K-Means as a hard
clustering method approximating the soft probabilistic clustering in GMMs.
where:
is the number of clusters,
are mixing coefficients ( ),
are means,
covariance matrices.
The EM algorithm for GMM alternates between:
E-step: Compute the soft responsibilities:
and zero otherwise, meaning each data point is assigned to the nearest centroid.
M-step (Centroid update): Given these hard assignments, update cluster centers to mean
of the assigned points:
Cluster shape assumption Spherical and equal variance Full covariance, learned from data
Intuition
K-Means corresponds to a simplified, hard clustering version of GMM where clusters are
spherical with equal variance.
The probabilistic EM soft assignments collapse to hard assignments in K-Means.
EM algorithm optimizes the likelihood in GMM, while K-Means optimizes a distortion function.
This derivation and interpretation help bridge the understanding of K-Means from a heuristic
clustering method to a probabilistically principled algorithm derived from Gaussian mixture
modeling and expectation-maximization.
This explanation is based on classical EM and GMM theory, as found in machine learning
literature such as Bishop's Pattern Recognition and Machine Learning and supporting academic
sources. [1] [2] [3] [4] [5] [6]
⁂
1. [Link]
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
Dimension Reduction — Principal Component Analysis (PCA): Full Derivation and
Steps, Including SVD-Based Derivation
Overview
Principal Component Analysis (PCA) is a fundamental technique for dimensionality reduction. It
finds a new set of orthogonal variables (principal components) that capture the maximal
variance in the data in descending order, reducing dimensionality while preserving as much
information as possible.
Problem Setup
Consider a dataset with $ N $ observations and $ D $ features arranged in a data matrix:
where:
is orthogonal,
is diagonal (rectangular) with singular values ,
is orthogonal, columns are right singular vectors.
2. The covariance matrix can be expressed as:
Summary Table
Step Description Formulation/Operation
Covariance
Calculate covariance matrix
computation
Find eigenvalues/eigenvectors of
Eigen-decomposition
covariance
This comprehensive derivation and explanation follows the classical PCA theory and numerical
linear algebra foundations as presented in standard references including Bishop’s Pattern
Recognition and Machine Learning and authoritative machine learning and statistics literature. [1]
[2] [3] [4]
1. [Link]
on-for-pca
2. [Link]
3. [Link]
4. [Link]
Artificial Neural Networks (ANNs): Full Mathematical Formulation
Artificial Neural Networks (ANNs) are computational models inspired by the biological neural
networks of animal brains. They consist of interconnected layers of simple processing units
called neurons, arranged in layers: an input layer, one or more hidden layers, and an output layer.
ANNs approximate complex functions mapping input data to outputs through compositions of
linear transformations and nonlinear activation functions.
where:
are inputs to the neuron (either original inputs or outputs from previous layer neurons),
are weights connecting input to neuron ,
is the bias of neuron ,
is an element-wise nonlinear activation function (e.g., sigmoid, ReLU, tanh),
is the pre-activation input to the activation function,
is the neuron's output (activation).
3. Activation Functions
Common choices include:
Sigmoid (logistic):
Hyperbolic tangent (tanh):
Rectified Linear Unit (ReLU):
Softmax (output layer for multiclass classification):
where $ \mathbf{y}_n = F(\mathbf{x}_n; \Theta) $ is the ANN output for input with
parameters , and is a loss function, such as:
This mathematical formulation captures the core building blocks and function of artificial neural
networks, consistent with rigorous treatments found in seminal machine learning literature.
If you want, I can provide detailed derivations of backpropagation, examples of specific
architectures, or explanations of optimization methods used in training ANNs.
which quantifies how the loss changes with changes in the pre-activation vector .
4. Step-by-Step Derivation
(a) Output layer error:
For the output layer , assuming element-wise activation,
where
is the gradient of loss w.r.t. outputs,
is the element-wise derivative of the activation function,
denotes element-wise (Hadamard) product.
This expresses the error at layer in terms of error at layer , the weights connecting them,
and the derivatives of the activation.
7. Intuition
Errors at output propagate backward, scaled by weights and element-wise modulated by
activation derivatives.
The gradient for each weight reflects how much a small change in that weight affects the
final error.
Backpropagation provides an efficient way to compute all gradients simultaneously,
avoiding redundant calculations inherent in naive differentiation.
This derivation and explanation follow the standard approach used in neural network training
literature, such as Bishop's Pattern Recognition and Machine Learning, and are foundational to
modern deep learning. Let me know if you want detailed examples or computational illustrations!