Exhaustive Notes on Machine Learning and Pattern Recognition with
Elaborated Definitions and Theorems
References
Ethem Alpaydin (2013). Introduction to Machine Learning, 2nd
Edition, PHI Learning Publisher.
o A comprehensive guide to machine learning, detailing
foundational concepts, algorithms, and practical applications,
balancing theory and implementation for students and
practitioners.
Sergios Theodoridis and Konstantinos Koutroumbas (2014).
Pattern Recognition, 4th Edition, Academic Press Publisher.
o An in-depth exploration of pattern recognition, covering
statistical and machine learning methods, decision theory, and
real-world examples, with rigorous mathematical grounding.
Unit 1: Introduction to Bayesian Decision
Theory, Classification, Losses and Risks,
Discriminant Functions, Utility Theory,
Association Rules
1. Introduction to Bayesian Decision Theory
Elaborated Definition:
o Bayesian Decision Theory is a robust, probabilistic framework for
making optimal decisions under uncertainty, rooted in statistical
principles. It integrates prior knowledge about the likelihood of
events (e.g., class probabilities) with new evidence from data
(e.g., observed features) to guide choices, such as classifying
objects or predicting outcomes. It assumes uncertainty is
inherent in real-world problems—due to noisy data,
incomplete information, or randomness—and uses
probability to minimize errors or risks in decisions.
o Originates from the work of Thomas Bayes (18th century),
widely applied in machine learning, pattern recognition, and
decision-making fields like medicine, finance, and robotics.
Bayes' Theorem essentially provides a way to update your belief
about an event (A) after observing new evidence (B). It's a
fundamental concept in probability and statistics, used in various
applications like machine learning, medical diagnosis, and more.
Bayes’ Theorem:
o Statement: A mathematical rule to update probabilities based
on new evidence, forming the backbone of Bayesian decision-
making.
o The formula P(A|B) = [P(B|A) * P(A)] / P(B) represents Bayes'
Theorem. It calculates the conditional probability of event A
occurring given that event B has occurred, based on the prior
probability of A and the likelihood of B given A.
o Formula: P(A|B) = [P(B|A) * P(A)] / P(B)
P(A|B): Posterior probability—the probability of event A
(e.g., class “spam”) given evidence B (e.g., email features
like keywords).
P(B|A): Likelihood—the probability of observing evidence B
given class A, derived from training data or models.
P(A): Prior probability—the initial belief about A’s likelihood
before seeing B, based on historical data or expert
judgment.
P(B): Evidence probability—the total probability of B across
all classes, acts as a normalizing constant, computed as
P(B) = Σ [P(B|A_i) * P(A_i)] over all possible A_i.
o Explanation:
Combines prior belief (P(A)) with observed data (P(B|A)) to
compute a revised probability (P(A|B)).
Example: Classifying an email—P(spam|words) = [P(words|
spam) * P(spam)] / P(words).
P(words) accounts for all ways words appear (in spam and
non-spam), ensuring probabilities sum to 1.
o Application: Choose the class with the highest posterior—e.g., if
P(spam|words) > P(not spam|words), classify as spam.
Process:
1. Estimate P(A) from past data or assumptions (e.g., 30% of emails
are spam).
2. Model P(B|A) from training data (e.g., frequency of “free” in spam
emails).
3. Compute P(A|B) for each class, decide based on maximum.
Advantages: Handles uncertainty, optimal when distributions are
known.
Challenges: Needs accurate priors and likelihoods, complex for high-
dimensional data.
2. Classification
Elaborated Definition:
o Classification is a fundamental supervised learning task where a
model assigns input data, described by features (e.g.,
measurements, attributes like pixel values or word counts), to
one or more predefined categories or classes. It involves learning
patterns from labeled training data—pairs of inputs and their
correct labels—to predict the class of new, unseen instances. A
cornerstone of machine learning and pattern recognition, it’s
used in diverse domains like spam detection, medical diagnosis,
and image recognition, aiming to generalize from training to real-
world scenarios.
o Core idea: Find decision boundaries (linear or non-linear) that
separate classes based on features.
Types:
o Binary: Two classes—e.g., positive/negative, spam/not spam.
o Multiclass: Multiple mutually exclusive classes—e.g., digit
recognition (0-9).
o Multi-label: Assign multiple labels to one instance—e.g., a news
article tagged as “politics” and “economy.”
Process:
1. Collect labeled dataset: Features (x) and labels (y).
2. Train model to learn mapping from x to y (e.g., via decision trees,
neural networks).
3. Test: Predict labels for new data, evaluate accuracy.
Examples: Diagnose disease (sick/healthy), classify images (cat/dog).
Goal: Maximize correct predictions, minimize errors.
3. Losses and Risks
Elaborated Definition:
o Losses: A loss is a quantitative measure of the penalty or cost
incurred when a model’s decision or prediction deviates from the
true outcome. It reflects the consequence of errors, tailored to
the problem—e.g., in medicine, misclassifying a disease as
benign might be costlier than the reverse. Loss functions guide
model training by quantifying “how wrong” predictions are.
o Risks: Risk is the expected loss, averaging the loss across all
possible outcomes, weighted by their probabilities. It accounts
for uncertainty in predictions, combining the loss function with
the probability distribution of classes given data, and helps
evaluate and optimize decision rules in a probabilistic framework.
Loss Types:
o 0-1 Loss: Simple and common—assigns 0 if prediction is correct,
1 if incorrect.
o Squared Loss: (Predicted - Actual)², measures error magnitude,
often for regression but adaptable to classification.
o Custom Loss: Context-specific—e.g., in cancer diagnosis, false
negative (missing cancer) has higher cost (e.g., 10) than false
positive (e.g., 1).
Risk Details:
o Expected Risk: R(δ) = Σ [L(δ(x), y) * P(y|x)]
L(δ(x), y): Loss for decision δ(x) (e.g., predicted class) vs.
true class y.
P(y|x): Posterior probability of true class y given input x.
Explanation: Computes average loss over all possible true
classes, weighted by likelihood, guiding optimal decisions.
o Bayes Risk: The minimum expected risk, achieved by the
optimal decision rule (e.g., choose class with highest posterior in
Bayesian theory).
o Example: For binary classification, if L(correct) = 0, L(wrong) =
1, risk is the probability of error; minimize by picking max P(y|x).
Application: Train models to minimize risk, balance errors (e.g., false
positives vs. negatives).
4. Discriminant Functions
Elaborated Definition:
o Discriminant functions are mathematical constructs that map
input features (e.g., a vector of measurements) to scores or
values, enabling classification by separating data into distinct
classes. They define decision boundaries—regions in feature
space where the model assigns one class over another—based
on the relative scores for each class. Used in pattern recognition
and machine learning, they simplify complex data into actionable
decisions, forming the basis for methods like linear discriminant
analysis and support vector machines.
Types:
o Linear: g(x) = w^T x + b
w: Weight vector, adjusts feature importance.
x: Feature vector.
b: Bias, shifts boundary.
Classify: Assign to Class 1 if g(x) > 0, else Class 2.
o Quadratic: g(x) = x^T W x + w^T x + b
W: Matrix for quadratic terms, captures non-linear
relations.
Used for non-linear boundaries.
Bayesian Context:
o Derived from posteriors: g_i(x) = log(P(x|C_i)) + log(P(C_i))
P(x|C_i): Likelihood of features x given class C_i.
P(C_i): Prior probability of class C_i.
Classify to class i with highest g_i(x).
Example: Two-class problem—classify a point (x_1, x_2) as “cat” if
g_cat(x) > g_dog(x).
Advantages: Interpretable, computationally efficient for linear cases.
Cons: Limited by form (linear struggles with non-linear data).
5. Utility Theory
Elaborated Definition:
o Utility Theory is a decision-making framework from economics
and statistics, extended to machine learning, where outcomes of
decisions are assigned numerical values called utilities to
represent their desirability, benefit, or preference. Unlike loss,
which penalizes errors, utility quantifies the positive value of
correct or beneficial choices, guiding decisions to maximize
expected gain. It accounts for uncertainty by weighting utilities
with probabilities, making it ideal for scenarios where outcomes
have varying impacts (e.g., medical, financial decisions).
Expected Utility:
o Formula: EU(δ) = Σ [U(δ(x), y) * P(y|x)]
U(δ(x), y): Utility (benefit) of decision δ(x) (e.g., classify as
positive) given true class y.
P(y|x): Posterior probability of true class y given input x.
Explanation: Computes average utility across all possible
true classes, weighted by their likelihood. Decision rule:
Choose δ(x) to maximize EU.
o Example: In cancer testing:
U(detect cancer, cancer) = 10 (high benefit of early
detection).
U(detect cancer, no cancer) = -1 (cost of false positive).
Compute EU for “test positive” vs. “test negative,” pick
higher.
Contrast with Risk: Risk minimizes expected loss; utility maximizes
expected benefit.
Applications: Medical diagnosis (value of early treatment), business
(profit from decisions).
Challenges: Utility is subjective, hard to quantify, depends on context
or stakeholder views.
6. Association Rules
Elaborated Definition:
o Association Rules are a data mining technique to uncover
frequent, meaningful relationships or patterns between items or
events in large datasets, typically expressed as “If A, then B”
(e.g., if bread is bought, butter is likely). Widely used in market
basket analysis, recommendation systems, and bioinformatics,
they identify co-occurrences or dependencies, helping to predict
behavior, optimize strategies, or discover insights. Rules are
evaluated by metrics like support, confidence, and lift to ensure
strength and relevance.
Key Metrics:
o Support: Proportion of transactions containing both A and B.
Formula: Support(A→B) = P(A and B) = (Count of {A, B}) /
(Total transactions).
Example: 10% of shoppers buy bread and butter together.
o Confidence: Probability of B given A, measuring rule reliability.
Formula: Confidence(A→B) = P(B|A) = Support(A→B) /
Support(A).
Example: 80% of bread buyers also buy butter.
o Lift: Measures strength of rule over random co-occurrence.
Formula: Lift(A→B) = Confidence(A→B) / P(B).
Interpretation: Lift > 1 (positive correlation), = 1
(independent), < 1 (negative).
Process:
1. Find frequent itemsets (e.g., {bread, butter}) using Apriori
algorithm, above min support.
2. Generate rules (e.g., bread→butter) with confidence above
threshold.
3. Evaluate with lift, other metrics.
Example: In a store, {bread} → {butter}, Support = 10%, Confidence
= 80%, Lift = 1.5.
Applications: Product bundling, cross-selling, gene association
studies.
Unit 2: Supervised Machine Learning Algorithm, Non-Parametric
Methods: Histogram Estimator, Kernel Estimator, K-Nearest
Neighbor Estimator
1. Supervised Machine Learning Algorithm
Elaborated Definition:
o Supervised Machine Learning is a paradigm in which a model
learns to map input data (features, such as measurements,
images, or text attributes) to correct output labels (categories or
values) using a labeled training dataset. The “supervision” comes
from known input-output pairs, guiding the model to generalize
patterns for predicting outcomes on new, unseen data. A
cornerstone of machine learning, it’s applied in classification
(e.g., spam detection) and regression (e.g., price prediction),
relying on algorithms to minimize prediction errors through
training.
Process:
1. Collect labeled data: Features (x, e.g., pixel values) and labels (y,
e.g., “cat”).
2. Split into training (learn), test (evaluate) sets.
3. Train model: Minimize error via loss function.
4. Predict: Apply to new data, assess accuracy.
Types:
o Classification: Predict discrete labels (e.g., spam/not spam).
o Regression: Predict continuous values (e.g., temperature).
Examples:
o Linear Regression: y = w^T x + b, fits a line.
o Logistic Regression: Uses sigmoid for binary probability.
o Decision Trees: Split data by feature thresholds.
o SVM: Finds optimal hyperplane to separate classes.
Challenges: Overfitting (memorizes noise), underfitting (misses
patterns), data quality.
2. Non-Parametric Methods
Elaborated Definition:
o Non-Parametric Methods are flexible statistical and machine
learning techniques that make no strict assumptions about the
underlying data distribution (e.g., not assuming normality, as
parametric methods like linear regression do). Instead, they
adapt to the data’s shape, with model complexity (e.g., number
of parameters) growing with dataset size. Ideal for complex, non-
linear, or unknown distributions, they’re used in density
estimation and classification, offering versatility at the cost of
computation.
Characteristics:
o No fixed parameters (e.g., mean, variance).
o Flexible: Fit irregular patterns.
Pros: Handle diverse data, no distribution assumption.
Cons: Slow for large data, sensitive to noise.
Histogram Estimator
Elaborated Definition:
o The Histogram Estimator is a simple non-parametric method to
estimate the probability density function of a dataset by dividing
the feature space into discrete, fixed-width bins and counting the
frequency of data points in each. It approximates the distribution
by normalizing counts, providing a visual and intuitive way to
understand data patterns (e.g., spread, peaks). Common in
statistics and machine learning, it’s a starting point for density
estimation.
Process:
1. Divide feature range (e.g., age 0-100) into bins (e.g., 0-10, 10-
20).
2. Count points in each bin.
3. Normalize: Density = Count / (Total points * Bin width).
Example: For ages [5, 15, 12, 30], bins 0-10, 10-20: Density = 1/(410)
for 0-10 (1 point), 2/(410) for 10-20.
Pros: Easy, visual, fast for small data.
Cons: Bin size affects result, poor for high dimensions.
Kernel Estimator
Elaborated Definition:
o The Kernel Estimator, or Kernel Density Estimation (KDE), is a
sophisticated non-parametric technique to estimate a smooth,
continuous probability density function by placing a kernel (a
weighting function, often Gaussian) at each data point and
summing their contributions. It avoids the blocky nature of
histograms, producing a smoother curve that better captures
data distribution, widely used in statistics, visualization, and
machine learning for flexible density estimation.
Process:
o Formula: f(x) = (1/nh) * Σ K((x - x_i) / h)
n: Number of data points.
x_i: Observed data points.
h: Bandwidth, controls smoothness.
K: Kernel function (e.g., Gaussian: K(z) = (1/√(2π)) * e^(-
z²/2)).
o Explanation:
Each point contributes a kernel, weighted by distance from
x.
Bandwidth h: Small h gives spiky fit, large h oversmoothes.
Example: Estimate income density—place Gaussian kernels at $30k,
$50k, sum for smooth curve.
Pros: Smooth, flexible, kernel choice adapts fit.
Cons: Bandwidth selection tricky, computationally heavy.
K-Nearest Neighbor Estimator
Elaborated Definition:
o The K-Nearest Neighbor (KNN) Estimator is a versatile non-
parametric method used for density estimation or classification,
relying on the k closest data points to a query point in feature
space. It estimates probabilities or assigns classes based on local
data, adapting to patterns without assuming a distribution.
Simple yet powerful, it’s applied in pattern recognition, machine
learning, and data analysis, leveraging proximity to infer
outcomes.
Process:
o Density Estimation: f(x) = k / (n * V)
k: Number of nearest neighbors.
n: Total points.
V: Volume of region containing k neighbors (depends on
distance).
o Classification:
1. Compute distance (e.g., Euclidean: sqrt(Σ (x_i - y_i)²)).
2. Find k nearest points.
3. Assign class by majority vote.
o Explanation: Local density or class depends on nearby points; k
controls sensitivity.
Example: Classify tumor—5 neighbors, 3 malignant, 2 benign →
malignant.
Pros: Intuitive, adaptive, no model assumption.
Cons: Slow for large data, k choice critical, struggles in high
dimensions.
Unit 3: Dimensionality Reduction: Introduction, Subset Selection,
Principal Component Analysis (PCA), Factor Analysis, Singular Value
Decomposition and Matrix Factorization, Multidimensional Scaling,
Linear Discriminant Analysis
1. Introduction
Elaborated Definition:
o Dimensionality Reduction is a set of techniques in machine
learning and statistics to decrease the number of features
(dimensions) in a dataset while preserving essential information,
patterns, or structure. High-dimensional data (e.g., thousands of
features in genomics) poses challenges like increased
computation, storage, and the “curse of dimensionality” (sparse,
noisy data). These methods simplify models, enable visualization,
and improve performance, critical in image processing, text
analysis, and more.
Purpose:
o Reduce computation time, memory use.
o Prevent overfitting, improve generalization.
o Visualize high-dimensional data (e.g., 2D/3D plots).
Types:
o Feature Selection: Keep best original features.
o Feature Extraction: Transform to new, lower-dimensional space.
2. Subset Selection
Elaborated Definition:
o Subset Selection is a dimensionality reduction approach that
identifies and retains a smaller, optimal subset of original
features from a dataset, discarding those deemed irrelevant,
redundant, or noisy. It preserves the original feature space,
focusing on the most predictive or informative variables, and is
widely used in machine learning to simplify models, enhance
interpretability, and reduce complexity.
Methods:
o Filter: Rank features independently—e.g., by correlation with
output, variance; select top k.
o Wrapper: Test subsets with model—e.g., forward selection (add
features, check accuracy), backward elimination (remove, test).
o Embedded: Built into model—e.g., LASSO uses penalty to shrink
irrelevant features to zero.
Example: From 100 gene features, select 10 most correlated with
disease.
Pros: Interpretable, retains original meaning.
Cons: Misses feature interactions.
3. Principal Component Analysis (PCA)
Elaborated Definition:
o Principal Component Analysis (PCA) is a widely used feature
extraction technique that transforms high-dimensional data into
a new coordinate system of orthogonal axes (principal
components) that capture the maximum variance in the data. It
reduces dimensions by projecting data onto fewer axes,
prioritizing those with the most variation, making it invaluable for
visualization, noise reduction, and efficient modeling in machine
learning and pattern recognition.
Process:
1. Standardize data: Center (mean = 0), scale (variance = 1).
2. Compute covariance matrix: Measures feature correlations.
3. Find eigenvectors (directions) and eigenvalues (variance amount)
of matrix.
4. Sort by eigenvalues, project data onto top k components.
Explanation:
o Eigenvectors: New axes, uncorrelated, capture data spread.
o Eigenvalues: Show variance along each axis—pick top for max
info.
Example: Reduce 3D (height, weight, age) to 2D by projecting onto
axes of max variance.
Pros: Reduces dimensions, removes correlation.
Cons: Loses feature meaning, assumes linearity.
4. Factor Analysis
Elaborated Definition:
o Factor Analysis is a statistical method for dimensionality
reduction that models observed variables (features) as linear
combinations of fewer, unobserved latent factors, plus error
terms. It assumes that correlations among variables arise from
underlying, hidden factors, aiming to uncover these to simplify
data interpretation. Used in psychology, social sciences, and
machine learning, it reduces dimensions while revealing latent
structure.
Process:
1. Model: x = Λ * f + e
x: Observed variables, Λ: Loadings (factor-variable
relations), f: Latent factors, e: Error.
2. Estimate loadings, factors via methods (e.g., maximum
likelihood).
3. Interpret factors—e.g., “intelligence” from test scores.
Explanation:
o Loadings: Show how strongly each variable ties to a factor.
o Factors: Latent variables explain correlations.
Example: IQ, math, verbal scores explained by a “cognitive ability”
factor.
Pros: Reveals hidden patterns, reduces dimensions.
Cons: Assumes linearity, interpretation subjective.
5. Singular Value Decomposition and Matrix Factorization
Singular Value Decomposition (SVD):
o Elaborated Definition:
Singular Value Decomposition is a powerful linear algebra
technique that decomposes a matrix (e.g., data matrix)
into three components, revealing its structure and enabling
dimensionality reduction. It generalizes eigenvalue
decomposition to any rectangular matrix, widely used in
machine learning for compression, noise reduction, and
applications like recommender systems.
o Formula: X = U * Σ * V^T
X: n x m data matrix (e.g., n samples, m features).
U: n x n matrix of left singular vectors (sample patterns).
Σ: n x m diagonal matrix of singular values (importance
weights).
V^T: m x m matrix of right singular vectors (feature
patterns).
o Explanation:
Singular values in Σ: Rank by magnitude, show data
importance.
Reduce: Keep top k values, approximate X with lower rank.
o Example: Compress user-movie rating matrix for
recommendations.
Matrix Factorization:
o Elaborated Definition:
Matrix Factorization approximates a high-dimensional
matrix as the product of two or more lower-rank matrices,
reducing complexity while capturing key patterns. Common
in collaborative filtering, it uncovers latent factors (e.g.,
user preferences, item traits) to predict missing entries,
vital in recommender systems and data analysis.
o Formula: X ≈ W * H
X: Original matrix, W: n x k matrix, H: k x m matrix, k:
Reduced rank.
o Example: Predict ratings—W (user factors), H (movie factors).
Pros: Handles sparse data, reduces dimensions.
Cons: Computationally intensive, less interpretable.
6. Multidimensional Scaling (MDS)
Elaborated Definition:
o Multidimensional Scaling is a dimensionality reduction and
visualization technique that maps high-dimensional data points
to a lower-dimensional space (e.g., 2D or 3D) while preserving
pairwise distances or dissimilarities as closely as possible. It
helps visualize complex data, revealing clusters or patterns, and
is used in psychology, bioinformatics, and machine learning for
exploratory analysis.
Process:
1. Compute distance matrix (e.g., Euclidean) between all points.
2. Optimize low-dimensional coordinates to minimize stress
(difference between original and new distances).
o Types:
Metric MDS: Preserves exact distances.
Non-metric MDS: Preserves rank order of distances.
Explanation:
o Stress: Measures fit quality—lower stress, better mapping.
o Example: Map city travel distances to 2D, see geographic
clusters.
Pros: Great for visualization, flexible distances.
Cons: Sensitive to noise, slow for large data.
7. Linear Discriminant Analysis (LDA)
Elaborated Definition:
o Linear Discriminant Analysis is a supervised dimensionality
reduction and classification method that finds linear
combinations of features to maximize separation between known
classes while minimizing variation within each class. Unlike PCA
(variance-focused), LDA uses class labels to optimize
discriminability, making it ideal for pattern recognition and
machine learning tasks like face recognition or spam filtering.
Process:
1. Compute within-class scatter (S_W): Variation within each class.
2. Compute between-class scatter (S_B): Variation between class
means.
3. Find linear discriminants: Maximize S_B / S_W via eigenvectors.
4. Project data onto top discriminants.
Explanation:
o Goal: Spread classes apart, keep points within classes tight.
o Eigenvectors: Directions of max separation.
Example: Reduce 3D features to 1D to separate two species.
Pros: Supervised, optimizes class separation.
Cons: Assumes normal data, linear boundaries.
Unit 4: Unsupervised Learning: Introduction, Hierarchical Clustering:
Agglomerative Clustering Algorithm, The Single Linkage Algorithm,
The Complete Linkage Algorithm, The Average-Linkage Algorithm,
Partitional Clustering: Forgy’s Algorithm, The K-Means Algorithm
1. Introduction
Elaborated Definition:
o Unsupervised Learning is a machine learning paradigm where
models analyze unlabeled data—lacking predefined outputs or
categories—to discover hidden patterns, structures, or
relationships. Without supervision (no correct answers provided),
it relies on intrinsic data properties, such as similarity or density,
to group, reduce, or interpret data. Essential in clustering,
dimensionality reduction, and anomaly detection, it’s applied in
customer segmentation, image compression, and exploratory
analysis.
Purpose:
o Cluster similar items (e.g., group customers).
o Reduce dimensions (e.g., simplify data).
o Detect anomalies (e.g., fraud).
Contrast: No labels, unlike supervised learning; finds natural
structure.
2. Hierarchical Clustering
Elaborated Definition:
o Hierarchical Clustering is an unsupervised method that organizes
data into a nested hierarchy of clusters, building a tree-like
structure (dendrogram) to show relationships. It proceeds either
bottom-up (agglomerative, merging small clusters) or top-down
(divisive, splitting large clusters), using distance metrics to group
similar points, widely used in biology, social sciences, and data
analysis.
Agglomerative Clustering Algorithm:
o Process:
1. Start: Each data point is a cluster.
2. Compute pairwise distances (e.g., Euclidean: sqrt(Σ (x_i -
y_i)²)).
3. Merge closest clusters based on linkage criterion.
4. Repeat until one cluster or desired level.
o Output: Dendrogram—cut at level for k clusters.
o Example: Group patients by symptom similarity, see hierarchy.
The Single Linkage Algorithm
Elaborated Definition:
o The Single Linkage Algorithm is a hierarchical clustering
approach that merges clusters based on the minimum distance
between any two points, one from each cluster. It focuses on the
closest pair, forming chains of similar points, and is effective for
detecting elongated or non-spherical clusters in unsupervised
learning.
Distance: d(C1, C2) = min(dist(x, y)), x in C1, y in C2.
Example: Cluster points—merge if closest pair is 1 unit apart.
Pros: Simple, good for irregular clusters.
Cons: Chaining (long, stringy clusters), sensitive to noise.
The Complete Linkage Algorithm
Elaborated Definition:
o The Complete Linkage Algorithm is a hierarchical clustering
method that merges clusters based on the maximum distance
between any two points, one from each cluster. It prioritizes
compact, tight clusters, minimizing the spread within groups, and
is robust to outliers, used in pattern recognition and data
analysis.
Distance: d(C1, C2) = max(dist(x, y)), x in C1, y in C2.
Example: Merge clusters if farthest points are close, forms spheres.
Pros: Robust to noise, compact clusters.
Cons: Biased to spherical shapes, costly.
The Average-Linkage Algorithm
Elaborated Definition:
o The Average-Linkage Algorithm is a hierarchical clustering
technique that merges clusters based on the average distance
between all pairs of points, one from each cluster. It balances
single and complete linkage, offering a compromise that avoids
extreme chaining or compactness, making it versatile for
unsupervised data grouping.
Distance: d(C1, C2) = avg(dist(x, y)), x in C1, y in C2.
Example: Merge if average distance between points is small.
Pros: Balanced, less extreme than single/complete.
Cons: Computationally intensive, assumes balanced clusters.
3. Partitional Clustering
Elaborated Definition:
o Partitional Clustering is an unsupervised learning approach that
divides a dataset into non-overlapping, distinct clusters, directly
partitioning points into a fixed number of groups. It optimizes a
criterion, such as minimizing within-cluster distances or variance,
to group similar objects, common in data analysis, image
segmentation, and market research.
Forgy’s Algorithm
Elaborated Definition:
o Forgy’s Algorithm is an early partitional clustering method that
assigns data points to a fixed number of clusters, iteratively
updating cluster centers (centroids) to group similar items. A
precursor to k-means, it relies on random initialization and
distance-based assignment, used in unsupervised learning to
explore data structure.
Process:
1. Randomly assign points to k clusters.
2. Compute centroid (mean) of each cluster.
3. Reassign points to nearest centroid.
4. Repeat until assignments stabilize.
Example: Group customers by purchases into 3 clusters.
Pros: Simple, foundational.
Cons: Sensitive to initial guess.
The K-Means Algorithm
Elaborated Definition:
o The K-Means Algorithm is a popular partitional clustering method
that divides data into k non-overlapping clusters by minimizing
the within-cluster variance (sum of squared distances to
centroids). Widely used in machine learning, it iteratively assigns
points to clusters and updates centers, effective for compact,
spherical groups in data analysis.
Process:
1. Choose k, randomly initialize k centroids.
2. Assign each point to nearest centroid (Euclidean distance).
3. Update centroids: Mean of assigned points.
4. Repeat until centroids stabilize.
Explanation:
o Objective: Minimize Σ Σ ||x_i - μ_j||², where x_i is point, μ_j is
centroid of cluster j.
o Converges to local optimum, depends on initial centroids.
Example: Cluster images by color into 5 groups.
Pros: Fast, scalable, good for spherical clusters.
Cons: Needs k, sensitive to outliers, initialization.
Unit 5: Multilayer Perceptron: The Perceptron, Training a
Perceptron, Learning Boolean Functions, Multilayer Perceptrons,
Back Propagation Algorithm, Training Procedures, Tuning Network
Size
1. The Perceptron
Elaborated Definition:
o The Perceptron is a foundational artificial neural network model,
introduced by Frank Rosenblatt (1958), designed as a binary
linear classifier. It takes multiple input features, applies weights
to reflect their importance, adds a bias, and uses an activation
function (typically a step function) to produce a binary output,
classifying data into two categories. A building block of neural
networks, it’s limited to linearly separable problems but pivotal in
early machine learning.
Structure:
o Inputs: x_1, x_2, ..., x_n (feature vector).
o Weights: w_1, w_2, ..., w_n (importance of each input).
o Bias: b (adjusts threshold).
o Output: y = 1 if w^T x + b > 0, else 0 (step function).
Purpose: Classify linearly separable data (e.g., AND gate).
Limits: Fails for non-linear patterns (e.g., XOR).
2. Training a Perceptron
Elaborated Definition:
o Training a Perceptron is the process of iteratively adjusting its
weights and bias using labeled data to minimize classification
errors, enabling the model to correctly separate two classes. It
relies on a learning rule to update parameters based on
prediction mistakes, converging to a solution for linearly
separable data, a key concept in early neural network training.
Process:
1. Initialize weights, bias randomly (small values, e.g., 0.01).
2. For each sample (x, y):
Compute output: ŷ = step(w^T x + b).
Update: w = w + η * (y - ŷ) * x, b = b + η * (y - ŷ).
η: Learning rate (e.g., 0.01), controls step size.
3. Repeat until convergence (few errors).
Explanation:
o Error (y - ŷ): Drives update—positive if underpredict, negative if
overpredict.
o Converges: Finds line to separate classes if possible.
Example: Train to classify points above/below y = x.
Pros: Simple, guaranteed for linear data.
Cons: Fails for non-linear cases.
3. Learning Boolean Functions
Elaborated Definition:
o Learning Boolean Functions involves training a perceptron or
similar model to represent logical operations (e.g., AND, OR,
NOT) that map binary inputs (0 or 1) to binary outputs. These
functions model logical relationships, critical in computer science
and machine learning, but single perceptrons are limited to
linearly separable functions, requiring advanced models for
complex cases like XOR.
Examples:
o AND: Inputs (x_1, x_2), output 1 if x_1 = 1 and x_2 = 1.
Weights: w_1 = 1, w_2 = 1, b = -1.5.
Check: (1, 1) → 1 + 1 - 1.5 = 0.5 > 0 → 1.
o OR: Output 1 if x_1 = 1 or x_2 = 1.
Weights: w_1 = 1, w_2 = 1, b = -0.5.
Check: (1, 0) → 1 + 0 - 0.5 = 0.5 > 0 → 1.
o XOR: Not linearly separable, needs multilayer model.
Insight: Perceptron solves linear functions, not complex ones.
4. Multilayer Perceptrons (MLPs)
Elaborated Definition:
o Multilayer Perceptrons are advanced neural networks with
multiple layers—input, one or more hidden layers, and an output
layer—designed to model complex, non-linear relationships in
data. Each node (neuron) processes inputs with weights, bias,
and a non-linear activation function, enabling solutions to
problems like XOR, image recognition, and more, making MLPs a
cornerstone of deep learning.
Structure:
o Input layer: Features (x_1, x_2, ..., x_n).
o Hidden layers: Nodes apply w^T x + b, then activation (e.g.,
sigmoid, ReLU).
o Output layer: Produces prediction (class, value).
Activation: Non-linear—e.g., sigmoid (1/(1 + e^(-z))), ReLU (max(0,
z)).
Purpose: Solve non-linear problems.
5. Back Propagation Algorithm
Elaborated Definition:
o The Back Propagation Algorithm (short for “backward
propagation of errors”) is a supervised learning method to train
multilayer perceptrons by minimizing prediction error. It
computes the gradient of a loss function with respect to weights,
propagating errors backward through the network to update
parameters, leveraging the chain rule for efficiency. A
foundational technique in neural networks, it powers deep
learning.
Process:
1. Forward Pass:
Pass input through layers, apply weights, bias, activation.
Compute loss—e.g., mean squared error: (y - ŷ)².
2. Backward Pass:
Compute gradient: ∂Loss/∂w for each weight via chain rule.
Update: w = w - η * ∂Loss/∂w, η is learning rate.
3. Repeat for epochs until loss stabilizes.
Explanation:
o Chain rule: Breaks gradient into layer-by-layer contributions.
o Example: Loss = (y - ŷ)², gradient adjusts weights to reduce
error.
Example: Train MLP for digits, minimize classification error.
Pros: Effective for non-linear tasks.
Cons: Slow, risks local minima.
6. Training Procedures
Elaborated Definition:
o Training Procedures encompass the systematic steps to prepare,
configure, and optimize a machine learning model (e.g., MLP)
using labeled data to minimize prediction error. It involves data
handling, model setup, loss definition, optimization, and
evaluation, ensuring the model learns patterns effectively while
avoiding issues like overfitting, critical in supervised learning.
Steps:
1. Data Prep: Collect labeled data, split: train (70%), validation
(15%), test (15%).
2. Initialize: Random weights, bias (e.g., small, 0.01).
3. Loss Function: Choose—e.g., MSE for regression, cross-entropy
for classification.
4. Optimization:
Gradient descent: Update weights to minimize loss.
Variants: Stochastic GD (per sample), batch GD.
Learning rate (η): Small (0.01) for stability, large for speed.
5. Train: Run back propagation, adjust weights.
6. Validate: Check validation error, tune model.
7. Test: Evaluate on test set—accuracy, F1-score.
Challenges: Overfitting, underfitting, slow convergence.
7. Tuning Network Size
Elaborated Definition:
o Tuning Network Size is the process of adjusting the architecture
of a neural network—number of layers and nodes per layer—to
optimize performance, balancing complexity, accuracy, and
computational cost. Too small a network misses patterns; too
large risks overfitting and inefficiency. It’s a critical step in
designing effective neural models for machine learning tasks.
Factors:
o Layers: More layers capture complex, non-linear patterns.
o Nodes: More per layer model detail, increase computation.
Process:
1. Start: Small network (e.g., 1-2 layers, 10-20 nodes).
2. Evaluate: Check validation error, accuracy.
3. Adjust:
Increase size if underfitting (high error).
Reduce or regularize (e.g., dropout: drop nodes, weight
decay) if overfitting.
4. Cross-validate: Test sizes for robustness.
Example: For digits, try 2 layers (50, 20 nodes), tweak based on error.
Pros: Optimal size boosts accuracy, efficiency.
Cons: Trial-and-error, resource-heavy.