Role of Linear Algebra on ML
What is Linear Algebra?
It is a branch of mathematics that allows to define and perform
operations on higher-dimensional coordinates and plane
interactions in a concise way. Linear Algebra is an algebra
extension to an undefined number of dimensions. Linear
Algebra concerns the focus on linear equation systems.
Properties of Linear Algebra:
Associative Property: It is a property in Mathematics
which states that if a, b and c are mathematical objects
than a + (b + c) = (a + b) + c in which + is a binary
operation.
Commutative Property: It is a property in Mathematics
which states that if a and b are mathematical objects
then a + b = b + a in which + is a binary operation.
Distributive Property: It is a property in Mathematics
which states that if a, b and c are mathematical objects
then a * (b + c)= (a * b) + (a * c) in which * and + are
binary operators
Linear Algebra for Machine learning
In the context of Machine Learning, linear algebra is employed to
model and analyze relationships within data. It enables the
representation of data points as vectors and allows for efficient
computation of operations on these vectors. Linear
transformations, matrices, and vector spaces play a significant
role in defining and solving problems in ML.
The utilization of linear algebra in ML extends to solving systems
of linear equations, optimizing models, and comprehending
transformations inherent in algorithms like Principal Component
Analysis (PCA). The integration of linear algebra in ML provides a
powerful and versatile mathematical toolbox, to model, analyze,
and optimize complex relationships within data, thereby
advancing the capabilities of machine learning algorithms.
Importance of Linear Algebra in Machine Learning:
Linear algebra is fundamental to machine learning due to its role
in representing and solving systems of equations, defining
transformations, and optimizing algorithms. It provides a
mathematical framework for understanding and working with
high-dimensional data, making it a cornerstone for various
machine learning models and techniques.
Different ways to represent the Data in Linear Algebra
Linear algebra allows the representation of data using Scaler &
vectors, enabling efficient storage and manipulation of large
datasets.
Linear Algebra concepts used in Machine Learning for
Representation of Data:
Scalar and Vector
Scalar:
It is a physical quantity described using a single element,
It has only magnitude and not direction. Basically, a
scalar is just a single number.
Example: 17 and 256
Vector:
It is a geometric object having both magnitude and
direction, it is an ordered number array, and are always
in a row or column. A Vector has just one index, which
can refer to a particular value within the Vector.
𝑉=[𝑒1,𝑒2,𝑒3,𝑒4]V=[e1,e2,e3,e4]
Here V is a vector in which e1, e2, e3 and e4 are its
elements, and V[2] is e3.
Vector Operations
1. Scalar-Vector Multiplication
p = [e1, e2, e3]
The product of a scalar with a vector gives the below result.
When the scalar 2 is multiplied by a vector p then all the
elements of the vector p is multiplied by that scalar. This
operation satisfies commutative property.
p * 2 = [2 * e1, 2 * e2, 2 * e3]
Matrix
It is an ordered 2D array of numbers, symbols or expressions
arranged in rows and columns. It has two indices, the first index
points to the row, and the second index points to the column. A
𝑀=[𝑒1𝑒2𝑒3𝑒4]M=[e1e3e2e4]
Matrix can have multiple numbers of rows and columns.
and 𝑀[1][0]M[1][0]is e3.
Above M is a 2D matrix having e1, e2, e3 and e4 as elements,
A matrix having its left diagonal elements as 1 and other
elements 0 is an Identity matrix.
Example:
(1001)(1001) is 2D Identity Matrix. (100010001)100010001 is 3D Identity
Matrix.
Tensor
It is an algebraic object representing a linear mapping of
algebraic objects from one set to another. It is actually a 3D
array of numbers with a variable number of axes, arranged on a
regular grid. A tensor has three indices, first index points to the
row, the second index points to the column and the third
index points to the axis.
tensor T has 8 elements 𝑇=[𝑒1,𝑒2,𝑒3,𝑒4,𝑒5,𝑒6,𝑒7,𝑒8], three-
Here the
dimensional tensor with dimensions 2 x 2 x 2 such that 𝑇[0][3]
[1]=𝑒8.
Tensors play a significant role in machine learning, particularly in
deep learning, due to their ability to represent and manage
multi-dimensional data.
Linear Algebra Operations:
Machine learning models often involve transformations of data.
Linear algebra provides a concise way to represent and analyze
these transformations using matrices and linear operators.
Matrix Operations:
1. Scalar-Matrix Multiplication
When the scalar a is multiplied by a matrix p then all the
elements of the matrix p is multiplied by that scalar. Scalar-
Matrix multiplication is associative, distributive and
commutative.
p = [𝑒1,𝑒2,𝑒3,𝑒4]a is a scalar.𝑝∗𝑎=(𝑎∗𝑒1)(𝑎∗𝑒2)(𝑎∗𝑒3)(𝑎∗𝑒4)
2. Matrix-Matrix Addition
In-order to add matrices the rows and columns of the matrices
should be [Link] element of the first matrix is added with
the respective element of the another matrix both having same
row and column value. Matrix-Matrix addition is associative,
distributive and commutative. The addition of
matrix m1 and m2 gives the below result.
𝑚1+𝑚2=(𝑎+𝑝)(𝑏+𝑞)(𝑐+𝑟)(𝑑+𝑠)
3. Matrix-Matrix Subtraction
Each element of the first matrix is subtracted with the respective
element of the another matrix both having same row and column
value. Matrix-Matrix addition is associative, distributive and
commutative. In-order to subtract matrices, the rows and
columns of the matrices should be equal. The subtraction
between matrix m1 and m2 gives the below result-
𝑚1−𝑚2=(𝑎−𝑝)(𝑏−𝑞)(𝑐−𝑟)(𝑑−𝑠)
4. Matrix-Matrix Multiplication
To multiply two matrices, the number of columns of the
first matrix should be equal to the number of rows in the
second [Link]-Matrix multiplication is associative
and distributive but not commutative. The product of
matrix m1 and m2 is given below
𝑚1∗𝑚2=((𝑎∗𝑝)+(𝑏∗𝑟))((𝑎∗𝑞)+(𝑏∗𝑠))((𝑐∗𝑝)+(𝑑∗𝑟))((𝑐∗𝑞)+(𝑑∗𝑠))
Vector-Matrix Operations (Vector-Matrix Multiplication):
The number of rows of a matrix should be equal to the number of
elements of the vector then only they can be multiplied. Vector-
Matrix multiplication is associative and distributive but not
commutative.
Multiplying a matrix p with a vector q gives the below product-
𝑝=[𝑒1,𝑒2,𝑒3,𝑒4,𝑒5,𝑒6 ] , 𝑞=𝑎𝑏
𝑝∗𝑞=[(𝑒1∗𝑎)+(𝑒2∗𝑏)][(𝑒3∗𝑎)+(𝑒4∗𝑏)][(𝑒5∗𝑎)+(𝑒6∗𝑏)]
Transpose
The transpose of a matrix generates a new matrix in which the
rows become columns and columns become rows of the original
matrix. Transposition is vital for tasks like computing correlations
and solving linear equations.
Transpose of an m*n matrix will give a n*m matrix.
𝑚=(𝑎 𝑏 𝑇𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒(𝑚)=(𝑎 𝑐
𝑐 𝑑) 𝑏 𝑑)
Inverse
The inverse of a matrix is the matrix when multiplied with the
original matrix gives the Identity matrix as the product. If m is a
matrix and n is the inverse matrix of m, then m*n = I, in
which I represent Identity matrix.
They capture the intrinsic properties of a transformation
or dataset.
Eigenvalues and Eigenvectors:
Understanding eigenvectors and eigenvalues provides insights
into the behavior of linear transformations and is foundational in
various fields, especially in the analysis of square matrices.
In linear algebra, eigenvectors and eigenvalues are
crucial for diagonalizing matrices. Diagonalization
simplifies matrix operations, making computations more
efficient.
They are used in various applications, such as principal
component analysis (PCA) in data analysis, solving
systems of linear differential equations.
They capture the intrinsic properties of a transformation
or dataset.
Eigenvectors
Eigenvectors are special vectors that remain in the same
direction after a linear transformation. When a matrix A is
multiplied by its corresponding eigenvector (v), the result is a
scaled version of the original vector, i.e.,
Av=λv,
where,
\( 𝜆 is the eigenvalue.
Eigenvectors are essentially the “directions” that remain
unchanged, only scaled, when a transformation is applied.
Eigenvalues
Eigenvalues λ are the scaling factors by which the eigenvectors
are stretched or compressed during a linear transformation. They
represent how much the eigenvector is “stretched” or “shrunk”
by the linear transformation. Larger eigenvalues indicate a
greater stretching, and smaller eigenvalues indicate
compression.
Matrix Factorization
Matrix decomposition techniques like SVD is one of the most
suggested areas of linear algebra.
Singular Value Decomposition (SVD) is a powerful technique
for decomposing a matrix into three constituent matrices: U, S,
and VT. These matrices can be used to represent the matrix in a
more compact and informative way.
SVD has a wide range of applications in machine learning,
including:
Data compression, noise reduction by discarding the
smaller singular values and their corresponding singular
vectors to reduce the storage requirements for data
without significantly affecting its quality.
Dimensionality reduction by keeping only the most
important singular values and their corresponding
singular vectors useful for tasks such as data
visualization and feature extraction.
Linear Algebra in Machine Learning:
Datasets in machine learning serve as the foundation for model
training and evaluation. These datasets are essentially matrices,
where each row represents a unique observation or data point,
and each column represents a specific feature or variable. The
tabular structure of datasets aligns with the principles of linear
algebra, where matrices are fundamental entities.
Linear algebra provides the tools to manipulate and transform
these datasets efficiently. Operations like matrix multiplication,
addition, and decomposition are crucial for tasks such as feature
engineering, data preprocessing, and computing various
statistical measures. The representation of datasets as matrices
allows for seamless integration of linear algebra techniques into
the machine learning workflow.
One-hot Encoding
In machine learning, dealing with categorical variables
often involves converting them into a numerical format,
and one-hot encoding is a prevalent technique for this
purpose. It transforms categorical variables into binary
vectors, where each category is represented by a
column, and the presence or absence of that category is
indicated by binary values.
The resulting one-hot encoded representation can be
viewed as a sparse matrix, where most elements are
zero, and linear algebra’s vector representation becomes
evident. This compact encoding simplifies the handling of
categorical data in machine learning algorithms,
facilitating efficient computations and reducing the risk
of bias associated with numerical encodings.
Linear Regression
Linear regression is a fundamental machine learning
algorithm, and its implementation underscores the
importance of linear algebra in the field. Linear algebra
provides the mathematical foundation for understanding
and solving the equations involved in linear regression.
The use of matrices and vectors simplifies the
formulation and computation, making the
implementation more efficient and scalable.
Understanding linear algebra is essential for grasping the
underlying principles of linear regression and other
machine learning algorithms.
Regularization
Regularization methods act as a form of constraint on the
model’s complexity, encouraging simpler models with
smaller coefficients. The elegant integration of linear
algebra concepts into regularization
techniques highlights the synergy between mathematical
principles and practical machine learning challenges.
The regularization term in both L1 and L2 regularization
is essentially a measure of the magnitude or length of
the coefficient vector, a concept directly borrowed from
linear algebra. In the case of L2 regularization, the
penalty term is proportional to the Euclidean norm (L2
norm) of the coefficient vector, emphasizing the role of
linear algebra’s vector norms in regularization.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) stands out as a
powerful dimensionality reduction technique widely used
in machine learning and data analysis. Its primary
objective is to transform a high-dimensional dataset into
a lower-dimensional representation while retaining as
much variability as possible.
At its core, PCA involves the computation of eigenvectors
and eigenvalues of the dataset’s covariance matrix—a
task that aligns with linear algebra principles. The
covariance matrix captures the relationships between
different features, and its eigenvectors represent the
principal components, or the directions of maximum
variance.
Images and Photographs
Images and photographs, vital components of computer
vision applications, are inherently structured as matrices
of pixel values. Each pixel’s position corresponds to a
specific element in the matrix, and its intensity is
encoded as the value of that element. Linear algebra
operations play a central role in image processing tasks,
such as scaling, rotating, and filtering.
Transformations applied to images can be represented as
matrix operations, making linear algebra an essential
tool in image manipulation. For instance, a rotation
transformation can be expressed as a matrix
multiplication, showcasing the versatility of linear
algebra in handling image data.
Deep Learning
Deep learning, characterized by artificial neural
networks (ANNs) with multiple layers, relies extensively
on linear algebra structures for both model
representation and training. ANNs process information
through interconnected nodes organized in layers, where
each connection is associated with a weight.
The fundamental operations within a neural network—
matrix multiplications and element-wise activations—are
inherently linear algebraic. The input layer, hidden
layers, and output layer collectively involve manipulating
vectors, matrices, and tensors.
Conclusion:
Linear algebra is the cornerstone of mathematical concepts in
machine learning. A solid grasp of vectors, matrices, and
operations like matrix multiplication is essential for
understanding algorithms, developing models, and navigating
the intricacies of data transformations. Aspiring machine
learning practitioners benefit immensely from a strong
foundation in linear algebra, enhancing their ability to innovate
and contribute to this dynamic field.
Introduction
Entropy is one of the key aspects of Machine Learning. It is a must to know for
anyone who wants to make a mark in Machine Learning and yet it perplexes
many of us. The focus of this article is to understand the working of entropy in
machine learning by exploring the underlying concept of probability theory, how
the formula works, its significance, and why it is important for the Decision Tree
algorithm.
This article was published as a part of the Data Science Blogathon .
Table of contents
What is Entropy in Machine Learning?
In Machine Learning, entropy measures the level of disorder or uncertainty in
a given dataset or system. It is a metric that quantifies the amount of
information in a dataset, and it is commonly used to evaluate the quality of a
model and its ability to make accurate predictions.
A higher entropy value indicates a more heterogeneous dataset with diverse
classes, while a lower entropy signifies a more pure and homogeneous subset of
data. Decision tree models can use entropy to determine the best splits to make
informed decisions and build accurate predictive models.
The Origin of Entropy
Claude E. Shannon’s 1948 paper on “A Mathematical Theory of Communication ”
marked the birth of information theory. He aimed to mathematically measure the
statistical nature of lost information in phone-line signals and proposed
information entropy to estimate uncertainty reduced by a message. Entropy
measures the amount of surprise and data present in a variable. In information
theory, a random variable’s entropy reflects the average uncertainty level in its
possible outcomes. Events with higher uncertainty have higher entropy.
Information theory finds applications in machine learning models, including
Decision Trees. Understanding entropy helps improve data storage,
communication, and decision-making.
What is a Decision Tree in Machine Learning?
The Decision Tree is a popular supervised learning technique in machine
learning, serving as a hierarchical if-else statement based on feature comparison
operators. It is used for regression and classification problems, finding
relationships between predictor and response variables. The tree structure
includes Root, Branch, and Leaf nodes, representing all possible outcomes
based on specific conditions or rules. The algorithm aims to create homogenous
Leaf nodes containing records of a single type in the outcome variable. However,
sometimes restrictions may lead to mixed outcomes in the Leaf nodes. To build
the tree, the algorithm selects features and thresholds by optimizing a loss
function, aiming for the most accurate predictions. Decision Trees offer
interpretable models and are widely used for various applications, from simple
binary classification to complex decision-making tasks.
Components of a Decision Tree
Root Node: This is where the tree starts. It represents the entire dataset and is
divided into branches based on a chosen feature.
Internal Nodes: These nodes represent the questions or conditions asked about
the features. They lead to further branches or child nodes.
Branches and Edges: These show the possible outcomes of a condition. They
lead to child nodes or leaves.
Leaves (Terminal Nodes): These are the end points of the tree. They represent
the final decision or prediction.
Cost Function in a Decision Tree
The decision tree algorithm learns that it creates the tree from the dataset via the
optimization of the cost function. In the case of classification problems, the cost
or the loss function is a measure of impurity in the target column of nodes
belonging to a root node.
The impurity is nothing but the surprise or the uncertainty available in the
information that we had discussed above. At a given node, the impurity is a
measure of a mixture of different classes or in our case a mix of different car
types in the Y variable. Hence, the impurity is also referred to as heterogeneity
present in the information or at every node.
The goal is to minimize this impurity as much as possible at the leaf (or the end-
outcome) nodes. It means the objective function is to decrease the impurity (i.e.
uncertainty or surprise) of the target column or in other words, to increase the
homogeneity of the Y variable at every split of the given data.
To understand the objective function, we need to understand how the impurity or
the heterogeneity of the target column is computed. There are two metrics to
estimate this impurity: Entropy and Gini. In addition to this, to answer the
previous question on how the decision tree chooses the attributes, there are
various splitting methods including Chi-square, Gini-index , and Entropy however,
the focus here is on Entropy and we will further explore how it helps to create the
tree.
Example of Cost Function in a Decision Tree
Now, it’s been a while since I have been talking about a lot of theory stuff. Let’s
do one thing: I offer you coffee and we perform an experiment. I have a box full
of an equal number of coffee pouches of two flavors: Caramel Latte and the
regular, Cappuccino. You may choose either of the flavors but with eyes closed.
The fun part is: in case you get the caramel latte pouch then you are free to stop
reading this article 🙂 or if you get the cappuccino pouch then you would have to
read the article till the end 🙂
This predicament where you would have to decide and this decision of yours that
can lead to results with equal probability is nothing else but said to be the state
of maximum uncertainty. In case, I had only caramel latte coffee pouches or
cappuccino pouches then we know what the outcome would have been and
hence the uncertainty (or surprise) will be zero.
The probability of getting each outcome of a caramel latte pouch or
cappuccino pouch is:
P(Coffee pouch == Caramel Latte) = 0.50
P(Coffee pouch == Cappuccino) = 1 – 0.50 = 0.50
When we have only one result either caramel latte or cappuccino pouch, then in
the absence of uncertainty, the probability of the event is:
P(Coffee pouch == Caramel Latte) = 1
P(Coffee pouch == Cappuccino) = 1 – 1 = 0
There is a relationship between heterogeneity and uncertainty; the more
heterogeneous the event the more uncertainty. On the other hand, the less
heterogeneous, or so to say, the more homogeneous the event, the lesser is the
uncertainty. The uncertainty is expressed as Gini or Entropy.
How Does Entropy Actually Work?
Claude E. Shannon had expressed this relationship between the probability and
the heterogeneity or impurity in the mathematical form with the help of the
following equation:
H(X) = – Σ (p i * log2 pi)
The uncertainty or the impurity is represented as the log to base 2 of the
probability of a category (p i). The index (i) refers to the number of possible
categories. Here, i = 2 as our problem is a binary classification.
This equation is graphically depicted by a symmetric curve as shown below. On
the x-axis is the probability of the event and the y-axis indicates the
heterogeneity or the impurity denoted by H(X).
Example of Entropy in Machine Learning
We will explore how the curve works in detail and then shall illustrate the
calculation of entropy for our coffee flavor experiment.
Source: Slideplayer
The log2 pi has a very unique property that is when there are only two outcomes
say probability of the event = p i is either 1 or 0.50 then in such scenario
log2 pi takes the following values (ignoring the negative term):
pi = 1 pi = 0.50
log2 (1) =
log2 pi log2 (0.50) = 1
0
Now, the above values of the probability and log 2 pi are depicted in the following
manner:
The catch is when the probability, p i becomes 0, then the value of log 2 p0 moves
towards infinity and the curve changes its shape to:
The entropy or the impurity measure can only take value from 0 to 1 as the
probability ranges from 0 to 1 and hence, we do not want the above situation. So,
to make the curve and the value of log2 pi back to zero, we multiply log 2 pi with
the probability i.e. with p i itself.
Therefore, the expression becomes (p i* log2 pi) and log2 pi returns a negative
value and to remove this negativity effect, we multiply the resultant with a
negative sign and the equation finally becomes:
H(X) = – Σ (p i * log2 pi)
Now, this expression can be used to show how the uncertainty changes
depending on the likelihood of an event.
The curve finally becomes and holds the following values:
This scale of entropy from 0 to 1 is for binary classification problems. For a
multiple classification problem, the above relationship holds, however, the scale
may change.
Calculation of Entropy in Python
We shall estimate the entropy for three different scenarios. The event Y is getting
a caramel latte coffee pouch. The heterogeneity or the impurity formula for two
different classes is as follows:
H(X) = – [(pi * log2 pi) + (qi * log2 qi)]
where,
pi = Probability of Y = 1 i.e. probability of success of the event
qi = Probability of Y = 0 i.e. probability of failure of the event
Case 1
Quantity of
Coffee flavor Probability
Pouches
Caramel Latte 7 0.7
Cappuccino 3 0.3
Quantity of
Coffee flavor Probability
Pouches
Total 10 1
H(X) = – [(0.70 * log 2 (0.70)) + (0.30 * log 2 (0.30))] = 0.88129089
This value 0.88129089 is the measurement of uncertainty when given the box full
of coffee pouches and asked to pull out one of the pouches when there are
seven pouches of caramel latte flavor and three pouches of cappuccino flavor.
Case 2
Quantity of
Coffee flavor Probability
Pouches
Caramel Latte 5 0.5
Cappuccino 5 0.5
Total 10 1
H(X) = – [(0.50 * log 2 (0.50)) + (0.50 * log 2 (0.50))] = 1
Case 3
Quantity of
Coffee flavor Probability
Pouches
Caramel Latte 10 1
Cappuccino 0 0
Total 10 1
H(X) = – [(1.0 * log 2 (1.0) + (0 * log 2 (0)] ~= 0
In scenarios 2 and 3, can see that the entropy is 1 and 0, respectively. In
scenario 3, when we have only one flavor of the coffee pouch, caramel latte, and
have removed all the pouches of cappuccino flavor, then the uncertainty or the
surprise is also completely removed and the aforementioned entropy is zero. We
can then conclude that the information is 100% present.
Use of Entropy in Decision Tree
As we have seen above, in decision trees the cost function is to minimize the
heterogeneity in the leaf nodes. Therefore, the aim is to find out the attributes
and within those attributes the threshold such that when the data is split into two,
we achieve the maximum possible homogeneity or in other words, results in the
maximum drop in the entropy within the two tree levels.
At the root level, the entropy of the target column is estimated via the formula
proposed by Shannon for entropy. At every branch, the entropy computed for the
target column is the weighted entropy. The weighted entropy means taking the
weights of each attribute. The weights are the probability of each of the classes.
The more the decrease in the entropy, the more is the information gained.
Information Gain is the pattern observed in the data and is the reduction in
entropy. It can also be seen as the entropy of the parent node minus the entropy
of the child node. It is calculated as 1 – entropy. The entropy and information
gain for the above three scenarios is as follows:
Entropy Information Gain
Case
0.88129089 0.11870911
1
Case
1 0
2
Case
0 1
3
Estimation of Entropy and Information Gain at Node Level
We have the following tree with a total of four values at the root node that is split
into the first level having one value in one branch (say, Branch 1) and three
values in the other branch (Branch 2). The entropy at the root node is 1.
Now, to compute the entropy at the child node 1, the weights are taken as ⅓ for
Branch 1 and ⅔ for Branch 2 and are calculated using Shannon’s entropy
formula. As we had seen above, the entropy for child node 2 is zero because
there is only one value in that child node meaning there is no uncertainty and
hence, the heterogeneity is not present.
H(X) = – [(1/3 * log 2 (1/3)) + (2/3 * log 2 (2/3))] = 0.9184
The information gain for the above tree is the reduction in the weighted average
of the entropy.
Information Gain = 1 – ( ¾ * 0.9184) – (¼ *0) = 0.3112
Conclusion
Information Entropy or Shannon’s entropy quantifies the amount of uncertainty
(or surprise) involved in the value of a random variable or the outcome of a
random process. Its significance in the decision tree is that it allows us to
estimate the impurity or heterogeneity of the target variable. Subsequently, to
achieve the maximum level of homogeneity in the response variable, the child
nodes are created in such a way that the total entropy of these child nodes must
be less than the entropy of the parent node.
Entropy plays a fundamental role in machine learning, enabling us to measure
uncertainty and information content in data. Understanding entropy is crucial for
building accurate decision trees and improving various learning models.