0% found this document useful (0 votes)
16 views

Machine Learning

Machine learning involves developing algorithms that can learn from data to make predictions without being explicitly programmed. The document discusses the fundamentals of machine learning, including what it is, the machine learning process, types of machine learning such as supervised, unsupervised and reinforcement learning, and applications of machine learning. Supervised learning involves training models on labeled data to learn relationships between inputs and outputs.

Uploaded by

sp1135220
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Machine Learning

Machine learning involves developing algorithms that can learn from data to make predictions without being explicitly programmed. The document discusses the fundamentals of machine learning, including what it is, the machine learning process, types of machine learning such as supervised, unsupervised and reinforcement learning, and applications of machine learning. Supervised learning involves training models on labeled data to learn relationships between inputs and outputs.

Uploaded by

sp1135220
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

‭MACHINE LEARNING‬

‭Unit – 1 Fundamentals of Machine Learning‬

‭What is Machine Learning?‬

‭ achine learning is a field of computer science and artificial intelligence (AI) that involve‬
M
‭developing algorithms and statistical models that enable computer systems to‬
‭automatically learn and improve from experience without being explicitly programmed.‬
‭In other words, machine learning is a subset of AI that focuses on teaching machines to‬
‭recognize patterns and make predictions or decisions based on data.‬

‭ he machine learning process typically involves several steps, including data collection‬
T
‭and preparation, algorithm selection, model training and evaluation, and deployment.‬
‭During training, the algorithm is fed with input data and learns to recognize patterns in‬
‭the data that can be used to make predictions or decisions. Once the model has been‬
‭trained, it can be used to make predictions on new, unseen data.‬

‭ ‬ ‭Machine‬ ‭Learning‬ ‭system‬ ‭learns‬ ‭from‬ ‭historical‬‭data,‬‭builds‬‭the‬‭prediction‬‭models,‬


A
‭and‬‭whenever‬‭it‬‭receives‬‭new‬‭data,‬‭predicts‬‭the‬‭output‬‭for‬‭it‬‭.‬‭The‬‭accuracy‬‭of‬‭predicted‬
‭output‬ ‭depends‬ ‭upon‬ ‭the‬‭amount‬‭of‬‭data,‬‭as‬‭the‬‭huge‬‭amount‬‭of‬‭data‬‭helps‬‭to‬‭build‬‭a‬
‭better model which predicts the output more accurately.‬

‭ uppose‬ ‭we‬ ‭have‬‭a‬‭complex‬‭problem,‬‭where‬‭we‬‭need‬‭to‬‭perform‬‭some‬‭predictions,‬‭so‬


S
‭instead‬‭of‬‭writing‬‭a‬‭code‬‭for‬‭it,‬‭we‬‭just‬‭need‬‭to‬‭feed‬‭the‬‭data‬‭to‬‭generic‬‭algorithms,‬‭and‬
‭with‬ ‭the‬ ‭help‬ ‭of‬‭these‬‭algorithms,‬‭machine‬‭builds‬‭the‬‭logic‬‭as‬‭per‬‭the‬‭data‬‭and‬‭predict‬
‭the‬ ‭output.‬ ‭Machine‬ ‭learning‬ ‭has‬ ‭changed‬‭our‬‭way‬‭of‬‭thinking‬‭about‬‭the‬‭problem.‬‭The‬
‭below block diagram explains the working of Machine Learning algorithm:‬

‭Features of Machine Learning:‬


‭‬
● ‭ achine learning uses data to detect various patterns in a given dataset.‬
M
‭●‬ ‭It can learn from past data and improve automatically.‬
‭●‬ ‭It is a data-driven technology.‬
‭●‬ ‭Machine learning is much similar to data mining as it also deals with a huge‬
‭amount of data.‬

‭Need for Machine Learning‬

‭ he‬‭need‬‭for‬‭machine‬‭learning‬‭is‬‭increasing‬‭day‬‭by‬‭day.‬‭The‬‭reason‬‭behind‬‭the‬‭need‬‭for‬
T
‭machine‬‭learning‬‭is‬‭that‬‭it‬‭is‬‭capable‬‭of‬‭doing‬‭tasks‬‭that‬‭are‬‭too‬‭complex‬‭for‬‭a‬‭person‬‭to‬
‭implement‬‭directly.‬‭As‬‭a‬‭human,‬‭we‬‭have‬‭some‬‭limitations‬‭as‬‭we‬‭cannot‬‭access‬‭a‬‭huge‬
‭amount‬‭of‬‭data‬‭manually,‬‭so‬‭for‬‭this,‬‭we‬‭need‬‭some‬‭computer‬‭systems,‬‭and‬‭here‬‭comes‬
‭machine learning to make things easy for us.‬

‭ e‬‭can‬‭train‬‭machine‬‭learning‬‭algorithms‬‭by‬‭providing‬‭them‬‭with‬‭a‬‭huge‬‭amount‬‭of‬‭data‬
W
‭and‬‭letting‬‭them‬‭explore‬‭the‬‭data,‬‭construct‬‭the‬‭models,‬‭and‬‭predict‬‭the‬‭required‬‭output‬
‭automatically.‬ ‭The‬ ‭performance‬ ‭of‬ ‭the‬ ‭machine‬ ‭learning‬ ‭algorithm‬ ‭depends‬ ‭on‬ ‭the‬
‭amount‬‭of‬‭data,‬‭and‬‭it‬‭can‬‭be‬‭determined‬‭by‬‭the‬‭cost‬‭function.‬‭With‬‭the‬‭help‬‭of‬‭machine‬
‭learning, we can save both time and *money.‬

‭ he‬ ‭importance‬ ‭of‬ ‭machine‬ ‭learning‬ ‭can‬ ‭be‬ ‭easily‬ ‭understood‬ ‭by‬ ‭its‬ ‭uses‬ ‭cases,‬
T
‭Currently,‬ ‭machine‬ ‭learning‬ ‭is‬ ‭used‬ ‭in‬‭self-driving‬‭cars‬‭,‬‭cyber‬‭fraud‬‭detection‬‭,‬‭face‬
‭recognition‬‭,‬‭and‬‭friend‬‭suggestion‬‭by‬‭Facebook‬‭,‬‭etc.‬‭Various‬‭top‬‭companies‬‭such‬‭as‬
‭Netflix‬‭and‬‭Amazon‬‭have‬‭built‬‭machine‬‭learning‬‭models‬‭that‬‭are‬‭using‬‭a‬‭vast‬‭amount‬‭of‬
‭data to analyze user interest and recommend products accordingly.‬

‭Following are some key points that show the importance of Machine Learning:‬

‭‬
● ‭ apid increment in the production of data‬
R
‭●‬ ‭Solving complex problems, which are difficult for a human‬
‭●‬ ‭Decision-making in various sectors including finance‬
‭●‬ ‭Finding hidden patterns and extracting useful information from data.‬

‭Why use Machine Learning?‬

‭1.‬ ‭Improved decision-making: Machine learning can help automate decision-making‬


‭ rocesses by analyzing vast amounts of data, identifying patterns, and making‬
p
‭predictions based on that data.‬
‭2.‬ ‭Increased efficiency: Machine learning algorithms can automate many tasks that‬
‭ ould otherwise require human intervention, reducing the time and effort‬
w
‭required to complete those tasks.‬
‭3.‬ ‭Personalization: Machine learning can help personalize experiences for users by‬
‭analyzing their behavior and preferences, and making recommendations or‬
‭adjustments based on that analysis.‬
‭4.‬ ‭Scalability: Machine learning algorithms can process large amounts of data‬
‭quickly and efficiently, making it easier to scale operations and handle large‬
‭volumes of data.‬
‭5.‬ ‭Improved accuracy: Machine learning algorithms can often make more accurate‬
‭predictions or decisions than humans, especially when it comes to complex data‬
‭sets.‬
‭6.‬ ‭Discovering insights: Machine learning can help uncover hidden patterns or‬
‭insights in data that may not be immediately apparent to humans, leading to new‬
‭discoveries and insights.‬

‭ verall, Machine learning can help organizations make more informed decisions,‬
O
‭increase efficiency, improve customer experiences, and achieve better business‬
‭outcomes.‬

‭Types of Machine Learning‬

‭1.‬ ‭Supervised Learning:‬‭In supervised learning, the algorithm‬‭is trained on a‬


l‭abeled dataset where the correct output is provided for each input. The‬
‭algorithm learns to recognize patterns in the input data and map them to the‬
‭correct output. The goal is to train the algorithm to predict the correct output for‬
‭new, unseen input data. Examples of supervised learning include image‬
‭classification, speech recognition, and predicting housing prices based on‬
‭various features.‬
‭ .‬ ‭Unsupervised Learning:‬‭In unsupervised learning, the‬‭algorithm is trained on‬
2
‭an unlabeled dataset where the correct output is not provided. The algorithm‬
‭learns to identify patterns and relationships in the input data without any‬
‭predefined labels. The goal is to uncover hidden structures in the data and group‬
‭similar data points together. Examples of unsupervised learning include‬
‭clustering, anomaly detection, and dimensionality reduction.‬
‭3.‬ ‭Reinforcement Learning:‬‭In reinforcement learning,‬‭the algorithm learns by‬
‭interacting with an environment and receiving feedback in the form of rewards or‬
‭penalties. The algorithm learns to make decisions that maximize the reward over‬
‭time. The goal is to learn a policy that maximizes the expected cumulative‬
r‭ eward. Examples of reinforcement learning include game playing, robotics, and‬
‭autonomous driving.‬

‭ ach type of machine learning has its own unique set of algorithms and techniques, and‬
E
‭the choice of which type to use depends on the specific problem and data available.‬

‭Supervised Machine Learning‬

‭ upervised‬‭learning‬‭is‬‭the‬‭type‬‭of‬‭machine‬‭learning‬‭in‬‭which‬‭machines‬‭are‬‭trained‬‭using‬
S
‭well‬ ‭"labeled"‬ ‭training‬‭data,‬‭and‬‭on‬‭the‬‭basis‬‭of‬‭that‬‭data,‬‭machines‬‭predict‬‭the‬‭output.‬
‭The labeled data means some input data is already tagged with the correct output.‬

‭How Supervised Learning Works?‬

I‭n‬ ‭supervised‬ ‭learning,‬ ‭the‬ ‭training‬ ‭data‬ ‭provided‬ ‭to‬ ‭the‬ ‭machines‬ ‭work‬ ‭as‬ ‭the‬
‭supervisor‬‭that‬‭teaches‬‭the‬‭machines‬‭to‬‭predict‬‭the‬‭output‬‭correctly.‬‭It‬‭applies‬‭the‬‭same‬
‭concept as a student learns in the supervision of the teacher.‬

‭ upervised‬ ‭learning‬ ‭is‬ ‭a‬‭process‬‭of‬‭providing‬‭input‬‭data‬‭as‬‭well‬‭as‬‭correct‬‭output‬‭data‬


S
‭to‬ ‭the‬ ‭machine‬ ‭learning‬ ‭model.‬‭The‬‭aim‬‭of‬‭a‬‭supervised‬‭learning‬‭algorithm‬‭is‬‭to‬‭find‬‭a‬
‭mapping function to map the input variable(x) with the output variable(y)‬‭.‬

I‭n‬ ‭the‬ ‭real-world,‬ ‭supervised‬ ‭learning‬ ‭can‬ ‭be‬ ‭used‬ ‭for‬ ‭Risk‬ ‭Assessment,‬ ‭Image‬
‭classification, Fraud Detection, spam filtering‬‭, etc.‬

I‭n‬ ‭supervised‬ ‭learning,‬ ‭models‬ ‭are‬ ‭trained‬ ‭using‬ ‭a‬ ‭labeled‬ ‭dataset,‬ ‭where‬ ‭the‬ ‭model‬
‭learns‬ ‭about‬ ‭each‬ ‭type‬ ‭of‬ ‭data.‬ ‭Once‬ ‭the‬ ‭training‬ ‭process‬ ‭is‬ ‭completed,‬ ‭the‬ ‭model‬ ‭is‬
‭tested‬ ‭on‬ ‭the‬ ‭basis‬ ‭of‬ ‭test‬ ‭data‬ ‭(a‬ ‭subset‬ ‭of‬ ‭the‬ ‭training‬ ‭set),‬‭and‬‭then‬‭it‬‭predicts‬‭the‬
‭output.‬

‭ he‬‭working‬‭of‬‭Supervised‬‭learning‬‭can‬‭be‬‭easily‬‭understood‬‭by‬‭the‬‭below‬‭example‬‭and‬
T
‭diagram:‬
‭ uppose‬ ‭we‬ ‭have‬ ‭a‬ ‭dataset‬ ‭of‬ ‭different‬ ‭types‬ ‭of‬ ‭shapes‬ ‭which‬ ‭includes‬ ‭square,‬
S
‭rectangle,‬‭triangle,‬‭and‬‭Polygon.‬‭Now‬‭the‬‭first‬‭step‬‭is‬‭that‬‭we‬‭need‬‭to‬‭train‬‭the‬‭model‬‭for‬
‭each shape.‬

‭○‬ ‭If the given shape has four sides, and all the sides are equal, then it will be‬
‭labeled as a‬‭Square‬‭.‬

‭○‬ ‭If the given shape has three sides, then it will be labeled as a‬‭triangle‬‭.‬

‭○‬ ‭If the given shape has six equal sides then it will be labeled as‬‭hexagon‬‭.‬

‭ ow,‬‭after‬‭training,‬‭we‬‭test‬‭our‬‭model‬‭using‬‭the‬‭test‬‭set,‬‭and‬‭the‬‭task‬‭of‬‭the‬‭model‬‭is‬‭to‬
N
‭identify the shape.‬

‭ he‬‭machine‬‭is‬‭already‬‭trained‬‭on‬‭all‬‭types‬‭of‬‭shapes,‬‭and‬‭when‬‭it‬‭finds‬‭a‬‭new‬‭shape,‬‭it‬
T
‭classifies the shape on the bases of a number of sides, and predicts the output.‬

‭Steps Involved in Supervised Learning:‬

‭○‬ ‭First Determine the type of training dataset‬

‭○‬ ‭Collect/Gather the labeled training data.‬


‭○‬ ‭Split the training dataset into training‬‭dataset, test dataset, and validation‬
‭dataset‬‭.‬

‭○‬ ‭Determine the input features of the training dataset, which should have enough‬
‭knowledge so that the model can accurately predict the output.‬

‭○‬ ‭Determine the suitable algorithm for the model, such as support vector machine,‬
‭decision tree, etc.‬

‭○‬ ‭Execute the algorithm on the training dataset. Sometimes we need validation‬
‭sets as the control parameters, which are the subset of training datasets.‬

‭○‬ ‭Evaluate the accuracy of the model by providing the test set. If the model‬
‭predicts the correct output, which means our model is accurate.‬

‭Types of supervised Machine learning Algorithms:‬

‭Supervised learning can be further divided into two types of problems:‬

‭1. Regression‬

‭ egression‬‭algorithms‬‭are‬‭used‬‭if‬‭there‬‭is‬‭a‬‭relationship‬‭between‬‭the‬‭input‬‭variable‬‭and‬
R
‭the‬ ‭output‬ ‭variable.‬ ‭It‬ ‭is‬ ‭used‬ ‭for‬ ‭the‬ ‭prediction‬ ‭of‬ ‭continuous‬ ‭variables,‬ ‭such‬ ‭as‬
‭Weather‬ ‭forecasting,‬ ‭Market‬ ‭Trends,‬ ‭etc.‬ ‭Below‬ ‭are‬ ‭some‬ ‭popular‬ ‭Regression‬
‭algorithms which come under supervised learning:‬

‭○‬ ‭Linear Regression‬

‭○‬ ‭Regression Trees‬


‭○‬ ‭Non-Linear Regression‬

‭○‬ ‭Bayesian Linear Regression‬

‭○‬ ‭Polynomial Regression‬

‭2. Classification‬

‭ lassification‬‭algorithms‬‭are‬‭used‬‭when‬‭the‬‭output‬‭variable‬‭is‬‭categorical,‬‭which‬‭means‬
C
‭there are two classes such as Yes-No, Male-Female, True-false, etc.‬

‭Spam Filtering,‬

‭○‬ ‭Random Forest‬

‭○‬ ‭Decision Trees‬

‭○‬ ‭Logistic Regression‬

‭○‬ ‭Support vector Machines‬

‭Advantages of Supervised learning:‬

‭○‬ ‭With the help of supervised learning, the model can predict the output on the‬
‭basis of prior experiences.‬

‭○‬ ‭In supervised learning, we can have an exact idea about the classes of objects.‬

‭○‬ ‭Supervised learning model helps us to solve various real-world problems such as‬
‭fraud detection, spam filtering‬‭, etc.‬

‭Disadvantages of supervised learning:‬

‭○‬ ‭Supervised learning models are not suitable for handling the complex tasks.‬

‭○‬ ‭Supervised learning cannot predict the correct output if the test data is different‬
‭from the training dataset.‬

‭○‬ ‭Training required lots of computation times.‬


‭○‬ ‭In supervised learning, we need enough knowledge about the classes of object.‬

‭Unsupervised Machine Learning‬

I‭n‬ ‭the‬ ‭previous‬ ‭topic,‬ ‭we‬ ‭learned‬ ‭supervised‬ ‭machine‬ ‭learning‬ ‭in‬ ‭which‬ ‭models‬ ‭are‬
‭trained‬ ‭using‬ ‭labeled‬ ‭data‬ ‭under‬ ‭the‬ ‭supervision‬ ‭of‬ ‭training‬ ‭data.‬ ‭But‬ ‭there‬ ‭may‬ ‭be‬
‭many‬‭cases‬‭in‬‭which‬‭we‬‭do‬‭not‬‭have‬‭labeled‬‭data‬‭and‬‭need‬‭to‬‭find‬‭the‬‭hidden‬‭patterns‬
‭from‬‭the‬‭given‬‭dataset.‬‭So,‬‭to‬‭solve‬‭such‬‭types‬‭of‬‭cases‬‭in‬‭machine‬‭learning,‬‭we‬‭need‬
‭unsupervised learning techniques.‬

‭What is Unsupervised Learning?‬

‭ s‬‭the‬‭name‬‭suggests,‬‭unsupervised‬‭learning‬‭is‬‭a‬‭machine‬‭learning‬‭technique‬‭in‬‭which‬
A
‭models‬‭are‬‭not‬‭supervised‬‭using‬‭training‬‭dataset.‬‭Instead,‬‭models‬‭itself‬‭find‬‭the‬‭hidden‬
‭patterns‬ ‭and‬ ‭insights‬ ‭from‬ ‭the‬ ‭given‬ ‭data.‬ ‭It‬‭can‬‭be‬‭compared‬‭to‬‭learning‬‭which‬‭takes‬
‭place in the human brain while learning new things. It can be defined as:‬

‭Unsupervised‬ ‭learning‬ ‭is‬ ‭a‬ ‭type‬ ‭of‬ ‭machine‬ ‭learning‬ ‭in‬ ‭which‬ ‭models‬ ‭are‬ ‭trained‬
‭using unlabeled dataset and are allowed to act on that data without any supervision.‬

‭ nsupervised‬ ‭learning‬ ‭cannot‬ ‭be‬ ‭directly‬ ‭applied‬ ‭to‬ ‭a‬ ‭regression‬ ‭or‬ ‭classification‬
U
‭problem‬ ‭because‬ ‭unlike‬ ‭supervised‬ ‭learning,‬ ‭we‬ ‭have‬ ‭the‬ ‭input‬ ‭data‬ ‭but‬ ‭no‬
‭corresponding‬‭output‬‭data.‬‭The‬‭goal‬‭of‬‭unsupervised‬‭learning‬‭is‬‭to‬‭find‬‭the‬‭underlying‬
‭structure‬ ‭of‬‭dataset,‬‭group‬‭that‬‭data‬‭according‬‭to‬‭similarities,‬‭and‬‭represent‬‭that‬
‭dataset in a compressed format‬‭.‬

‭ xample:‬ ‭Suppose‬ ‭the‬ ‭unsupervised‬ ‭learning‬ ‭algorithm‬ ‭is‬ ‭given‬ ‭an‬ ‭input‬ ‭dataset‬
E
‭containing‬ ‭images‬ ‭of‬ ‭different‬ ‭types‬ ‭of‬ ‭cats‬ ‭and‬ ‭dogs.‬ ‭The‬ ‭algorithm‬ ‭is‬ ‭never‬ ‭trained‬
‭upon‬ ‭the‬ ‭given‬ ‭dataset,‬ ‭which‬ ‭means‬ ‭it‬ ‭does‬ ‭not‬ ‭have‬ ‭any‬ ‭idea‬‭about‬‭the‬‭features‬‭of‬
‭the‬ ‭dataset.‬ ‭The‬ ‭task‬ ‭of‬ ‭the‬ ‭unsupervised‬ ‭learning‬ ‭algorithm‬ ‭is‬ ‭to‬ ‭identify‬ ‭the‬ ‭image‬
‭features‬ ‭on‬ ‭their‬ ‭own.‬ ‭Unsupervised‬ ‭learning‬ ‭algorithm‬ ‭will‬ ‭perform‬ ‭this‬ ‭task‬ ‭by‬
‭clustering the image dataset into the groups according to similarities between images.‬
‭Why use Unsupervised Learning?‬

‭ elow‬ ‭are‬ ‭some‬ ‭main‬ ‭reasons‬ ‭which‬ ‭describe‬ ‭the‬ ‭importance‬ ‭of‬ ‭Unsupervised‬
B
‭Learning:‬

‭○‬ ‭Unsupervised learning is helpful for finding useful insights from the data.‬

‭○‬ ‭Unsupervised learning is much similar as a human learns to think by their own‬
‭experiences, which makes it closer to the real AI.‬

‭○‬ ‭Unsupervised learning works on unlabeled and uncategorized data which make‬
‭unsupervised learning more important.‬

‭○‬ ‭In real-world, we do not always have input data with the corresponding output so‬
‭to solve such cases, we need unsupervised learning.‬

‭Working of Unsupervised Learning‬

‭Working of unsupervised learning can be understood by the below diagram:‬


‭ ere,‬ ‭we‬ ‭have‬ ‭taken‬ ‭an‬ ‭unlabeled‬ ‭input‬ ‭data,‬ ‭which‬ ‭means‬ ‭it‬ ‭is‬ ‭not‬ ‭categorized‬ ‭and‬
H
‭corresponding‬ ‭outputs‬ ‭are‬ ‭also‬ ‭not‬ ‭given.‬ ‭Now,‬ ‭this‬ ‭unlabeled‬ ‭input‬ ‭data‬ ‭is‬‭fed‬‭to‬‭the‬
‭machine‬‭learning‬‭model‬‭in‬‭order‬‭to‬‭train‬‭it.‬‭Firstly,‬‭it‬‭will‬‭interpret‬‭the‬‭raw‬‭data‬‭to‬‭find‬‭the‬
‭hidden‬ ‭patterns‬ ‭from‬‭the‬‭data‬‭and‬‭then‬‭will‬‭apply‬‭suitable‬‭algorithms‬‭such‬‭as‬‭k-means‬
‭clustering, Decision tree, etc.‬

‭ nce‬‭it‬‭applies‬‭the‬‭suitable‬‭algorithm,‬‭the‬‭algorithm‬‭divides‬‭the‬‭data‬‭objects‬‭into‬‭groups‬
O
‭according to the similarities and difference between the objects.‬

‭Types of Unsupervised Learning Algorithm:‬

‭ he‬ ‭unsupervised‬ ‭learning‬ ‭algorithm‬ ‭can‬ ‭be‬ ‭further‬ ‭categorized‬ ‭into‬ ‭two‬ ‭types‬ ‭of‬
T
‭problems:‬
‭○‬ ‭Clustering‬‭: Clustering is a method of grouping the‬‭objects into clusters such that‬
‭objects with most similarities remains into a group and has less or no similarities‬
‭with the objects of another group. Cluster analysis finds the commonalities‬
‭between the data objects and categorizes them as per the presence and‬
‭absence of those commonalities.‬

‭○‬ ‭Association‬‭: An association rule is an unsupervised‬‭learning method which is‬


‭used for finding the relationships between variables in the large database. It‬
‭determines the set of items that occurs together in the dataset. Association rule‬
‭makes marketing strategy more effective. Such as people who buy X item‬
‭(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical‬
‭example of Association rule is Market Basket Analysis.‬

‭Unsupervised Learning algorithms:‬

‭Below is the list of some popular unsupervised learning algorithms:‬

‭○‬ ‭K-means clustering‬

‭○‬ ‭KNN (k-nearest neighbors)‬

‭○‬ ‭Hierarchal clustering‬


‭○‬ ‭Anomaly detection‬

‭○‬ ‭Neural Networks‬

‭○‬ ‭Principle Component Analysis‬

‭○‬ ‭Independent Component Analysis‬

‭○‬ ‭Apriori algorithm‬

‭○‬ ‭Singular value decomposition‬

‭Advantages of Unsupervised Learning‬

‭○‬ ‭Unsupervised learning is used for more complex tasks as compared to‬
‭supervised learning because, in unsupervised learning, we don't have labeled‬
‭input data.‬

‭○‬ ‭Unsupervised learning is preferable as it is easy to get unlabeled data in‬


‭comparison to labeled data.‬

‭Disadvantages of Unsupervised Learning‬

‭○‬ ‭Unsupervised learning is intrinsically more difficult than supervised learning as it‬
‭does not have corresponding output.‬

‭○‬ ‭The result of the unsupervised learning algorithm might be less accurate as input‬
‭data is not labeled, and algorithms do not know the exact output in advance.‬

‭Reinforcement Learning:‬

‭ einforcement learning is an area of Machine Learning. It is about taking suitable action‬


R
‭to maximize reward in a particular situation. It is employed by various software and‬
‭machines to find the best possible behavior or path it should take in a specific situation.‬
‭Reinforcement learning differs from supervised learning in a way that in supervised‬
‭learning the training data has the answer key with it so the model is trained with the‬
c‭ orrect answer itself whereas in reinforcement learning, there is no answer but the‬
‭reinforcement agent decides what to do to perform the given task. In the absence of a‬
‭training dataset, it is bound to learn from its experience.‬

‭Reinforcement‬‭Learning‬‭(RL)‬‭is‬‭the‬‭science‬‭of‬‭decision-making.‬‭It‬‭is‬‭about‬‭learning‬‭the‬
‭optimal‬ ‭behavior‬ ‭in‬ ‭an‬ ‭environment‬ ‭to‬ ‭obtain‬ ‭maximum‬ ‭reward.‬ ‭In‬ ‭RL,‬ ‭the‬ ‭data‬ ‭is‬
‭accumulated‬ ‭from‬ ‭machine‬ ‭learning‬‭systems‬‭that‬‭use‬‭a‬‭trial-and-error‬‭method.‬‭Data‬‭is‬
‭not part of the input that we would find in supervised or unsupervised machine learning.‬

‭Reinforcement‬ ‭learning‬ ‭uses‬ ‭algorithms‬ ‭that‬ ‭learn‬ ‭from‬ ‭outcomes‬ ‭and‬ ‭decide‬ ‭which‬
‭action‬ ‭to‬ ‭take‬ ‭next.‬ ‭After‬ ‭each‬ ‭action,‬ ‭the‬ ‭algorithm‬ ‭receives‬ ‭feedback‬ ‭that‬ ‭helps‬ ‭it‬
‭determine‬ ‭whether‬ ‭the‬ ‭choice‬ ‭it‬ ‭made‬ ‭was‬ ‭correct,‬ ‭neutral‬ ‭or‬ ‭incorrect.‬ ‭It‬ ‭is‬ ‭a‬ ‭good‬
‭technique‬ ‭to‬ ‭use‬ ‭for‬ ‭automated‬ ‭systems‬ ‭that‬ ‭have‬ ‭to‬ ‭make‬ ‭a‬ ‭lot‬ ‭of‬ ‭small‬ ‭decisions‬
‭without human guidance.‬

‭Reinforcement‬ ‭learning‬ ‭is‬ ‭an‬ ‭autonomous,‬ ‭self-teaching‬‭system‬‭that‬‭essentially‬‭learns‬


‭by‬ ‭trial‬ ‭and‬ ‭error.‬ ‭It‬ ‭performs‬ ‭actions‬ ‭with‬ ‭the‬ ‭aim‬ ‭of‬ ‭maximizing‬ ‭rewards,‬ ‭or‬ ‭in‬ ‭other‬
‭words, it is learning by doing in order to achieve the best outcomes.‬

‭Example:‬

‭The‬ ‭problem‬ ‭is‬ ‭as‬ ‭follows:‬ ‭We‬ ‭have‬ ‭an‬ ‭agent‬ ‭and‬ ‭a‬ ‭reward,‬ ‭with‬ ‭many‬ ‭hurdles‬ ‭in‬
‭between.‬‭The‬‭agent‬‭is‬‭supposed‬‭to‬‭find‬‭the‬‭best‬‭possible‬‭path‬‭to‬‭reach‬‭the‬‭reward.‬‭The‬
‭following problem explains the problem more easily.‬
‭The above image shows the robot, diamond, and fire. The goal of the robot is to get the‬
‭reward that is the diamond and avoid the hurdles that are fired. The robot learns by‬
‭trying all the possible paths and then choosing the path which gives him the reward with‬
‭the least hurdles. Each right step will give the robot a reward and each wrong step will‬
‭subtract the reward of the robot. The total reward will be calculated when it reaches the‬
‭final reward that is the diamond.‬

‭Main points in Reinforcement learning –‬

‭●‬ ‭Input: The input should be an initial state from which the model will start‬
‭●‬ ‭Output: There are many possible outputs as there are a variety of solutions to‬
‭a particular problem‬
‭●‬ ‭Training: The training is based upon the input, The model will return a state‬
‭and the user will decide to reward or punish the model based on its output.‬
‭●‬ ‭The model keeps continues to learn.‬
‭●‬ ‭The best solution is decided based on the maximum reward.‬

‭Types of Reinforcement:‬

‭There are two types of Reinforcement:‬


‭1.‬ ‭Positive:‬‭Positive Reinforcement is defined as when an event, occurs due to‬

‭a particular behavior, increases the strength and the frequency of the‬


‭behavior. In other words, it has a positive effect on behavior.‬
‭Advantages of reinforcement learning are:‬
‭●‬ ‭Maximizes Performance‬
‭●‬ ‭Sustain Change for a long period of time‬
‭●‬ ‭Too much Reinforcement can lead to an overload of states which‬
‭can diminish the results‬
‭2.‬ ‭Negative:‬‭Negative Reinforcement is defined as strengthening‬‭of behavior‬

‭because a negative condition is stopped or avoided.‬


‭Advantages of reinforcement learning:‬
‭●‬ ‭Increases Behavior‬
‭●‬ ‭Provide defiance to a minimum standard of performance‬
‭●‬ ‭It Only provides enough to meet up the minimum behavior‬

‭Elements of Reinforcement Learning‬

‭Reinforcement Learning elements are as follows:‬

‭1.‬ ‭Policy‬
‭2.‬ ‭Reward function‬
‭3.‬ ‭Value function‬
‭4.‬ ‭Model of the environment‬

‭Policy:‬‭Policy defines the learning agent behavior‬‭for given time period. It is a mapping‬
‭from perceived states of the environment to actions to be taken when in those states.‬

‭Reward function:‬‭Reward function is used to define‬‭a goal in a reinforcement learning‬


‭problem.A reward function is a function that provides a numerical score based on the‬
‭state of the environment‬
‭Value function:‬‭Value functions specify what is good in the long run. The value of a‬
‭state is the total amount of reward an agent can expect to accumulate over the future,‬
‭starting from that state.‬

‭Model of the environment:‬‭Models are used for planning.‬

‭Application of Reinforcement Learning‬

‭1. Robotics: Robots with pre-programmed behavior are useful in structured‬


‭environments, such as the assembly line of an automobile manufacturing plant, where‬
‭the task is repetitive in nature.‬

‭2. A master chess player makes a move. The choice is informed both by planning,‬
‭anticipating possible replies and counter replies.‬

‭3. An adaptive controller adjusts parameters of a petroleum refinery’s operation in real‬


‭time.‬

‭RL can be used in large environments in the following situations:‬

‭1.‬ ‭A model of the environment is known, but an analytic solution is not available;‬
‭2.‬ ‭Only a simulation model of the environment is given (the subject of‬
‭simulation-based optimization)‬
‭3.‬ ‭The only way to collect information about the environment is to interact with it.‬

‭Advantages of Reinforcement Learning‬

‭1. Reinforcement learning can be used to solve very complex problems that cannot be‬
‭solved by conventional techniques.‬

‭2. The model can correct the errors that occurred during the training process.‬
‭3. In RL, training data is obtained via the direct interaction of the agent with the‬
‭environment‬

‭4. Reinforcement learning can handle environments that are non-deterministic, meaning‬
‭that the outcomes of actions are not always predictable. This is useful in real-world‬
‭applications where the environment may change over time or is uncertain.‬

‭5. Reinforcement learning can be used to solve a wide range of problems, including‬
‭those that involve decision making, control, and optimization.‬

‭6. Reinforcement learning is a flexible approach that can be combined with other‬
‭machine learning techniques, such as deep learning, to improve performance.‬

‭Disadvantages of Reinforcement Learning‬

‭1. Reinforcement learning is not preferable to use for solving simple problems.‬

‭2. Reinforcement learning needs a lot of data and a lot of computation‬

‭3. Reinforcement learning is highly dependent on the quality of the reward function. If‬
‭the reward function is poorly designed, the agent may not learn the desired behavior.‬

‭4. Reinforcement learning can be difficult to debug and interpret. It is not always clear‬
‭why the agent is behaving in a certain way, which can make it difficult to diagnose and‬
‭fix problems.‬

‭Challenges of Machine Learning:‬

‭ achine learning has revolutionized many fields and brought significant benefits, but it‬
M
‭also faces several challenges. Here are some of the main challenges of machine‬
‭learning:‬

‭1.‬ D
‭ ata quality: Machine learning algorithms rely on high-quality data to make‬
‭accurate predictions. Poor quality data can lead to biased, inaccurate or‬
‭unreliable results, so ensuring data quality is crucial.‬
‭2.‬ D ‭ ata quantity: Machine learning algorithms require large amounts of data to be‬
‭trained effectively. In some cases, collecting enough data can be challenging, or‬
‭the available data may not be representative of the entire population.‬
‭3.‬ ‭Overfitting: Overfitting occurs when a model is trained too well on a particular‬
‭dataset, resulting in poor performance on new, unseen data. This can be caused‬
‭by using overly complex models or training with insufficient data.‬
‭4.‬ ‭Interpretability: Some machine learning models, especially deep learning models,‬
‭can be difficult to interpret, making it challenging to understand how they arrive at‬
‭their predictions.‬
‭5.‬ ‭Algorithm selection: There are numerous algorithms available for different types‬
‭of problems, and choosing the most appropriate algorithm can be difficult.‬
‭6.‬ ‭Scalability: Some machine learning algorithms can be computationally‬
‭expensive, making it difficult to scale them to handle large volumes of data or‬
‭real-time processing.‬
‭7.‬ ‭Ethical considerations: Machine learning can be used to make decisions that‬
‭impact people's lives, so ethical considerations around bias, fairness, and privacy‬
‭are important.‬

‭ vercoming these challenges requires careful consideration of the problem, data, and‬
O
‭algorithms involved, as well as ongoing research and development in the field.‬

‭Testing and Validation:‬

‭ esting and validation are crucial steps in the machine learning workflow to ensure that‬
T
‭the trained model performs well on new, unseen data. Here's an overview of testing and‬
‭validation in machine learning:‬

‭1.‬ T ‭ raining and testing data: The dataset is split into training and testing data. The‬
‭model is trained on the training data, and the performance of the trained model is‬
‭evaluated on the testing data.‬
‭2.‬ ‭Cross-validation: Cross-validation is a technique used to evaluate the‬
‭performance of the model by dividing the dataset into k-folds, where k is the‬
‭number of subsets of data. The model is trained on k-1 folds and validated on the‬
‭remaining fold, and this process is repeated k times, with each fold used once for‬
‭validation. The results are then averaged to give a final estimate of the model's‬
‭performance.‬
‭3.‬ ‭Overfitting and underfitting: Overfitting occurs when the model is too complex‬
‭and learns the noise in the training data, resulting in poor performance on new‬
‭data. Underfitting occurs when the model is too simple and is unable to capture‬
t‭he underlying patterns in the data, resulting in poor performance on both training‬
‭and new data. Regularization techniques can be used to address overfitting,‬
‭while increasing the complexity of the model or collecting more data can address‬
‭underfitting.‬
‭ .‬ ‭Hyperparameter tuning: Hyperparameters are parameters that are set before‬
4
‭training the model and control the learning process. Examples of‬
‭hyperparameters include the learning rate, regularization strength, and number of‬
‭hidden layers in a neural network. Hyperparameter tuning involves selecting the‬
‭optimal hyperparameters to maximize the performance of the model.‬
‭5.‬ ‭Evaluation metrics: Evaluation metrics are used to measure the performance of‬
‭the model. Common evaluation metrics include accuracy, precision, recall, F1‬
‭score, and AUC-ROC. The choice of evaluation metric depends on the specific‬
‭problem and the goals of the model.‬

‭ verall, testing and validation are critical steps in the machine learning workflow to‬
O
‭ensure that the trained model performs well on new, unseen data and can generalize to‬
‭new scenarios.‬

‭Classification:‬

‭ lassification is a type of supervised learning in machine learning where the goal is to‬
C
‭predict the class label of new, unseen instances based on a set of input features. The‬
‭input features are typically represented as a vector, and the output is a discrete class‬
‭label.‬

‭ he process of classification involves training a model on a labeled dataset, where the‬


T
‭correct class labels are known for each input. The model learns to recognize patterns in‬
‭the input features that are associated with the different class labels. Once the model is‬
‭trained, it can be used to predict the class labels of new, unseen instances.‬

‭There are many types of classification algorithms, including:‬

‭1.‬ L ‭ ogistic Regression: A linear model that predicts the probability of a binary or‬
‭multiclass outcome.‬
‭2.‬ ‭Decision Trees: A tree-based model that partitions the feature space into a series‬
‭of binary decisions.‬
‭3.‬ ‭Random Forest: An ensemble model that combines multiple decision trees to‬
‭improve performance and reduce overfitting.‬
‭4.‬ ‭Support Vector Machines (SVMs): A linear or nonlinear model that finds the‬
‭hyperplane that maximally separates the different classes.‬
‭5.‬ N ‭ aive Bayes: A probabilistic model that estimates the conditional probability of‬
‭each class given the input features.‬
‭6.‬ ‭Neural Networks: A nonlinear model that consists of multiple layers of‬
‭interconnected nodes and can learn complex patterns in the input features.‬

‭ lassification is a widely used technique in many fields, including image classification,‬


C
‭text classification, fraud detection, and medical diagnosis, to name a few.‬

‭MNIST Dataset:‬

‭ he MNIST dataset is a widely used benchmark dataset in machine learning,‬


T
‭particularly for image classification tasks. It consists of a set of 70,000 grayscale images‬
‭of handwritten digits (0-9), each with a resolution of 28x28 pixels. The dataset is split‬
‭into a training set of 60,000 images and a testing set of 10,000 images.‬

‭ he MNIST dataset is popular because it is relatively simple and well-suited for‬


T
‭evaluating the performance of different machine learning models. Many researchers‬
‭have used this dataset to compare the performance of different classification algorithms,‬
‭such as logistic regression, decision trees, random forests, support vector machines,‬
‭and neural networks.‬

‭ he MNIST dataset has also been extended to include variations such as rotated or‬
T
‭translated images, as well as modified versions that include noise or other types of‬
‭distortion. This allows researchers to evaluate the robustness of machine learning‬
‭models to different types of variations and distortions in the input data.‬

‭ verall, the MNIST dataset has played a crucial role in advancing the field of machine‬
O
‭learning and continues to be a valuable benchmark dataset for evaluating and‬
‭comparing different classification algorithms.‬

‭Performance Measures:‬

‭ erformance measures are used to evaluate the performance of a machine learning‬


P
‭model. The choice of performance measures depends on the specific problem and the‬
‭goals of the model. Here are some common performance measures used in machine‬
‭learning:‬
‭1.‬ A ‭ ccuracy: The proportion of correctly classified instances out of the total number‬
‭of instances.‬
‭2.‬ ‭Precision: The proportion of true positive instances out of the total number of‬
‭instances predicted as positive.‬
‭3.‬ ‭Recall: The proportion of true positive instances out of the total number of actual‬
‭positive instances.‬
‭4.‬ ‭F1 score: The harmonic mean of precision and recall.‬
‭5.‬ ‭Area Under the Curve (AUC): The area under the Receiver Operating‬
‭Characteristic (ROC) curve, which plots the true positive rate against the false‬
‭positive rate.‬
‭6.‬ ‭Confusion matrix: A table that shows the number of true positive, false positive,‬
‭true negative, and false negative instances.‬
‭7.‬ ‭Mean squared error (MSE): A measure of the average squared difference‬
‭between the predicted and actual values.‬
‭8.‬ ‭Mean absolute error (MAE): A measure of the average absolute difference‬
‭between the predicted and actual values.‬
‭9.‬ ‭R-squared: A measure of how well the model fits the data, ranging from 0 to 1,‬
‭where 1 indicates a perfect fit.‬
‭10.‬‭Root mean squared error (RMSE): A measure of the average square root of the‬
‭difference between the predicted and actual values.‬

‭ hese performance measures can be used to compare the performance of different‬


T
‭machine learning models or to evaluate the performance of a single model under‬
‭different conditions or on different datasets. It is important to select the appropriate‬
‭performance measure for the specific problem and to interpret the results in the context‬
‭of the problem and the goals of the model.‬

‭Confusion Matrix:‬

‭ he‬‭confusion‬‭matrix‬‭is‬‭a‬‭matrix‬‭used‬‭to‬‭determine‬‭the‬‭performance‬‭of‬‭the‬‭classification‬
T
‭models‬ ‭for‬ ‭a‬‭given‬‭set‬‭of‬‭test‬‭data.‬‭It‬‭can‬‭only‬‭be‬‭determined‬‭if‬‭the‬‭true‬‭values‬‭for‬‭test‬
‭data‬ ‭are‬ ‭known.‬ ‭The‬ ‭matrix‬ ‭itself‬ ‭can‬ ‭be‬ ‭easily‬ ‭understood,‬ ‭but‬ ‭the‬ ‭related‬
‭terminologies‬‭may‬‭be‬‭confusing.‬‭Since‬‭it‬‭shows‬‭the‬‭errors‬‭in‬‭the‬‭model‬‭performance‬‭in‬
‭the‬‭form‬‭of‬‭a‬‭matrix,‬‭hence‬‭also‬‭known‬‭as‬‭an‬‭error‬‭matrix‬‭.‬‭Some‬‭features‬‭of‬‭Confusion‬
‭matrix are given below:‬

‭●‬ ‭For‬‭the‬‭2‬‭prediction‬‭classes‬‭of‬‭classifiers,‬‭the‬‭matrix‬‭is‬‭of‬‭2*2‬‭table,‬‭for‬‭3‬‭classes,‬
‭it is 3*3 table, and so on.‬
‭●‬ ‭The‬‭matrix‬‭is‬‭divided‬‭into‬‭two‬‭dimensions,‬‭that‬‭are‬‭predicted‬‭values‬‭and‬‭actual‬
‭values‬‭along with the total number of predictions.‬
‭●‬ ‭Predicted‬‭values‬‭are‬‭those‬‭values,‬‭which‬‭are‬‭predicted‬‭by‬‭the‬‭model,‬‭and‬‭actual‬
‭values are the true values for the given observations.‬
‭●‬ ‭It looks like the below table:‬

‭The above table has the following cases:‬

‭●‬ ‭True‬‭Negative:‬‭Model‬‭has‬‭given‬‭prediction‬‭No,‬‭and‬‭the‬‭real‬‭or‬‭actual‬‭value‬‭was‬
‭also No.‬
‭●‬ ‭True Positive:‬‭The model has predicted yes, and the‬‭actual value was also true.‬
‭●‬ ‭False‬‭Negative:‬‭The‬‭model‬‭has‬‭predicted‬‭no,‬‭but‬‭the‬‭actual‬‭value‬‭was‬‭Yes,‬‭it‬‭is‬
‭also called a‬‭Type-II error‬‭.‬
‭●‬ ‭False‬‭Positive:‬‭The‬‭model‬‭has‬‭predicted‬‭Yes,‬‭but‬‭the‬‭actual‬‭value‬‭was‬‭No.‬‭It‬‭is‬
‭also called a‬‭Type-I error.‬

‭Need for Confusion Matrix in Machine Learning‬

‭●‬ ‭It‬ ‭evaluates‬ ‭the‬ ‭performance‬ ‭of‬ ‭the‬ ‭classification‬ ‭models‬ ‭when‬ ‭they‬ ‭make‬
‭predictions on test data and tells how good our classification model is.‬
‭●‬ ‭It‬ ‭not‬‭only‬‭tells‬‭the‬‭error‬‭made‬‭by‬‭the‬‭classifiers‬‭but‬‭also‬‭the‬‭type‬‭of‬‭errors‬‭such‬
‭as it is either a type-I or type-II error.‬
‭●‬ ‭With‬ ‭the‬ ‭help‬ ‭of‬ ‭the‬‭confusion‬‭matrix,‬‭we‬‭can‬‭calculate‬‭the‬‭different‬‭parameters‬
‭for the model, such as accuracy, precision, etc.‬

‭Example‬‭: We can understand the confusion matrix using‬‭an example.‬

‭ uppose we are trying to create a model that can predict the result for the disease that‬
S
‭is either a person has that disease or not. So, the confusion matrix for this is given as:‬
‭From the above example, we can conclude that:‬

‭○‬ ‭The table is given for the two-class classifier, which has two predictions "Yes"‬
‭and "NO." Here, Yes defines that patient has the disease, and No defines that‬
‭patient does not has that disease.‬

‭○‬ ‭The classifier has made a total of‬‭100 predictions‬‭.‬‭Out of 100 predictions,‬‭89‬
‭are true predictions‬‭, and‬‭11 are incorrect predictions‬‭.‬

‭○‬ ‭The model has given prediction "yes" for 32 times, and "No" for 68 times.‬
‭Whereas the actual "Yes" was 27, and actual "No" was 73 times.‬

‭Calculations using Confusion Matrix:‬

‭ e‬‭can‬‭perform‬‭various‬‭calculations‬‭for‬‭the‬‭model,‬‭such‬‭as‬‭the‬‭model's‬‭accuracy,‬‭using‬
W
‭this matrix. These calculations are given below:‬

‭○‬ ‭Classification Accuracy:‬‭It is one of the important‬‭parameters to determine the‬


‭accuracy of the classification problems. It defines how often the model predicts‬
‭the correct output. It can be calculated as the ratio of the number of correct‬
‭predictions made by the classifier to all number of predictions made by the‬
‭classifiers. The formula is given below:‬

‭○‬ ‭Misclassification rate:‬‭It is also termed as Error‬‭rate, and it defines how often‬
‭the model gives the wrong predictions. The value of error rate can be calculated‬
‭as the number of incorrect predictions to all number of the predictions made by‬
‭the classifier. The formula is given below:‬

‭○‬ ‭Precision:‬‭It can be defined as the number of correct‬‭outputs provided by the‬


‭model or out of all positive classes that have predicted correctly by the model,‬
‭how many of them were actually true. It can be calculated using the below‬
‭formula:‬

‭○‬ ‭Recall:‬‭It is defined as the out of total positive‬‭classes, how our model predicted‬
‭correctly. The recall must be as high as possible.‬

‭○‬ ‭F-measure:‬‭If two models have low precision and high‬‭recall or vice versa, it is‬
‭difficult to compare these models. So, for this purpose, we can use F-score. This‬
‭score helps us to evaluate the recall and precision at the same time. The F-score‬
‭is maximum if the recall is equal to the precision. It can be calculated using the‬
‭below formula:‬

‭Other important terms used in Confusion Matrix:‬

‭○‬ ‭Null Error rate:‬‭It defines how often our model would‬‭be incorrect if it always‬
‭predicted the majority class. As per the accuracy paradox, it is said that "‬‭the best‬
‭classifier has a higher error rate than the null error rate.‬‭"‬

‭○‬ ‭ROC Curve:‬‭The ROC is a graph displaying a classifier's‬‭performance for all‬


‭possible thresholds. The graph is plotted between the true positive rate (on the‬
‭Y-axis) and the false Positive rate (on the x-axis).‬
‭Precision and Recall:‬

‭ recision and recall are two important performance metrics used to evaluate the‬
P
‭performance of a machine learning model for binary classification problems.‬

‭ recision is a measure of the proportion of true positive predictions made by the model,‬
P
‭out of all the positive predictions it made. In other words, it measures how accurate the‬
‭positive predictions of the model are. The formula for precision is:‬

‭Precision = TP / (TP + FP)‬

‭where TP is the number of true positives and FP is the number of false positives.‬

‭ ecall, on the other hand, is a measure of the proportion of true positive predictions‬
R
‭made by the model, out of all the actual positive instances in the data. In other words, it‬
‭measures how well the model is able to identify positive instances. The formula for‬
‭recall is:‬

‭Recall = TP / (TP + FN)‬

‭where FN is the number of false negatives.‬

I‭n general, there is a trade-off between precision and recall, and the choice of which‬
‭metric to optimize depends on the specific needs of the application. For example, if the‬
‭cost of false positives is high, then it may be important to optimize for high precision,‬
‭even if it comes at the expense of lower recall. On the other hand, if the cost of false‬
‭negatives is high, then it may be more important to optimize for high recall, even if it‬
‭comes at the expense of lower precision.‬

‭ common way to balance precision and recall is to use the F1 score, which is the‬
A
‭harmonic mean of precision and recall:‬

‭F1 score = 2 * (precision * recall) / (precision + recall)‬

‭ he F1 score provides a single value that balances both precision and recall, and is‬
T
‭often used as a performance metric in binary classification problems.‬

‭Precision/Recall Tradeoff:‬
‭ he precision/recall tradeoff is a common challenge in machine learning, particularly in‬
T
‭binary classification problems, where the goal is to classify instances into one of two‬
‭categories (e.g., positive or negative). The tradeoff arises because improving precision‬
‭typically results in a decrease in recall, and vice versa.‬

‭ o understand the tradeoff, it's important to consider the decision threshold of the‬
T
‭classification algorithm. The decision threshold is the value above which an instance is‬
‭classified as positive and below which it is classified as negative. By default, most‬
‭classification algorithms use a threshold of 0.5, but this can be adjusted depending on‬
‭the needs of the application.‬

I‭f the decision threshold is increased (i.e., the algorithm becomes more conservative),‬
‭then the precision of the classifier typically improves, but the recall decreases. This is‬
‭because the classifier becomes more selective and only predicts positive instances that‬
‭are highly likely to be correct. On the other hand, if the decision threshold is decreased‬
‭(i.e., the algorithm becomes more liberal), then the recall of the classifier typically‬
‭improves, but the precision decreases. This is because the classifier predicts more‬
‭positive instances, but some of them may be incorrect.‬

‭ he precision/recall tradeoff can be visualized using a precision-recall curve, which‬


T
‭plots the precision and recall of the classifier for different decision thresholds. The ideal‬
c‭ lassifier would have both high precision and high recall at all decision thresholds,‬
‭resulting in a curve that is close to the top right corner of the plot. However, in practice,‬
‭there is usually a tradeoff between precision and recall, and the curve will typically be a‬
‭tradeoff between the two.‬

‭ o select the optimal decision threshold for a given application, it's important to consider‬
T
‭the specific needs and constraints of the problem. For example, if the cost of false‬
‭positives is high, then it may be important to select a threshold that maximizes‬
‭precision, even if it comes at the expense of lower recall. On the other hand, if the cost‬
‭of false negatives is high, then it may be more important to select a threshold that‬
‭maximizes recall, even if it comes at the expense of lower precision.‬

‭ROC Curve:‬

‭ he ROC (Receiver Operating Characteristic) curve is another common tool for‬


T
‭evaluating the performance of a binary classification model, particularly in cases where‬
‭the class distribution is imbalanced. The ROC curve plots the true positive rate (TPR)‬
‭against the false positive rate (FPR) for different decision thresholds of the model.‬

‭ he TPR is the proportion of true positive predictions made by the model, out of all the‬
T
‭actual positive instances in the data. In other words, it measures how well the model is‬
‭able to identify positive instances. The formula for TPR is:‬

‭TPR = TP / (TP + FN)‬

‭where TP is the number of true positives and FN is the number of false negatives.‬

‭ he FPR, on the other hand, is the proportion of false positive predictions made by the‬
T
‭model, out of all the actual negative instances in the data. In other words, it measures‬
‭how often the model incorrectly predicts a positive instance when the actual class is‬
‭negative. The formula for FPR is:‬

‭FPR = FP / (FP + TN)‬

‭where FP is the number of false positives and TN is the number of true negatives.‬

‭ he ROC curve is created by varying the decision threshold of the model and‬
T
‭computing the TPR and FPR for each threshold. The curve shows the tradeoff between‬
‭TPR and FPR at different thresholds, and provides a way to visualize the overall‬
‭performance of the model.‬
‭ perfect classifier would have a TPR of 1 and an FPR of 0 at all thresholds,‬‭resulting in‬
A
‭a curve that passes through the top left corner of the plot. In practice, however, the‬
‭curve will typically be a tradeoff between TPR and FPR, and the goal is to choose a‬
‭threshold that maximizes the overall performance of the classifier.‬

‭ he area under the ROC curve (AUC) is a commonly used performance metric for‬
T
‭binary classification models. A perfect classifier would have an AUC of 1, while a‬
‭random classifier (one that randomly assigns labels) would have an AUC of 0.5. A‬
‭higher AUC indicates better overall performance of the classifier.‬

‭AUC: Area Under the ROC curve‬

‭ UC‬‭is‬‭known‬‭for‬‭Area‬‭Under‬‭the‬‭ROC‬‭curve‬‭.‬‭As‬‭its‬‭name‬‭suggests,‬‭AUC‬‭calculates‬
A
‭the‬ ‭two-dimensional‬ ‭area‬ ‭under‬ ‭the‬ ‭entire‬ ‭ROC‬ ‭curve‬ ‭ranging‬ ‭from‬ ‭(0,0)‬ ‭to‬ ‭(1,1),‬ ‭as‬
‭shown below image:‬

I‭‬‭n‬ ‭the‬ ‭ROC‬ ‭curve,‬ ‭AUC‬ ‭computes‬ ‭the‬ ‭performance‬ ‭of‬ ‭the‬ ‭binary‬ ‭classifier‬ ‭across‬
‭different‬‭thresholds‬‭and‬‭provides‬‭an‬‭aggregate‬‭measure.‬‭The‬‭value‬‭of‬‭AUC‬‭ranges‬‭from‬
‭0‬‭to‬‭1,‬‭which‬‭means‬‭an‬‭excellent‬‭model‬‭will‬‭have‬‭AUC‬‭near‬‭1,‬‭and‬‭hence‬‭it‬‭will‬‭show‬‭a‬
‭good measure of Separability.‬

‭Applications of AUC-ROC Curve:‬

‭1.‬ ‭Classification of 3D model‬


‭The curve is used to classify a 3D model and separate it from the normal models.‬
‭With the specified threshold level, the curve classifies the non-3D and separates‬
‭out the 3D models.‬

‭2.‬ ‭Healthcare‬
‭The curve has various applications in the healthcare sector. It can be used to‬
‭detect cancer disease in patients. It does this by using false positive and false‬
‭negative rates, and accuracy depends on the threshold value used for the curve.‬

‭3.‬ ‭Binary Classification‬


‭AUC-ROC curve is mainly used for binary classification problems to evaluate‬
‭their performance.‬

‭Multiclass classification:‬

‭Multiclass classification is a type of classification problem where the goal is to classify‬


‭instances into three or more classes or categories. In contrast, binary classification‬
‭involves classifying instances into one of two categories.‬

‭There are several algorithms that can be used for multiclass classification, including‬
‭decision trees, random forests, naive Bayes, support vector machines, and neural‬
‭networks. One common approach is to use a one-vs-all (OVA) or one-vs-rest (OVR)‬
‭strategy, where multiple binary classifiers are trained to distinguish each class from the‬
‭others. Another approach is to use a multinomial logistic regression, which directly‬
‭models the probability of each class given the input features.‬

‭When evaluating the performance of a multiclass classification model, there are several‬
‭metrics that can be used. One commonly used metric is accuracy, which measures the‬
‭proportion of correct predictions made by the model. However, accuracy can be‬
‭misleading in cases where the class distribution is imbalanced, and it may be more‬
‭appropriate to use other metrics such as precision, recall, and F1-score.‬
‭Precision and recall can be extended to the multiclass classification setting using a‬
‭confusion matrix that counts the number of true positives, false positives, false‬
‭negatives, and true negatives for each class. Precision measures the proportion of‬
‭correct positive predictions made by the model out of all the positive predictions, while‬
‭recall measures the proportion of true positive predictions made by the model out of all‬
‭the actual positive instances. The F1-score is the harmonic mean of precision and‬
‭recall, and provides a single number that summarizes the overall performance of the‬
‭model.‬

‭There are also multiclass extensions of the ROC curve and AUC metric, such as the‬
‭micro- and macro-averaged ROC curves and AUCs. These metrics provide a way to‬
‭evaluate the overall performance of the model across all classes, and can be useful in‬
‭cases where some classes are more important than others.‬

‭Error Analysis:‬

‭ rror‬‭analysis‬‭is‬‭a‬‭process‬‭of‬‭examining‬‭the‬‭errors‬‭made‬‭by‬‭a‬‭machine‬‭learning‬‭model‬
E
‭during‬ ‭training‬ ‭and‬ ‭testing,‬ ‭in‬ ‭order‬ ‭to‬ ‭understand‬ ‭its‬ ‭weaknesses‬ ‭and‬ ‭improve‬ ‭its‬
‭performance.‬ ‭It‬ ‭involves‬ ‭analyzing‬ ‭the‬ ‭incorrect‬ ‭predictions‬ ‭made‬ ‭by‬ ‭the‬ ‭model,‬
‭identifying‬ ‭patterns‬ ‭or‬ ‭trends‬ ‭in‬ ‭the‬ ‭errors,‬ ‭and‬ ‭using‬ ‭this‬ ‭information‬ ‭to‬ ‭improve‬ ‭the‬
‭model or the data used to train it.‬

‭There are several steps involved in error analysis, including:‬

‭ .‬‭Collecting‬‭and‬‭analyzing‬‭data:‬‭The‬‭first‬‭step‬‭is‬‭to‬‭collect‬‭data‬‭on‬‭the‬‭errors‬‭made‬‭by‬
1
‭the‬‭model‬‭during‬‭training‬‭and‬‭testing.‬‭This‬‭can‬‭involve‬‭examining‬‭the‬‭confusion‬‭matrix,‬
‭analyzing‬‭misclassified‬‭instances,‬‭or‬‭looking‬‭at‬‭the‬‭distribution‬‭of‬‭errors‬‭across‬‭different‬
‭classes or features.‬

‭ .‬ ‭Identifying‬ ‭patterns‬ ‭or‬ ‭trends:‬ ‭The‬ ‭next‬ ‭step‬ ‭is‬ ‭to‬ ‭identify‬ ‭patterns‬ ‭or‬ ‭trends‬ ‭in‬ ‭the‬
2
‭errors‬ ‭made‬ ‭by‬ ‭the‬ ‭model.‬ ‭This‬ ‭can‬ ‭involve‬ ‭looking‬ ‭for‬ ‭common‬ ‭features‬ ‭or‬
‭characteristics‬ ‭of‬ ‭misclassified‬ ‭instances,‬ ‭such‬ ‭as‬ ‭specific‬ ‭image‬ ‭features‬ ‭or‬ ‭text‬
‭patterns.‬
‭ .‬ ‭Diagnosing‬ ‭the‬ ‭causes‬ ‭of‬ ‭errors:‬ ‭Once‬ ‭patterns‬ ‭or‬ ‭trends‬ ‭have‬ ‭been‬‭identified,‬‭the‬
3
‭next‬ ‭step‬ ‭is‬ ‭to‬ ‭diagnose‬ ‭the‬ ‭causes‬ ‭of‬ ‭the‬ ‭errors.‬ ‭This‬ ‭can‬ ‭involve‬ ‭analyzing‬ ‭the‬
‭features‬ ‭or‬ ‭characteristics‬ ‭of‬ ‭misclassified‬ ‭instances,‬ ‭or‬ ‭looking‬ ‭at‬ ‭the‬ ‭types‬ ‭of‬ ‭errors‬
‭made by the model, such as false positives or false negatives.‬

‭ .‬‭Improving‬‭the‬‭model‬‭or‬‭data:‬‭Based‬‭on‬‭the‬‭results‬‭of‬‭the‬‭error‬‭analysis,‬‭the‬‭final‬‭step‬
4
‭is‬ ‭to‬‭make‬‭changes‬‭to‬‭the‬‭model‬‭or‬‭data‬‭in‬‭order‬‭to‬‭improve‬‭its‬‭performance.‬‭This‬‭can‬
‭involve‬‭modifying‬‭the‬‭model‬‭architecture,‬‭adjusting‬‭the‬‭hyperparameters,‬‭or‬‭augmenting‬
‭the data to address specific weaknesses or biases.‬

‭ rror‬ ‭analysis‬ ‭is‬ ‭an‬ ‭important‬ ‭tool‬ ‭for‬ ‭improving‬ ‭the‬ ‭performance‬ ‭of‬ ‭machine‬ ‭learning‬
E
‭models,‬ ‭particularly‬ ‭in‬ ‭cases‬ ‭where‬ ‭the‬ ‭errors‬ ‭are‬ ‭non-random‬ ‭or‬ ‭systematic.‬ ‭By‬
‭analyzing‬‭the‬‭errors‬‭made‬‭by‬‭the‬‭model,‬‭it‬‭is‬‭possible‬‭to‬‭identify‬‭areas‬‭where‬‭the‬‭model‬
‭is weak and make targeted improvements to address these weaknesses.‬

‭UNIT-4 FUNDAMENTALS OF DEEP LEARNING‬

‭What is Deep Learning?‬

‭ eep learning is a subfield of machine learning that focuses on training artificial neural‬
D
‭networks with multiple layers to perform complex tasks. It is inspired by the structure‬
‭and function of the human brain, where information is processed through interconnected‬
‭neurons.‬

‭ raditional machine learning algorithms often require manual feature extraction, where‬
T
‭human experts have to identify relevant features from the input data. Deep learning, on‬
‭the other hand, attempts to automatically learn these features by building hierarchical‬
‭representations of the data. This is achieved by constructing neural networks with‬
‭multiple layers, known as deep neural networks.‬
‭ eep learning algorithms utilize a technique called‬‭backpropagation‬‭to iteratively adjust‬
D
‭the weights and biases of the neural network during training. This process involves‬
‭feeding input data through the network, comparing the output with the desired target,‬
‭calculating the error, and propagating it back through the network to update the‬
‭parameters. The network gradually learns to recognize patterns, classify data, or make‬
‭predictions based on the given training examples.‬

‭ ne of the key advantages of deep learning is its ability to handle large-scale,‬


O
‭unstructured data such as images, audio, and text. Deep neural networks have shown‬
‭remarkable performance in various domains, including computer vision, natural‬
‭language processing, speech recognition, and recommender systems. They have‬
‭achieved state-of-the-art results in tasks such as image classification, object detection,‬
‭machine translation, and speech synthesis.‬

‭ eep learning has benefited from advances in hardware acceleration, particularly‬


D
‭graphics processing units (GPUs) and specialized chips called tensor processing units‬
‭(TPUs). These powerful computing resources enable the efficient training and‬
‭deployment of deep neural networks, allowing for rapid progress in the field.‬

‭ verall, deep learning has revolutionized the field of artificial intelligence by enabling‬
O
‭machines to learn and make intelligent decisions from complex and unstructured data,‬
‭paving the way for significant advancements in various industries and applications.‬

‭Why need Deep Learning?‬

‭Deep learning is essential for several reasons:‬

‭1.‬ H ‭ andling Complex and Unstructured Data: Deep learning excels in processing‬
‭and extracting meaningful information from large-scale, unstructured data, such‬
‭as images, audio, and text. Traditional machine learning algorithms often struggle‬
‭with such complex data, requiring extensive feature engineering. Deep learning‬
‭algorithms, on the other hand, can automatically learn and extract relevant‬
‭features from the data, saving time and effort.‬
‭2.‬ ‭Improved Accuracy and Performance: Deep learning models have achieved‬
‭state-of-the-art results in various tasks, surpassing traditional machine learning‬
‭approaches in terms of accuracy and performance. The ability of deep neural‬
‭networks to learn hierarchical representations and capture intricate patterns in‬
‭the data allows them to make highly accurate predictions and classifications.‬
‭3.‬ ‭Feature Learning and Representation: Deep learning algorithms excel at learning‬
‭and representing features from raw data. By automatically learning hierarchical‬
r‭ epresentations, deep neural networks can extract high-level features that are‬
‭more informative and discriminative. This eliminates the need for manual feature‬
‭engineering, which can be time-consuming and limited in its ability to capture‬
‭complex patterns.‬
‭ .‬ ‭Scalability: Deep learning models are highly scalable, capable of handling‬
4
‭large-scale datasets and complex models. With the availability of powerful‬
‭hardware, such as GPUs and TPUs, deep learning algorithms can efficiently‬
‭process massive amounts of data and train complex neural networks, enabling‬
‭more sophisticated and accurate predictions.‬
‭5.‬ ‭Wide Range of Applications: Deep learning has demonstrated significant‬
‭advancements and breakthroughs in various domains. It has been successfully‬
‭applied to computer vision tasks, such as image recognition, object detection,‬
‭and image synthesis. In natural language processing, deep learning has been‬
‭used for machine translation, sentiment analysis, and text generation. It has also‬
‭been applied to speech recognition, recommender systems, drug discovery, and‬
‭many other fields, showcasing its versatility and impact.‬
‭6.‬ ‭Continuous Improvement: Deep learning is an active and rapidly evolving field of‬
‭research. Ongoing advancements in algorithms, architectures, and training‬
‭techniques continually push the boundaries of what deep learning can achieve.‬
‭The deep learning community consistently introduces novel techniques and‬
‭architectures to improve performance, making it an exciting and dynamic area of‬
‭study.‬

‭ verall, deep learning is crucial because it enables machines to effectively learn from‬
O
‭complex data, achieve high accuracy, and automate feature learning. It has the potential‬
‭to revolutionize various industries and drive innovations across multiple domains.‬

‭Introduction to Artificial Neural Network:‬

‭ n Artificial Neural Network (ANN) is a computational model inspired by the structure‬


A
‭and functioning of biological neural networks, such as the human brain. It is a key‬
‭component of deep learning and serves as the foundation for training and deploying‬
‭deep neural networks.‬

‭ t its core, an ANN consists of interconnected artificial neurons, also known as nodes or‬
A
‭units, organized into layers. The three main types of layers in an ANN are the input‬
‭layer, hidden layer(s), and output layer. Each neuron receives input signals, processes‬
‭them using an activation function, and produces an output that is transmitted to the next‬
‭layer.‬
‭ he connections between neurons are represented by weights, which determine the‬
T
‭strength and influence of the input signals. During the training process, these weights‬
‭are adjusted iteratively using algorithms like backpropagation, in order to minimize the‬
‭difference between the predicted output and the desired output for a given input.‬

‭ he architecture and structure of an ANN can vary depending on the task and‬
T
‭complexity of the problem being solved. Feedforward neural networks are a common‬
‭type of ANN, where information flows strictly in one direction, from the input layer to the‬
‭output layer, without any loops or feedback connections. Convolutional Neural Networks‬
‭(CNNs) are a specialized type of feedforward network commonly used for image‬
‭processing and computer vision tasks.‬

‭ ecurrent Neural Networks (RNNs) introduce feedback connections, allowing the‬


R
‭network to have memory and process sequential data. This makes RNNs well-suited for‬
‭tasks like natural language processing and time series analysis.‬

‭ NNs have the ability to learn and generalize from training examples, making them‬
A
‭powerful tools for tasks such as classification, regression, pattern recognition, and‬
‭decision-making. By adjusting the weights and biases of the network through training,‬
‭ANNs can recognize complex patterns and make predictions on new, unseen data.‬

‭ NNs have been successful in a wide range of applications, including image and‬
A
‭speech recognition, natural language processing, recommendation systems,‬
‭autonomous vehicles, and more. Their versatility and ability to process large amounts of‬
‭data have contributed to significant advancements in the field of artificial intelligence‬
‭and machine learning.‬

I‭n summary, an Artificial Neural Network is a computational model that mimics the‬
‭behavior of biological neural networks. It consists of interconnected artificial neurons‬
‭organized into layers, and it learns from data to make predictions and solve complex‬
‭problems. ANNs are a fundamental component of deep learning and have‬
‭revolutionized various industries and domains.‬
‭Core components of Neural Network:‬

‭The core components of neural networks include:‬

‭1.‬ N ‭ eurons (Nodes): Neurons are the basic units of a neural network. They receive‬
‭input signals, perform computations, and produce an output. Each neuron‬
‭applies an activation function to the weighted sum of its inputs‬‭to introduce‬
‭non-linearity‬‭and determine its output value.‬
‭2.‬ ‭Weights and Biases: Weights and biases are parameters associated with the‬
‭connections between neurons. Each connection between neurons is assigned a‬
‭weight, which determines the strength or importance of the signal passing‬
‭through that connection. Biases are additional values added to the inputs of‬
‭neurons, allowing them to learn and adjust their behavior.‬
‭3.‬ ‭Activation Function: An activation function determines the output of a neuron‬
‭based on its input. It introduces‬‭non-linearities‬‭into the network, enabling it to‬
‭learn complex patterns and relationships. Common activation functions include‬
‭the sigmoid function, hyperbolic tangent (tanh) function, and rectified linear unit‬
‭(ReLU) function.‬
‭4.‬ ‭Layers: Neurons are organized into layers in a neural network. The three main‬
‭types of layers are:‬

‭ . Input Layer: The input layer receives the initial data or features and passes them to‬
a
‭the subsequent layers.‬

‭ . Hidden Layer(s): Hidden layers are intermediate layers between the input and output‬
b
‭layers. They perform computations and progressively extract higher-level features from‬
‭the input data.‬
c‭ . Output Layer: The output layer produces the final predictions or outputs of the‬
‭network. The number of neurons in the output layer depends on the nature of the task,‬
‭such as binary classification, multi-class classification, or regression.‬

‭5.‬ C ‭ onnections and Architecture: Connections represent the paths through which‬
‭signals flow between neurons. They are represented by weights that are adjusted‬
‭during the training process. The architecture of a neural network refers to its‬
‭structure, including the arrangement and connectivity of neurons and layers.‬
‭Different architectures, such as feedforward networks, convolutional networks,‬
‭and recurrent networks, have specific characteristics and are suitable for different‬
‭tasks.‬
‭6.‬ ‭Loss/Cost Function: The loss or cost function measures the discrepancy between‬
‭the predicted outputs of the network and the desired outputs. It quantifies the‬
‭network's performance and guides the learning process. During training, the goal‬
‭is to minimize the loss function by adjusting the weights and biases of the‬
‭network.‬
‭7.‬ ‭Optimization Algorithm: Optimization algorithms are used to adjust the weights‬
‭and biases of the network during the training process. The most common‬
‭algorithm is backpropagation, which calculates the gradients of the loss function‬
‭with respect to the network parameters and updates the weights and biases‬
‭accordingly.‬

‭ hese core components work together to enable neural networks to learn from data,‬
T
‭make predictions, and solve complex tasks. By adjusting the weights and biases based‬
‭on the training examples, neural networks can generalize and make accurate‬
‭predictions on new, unseen data.‬

‭Multi-Layer Perceptron (MLP):‬

‭ he Multi-Layer Perceptron (MLP) is a type of‬‭feedforward‬‭neural network‬‭, which is one‬


T
‭of the most common and basic architectures used in deep learning. It is composed of‬
‭multiple layers of interconnected artificial neurons (perceptrons) and is widely employed‬
‭for tasks such as classification and regression.‬

‭Here are the key characteristics and components of an MLP:‬

‭ . Input Layer: The input layer receives the initial data or features and passes them to‬
1
‭the subsequent layers. Each input node represents a feature or attribute of the input‬
‭data.‬
‭ . Hidden Layers: The hidden layers are intermediate layers between the input and‬
2
‭output layers. They perform computations by applying activation functions to the‬
‭weighted sum of their inputs. MLPs can have one or more hidden layers, allowing them‬
‭to learn increasingly complex representations of the input data.‬

‭ . Output Layer: The output layer produces the final predictions or outputs of the MLP.‬
3
‭The number of neurons in the output layer depends on the specific task. For example, in‬
‭binary classification, there would typically be one output neuron representing the‬
‭probability of belonging to one class. In multi-class classification, there would be‬
‭multiple output neurons, each representing the probability of belonging to a specific‬
‭class.‬

‭ . Neurons and Activation Functions: Each neuron in an MLP applies an activation‬


4
‭function to the weighted sum of its inputs. The most commonly used activation functions‬
‭include the sigmoid function, hyperbolic tangent (tanh) function, and rectified linear unit‬
‭(ReLU) function. These non-linear activation functions introduce non-linearity into the‬
‭network, enabling it to learn and represent complex patterns.‬

‭ . Weights and Biases: MLPs have weights and biases associated with the connections‬
5
‭between neurons. Each connection has a weight that determines the strength or‬
‭importance of the signal passing through it. Biases are additional values added to the‬
‭inputs of neurons, enabling them to learn and adjust their behavior.‬

‭ . Training: MLPs are trained using an optimization algorithm, typically backpropagation.‬


6
‭Backpropagation calculates the gradients of the loss function with respect to the‬
‭network parameters (weights and biases) and adjusts them iteratively to minimize the‬
‭loss. The training process involves forward propagation, where inputs are passed‬
‭through the network to produce predictions, and backward propagation, where the error‬
‭is propagated back through the network to update the weights.‬

‭ LPs have been successful in various domains and tasks, including image‬
M
‭classification, text analysis, and time series forecasting. While MLPs have limitations in‬
‭capturing spatial relationships and sequential dependencies, they serve as a‬
‭fundamental building block in more advanced architectures such as convolutional neural‬
‭networks (CNNs) and recurrent neural networks (RNNs).‬

‭Activation Functions:‬
I‭n deep learning, various activation functions are used to introduce non-linearities into‬
‭neural networks, allowing them to learn and represent complex patterns. Here are some‬
‭commonly used activation functions in deep learning:‬

‭ . Rectified Linear Unit (ReLU): ReLU is one of the most popular activation functions in‬
1
‭deep learning. It maps negative input values to zero and keeps positive values‬
‭unchanged. The activation function is defined as‬

‭ReLU(x) = max(0, x)‬

‭ eLU is computationally efficient and helps‬‭alleviate‬‭the vanishing gradient problem‬‭. It‬


R
‭has been widely adopted in deep neural networks and has been shown to accelerate‬
‭training convergence.‬

‭ . Leaky ReLU: Leaky ReLU is a variation of ReLU that addresses the "dying ReLU"‬
2
‭problem by allowing a small non-zero output for negative input values. The activation‬
‭function is defined as:‬

‭LeakyReLU(x) = max(ax, x)‬

‭ ere, a is a small positive constant. By introducing a small slope for negative values,‬
H
‭Leaky ReLU ensures that neurons can still receive gradients and learn during training.‬

‭ . Parametric ReLU (PReLU): PReLU is another variation of ReLU where the slope for‬
3
‭negative input values is learned during training. Instead of using a fixed constant like in‬
‭Leaky ReLU, PReLU allows the slope to be optimized as a parameter.‬

‭ . Sigmoid Function: The sigmoid function, also known as the logistic function, maps the‬
4
‭input to a range between 0 and 1. It is given by:‬

‭σ(x) = 1 / (1 + e^(-x))‬

‭ igmoid functions were widely used in the past, but they are less common in deep‬
S
‭learning architectures today. They are still used in certain cases, such as the output‬
‭layer of binary classification problems where the output represents the probability of‬
‭belonging to a class.‬

‭ . Hyperbolic Tangent (tanh) Function: The tanh function is similar to the sigmoid‬
5
‭function but maps the input to a range between -1 and 1. It is defined as:‬

‭tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))‬


‭ anh functions are also used less frequently in deep learning compared to ReLU and‬
T
‭its variants. However, they can still be useful in specific cases where negative outputs‬
‭are desired.‬

‭ . Softmax Function: The softmax function is typically used in the output layer of‬
6
‭multi-class classification problems. It converts the outputs of the last layer into a‬
‭probability distribution over multiple classes. The softmax function ensures that the‬
‭predicted probabilities sum up to 1. It is defined as:‬

‭softmax(x_i) = e^(x_i) / (sum(e^(x_j)) for j in classes)‬

‭ oftmax is commonly used when the goal is to classify inputs into mutually exclusive‬
S
‭classes.‬

‭ he choice of activation function depends on the specific problem, network architecture,‬


T
‭and characteristics of the data. Experimentation and empirical evaluation are often‬
‭necessary to determine the most suitable activation function for a given task.‬

‭#graphs - to be visited‬

‭Sigmoid Function:‬

‭ he sigmoid activation function, also known as the logistic function, is a common‬


T
‭non-linear activation function used in neural networks. It has a characteristic S-shaped‬
‭curve and maps the input to a range between 0 and 1. The sigmoid function is defined‬
‭as:‬

‭σ(x) = 1 / (1 + e^(-x))‬

‭Here, e is the base of the natural logarithm and x is the input to the function.‬

‭Key properties and characteristics of the sigmoid function:‬

‭ . Output Range: The sigmoid function squashes the input values into the range (0, 1).‬
1
‭As a result, the output of the sigmoid function can be interpreted as a probability or a‬
‭measure of confidence.‬

‭ . Non-Linearity: The sigmoid function introduces non-linearity into the network, allowing‬
2
‭neural networks to learn and represent complex relationships between inputs and‬
‭outputs. This non-linearity is crucial for capturing intricate patterns in data.‬
‭ . Smoothness: The sigmoid function is a smooth and continuous function, which‬
3
‭means it is differentiable at all points. This property is essential for the backpropagation‬
‭algorithm, which relies on derivatives for updating the weights during training.‬

‭ . Gradient Saturation: One limitation of the sigmoid function is that its gradient‬
4
‭saturates as the absolute value of the input becomes large. This saturation occurs‬
‭because the slope of the sigmoid function approaches zero for extremely positive or‬
‭negative inputs. As a result, during backpropagation, the gradients can become very‬
‭small, leading to slower convergence and the vanishing gradient problem. This limitation‬
‭makes the sigmoid function less commonly used in deep neural networks compared to‬
‭other activation functions like ReLU.‬

‭ . Output Interpretation: The sigmoid function is often used in the output layer of a‬
5
‭neural network for binary classification tasks. The output can be interpreted as the‬
‭probability of belonging to a particular class, with values closer to 1 indicating a higher‬
‭likelihood.‬

‭ hile the sigmoid function has been widely used in the past, it is less commonly‬
W
‭employed in deep learning architectures today, primarily due to the issue of gradient‬
‭saturation. Activation functions like ReLU and its variants, which do not suffer from‬
‭gradient saturation, are more prevalent in deep neural networks. However, the sigmoid‬
‭function can still be useful in certain cases, such as the output layer of binary‬
‭classification problems or in architectures where its specific properties are desired.‬

‭Rectified Linear Unit (ReLU):‬

‭ he Rectified Linear Unit (ReLU) is a widely used activation function in deep learning. It‬
T
‭introduces non-linearity to neural networks and helps them represent complex patterns‬
‭in the data. The ReLU actvation function is defined as:‬

‭ReLU (x) = max(0, x)‬

I‭n other words, ReLU returns the input value if it is positive or zero or any negative‬
‭value.‬

‭Here are some characteristics and advantages of ReLU activation function:‬


‭ . Simplicity: ReLU is a simple function to compute, involving only a comparison and a‬
1
‭maximum operation. This simplicity contributes to faster training and inference times.‬

‭ . Non-linearity: ReLU is a non-linear activation function, which allows neural networks‬


2
‭to model and learn non-linear relationships in the data. By introducing non-linearity.‬
‭ReLU enables the network to capture complex patterns and make more expressive‬
‭predictions.‬

‭ . Sparsity: ReLU has a desirable property of inducing sparsity in neural networks.‬


3
‭When the input to a ReLU neuron is negative, the neuron becomes inactive (outputting‬
‭zero). This sparsity property can help in reducing the overall complexity of the network‬
‭and prevent overfitting by focusing on the most relevant features.‬

‭ . Efficient Computation: ReLU is computationally efficient to compute compared to‬


4
‭other activation functions like sigmoid or tanh, which invoves more complex‬
‭mathematical operations. The simplicity of ReLU makes it faster to compute, which is‬
‭especially beneficial in large networks with millions of parameters.‬

‭ . Mitigating the Vanishing Gradient prpoblems: One major advantage of ReLU over‬
5
‭other activation functions like sigmoid or tan h is that it mitigates the vanishing gradient‬
‭problem. The vanishing gradient problems occur when the gradients become very small‬
‭during backpropagation, leading to slow learaning or difficulty in training neural‬
‭networks. ReLU helps alleviate this problem by avoiding saturation for positive input‬
‭values.‬

‭ . Potential Dead Neurons: One drawback of ReLU is the issue of “dead neurons”. A‬
6
‭neuron becomes “dead” when its output is always zero, causing the neuron to no longer‬
‭contribute to the learning process. Dead Neurons can occur when the weights‬
‭associated with a neuron are updated in a way that keeps the neuron’s input always‬
‭negative. In such cases, the neuron will never activate, and its gradients like leaky‬
‭ReLU or Parametric ReLU (PReLU) have been introduced to address this problem by‬
‭allowing small non-zero outputs for negative input values.‬

‭Introduction to Tensors and Operations:‬

I‭n the context of deep learning, tensors are fundamental data structures that represent‬
‭multi-dimensional arrays or mathematical objects. They are the primary way to‬‭store‬
‭and manipulate data in neural networks‬‭. Tensors can‬‭have different dimensions, such‬
‭as scalars (0-dimensional), vectors (1-dimensional), matrices (2-dimensional), or‬
‭higher-dimensional arrays.‬
‭Here's an overview of tensors and operations commonly used in deep learning:‬

‭ . Scalars: Scalars are tensors of rank 0, representing single values. For example, a‬
1
‭scalar can represent a single number like 5 or 0.8.‬

‭ . Vectors: Vectors are tensors of rank 1, representing a sequence of values arranged in‬
2
‭a single dimension. For example, a vector can represent features of a data point, such‬
‭as [2, 4, 6, 8].‬

‭ . Matrices: Matrices are tensors of rank 2, representing a 2-dimensional array of‬


3
‭values. They are often used to store data in tabular form or to represent weights‬
‭between layers in a neural network. For example, a matrix can represent a 3x3 grid of‬
‭values like [[1, 2, 3], [4, 5, 6], [7, 8, 9]].‬

‭ . Higher-dimensional Tensors: Tensors of rank 3 or higher represent multi-dimensional‬


4
‭arrays. For example, a rank-3 tensor can represent a collection of RGB images, where‬
‭each image is represented by a 3-dimensional array of pixel values.‬

‭Common tensor operations in deep learning include:‬

‭ . Addition and Subtraction: Element-wise addition and subtraction are performed‬


1
‭between tensors of the same shape, where the corresponding elements are added or‬
‭subtracted.‬

‭ . Multiplication and Division: Element-wise multiplication and division are performed‬


2
‭between tensors of the same shape, where the corresponding elements are multiplied‬
‭or divided.‬

‭ . Dot Product: The dot product (also known as the inner product or scalar product) is a‬
3
‭mathematical operation that combines two vectors and produces a scalar. It calculates‬
‭the sum of the products of corresponding elements in the vectors.‬

‭ . Matrix Multiplication: Matrix multiplication is a fundamental operation that combines‬


4
‭two matrices to produce a resulting matrix. It involves taking dot products of rows from‬
‭the first matrix and columns from the second matrix.‬

‭ . Transposition: Transposing a matrix or tensor changes its shape by flipping its‬


5
‭dimensions. Rows become columns, and columns become rows.‬

‭ . Element-wise Functions: Various mathematical functions can be applied‬


6
‭element-wise to tensors, such as the sigmoid, ReLU, tanh, and softmax functions.‬
‭These functions are applied independently to each element of the tensor.‬
‭ hese operations form the foundation for many computations in deep learning, including‬
T
‭forward and backward propagation during training, weight updates, and gradient‬
‭calculations.‬

‭ eep learning libraries like TensorFlow and PyTorch provide efficient implementations of‬
D
‭these tensor operations, along with additional functionalities for building and training‬
‭neural networks.‬

‭TensorFlow Framework:‬

‭ ensorFlow is an open-source deep learning framework developed by Google. It is‬


T
‭widely used in the field of machine learning and has become one of the most popular‬
‭frameworks for building and training neural networks. TensorFlow provides a‬
‭comprehensive ecosystem of tools, libraries, and resources that make it easier to‬
‭develop and deploy machine learning models.‬

‭Here are some key features and components of TensorFlow:‬

‭ . Computational Graph: TensorFlow uses a computational graph paradigm to represent‬


1
‭and execute computations. The graph consists of nodes that represent mathematical‬
‭operations, and edges that represent the flow of data between operations. This‬
‭graph-based approach allows for efficient execution on various hardware platforms,‬
‭including CPUs, GPUs, and TPUs (Tensor Processing Units).‬

‭ . Automatic Differentiation: TensorFlow has built-in support for automatic differentiation,‬


2
‭which is crucial for training neural networks using techniques like backpropagation. It‬
‭can automatically compute gradients of functions and operations, enabling efficient‬
‭computation of gradients for parameter updates during the training process.‬

‭ . Eager Execution: TensorFlow supports both static graph execution and dynamic‬
3
‭graph execution through its eager execution mode. With eager execution, you can‬
‭execute operations immediately and get results directly, making it easier for debugging‬
‭and experimentation.‬

‭ . High-Level APIs: TensorFlow provides high-level APIs, such as Keras and tf.data,‬
4
‭that simplify the process of building and training neural networks. Keras is a‬
‭user-friendly API that allows for fast prototyping and supports a wide range of neural‬
‭network architectures. tf.data is a powerful API for efficient data loading and‬
‭preprocessing.‬
‭ . TensorFlow Hub: TensorFlow Hub is a repository of pre-trained machine learning‬
5
‭models, including various neural network architectures. It allows users to reuse and‬
‭transfer pre-trained models for different tasks, making it easier to leverage existing‬
‭knowledge and accelerate development.‬

‭ . TensorBoard: TensorBoard is a visualization tool included with TensorFlow. It‬


6
‭provides interactive visualizations of training metrics, model graphs, and histograms,‬
‭enabling users to monitor and analyze the performance and behavior of their models.‬

‭ . TensorFlow Serving: TensorFlow Serving is a dedicated serving system that allows‬


7
‭you to deploy trained TensorFlow models in production. It provides a flexible and‬
‭scalable serving infrastructure, enabling efficient inference and serving of machine‬
‭learning models.‬

‭ . TensorFlow.js: TensorFlow also has a JavaScript library called TensorFlow.js that‬


8
‭allows for the deployment and execution of machine learning models in web browsers‬
‭or Node.js. This enables the development of machine learning applications that run‬
‭directly in the browser.‬

‭ ensorFlow is known for its versatility, scalability, and community support. It has a vast‬
T
‭ecosystem of resources, including tutorials, documentation, and pre-trained models,‬
‭which makes it easier for developers to get started and explore the capabilities of deep‬
‭learning.‬

‭UNIT 2 -‬‭Training Models‬

‭Linear Regression‬

‭Linear regression is one of the easiest and most popular Machine Learning algorithms.‬
‭It is a statistical method that is used for‬‭predictive‬‭analysis.‬‭Linear regression makes‬
‭predictions for continuous/real or numeric variables such as sales, salary, age, product‬
‭price, etc.‬
‭Linear regression is a statistical modeling technique used to establish a linear‬
‭relationship between a dependent variable and one or more independent variables. It‬
‭aims to predict or estimate the value of the dependent variable based on the given‬
‭independent variables.‬
‭The linear regression model provides a sloped straight line representing the relationship‬
‭between the variables. Consider the below image:‬

‭The key assumptions of linear regression include:‬


‭1.‬ ‭Linearity: There should be a linear relationship between the independent‬
‭variables and the dependent variable. This assumption implies that the‬
‭relationship can be represented by a straight line.‬
‭2.‬ ‭Independence: The observations or data points should be independent of each‬
‭other. This assumption assumes that the observations are not influenced by each‬
‭other.‬
‭3.‬ ‭Homoscedasticity: The variance of the residuals (the differences between the‬
‭predicted and actual values) should be constant across all levels of the‬
‭independent variables. In other words, the spread of the residuals should be‬
‭consistent.‬
‭4.‬ ‭Normality: The residuals should follow a normal distribution. This assumption is‬
‭important for conducting statistical inference and hypothesis testing.‬

‭The general equation for simple linear regression, with a single independent variable,‬
‭can be represented as:‬
‭y = β₀ + β₁x + ɛ‬
‭Where:‬
‭●‬ ‭y is the dependent variable‬
‭●‬ ‭x is the independent variable‬
‭●‬ ‭β₀ is the y-intercept (the value of y when x is zero)‬
‭●‬ ‭β₁ is the slope (the change in y for a unit change in x)‬
‭●‬ ‭ɛ is the error term, representing the deviations of the actual values from the‬
‭predicted values‬
‭The goal of linear regression is to estimate the values of β₀ and β₁ that minimize the‬
‭sum of squared errors, which is achieved using various optimization techniques such as‬
‭ordinary least squares (OLS).‬

‭Linear regression can be further extended to multiple linear regression, where there are‬
‭multiple independent variables. The equation takes the form:‬
‭y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ɛ‬
‭Where x₁, x₂, ..., xₚ are the independent variables, and β₁, β₂, ..., βₚ are their respective‬
‭slopes.‬

‭Types of Linear Regression‬

‭Linear regression can be further divided into two types of the algorithm:‬
‭○‬ ‭Simple Linear Regression:‬
‭If a single independent variable is used to predict the value of a numerical‬
‭dependent variable, then such a Linear Regression algorithm is called Simple‬
‭Linear Regression.‬
‭○‬ ‭Multiple Linear regression:‬
‭If more than one independent variable is used to predict the value of a numerical‬
‭dependent variable, then such a Linear Regression algorithm is called Multiple‬
‭Linear Regression.‬

‭Linear Regression Line‬


‭A linear line showing the relationship between the dependent and independent variables‬
‭is called a regression line. A regression line can show two types of relationship:‬
‭○‬ ‭Positive Linear Relationship:‬
‭If the dependent variable increases on the Y-axis and independent variable‬
‭increases on X-axis, then such a relationship is termed as a Positive linear‬
‭relationship.‬

‭○‬ ‭Negative Linear Relationship:‬


‭If the dependent variable decreases on the Y-axis and independent variable‬
‭increases on the X-axis, then such a relationship is called a negative linear‬
‭relationship.‬
‭GRADIENT DESECENT‬

‭Gradient‬‭Descent‬‭is‬‭known‬‭as‬‭one‬‭of‬‭the‬‭most‬‭commonly‬‭used‬‭optimization‬‭algorithms‬
‭to‬ ‭train‬ ‭machine‬ ‭learning‬ ‭models‬ ‭by‬ ‭means‬ ‭of‬ ‭minimizing‬ ‭errors‬‭between‬‭actual‬‭and‬
‭expected results. Further, gradient descent is also used to train Neural Networks.‬

‭In‬ ‭mathematical‬ ‭terminology,‬ ‭Optimization‬ ‭algorithm‬ ‭refers‬ ‭to‬ ‭the‬ ‭task‬ ‭of‬
‭minimizing/maximizing‬ ‭an‬ ‭objective‬ ‭function‬ ‭f(x)‬ ‭parameterized‬ ‭by‬ ‭x.‬ ‭Similarly,‬ ‭in‬
‭machine‬ ‭learning,‬ ‭optimization‬ ‭is‬ ‭the‬ ‭task‬ ‭of‬ ‭minimizing‬ ‭the‬ ‭cost‬ ‭function‬
‭parameterized‬‭by‬‭the‬‭model's‬‭parameters.‬‭The‬‭main‬‭objective‬‭of‬‭gradient‬‭descent‬‭is‬‭to‬
‭minimize‬ ‭the‬ ‭convex‬ ‭function‬ ‭using‬ ‭iteration‬ ‭of‬ ‭parameter‬ ‭updates.‬ ‭Once‬ ‭these‬
‭machine‬ ‭learning‬ ‭models‬ ‭are‬ ‭optimized,‬ ‭these‬‭models‬‭can‬‭be‬‭used‬‭as‬‭powerful‬‭tools‬
‭for Artificial Intelligence and various computer science applications.‬

‭What is Gradient Descent or Steepest Descent?‬


‭Gradient‬ ‭descent‬ ‭was‬ ‭initially‬ ‭discovered‬ ‭by‬ ‭"Augustin-Louis‬ ‭Cauchy"‬ ‭in‬ ‭mid‬ ‭of‬ ‭18th‬
‭century.‬ ‭Gradient‬ ‭Descent‬ ‭is‬ ‭defined‬ ‭as‬ ‭one‬ ‭of‬ ‭the‬ ‭most‬ ‭commonly‬ ‭used‬ ‭iterative‬
‭optimization‬ ‭algorithms‬ ‭of‬ ‭machine‬ ‭learning‬ ‭to‬ ‭train‬ ‭the‬ ‭machine‬ ‭learning‬ ‭and‬ ‭deep‬
‭learning models. It helps in finding the local minimum of a function.‬

‭The‬ ‭best‬ ‭way‬ ‭to‬ ‭define‬ ‭the‬ ‭local‬ ‭minimum‬ ‭or‬ ‭local‬ ‭maximum‬ ‭of‬ ‭a‬ ‭function‬ ‭using‬
‭gradient descent is as follows:‬

‭○‬ ‭If we move towards a negative gradient or away from the gradient of the‬
‭function at the current point, it will give the local minimum of that function.‬

‭○‬ ‭Whenever we move towards a positive gradient or towards the gradient of the‬
‭function at the current point, we will get the local maximum of that function.‬

‭This‬ ‭entire‬ ‭procedure‬ ‭is‬ ‭known‬ ‭as‬ ‭Gradient‬ ‭Ascent,‬ ‭which‬ ‭is‬ ‭also‬‭known‬‭as‬‭steepest‬
‭descent.‬ ‭The‬ ‭main‬ ‭objective‬ ‭of‬ ‭using‬ ‭a‬ ‭gradient‬ ‭descent‬ ‭algorithm‬ ‭is‬ ‭to‬ ‭minimize‬ ‭the‬
‭cost function using iteration.‬‭To achieve this goal,‬‭it performs two steps iteratively:‬

‭○‬ ‭Calculates the first-order derivative of the function to compute the gradient or‬
‭slope of that function.‬

‭○‬ ‭Move away from the direction of the gradient, which means slope increased‬
‭from the current point by alpha times, where Alpha is defined as Learning Rate.‬
‭It is a tuning parameter in the optimization process which helps to decide the‬
‭length of the steps.‬
‭What is Cost-function?‬

‭The‬ ‭cost‬ ‭function‬ ‭is‬‭defined‬‭as‬‭the‬‭measurement‬‭of‬‭difference‬‭or‬‭error‬‭between‬‭actual‬


‭values‬ ‭and‬ ‭expected‬ ‭values‬ ‭at‬ ‭the‬ ‭current‬ ‭position‬ ‭and‬ ‭present‬‭in‬‭the‬‭form‬‭of‬‭a‬‭single‬
‭real‬‭number.‬‭It‬‭helps‬‭to‬‭increase‬‭and‬‭improve‬‭machine‬‭learning‬‭efficiency‬‭by‬‭providing‬
‭feedback‬ ‭to‬ ‭this‬ ‭model‬ ‭so‬ ‭that‬ ‭it‬ ‭can‬ ‭minimize‬ ‭error‬ ‭and‬ ‭find‬ ‭the‬ ‭local‬ ‭or‬ ‭global‬
‭minimum.‬‭Further,‬‭it‬‭continuously‬‭iterates‬‭along‬‭the‬‭direction‬‭of‬‭the‬‭negative‬‭gradient‬
‭until‬‭the‬‭cost‬‭function‬‭approaches‬‭zero.‬‭At‬‭this‬‭steepest‬‭descent‬‭point,‬‭the‬‭model‬‭will‬
‭stop‬ ‭learning‬ ‭further.‬ ‭Although‬ ‭cost‬ ‭function‬ ‭and‬ ‭loss‬ ‭function‬ ‭are‬ ‭considered‬
‭synonymous,‬ ‭also‬ ‭there‬ ‭is‬ ‭a‬ ‭minor‬ ‭difference‬ ‭between‬ ‭them.‬ ‭The‬ ‭slight‬ ‭difference‬
‭between‬‭the‬‭loss‬‭function‬‭and‬‭the‬‭cost‬‭function‬‭is‬‭about‬‭the‬‭error‬‭within‬‭the‬‭training‬‭of‬
‭machine‬‭learning‬‭models,‬‭as‬‭loss‬‭function‬‭refers‬‭to‬‭the‬‭error‬‭of‬‭one‬‭training‬‭example,‬
‭while a cost function calculates the average error across an entire training set.‬

‭The‬ ‭cost‬ ‭function‬ ‭is‬ ‭calculated‬ ‭after‬‭making‬‭a‬‭hypothesis‬‭with‬‭initial‬‭parameters‬‭and‬


‭modifying‬ ‭these‬ ‭parameters‬ ‭using‬ ‭gradient‬ ‭descent‬ ‭algorithms‬ ‭over‬ ‭known‬ ‭data‬ ‭to‬
‭reduce the cost function.‬

‭How does Gradient Descent work?‬

‭Before‬‭starting‬‭the‬‭working‬‭principle‬‭of‬‭gradient‬‭descent,‬‭we‬‭should‬‭know‬‭some‬‭basic‬
‭concepts‬‭to‬‭find‬‭out‬‭the‬‭slope‬‭of‬‭a‬‭line‬‭from‬‭linear‬‭regression.‬‭The‬‭equation‬‭for‬‭simple‬
‭linear regression is given as:‬

‭1.‬ ‭Y‬‭=‭m
‬ X‬‭+c‬

‭Where‬ ‭'m'‬ ‭represents‬ ‭the‬ ‭slope‬ ‭of‬ ‭the‬ ‭line,‬ ‭and‬ ‭'c'‬ ‭represents‬ ‭the‬ ‭intercepts‬ ‭on‬ ‭the‬
‭y-axis.‬
‭The‬ ‭starting‬ ‭point(shown‬ ‭in‬ ‭above‬ ‭fig.)‬ ‭is‬ ‭used‬ ‭to‬ ‭evaluate‬ ‭the‬ ‭performance‬ ‭as‬ ‭it‬ ‭is‬
‭considered‬ ‭just‬ ‭as‬ ‭an‬ ‭arbitrary‬ ‭point.‬ ‭At‬ ‭this‬ ‭starting‬ ‭point,‬ ‭we‬ ‭will‬ ‭derive‬ ‭the‬ ‭first‬
‭derivative‬‭or‬‭slope‬‭and‬‭then‬‭use‬‭a‬‭tangent‬‭line‬‭to‬‭calculate‬‭the‬‭steepness‬‭of‬‭this‬‭slope.‬
‭Further, this slope will inform the updates to the parameters (weights and bias).‬

‭The‬ ‭slope‬ ‭becomes‬ ‭steeper‬‭at‬‭the‬‭starting‬‭point‬‭or‬‭arbitrary‬‭point,‬‭but‬‭whenever‬‭new‬


‭parameters‬ ‭are‬ ‭generated,‬‭then‬‭steepness‬‭gradually‬‭reduces,‬‭and‬‭at‬‭the‬‭lowest‬‭point,‬
‭it approaches the lowest point, which is called a point of convergence.‬

‭The‬ ‭main‬ ‭objective‬ ‭of‬ ‭gradient‬ ‭descent‬ ‭is‬ ‭to‬ ‭minimize‬ ‭the‬ ‭cost‬ ‭function‬ ‭or‬ ‭the‬ ‭error‬
‭between‬ ‭expected‬ ‭and‬ ‭actual.‬ ‭To‬ ‭minimize‬ ‭the‬ ‭cost‬ ‭function,‬ ‭two‬ ‭data‬ ‭points‬ ‭are‬
‭required:‬

‭○‬ ‭Direction & Learning Rate‬

‭These‬ ‭two‬ ‭factors‬ ‭are‬ ‭used‬ ‭to‬ ‭determine‬ ‭the‬ ‭partial‬ ‭derivative‬ ‭calculation‬ ‭of‬ ‭future‬
‭iteration‬‭and‬‭allow‬‭it‬‭to‬‭the‬‭point‬‭of‬‭convergence‬‭or‬‭local‬‭minimum‬‭or‬‭global‬‭minimum.‬
‭Let's discuss learning rate factors in brief;‬

‭Learning Rate:‬
‭It‬ ‭is‬ ‭defined‬ ‭as‬ ‭the‬ ‭step‬ ‭size‬ ‭taken‬ ‭to‬ ‭reach‬ ‭the‬ ‭minimum‬ ‭or‬ ‭lowest‬ ‭point.‬ ‭This‬ ‭is‬
‭typically‬‭a‬‭small‬‭value‬‭that‬‭is‬‭evaluated‬‭and‬‭updated‬‭based‬‭on‬‭the‬‭behavior‬‭of‬‭the‬‭cost‬
‭function.‬‭If‬‭the‬‭learning‬‭rate‬‭is‬‭high,‬‭it‬‭results‬‭in‬‭larger‬‭steps‬‭but‬‭also‬‭leads‬‭to‬‭risks‬‭of‬
‭overshooting‬‭the‬‭minimum.‬‭At‬‭the‬‭same‬‭time,‬‭a‬‭low‬‭learning‬‭rate‬‭shows‬‭the‬‭small‬‭step‬
‭sizes,‬ ‭which‬ ‭compromises‬ ‭overall‬ ‭efficiency‬ ‭but‬ ‭gives‬ ‭the‬ ‭advantage‬ ‭of‬ ‭more‬
‭precision.‬

‭Types of Gradient Descent‬


‭Based‬‭on‬‭the‬‭error‬‭in‬‭various‬‭training‬‭models,‬‭the‬‭Gradient‬‭Descent‬‭learning‬‭algorithm‬
‭can‬ ‭be‬ ‭divided‬ ‭into‬ ‭Batch‬ ‭gradient‬ ‭descent,‬ ‭stochastic‬ ‭gradient‬ ‭descent,‬ ‭and‬
‭mini-batch‬ ‭gradient‬ ‭descent.‬ ‭Let's‬ ‭understand‬ ‭these‬ ‭different‬ ‭types‬ ‭of‬ ‭gradient‬
‭descent:‬

‭1. Batch Gradient Descent:‬

‭Batch‬‭gradient‬‭descent‬‭(BGD)‬‭is‬‭used‬‭to‬‭find‬‭the‬‭error‬‭for‬‭each‬‭point‬‭in‬‭the‬‭training‬‭set‬
‭and‬ ‭update‬‭the‬‭model‬‭after‬‭evaluating‬‭all‬‭training‬‭examples.‬‭This‬‭procedure‬‭is‬‭known‬
‭as‬‭the‬‭training‬‭epoch.‬‭In‬‭simple‬‭words,‬‭it‬‭is‬‭a‬‭greedy‬‭approach‬‭where‬‭we‬‭have‬‭to‬‭sum‬
‭over all examples for each update.‬
‭Advantages of Batch gradient descent:‬

‭○‬ ‭It produces less noise in comparison to other gradient descent.‬

‭○‬ ‭It produces stable gradient descent convergence.‬

‭○‬ ‭It is Computationally efficient as all resources are used for all training samples.‬

‭2. Stochastic gradient descent‬

‭Stochastic‬‭gradient‬‭descent‬‭(SGD)‬‭is‬‭a‬‭type‬‭of‬‭gradient‬‭descent‬‭that‬‭runs‬‭one‬‭training‬
‭example‬ ‭per‬ ‭iteration.‬ ‭Or‬ ‭in‬ ‭other‬ ‭words,‬ ‭it‬ ‭processes‬ ‭a‬ ‭training‬ ‭epoch‬ ‭for‬ ‭each‬
‭example‬ ‭within‬ ‭a‬ ‭dataset‬ ‭and‬ ‭updates‬ ‭each‬ ‭training‬ ‭example's‬ ‭parameters‬ ‭one‬ ‭at‬ ‭a‬
‭time.‬ ‭As‬ ‭it‬ ‭requires‬ ‭only‬ ‭one‬ ‭training‬ ‭example‬ ‭at‬‭a‬‭time,‬‭hence‬‭it‬‭is‬‭easier‬‭to‬‭store‬‭in‬
‭allocated‬ ‭memory.‬ ‭However,‬ ‭it‬ ‭shows‬ ‭some‬ ‭computational‬ ‭efficiency‬ ‭losses‬ ‭in‬
‭comparison‬‭to‬‭batch‬‭gradient‬‭systems‬‭as‬‭it‬‭shows‬‭frequent‬‭updates‬‭that‬‭require‬‭more‬
‭detail‬ ‭and‬ ‭speed.‬ ‭Further,‬ ‭due‬ ‭to‬ ‭frequent‬ ‭updates,‬ ‭it‬ ‭is‬ ‭also‬ ‭treated‬ ‭as‬ ‭a‬ ‭noisy‬
‭gradient.‬‭However,‬‭sometimes‬‭it‬‭can‬‭be‬‭helpful‬‭in‬‭finding‬‭the‬‭global‬‭minimum‬‭and‬‭also‬
‭escaping the local minimum.‬

‭Advantages of Stochastic gradient descent:‬

‭In‬ ‭Stochastic‬ ‭gradient‬ ‭descent‬ ‭(SGD),‬ ‭learning‬ ‭happens‬ ‭on‬ ‭every‬ ‭example,‬ ‭and‬ ‭it‬
‭consists of a few advantages over other gradient descent.‬

‭○‬ ‭It is easier to allocate in desired memory.‬

‭○‬ ‭It is relatively fast to compute than batch gradient descent.‬

‭○‬ ‭It is more efficient for large datasets.‬

‭3. MiniBatch Gradient Descent:‬

‭Mini‬ ‭Batch‬ ‭gradient‬ ‭descent‬ ‭is‬ ‭the‬ ‭combination‬ ‭of‬ ‭both‬ ‭batch‬ ‭gradient‬ ‭descent‬ ‭and‬
‭stochastic‬‭gradient‬‭descent.‬‭It‬‭divides‬‭the‬‭training‬‭datasets‬‭into‬‭small‬‭batch‬‭sizes‬‭then‬
‭performs‬ ‭the‬ ‭updates‬ ‭on‬ ‭those‬ ‭batches‬ ‭separately.‬ ‭Splitting‬ ‭training‬ ‭datasets‬ ‭into‬
‭smaller‬ ‭batches‬ ‭make‬ ‭a‬ ‭balance‬ ‭to‬ ‭maintain‬ ‭the‬ ‭computational‬ ‭efficiency‬ ‭of‬ ‭batch‬
‭gradient‬ ‭descent‬ ‭and‬ ‭speed‬ ‭of‬ ‭stochastic‬ ‭gradient‬‭descent.‬‭Hence,‬‭we‬‭can‬‭achieve‬‭a‬
‭special‬ ‭type‬ ‭of‬ ‭gradient‬ ‭descent‬ ‭with‬ ‭higher‬ ‭computational‬ ‭efficiency‬ ‭and‬ ‭less‬‭noisy‬
‭gradient descent.‬

‭Advantages of Mini Batch gradient descent:‬

‭○‬ ‭It is easier to fit in allocated memory.‬

‭○‬ ‭It is computationally efficient.‬

‭○‬ ‭It produces stable gradient descent convergence.‬

‭Challenges with the Gradient Descent‬


‭Although‬ ‭we‬ ‭know‬ ‭Gradient‬ ‭Descent‬ ‭is‬ ‭one‬ ‭of‬ ‭the‬ ‭most‬ ‭popular‬ ‭methods‬ ‭for‬
‭optimization‬‭problems,‬‭it‬‭still‬‭also‬‭has‬‭some‬‭challenges.‬‭There‬‭are‬‭a‬‭few‬‭challenges‬‭as‬
‭follows:‬

‭1. Local Minima and Saddle Point:‬

‭For‬ ‭convex‬ ‭problems,‬ ‭gradient‬ ‭descent‬ ‭can‬ ‭find‬ ‭the‬ ‭global‬ ‭minimum‬‭easily,‬‭while‬‭for‬
‭non-convex‬ ‭problems,‬ ‭it‬ ‭is‬‭sometimes‬‭difficult‬‭to‬‭find‬‭the‬‭global‬‭minimum,‬‭where‬‭the‬
‭machine learning models achieve the best results.‬
‭Whenever‬ ‭the‬ ‭slope‬ ‭of‬ ‭the‬ ‭cost‬ ‭function‬ ‭is‬ ‭at‬ ‭zero‬ ‭or‬ ‭just‬ ‭close‬ ‭to‬ ‭zero,‬ ‭this‬ ‭model‬
‭stops‬ ‭learning‬ ‭further.‬ ‭Apart‬ ‭from‬ ‭the‬ ‭global‬ ‭minimum,‬ ‭there‬ ‭occur‬ ‭some‬ ‭scenarios‬
‭that‬ ‭can‬ ‭show‬ ‭this‬ ‭slope,‬ ‭which‬ ‭is‬ ‭saddle‬ ‭point‬ ‭and‬ ‭local‬ ‭minimum.‬ ‭Local‬ ‭minima‬
‭generate‬‭the‬‭shape‬‭similar‬‭to‬‭the‬‭global‬‭minimum,‬‭where‬‭the‬‭slope‬‭of‬‭the‬‭cost‬‭function‬
‭increases on both sides of the current points.‬

‭In‬ ‭contrast,‬ ‭with‬ ‭saddle‬ ‭points,‬ ‭the‬ ‭negative‬ ‭gradient‬ ‭only‬ ‭occurs‬ ‭on‬ ‭one‬ ‭side‬ ‭of‬ ‭the‬
‭point,‬ ‭which‬ ‭reaches‬ ‭a‬‭local‬‭maximum‬‭on‬‭one‬‭side‬‭and‬‭a‬‭local‬‭minimum‬‭on‬‭the‬‭other‬
‭side. The name of a saddle point is taken by that of a horse's saddle.‬

‭The‬‭name‬‭of‬‭local‬‭minima‬‭is‬‭because‬‭the‬‭value‬‭of‬‭the‬‭loss‬‭function‬‭is‬‭minimum‬‭at‬‭that‬
‭point‬‭in‬‭a‬‭local‬‭region.‬‭In‬‭contrast,‬‭the‬‭name‬‭of‬‭the‬‭global‬‭minima‬‭is‬‭given‬‭so‬‭because‬
‭the‬ ‭value‬ ‭of‬ ‭the‬ ‭loss‬ ‭function‬ ‭is‬‭minimum‬‭there,‬‭globally‬‭across‬‭the‬‭entire‬‭domain‬‭of‬
‭the loss function.‬

‭2. Vanishing and Exploding Gradient‬


‭In‬ ‭a‬ ‭deep‬ ‭neural‬ ‭network,‬ ‭if‬ ‭the‬ ‭model‬ ‭is‬ ‭trained‬ ‭with‬ ‭gradient‬ ‭descent‬ ‭and‬
‭backpropagation,‬‭there‬‭can‬‭occur‬‭two‬‭more‬‭issues‬‭other‬‭than‬‭local‬‭minima‬‭and‬‭saddle‬
‭point.‬

‭Vanishing Gradients:‬

‭Vanishing‬ ‭Gradient‬ ‭occurs‬ ‭when‬ ‭the‬ ‭gradient‬ ‭is‬ ‭smaller‬ ‭than‬ ‭expected.‬ ‭During‬
‭backpropagation,‬ ‭this‬ ‭gradient‬ ‭becomes‬ ‭smaller‬ ‭that‬ ‭causing‬ ‭the‬ ‭decrease‬ ‭in‬ ‭the‬
‭learning‬ ‭rate‬ ‭of‬ ‭earlier‬ ‭layers‬ ‭than‬ ‭the‬ ‭later‬ ‭layer‬ ‭of‬ ‭the‬‭network.‬‭Once‬‭this‬‭happens,‬
‭the weight parameters update until they become insignificant.‬

‭Exploding Gradient:‬

‭Exploding‬ ‭gradient‬ ‭is‬ ‭just‬ ‭opposite‬ ‭to‬ ‭the‬ ‭vanishing‬ ‭gradient‬ ‭as‬ ‭it‬ ‭occurs‬ ‭when‬ ‭the‬
‭Gradient‬ ‭is‬ ‭too‬ ‭large‬ ‭and‬ ‭creates‬ ‭a‬ ‭stable‬ ‭model.‬ ‭Further,‬ ‭in‬ ‭this‬ ‭scenario,‬ ‭model‬
‭weight‬ ‭increases,‬ ‭and‬ ‭they‬ ‭will‬ ‭be‬ ‭represented‬ ‭as‬ ‭NaN.‬ ‭This‬ ‭problem‬ ‭can‬ ‭be‬ ‭solved‬
‭using‬ ‭the‬ ‭dimensionality‬ ‭reduction‬ ‭technique,‬ ‭which‬ ‭helps‬ ‭to‬ ‭minimize‬ ‭complexity‬
‭within the model.‬

‭Batch Gradient Descent‬

‭Batch gradient descent is an optimization algorithm used to minimize the cost function‬
‭in machine learning and deep learning models. It operates by calculating the gradient of‬
‭the cost function with respect to the model parameters using the entire training dataset‬
‭at each iteration. The parameters are then updated based on the average gradient‬
‭across all the training examples.‬
‭Here's a step-by-step overview of the batch gradient descent algorithm:‬
‭1.‬ ‭Initialization: Initialize the parameters of the model with some initial values.‬
‭2.‬ ‭Compute the cost function: Evaluate the cost function, which measures the‬
‭discrepancy between the model's predictions and the actual values of the training‬
‭data.‬
‭3.‬ ‭Compute the gradient: Calculate the gradient of the cost function with respect to‬
‭each parameter. This involves taking the derivative of the cost function with‬
‭respect to each parameter, considering all the training examples.‬
‭4.‬ ‭Update the parameters: Adjust the parameters by subtracting the learning rate‬
‭(α) times the average gradient from the current parameter values. The learning‬
‭rate determines the step size of the update and controls the convergence speed‬
‭of the algorithm.‬
‭Parameters_new = Parameters_old - α * Average(Gradient)‬
‭5.‬ ‭Repeat steps 2-4: Iterate the process by recalculating the cost function, gradient,‬
‭and updating the parameters until a stopping criterion is met. This criterion can‬
‭be a maximum number of iterations or reaching a specific threshold for the cost‬
‭function.‬
‭Batch gradient descent has a few advantages and considerations:‬
‭Advantages:‬
‭●‬ ‭Convergence to global minimum: Since the algorithm considers the entire‬
‭training dataset at each iteration, batch gradient descent guarantees‬
‭convergence to the global minimum of the cost function.‬
‭●‬ ‭Stable updates: The updates based on the average gradient tend to be smoother‬
‭and more stable compared to stochastic gradient descent.‬
‭●‬ ‭Suitable for small datasets: Batch gradient descent is computationally feasible for‬
‭small to medium-sized datasets, where the memory can accommodate the entire‬
‭training dataset.‬
‭Considerations:‬
‭●‬ ‭Computational cost: As batch gradient descent uses the entire training dataset to‬
‭compute the gradient, it can be computationally expensive for large datasets.‬
‭●‬ ‭Memory requirements: The algorithm requires storing the entire training dataset‬
‭in memory, which can be challenging for datasets that don't fit in memory.‬
‭●‬ ‭Lack of parallelization: Since the gradient computation depends on the entire‬
‭dataset, it is not easily parallelizable across multiple processors or distributed‬
‭systems.‬
‭Overall, batch gradient descent is a reliable optimization algorithm for models with‬
‭relatively small datasets. It provides stable updates and guarantees convergence to the‬
‭global minimum. However, its computational cost and memory requirements make it‬
‭less efficient for large-scale datasets. In such cases, stochastic gradient descent or‬
‭mini-batch gradient descent are often preferred alternatives.‬

‭Stochastic Gradient Descent‬

‭Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used in‬


‭machine learning to minimize the cost function of a model. Unlike batch gradient‬
‭descent, which computes the gradient using the entire training dataset, SGD updates‬
‭the model parameters based on the gradient of the cost function computed on a single‬
‭training example at a time.‬
‭Here's a step-by-step overview of the stochastic gradient descent algorithm:‬
‭1.‬ ‭Initialization: Initialize the parameters of the model with some initial values.‬
‭2.‬ ‭Shuffle the training dataset: Randomly shuffle the training examples to introduce‬
‭randomness in the order of processing.‬
‭3.‬ ‭Iterate over the training examples: For each training example, perform the‬
‭following steps:‬
‭a. Compute the cost function: Evaluate the cost function using the current model‬
‭parameters and the training example.‬
‭b. Compute the gradient: Calculate the gradient of the cost function with respect‬
‭to each parameter using only the current training example.‬
‭c. Update the parameters: Adjust the parameters by subtracting the learning rate‬
‭(α) times the gradient from the current parameter values.‬
‭Parameters_new = Parameters_old - α * Gradient‬
‭4.‬ ‭Repeat steps 3a-3c: Iterate over the entire training dataset, processing one‬
‭example at a time, until a stopping criterion is met. This criterion can be a‬
‭maximum number of iterations or reaching a specific threshold for the cost‬
‭function.‬
‭Stochastic gradient descent has a few advantages and considerations:‬
‭Advantages:‬
‭●‬ ‭Computationally efficient: Processing one training example at a time makes SGD‬
‭computationally efficient, especially for large datasets, as it requires less memory‬
‭and computational resources compared to batch gradient descent.‬
‭●‬ ‭Ability to escape local minima: The randomness introduced by processing‬
‭examples individually allows SGD to escape local minima and potentially find‬
‭better solutions.‬
‭●‬ ‭Online learning: SGD lends itself well to online learning scenarios, where new‬
‭data arrives continuously, as it can update the model incrementally as new‬
‭examples come in.‬
‭Considerations:‬
‭●‬ ‭Noisy updates: Since the updates are based on a single training example, the‬
‭gradient estimates can be noisy, which can lead to oscillations or slower‬
‭convergence compared to batch gradient descent.‬
‭●‬ ‭Learning rate tuning: Selecting an appropriate learning rate is crucial for SGD. If‬
‭the learning rate is too high, the algorithm may fail to converge, while a learning‬
‭rate that is too low can result in slow convergence.‬
‭●‬ ‭Lack of convergence to global minimum: SGD does not guarantee convergence‬
‭to the global minimum due to the inherent noise in the gradient estimates.‬
‭However, it often converges to a good solution and is widely used in practice.‬
‭To mitigate the noise and improve convergence, variations of SGD, such as mini-batch‬
‭gradient descent, are commonly used. Mini-batch SGD computes the gradient based on‬
‭a small subset (mini-batch) of training examples, striking a balance between efficiency‬
‭and stability.‬
‭In summary, stochastic gradient descent is a popular optimization algorithm for training‬
‭machine learning models. It is computationally efficient and can handle large datasets.‬
‭However, it trades off some stability and precision for speed, and careful tuning of the‬
‭learning rate is necessary for optimal performance.‬
‭Mini-batch Gradient Descent‬‭,‬

‭Mini-batch Gradient Descent is an optimization algorithm that combines the advantages‬


‭of both Batch Gradient Descent and Stochastic Gradient Descent. It computes the‬
‭gradient and updates the model parameters based on a small subset or mini-batch of‬
‭training examples at each iteration. The mini-batch size is typically between 10 and‬
‭1,000, striking a balance between the efficiency of SGD and the stability of batch‬
‭gradient descent.‬
‭Here's an overview of the Mini-batch Gradient Descent algorithm:‬
‭1.‬ ‭Initialization: Initialize the parameters of the model with some initial values.‬
‭2.‬ ‭Shuffle the training dataset: Randomly shuffle the training examples to introduce‬
‭randomness in the order of processing.‬
‭3.‬ ‭Partition the dataset into mini-batches: Divide the shuffled training dataset into‬
‭smaller mini-batches, each containing a predefined number of examples.‬
‭4.‬ ‭Iterate over the mini-batches: For each mini-batch, perform the following steps:‬
‭a. Compute the cost function: Evaluate the cost function using the current model‬
‭parameters and the examples in the mini-batch.‬
‭b. Compute the gradient: Calculate the gradient of the cost function with respect‬
‭to each parameter using the examples in the mini-batch.‬
‭c. Update the parameters: Adjust the parameters by subtracting the learning rate‬
‭(α) times the gradient from the current parameter values.‬
‭Parameters_new = Parameters_old - α * Gradient‬
‭5.‬ ‭Repeat steps 4a-4c: Iterate over the mini-batches, processing one mini-batch at‬
‭a time, until a stopping criterion is met. This criterion can be a maximum number‬
‭of iterations or reaching a specific threshold for the cost function.‬
‭Mini-batch Gradient Descent combines the efficiency of processing multiple examples in‬
‭parallel (like batch gradient descent) with the improved stability and faster convergence‬
‭per iteration (like stochastic gradient descent). It benefits from the noise reduction and‬
‭smoother updates from the mini-batches, which can result in faster convergence and‬
‭better generalization compared to stochastic gradient descent.‬
‭The choice of mini-batch size is crucial. A small mini-batch size introduces more noise‬
‭but provides faster updates, while a larger mini-batch size reduces the noise but slows‬
‭down the updates. The mini-batch size is often determined through experimentation,‬
‭balancing the computational efficiency and convergence speed of the algorithm.‬
‭Mini-batch Gradient Descent is widely used in deep learning, where datasets are‬
‭typically large and can contain millions of examples. It leverages parallel processing‬
‭capabilities and strikes a balance between efficiency and stability, making it a popular‬
‭choice for training neural networks and other complex models.‬

‭DIFFERENCE BETWEEN‬

‭1.‬ ‭Batch Gradient Descent:‬


‭●‬ ‭Computes the gradient of the cost function using the entire training dataset.‬
‭●‬ ‭Updates the model parameters based on the average gradient across all training‬
‭examples.‬
‭●‬ ‭Guarantees convergence to the global minimum but can be computationally‬
‭expensive for large datasets.‬
‭2.‬ ‭Stochastic Gradient Descent:‬
‭●‬ ‭Computes the gradient and updates the model parameters for each training‬
‭example individually.‬
‭●‬ ‭Updates the parameters based on the gradient of a single training example.‬
‭●‬ ‭Converges faster per iteration but does not guarantee convergence to the global‬
‭minimum due to the noisy nature of individual training examples.‬
‭●‬ ‭Computationally efficient, especially for large datasets, as it processes one‬
‭training example at a time.‬
‭3.‬ ‭Mini-batch Gradient Descent:‬
‭●‬ ‭Computes the gradient and updates the model parameters based on a small‬
‭subset or mini-batch of training examples.‬
‭●‬ ‭Strikes a balance between the efficiency of batch gradient descent and the‬
‭stability of stochastic gradient descent.‬
‭●‬ ‭Typically uses mini-batch sizes between 10 and 1,000.‬
‭●‬ ‭Provides faster convergence compared to batch gradient descent and reduces‬
‭the noise and oscillations in updates compared to stochastic gradient descent.‬
‭●‬ ‭Widely used in deep learning and other models with large datasets.‬
‭In summary, batch gradient descent processes the entire dataset at each iteration,‬
‭stochastic gradient descent processes one example at a time, and mini-batch gradient‬
‭descent processes a small subset of examples at each iteration. Batch gradient descent‬
‭guarantees convergence but can be computationally expensive. Stochastic gradient‬
‭descent is computationally efficient but introduces noise. Mini-batch gradient descent‬
‭combines efficiency and stability by processing mini-batches of examples, striking a‬
‭balance between the two extremes.‬
‭Polynomial Regression‬

‭Polynomial regression is a form of regression analysis where the relationship between‬


‭the independent variable(s) and the dependent variable is modeled as an nth degree‬
‭polynomial. It is an extension of linear regression that allows for nonlinear relationships‬
‭between the variables.‬
‭In polynomial regression, the model takes the form:‬
‭y = β₀ + β₁x + β₂x² + ... + βₙxⁿ + ε‬
‭where:‬
‭●‬ ‭y is the dependent variable.‬
‭●‬ ‭x is the independent variable.‬
‭●‬ ‭β₀, β₁, β₂, ..., βₙ are the coefficients of the polynomial terms.‬
‭●‬ ‭ε is the error term.‬
‭The key steps involved in polynomial regression are:‬
‭1.‬ ‭Data preparation: Prepare the dataset by collecting the independent and‬
‭dependent variables.‬
‭2.‬ ‭Feature engineering: Create additional polynomial terms by raising the‬
‭independent variable to different powers (e.g., x², x³) to capture the nonlinear‬
‭relationships.‬
‭3.‬ ‭Model fitting: Use the prepared dataset and the polynomial features to fit the‬
‭regression model. This involves estimating the coefficients (β₀, β₁, β₂, ..., βₙ) that‬
‭minimize the error between the predicted values and the actual values.‬
‭4.‬ ‭Model evaluation: Assess the quality of the polynomial regression model using‬
‭evaluation metrics such as mean squared error (MSE), R-squared, or adjusted‬
‭R-squared. These metrics measure how well the model fits the data and its‬
‭predictive performance.‬
‭5.‬ ‭Prediction: Utilize the trained polynomial regression model to make predictions‬
‭on new or unseen data.‬

‭Need for Polynomial Regression:‬

‭The need of Polynomial Regression in ML can be understood in the below points:‬

‭○‬ ‭If we apply a linear model on a‬‭linear dataset‬‭, then‬‭it provides us a good result as‬
‭we have seen in Simple Linear Regression, but if we apply the same model‬
‭without any modification on a‬‭non-linear dataset‬‭,‬‭then it will produce a drastic‬
‭output. Due to which loss function will increase, the error rate will be high, and‬
‭accuracy will be decreased.‬

‭○‬ ‭So for such cases,‬‭where data points are arranged‬‭in a non-linear fashion, we‬
‭need the Polynomial Regression model‬‭. We can understand‬‭it in a better way‬
‭using the below comparison diagram of the linear dataset and non-linear dataset.‬
‭○‬ ‭In the above image, we have taken a dataset which is arranged non-linearly. So if‬
‭we try to cover it with a linear model, then we can clearly see that it hardly covers‬
‭any data point. On the other hand, a curve is suitable to cover most of the data‬
‭points, which is of the Polynomial model.‬

‭○‬ ‭Hence,‬‭if the datasets are arranged in a non-linear‬‭fashion, then we should use the‬
‭Polynomial Regression model instead of Simple Linear Regression.‬

‭Note:‬ ‭A‬ ‭Polynomial‬ ‭Regression‬ ‭algorithm‬ ‭is‬ ‭also‬ ‭called‬ ‭Polynomial‬ ‭Linear‬ ‭Regression‬
‭because‬‭it‬‭does‬‭not‬‭depend‬‭on‬‭the‬‭variables,‬‭instead,‬‭it‬‭depends‬‭on‬‭the‬‭coefficients,‬‭which‬
‭are arranged in a linear fashion.‬

‭Equation of the Polynomial Regression Model:‬


‭Simple Linear Regression equation: y = b‬‭0‭+
‬ b‬‭1‭x
‬ ‬ ‭.........(a)‬

‭Multiple Linear Regression equation: y= b‬‭0‬‭+b‬‭1‬‭x+‬‭b‭2‬ ‬‭x‭2‬ ‭+


‬ b‬‭3‭x ‬ ‬‭+....+ b‬‭n‬‭x‭n‬ ‬
‬ ‭3 ‭.........(b)‬

‭2‬ ‭3‬ ‭n‬


‭Polynomial Regression equation: y= b‬‭0‭+
‬ b‬‭1‭x
‬ ‬‭+ b‬‭2‭x
‬ ‬ ‭+ b‬‭3‬‭x‬ ‭+....+ b‬‭n‬‭x‬ ‭..........(c)‬
‭When‬‭we‬‭compare‬‭the‬‭above‬‭three‬‭equations,‬‭we‬‭can‬‭clearly‬‭see‬‭that‬‭all‬‭three‬‭equations‬
‭are‬‭Polynomial‬‭equations‬‭but‬‭differ‬‭by‬‭the‬‭degree‬‭of‬‭variables.‬‭The‬‭Simple‬‭and‬‭Multiple‬
‭Linear‬ ‭equations‬ ‭are‬ ‭also‬ ‭Polynomial‬ ‭equations‬ ‭with‬ ‭a‬ ‭single‬ ‭degree,‬ ‭and‬ ‭the‬
‭Polynomial‬ ‭regression‬ ‭equation‬ ‭is‬ ‭Linear‬ ‭equation‬ ‭with‬ ‭the‬ ‭nth‬‭degree.‬‭So‬‭if‬‭we‬‭add‬‭a‬
‭degree‬ ‭to‬ ‭our‬ ‭linear‬ ‭equations,‬ ‭then‬ ‭it‬ ‭will‬ ‭be‬ ‭converted‬ ‭into‬ ‭Polynomial‬ ‭Linear‬
‭equations.‬

‭Polynomial regression can be a powerful tool to capture nonlinear relationships between‬


‭variables. By including polynomial terms of higher degrees, the model can fit more‬
‭complex patterns in the data. However, it is essential to select the appropriate degree of‬
‭the polynomial and‬‭avoid overfitting‬‭the model to‬‭the training data.‬
‭It is worth noting that polynomial regression can be sensitive to outliers and‬
‭multicollinearity between the polynomial terms. Preprocessing techniques such as data‬
‭normalization and regularization methods like Ridge or Lasso regression can help‬
‭mitigate these issues.‬
‭Overall, polynomial regression provides a flexible approach to model nonlinear‬
‭relationships in data and is widely used in various fields such as physics, economics,‬
‭and social sciences.‬

‭LEARNING CURVES‬

‭Learning curves are a valuable tool for evaluating and diagnosing the performance of‬
‭machine learning models. They provide insights into how the model's performance‬
‭changes as the training dataset size increases. Learning curves plot the training and‬
‭validation performance metrics (such as accuracy or error) against the number of‬
‭training examples or iterations.‬
‭Here's a general process for creating learning curves:‬
‭1.‬ ‭Vary the training dataset size: Start by training the model with a small subset of‬
‭the training data. Gradually increase the dataset size in predefined intervals or by‬
‭a fixed proportion.‬
‭2.‬ ‭Train the model: For each dataset size, train the model using the corresponding‬
‭subset of the training data.‬
‭3.‬ ‭Evaluate performance: After training, assess the model's performance on both‬
‭the training set and a separate validation set. Compute the desired performance‬
‭metric (e.g., accuracy, error, or loss) for both the training and validation sets.‬
‭4.‬ ‭Repeat steps 2-3: Repeat the training and evaluation process for each dataset‬
‭size, recording the performance metrics.‬
‭5.‬ ‭Plot the learning curves: Visualize the performance metrics on a line plot, with the‬
‭dataset size or iterations on the x-axis and the performance metric on the y-axis.‬
‭Typically, separate curves are plotted for the training and validation performance.‬

‭Interpreting learning curves can provide several insights into the model's behavior:‬
‭●‬ ‭Bias and Variance: Learning curves can help diagnose whether the model suffers‬
‭from bias (underfitting) or variance (overfitting). If both the training and validation‬
‭performance plateau at a high error, it indicates high bias. If there is a significant‬
‭gap between the training and validation performance, it suggests high variance.‬
‭●‬ ‭Overfitting and Underfitting: Learning curves can show if the model is overfitting‬
‭or underfitting the data. Overfitting is indicated by a large gap between the‬
‭training and validation performance, where the training performance is high while‬
‭the validation performance is low. Underfitting is reflected in low performance for‬
‭both the training and validation sets.‬
‭●‬ ‭Model Improvement: Learning curves illustrate how the model's performance‬
‭changes as more data is added. They can reveal if the model would benefit from‬
‭additional training data or if it has already reached its optimal performance.‬
‭●‬ ‭Generalization: Learning curves can indicate whether the model's performance‬
‭generalizes well to unseen data. If the training and validation performance‬
‭converge and plateau at similar values, it suggests good generalization.‬
‭However, if the performance gap persists, the model may have difficulty‬
‭generalizing beyond the training data.‬
‭By analyzing learning curves, you can make informed decisions regarding model‬
‭improvements, such as adjusting the model architecture, adding more training data, or‬
‭implementing regularization techniques.‬
‭In summary, learning curves provide valuable insights into a machine learning model's‬
‭performance, helping to diagnose issues such as bias, variance, overfitting, and‬
‭underfitting. They aid in understanding the model's behavior, assessing generalization‬
‭capabilities, and guiding improvements in the training process.‬

‭ he most popular example of a learning curve is‬‭loss‬‭over time‬‭. Loss (or cost)‬
T
‭measures our model error, or “how bad our model is doing”. So, for now, the lower our‬
‭loss becomes, the better our model performance will be.‬
‭ espite the fact it has slight ups and downs, in the long term, the loss decreases over‬
D
‭time, so the model is learning.‬

‭ ther examples of very popular learning curves are accuracy, precision, and recall. All‬
O
‭of these capture model performance, so the higher they are, the better our model‬
‭becomes.‬

‭ he model performance is growing over time, which means the model is improving with‬
T
‭experience (it’s learning).‬

‭ e also see it grows at the beginning, but over time it reaches a plateau, meaning it’s‬
W
‭not able to learn anymore.‬

‭Multiple Curves:‬
‭ ne of the most widely used metrics combinations is training loss + validation loss over‬
O
‭time.‬
‭The training loss indicates how well the model is fitting the training data, while the‬
‭validation loss indicates how well the model fits new data.‬

‭We often see these two types of learning curves appearing in charts:‬

‭●‬ O ‭ ptimization Learning Curves: Learning curves calculated on the metric by which‬
‭the parameters of the model are being optimized, such as loss or Mean Squared‬
‭Error‬
‭●‬ ‭Performance Learning Curves: Learning curves calculated on the metric by‬
‭which the model will be evaluated and selected, such as accuracy, precision,‬
‭recall, or F1 score‬

‭ learning curve can help to find the right amount of training data to fit our model with a‬
A
‭good bias-variance trade-off. This is why learning curves are so important.‬

‭The Bias/Variance Tradeoff‬

‭The bias-variance tradeoff is a fundamental concept in machine learning that deals with‬
‭the relationship between a model's bias and variance and their impact on the model's‬
‭predictive performance.‬
‭Bias refers to the error introduced by approximating a real-world problem with a‬
‭simplified model. It represents the model's tendency to consistently underestimate or‬
‭overestimate the true values. A high-bias model typically oversimplifies the underlying‬
‭problem and may struggle to capture complex patterns in the data. It is associated with‬
‭underfitting, where the model fails to capture the training data's inherent structure.‬

‭In such a problem, a hypothesis looks like follows.‬

‭Variance, on the other hand, refers to the model's sensitivity to fluctuations in the‬
‭training data. It measures the extent to which the model's predictions vary when trained‬
‭on different subsets of the training data. A high-variance model is overly sensitive to‬
‭noise or random fluctuations in the training data and tends to fit the training data too‬
‭closely. This can lead to poor generalization performance on unseen data, a‬
‭phenomenon known as overfitting.‬

‭The high variance data looks like follows:‬


‭High Variance‬
‭In such a problem, a hypothesis looks like follows.‬

‭The bias-variance tradeoff arises because reducing bias often increases variance, and‬
‭vice versa. As a result, finding the right balance between bias and variance is crucial for‬
‭building models that generalize well to unseen data.‬
‭the perfect tradeoff will be like.‬
‭The best fit will be given by hypothesis on the tradeoff point.‬
‭The error to complexity graph to show trade-off is given as –‬

‭Here's an overview of the bias-variance tradeoff and its implications:‬


‭1.‬ ‭High Bias, Low Variance:‬
‭●‬ ‭Underfitting: A high-bias model fails to capture the underlying patterns in‬
‭the data.‬
‭●‬ ‭Characterized by poor performance on both the training and validation‬
‭sets.‬
‭●‬ ‭Examples include overly simple models or models with insufficient‬
‭complexity to represent the data adequately.‬
‭2.‬ ‭Low Bias, High Variance:‬
‭●‬ ‭Overfitting: A low-bias model fits the training data too closely and‬
‭struggles to generalize.‬
‭●‬ ‭Achieves high performance on the training set but poor performance on‬
‭the validation set.‬
‭●‬ ‭Examples include models with high complexity or those trained with‬
‭limited data.‬
‭3.‬ ‭Tradeoff:‬
‭●‬ ‭Balancing bias and variance is essential for achieving the best‬
‭generalization performance.‬
‭●‬ ‭As the model's complexity increases, bias decreases, and variance‬
‭increases.‬
‭●‬ ‭The optimal tradeoff depends on the specific problem, dataset, and‬
‭available resources.‬
‭●‬ ‭Techniques like regularization, cross-validation, and ensemble methods‬
‭can help manage the bias-variance tradeoff.‬
‭The aim of machine learning is to find a model that strikes an optimal balance between‬
‭bias and variance, achieving both low bias (to capture important patterns) and low‬
‭variance (to generalize well). This typically involves iteratively adjusting the model's‬
‭complexity, regularization techniques, or dataset size to find the sweet spot where the‬
‭model performs well on unseen data.‬
‭Understanding the bias-variance tradeoff helps guide model selection, feature‬
‭engineering, and regularization strategies, ultimately leading to better-performing‬
‭models with improved generalization capabilities‬

‭Ridge regression (L2 Regularization)‬

‭Ridge regression is a regularization technique used in linear regression to mitigate the‬


‭problem of multicollinearity (high correlation) among the independent variables. It adds‬
‭a penalty term to the linear regression cost function, which helps reduce the impact of‬
‭multicollinearity and prevents overfitting.‬
‭In ridge regression, the model's objective is to minimize the following cost function:‬
‭J(β) = RSS(β) + α * ∑(βᵢ²)‬
‭where:‬
‭●‬ ‭J(β) is the cost function to be minimized.‬
‭●‬ ‭RSS(β) is the residual sum of squares, which measures the difference between‬
‭the predicted and actual values.‬
‭●‬ ‭β is the vector of regression coefficients.‬
‭●‬ ‭βᵢ represents each individual coefficient in β.‬
‭●‬ ‭∑(βᵢ²) is the sum of squares of the regression coefficients.‬
‭●‬ ‭α is the regularization parameter (also known as λ or the regularization strength).‬
‭The regularization term (∑(βᵢ²)) imposes a penalty on the coefficients, making them‬
‭smaller. The magnitude of the penalty is controlled by the regularization parameter α. As‬
‭α increases, the impact of the penalty becomes stronger, shrinking the coefficients‬
‭further.‬
‭Key features of ridge regression:‬
‭1.‬ ‭Shrinks coefficients: Ridge regression shrinks the coefficients towards zero, but‬
‭they are not forced to be exactly zero. This allows the model to include all‬
‭features in the prediction, although with reduced magnitudes.‬
‭2.‬ ‭Handles multicollinearity: Ridge regression is particularly useful when dealing‬
‭with multicollinearity, where independent variables are highly correlated. By‬
‭reducing the impact of correlated variables, it improves the stability and‬
‭interpretability of the model.‬
‭3.‬ ‭Bias-variance tradeoff: Ridge regression introduces a bias (due to the penalty) in‬
‭exchange for reducing variance. It can help prevent overfitting by reducing the‬
‭model's complexity and sensitivity to noise.‬
‭4.‬ ‭Choosing the regularization parameter: The choice of α is crucial in ridge‬
‭regression. A small α will result in coefficients close to ordinary least squares‬
‭(OLS) regression, while a large α will heavily shrink the coefficients. The optimal‬
‭value of α can be determined through techniques such as cross-validation.‬
‭Ridge regression is widely used in situations where multicollinearity is expected or‬
‭observed. It provides a solution for improving the stability and performance of the linear‬
‭regression model. By controlling the regularization parameter, ridge regression strikes a‬
‭balance between bias and variance, leading to better generalization and improved‬
‭prediction accuracy.‬

‭Lasso Regression (L1 Regularization)‬

‭Lasso regression, short for "Least Absolute Shrinkage and Selection Operator," is‬
‭another regularization technique used in linear regression to address multicollinearity‬
‭and perform feature selection. Similar to ridge regression, lasso regression adds a‬
‭penalty term to the linear regression cost function. However, lasso regression uses the‬
‭L1 regularization term, which encourages sparsity by driving some coefficients to‬
‭exactly zero.‬
‭In lasso regression, the model's objective is to minimize the following cost function:‬
‭J(β) = RSS(β) + α * ∑(|βᵢ|)‬
‭where:‬
‭●‬ ‭J(β) is the cost function to be minimized.‬
‭●‬ ‭RSS(β) is the residual sum of squares, which measures the difference between‬
‭the predicted and actual values.‬
‭●‬ ‭β is the vector of regression coefficients.‬
‭●‬ ‭βᵢ represents each individual coefficient in β.‬
‭●‬ ‭∑(|βᵢ|) is the sum of the absolute values of the regression coefficients.‬
‭●‬ ‭α is the regularization parameter (also known as λ or the regularization strength).‬
‭The regularization term (∑(|βᵢ|)) in lasso regression promotes sparsity in the coefficient‬
‭values. As α increases, more coefficients are pushed to exactly zero, resulting in a‬
‭sparse model where only a subset of features is selected for the prediction.‬
‭Key features of lasso regression:‬
‭1.‬ ‭Sparsity and feature selection: Lasso regression's L1 regularization tends to set‬
‭some regression coefficients to exactly zero. This leads to feature selection,‬
‭where irrelevant or less important features are eliminated from the model. Lasso‬
‭can automatically identify and exclude unnecessary features.‬
‭2.‬ ‭Handles multicollinearity: Similar to ridge regression, lasso regression is effective‬
‭in handling multicollinearity, reducing the impact of highly correlated variables. By‬
‭driving some coefficients to zero, it automatically selects one variable from a set‬
‭of highly correlated variables while excluding the others.‬
‭3.‬ ‭Bias-variance tradeoff: Lasso regression introduces a bias (due to the penalty) in‬
‭exchange for reducing variance. It can help prevent overfitting by reducing the‬
‭model's complexity and sensitivity to noise.‬
‭4.‬ ‭Choosing the regularization parameter: Similar to ridge regression, the choice of‬
‭α is essential in lasso regression. Smaller α values result in less shrinkage and‬
‭fewer coefficients driven to zero, while larger α values increase sparsity by‬
‭driving more coefficients to zero. The optimal α value can be determined through‬
‭techniques like cross-validation.‬
‭Lasso regression is particularly useful when dealing with high-dimensional datasets or‬
‭when feature selection is desired. By automatically selecting relevant features and‬
‭shrinking irrelevant ones to zero, lasso regression provides interpretable models and‬
‭can improve prediction accuracy. It strikes a balance between bias and variance,‬
‭leading to better generalization and more robust models.‬

‭Early Stopping‬
‭Early stopping is a technique used in machine learning, particularly in iterative training‬
‭processes like gradient descent, to prevent overfitting and determine the optimal‬
‭stopping point for training. It involves monitoring the performance of a model on a‬
‭validation set during the training process and stopping the training when the‬
‭performance starts to deteriorate.‬
‭The main idea behind early stopping is that as training progresses, the model learns to‬
‭fit the training data better. However, there is a risk of overfitting, where the model‬
‭becomes too specialized to the training data and performs poorly on unseen data. Early‬
‭stopping helps find the point where the model achieves good generalization by stopping‬
‭the training before overfitting occurs.‬
‭Here's how early stopping typically works:‬
‭1.‬ ‭Split the data: Divide the available dataset into three subsets: a training set, a‬
‭validation set, and a test set. The training set is used to train the model, the‬
‭validation set is used to monitor performance during training, and the test set is‬
‭used to evaluate the final model's performance.‬
‭2.‬ ‭Define a performance metric: Choose a performance metric, such as accuracy,‬
‭loss, or validation error, to assess the model's performance.‬
‭3.‬ ‭Training with monitoring: During the training process, after each iteration or‬
‭epoch, evaluate the model's performance on the validation set using the chosen‬
‭performance metric. Track the performance metric over time.‬
‭4.‬ ‭Early stopping criterion: Define a stopping criterion based on the performance‬
‭metric. For example, if the validation error increases or no longer improves for a‬
‭certain number of iterations, it may indicate that the model has started to overfit,‬
‭and training can be stopped.‬
‭5.‬ ‭Stopping and model selection: When the stopping criterion is met, stop the‬
‭training process and select the model at the point of best performance on the‬
‭validation set. This model is expected to have good generalization capabilities.‬
‭6.‬ ‭Final evaluation: Evaluate the selected model on the test set to estimate its‬
‭performance on unseen data. This provides an unbiased assessment of the‬
‭model's generalization ability.‬
‭The advantages of early stopping include:‬
‭●‬ ‭Prevention of overfitting: Early stopping helps avoid overfitting by stopping the‬
‭training process before the model becomes too specialized to the training data.‬
‭●‬ ‭Efficient use of resources: Early stopping reduces training time and‬
‭computational resources by stopping the process as soon as the model's‬
‭performance plateaus or starts to deteriorate.‬
‭●‬ ‭Simplicity and interpretability: Early stopping is a simple and intuitive technique‬
‭that can be easily implemented and understood.‬
‭However, it's important to note that early stopping is not always guaranteed to improve‬
‭the model's performance. It requires careful monitoring and selection of appropriate‬
‭stopping criteria to achieve the desired outcome.‬
‭Overall, early stopping is a powerful technique to prevent overfitting and determine the‬
‭optimal stopping point during model training. It helps strike a balance between training‬
‭the model sufficiently and avoiding excessive complexity, leading to better‬
‭generalization and improved performance on unseen data.‬

‭Logistic Regression‬

‭○‬ ‭Logistic regression is one of the most popular Machine Learning algorithms,‬
‭which comes under the Supervised Learning technique. It is used for predicting‬
‭the categorical dependent variable using a given set of independent variables.‬
‭○‬ ‭Logistic regression predicts the output of a categorical dependent variable.‬
‭Therefore the outcome must be a categorical or discrete value. It can be either‬
‭Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0‬
‭and 1, it gives the probabilistic values which lie between 0 and 1.‬
‭○‬ ‭Logistic Regression is much similar to the Linear Regression except that how‬
‭they are used. Linear Regression is used for solving Regression problems,‬
‭whereas Logistic regression is used for solving the classification problems.‬
‭○‬ ‭In Logistic regression, instead of fitting a regression line, we fit an "S" shaped‬
‭logistic function, which predicts two maximum values (0 or 1).‬
‭○‬ ‭The curve from the logistic function indicates the likelihood of something such as‬
‭whether the cells are cancerous or not, a mouse is obese or not based on its‬
‭weight, etc.‬
‭○‬ ‭Logistic Regression is a significant machine learning algorithm because it has the‬
‭ability to provide probabilities and classify new data using continuous and‬
‭discrete datasets.‬
‭○‬ ‭Logistic Regression can be used to classify the observations using different types‬
‭of data and can easily determine the most effective variables used for the‬
‭classification. The below image is showing the logistic function:‬

‭The logistic function is defined as:‬


‭p = 1 / (1 + e^(-z))‬
‭where:‬
‭●‬ ‭p is the probability of the positive class (e.g., class 1)‬
‭●‬ ‭z is the linear combination of the features and their associated coefficients‬
‭In logistic regression, the model's parameters (coefficients) are estimated using‬
‭maximum likelihood estimation, which seeks to find the set of parameters that‬
‭maximizes the likelihood of the observed data given the model.‬

‭Key features of logistic regression:‬


‭1.‬ ‭Binary classification: Logistic regression is well-suited for binary classification‬
‭problems, where the target variable has two classes (e.g., yes/no, true/false,‬
‭0/1).‬
‭2.‬ ‭Logistic function: The logistic function converts the linear combination of features‬
‭into a probability value between 0 and 1, representing the likelihood of the‬
‭positive class.‬
‭3.‬ ‭Coefficient interpretation: The coefficients in logistic regression provide insight‬
‭into the impact of each feature on the predicted probability. A positive coefficient‬
‭indicates that an increase in the feature value increases the probability of the‬
‭positive class, while a negative coefficient indicates the opposite.‬
‭4.‬ ‭Decision boundary: Logistic regression uses a decision boundary (typically at p =‬
‭0.5) to separate the two classes. Instances with probabilities above the threshold‬
‭are predicted as the positive class, while those below the threshold are predicted‬
‭as the negative class.‬
‭5.‬ ‭Regularization: Logistic regression can be regularized to prevent overfitting.‬
‭Regularization techniques like L1 (Lasso) or L2 (Ridge) regularization can be‬
‭applied to shrink the coefficients or perform feature selection.‬
‭6.‬ ‭Multiclass classification: Logistic regression can also be extended to handle‬
‭multiclass classification problems using techniques like one-vs-rest or softmax‬
‭regression.‬
‭Logistic regression is widely used in various domains, including finance, healthcare,‬
‭marketing, and social sciences, where binary classification tasks are prevalent. It‬
‭provides a simple yet effective approach to estimate the probability of an instance‬
‭belonging to a particular class.‬

‭Decision Boundaries‬
‭Decision boundaries are a fundamental concept in classification tasks and refer to the‬
‭boundaries or surfaces that separate different classes or categories in a machine‬
‭learning model. In a binary classification problem, the decision boundary is a line, curve,‬
‭or surface that separates the instances of one class from the instances of the other‬
‭class.‬
‭The decision boundary is determined by the learned parameters (coefficients) of the‬
‭classification algorithm. The goal is to find a decision boundary that best separates the‬
‭classes based on the available features or input variables.‬
‭Here are a few examples of decision boundaries in different types of classification‬
‭problems:‬
‭1.‬ ‭Linear Decision Boundary:‬
‭●‬ ‭In logistic regression or linear SVM (Support Vector Machine), the decision‬
‭boundary is a straight line or hyperplane that separates the classes.‬
‭●‬ ‭For example, in a 2D feature space, a linear decision boundary could be a‬
‭straight line that separates instances with different labels.‬

‭2.‬ ‭Non-linear Decision Boundary:‬


‭●‬ ‭In more complex classification problems, the decision boundary may not‬
‭be a straight line or hyperplane. It can have curves, bends, or more‬
‭complex shapes.‬
‭●‬ ‭Non-linear decision boundaries can be represented by polynomial‬
‭functions, splines, or other non-linear transformations of the input features.‬
‭●‬ ‭Classification algorithms like decision trees, random forests, or kernel‬
‭SVMs can learn non-linear decision boundaries.‬
‭3.‬ ‭Probabilistic Decision Boundary:‬
‭●‬ ‭In probabilistic classification algorithms like logistic regression or Naive‬
‭Bayes, the decision boundary is typically set at a probability threshold‬
‭(e.g., 0.5).‬
‭●‬ ‭Instances with predicted probabilities above the threshold are assigned to‬
‭one class, while those below the threshold are assigned to the other class.‬

‭4.‬ ‭Multiclass Decision Boundary:‬


‭●‬ ‭In multiclass classification problems, there are multiple decision‬
‭boundaries that separate each class from the rest.‬
‭●‬ ‭Techniques like one-vs-rest or softmax regression are used to determine‬
‭the decision boundaries for each class.‬

‭It's important to note that the complexity and shape of the decision boundary depend on‬
‭the underlying data distribution, the chosen classification algorithm, and the features‬
‭used for classification. Different algorithms have different capabilities in capturing‬
‭complex decision boundaries.‬
‭Visualizing the decision boundary can provide valuable insights into the model's‬
‭behavior and how it separates different classes in the feature space. Decision‬
‭boundaries play a crucial role in classification tasks as they define the regions where‬
‭instances are assigned to specific classes based on the learned parameters and‬
‭classification algorithm.‬

‭Softmax Regression‬

‭Softmax regression, also known as multinomial logistic regression or maximum entropy‬


‭classifier, is a classification algorithm used to assign instances to multiple classes. It is‬
‭an extension of logistic regression, which is used for binary classification, to handle‬
‭problems with more than two classes.‬
‭In softmax regression, the goal is to estimate the probabilities of an instance belonging‬
‭to each class based on its input features. The output of the softmax regression model is‬
‭a probability distribution over the possible classes, where each class is assigned a‬
‭probability value between 0 and 1 that sums up to 1.‬
‭Here's how softmax regression works:‬
‭1.‬ ‭Model representation: The softmax regression model consists of a set of learned‬
‭parameters or coefficients, one for each input feature and each output class.‬
‭These coefficients are usually represented as a weight matrix.‬
‭2.‬ ‭Linear combination: For each instance, the model computes a linear combination‬
‭of the input features and their associated coefficients for each class. This linear‬
‭combination is often called the "logits" or "scores".‬
‭3.‬ ‭Softmax function: The logits are then transformed into probabilities using the‬
‭softmax function. The softmax function ensures that the probabilities are‬
‭non-negative and sum up to 1. It is defined as follows for a K-class classification‬
‭problem:‬
‭p(y=k|x) = e^(z_k) / (∑(e^(z_j)) for j=1 to K)‬
‭where p(y=k|x) is the probability of the instance belonging to class k, z_k is the‬
‭logit for class k, and the denominator is the sum of exponential logits for all‬
‭classes.‬
‭4.‬ ‭Prediction: The class with the highest probability is selected as the predicted‬
‭class for each instance.‬
‭5.‬ ‭Training: The softmax regression model is trained by minimizing a loss function,‬
‭typically the cross-entropy loss. The loss function measures the difference‬
‭between the predicted probabilities and the true class labels. The model‬
‭parameters are adjusted using optimization algorithms like gradient descent to‬
‭minimize the loss.‬
‭Softmax regression is widely used in multiclass classification problems where there are‬
‭more than two possible output classes. It is commonly applied in natural language‬
‭processing tasks such as text categorization, sentiment analysis, and language‬
‭modeling. Softmax regression provides interpretable probabilities for each class and‬
‭can handle situations where instances may belong to multiple classes simultaneously.‬
‭Compared to other multiclass classification algorithms like one-vs-rest or multiclass‬
‭SVM, softmax regression offers a more direct probabilistic interpretation and is generally‬
‭computationally efficient. However, it assumes that the classes are mutually exclusive‬
‭and do not overlap.‬

‭Cross Entropy‬

‭Cross-entropy is a loss function commonly used in machine learning for classification‬


‭tasks. It measures the dissimilarity between the predicted probability distribution and the‬
‭true distribution of the target variable.‬
‭In the context of classification, the cross-entropy loss quantifies how well a predicted‬
‭probability distribution aligns with the actual labels. It is often used as the objective or‬
‭loss function to be minimized during the training of classification models.‬
‭The cross-entropy loss is calculated using the predicted probabilities (outputted by the‬
‭model) and the true labels. For a multi-class classification problem with K classes, the‬
‭cross-entropy loss is computed as follows:‬
‭L = -∑(y_i * log(p_i))‬
‭where L is the cross-entropy loss, y_i is the true label for class i (represented as a‬
‭one-hot vector), and p_i is the predicted probability for class i.‬
‭In this formulation, log refers to the natural logarithm. The loss is calculated for each‬
‭class and then summed to obtain the overall cross-entropy loss.‬
‭Intuitively, the cross-entropy loss penalizes models more heavily when they predict low‬
‭probabilities for the true classes and high probabilities for incorrect classes. It‬
‭encourages the model to correctly assign higher probabilities to the true classes.‬
‭During training, the model iteratively adjusts its parameters to minimize the‬
‭cross-entropy loss using optimization algorithms such as gradient descent. By‬
‭minimizing the loss, the model learns to improve its predictions and better align them‬
‭with the true labels.‬
‭The cross-entropy loss is commonly used in various classification algorithms, including‬
‭logistic regression, softmax regression, and neural networks. It provides a measure of‬
‭the discrepancy between the predicted probabilities and the true labels, allowing the‬
‭model to learn and make better predictions over time.‬
‭It's worth noting that the cross-entropy loss is just one of several loss functions available‬
‭for classification tasks. Its popularity stems from its effectiveness and compatibility with‬
‭probabilistic interpretations of classification problems.‬

‭Unit 3 -‬‭Support Vector Machines‬

‭Q.What is similarity function?‬

I‭n machine learning, a similarity function is a mathematical function that measures the‬
‭similarity between two data points in a dataset. The function takes two data points as‬
‭input and returns a similarity score that quantifies how similar or dissimilar the two‬
‭points are.‬

‭ here are many different types of similarity functions, each designed for a specific type‬
T
‭of data and application. For example, some similarity functions are designed for‬
‭ omparing text data, while others are designed for comparing image data or numerical‬
c
‭data.‬

‭ ne common type of similarity function is the cosine similarity function, which‬


O
‭measures the cosine of the angle between two vectors in a high-dimensional space.‬
‭This function is often used for text data and is a popular choice for tasks such as‬
‭document classification and information retrieval.‬

‭ ther examples of similarity functions include the Euclidean distance function, which‬
O
‭measures the distance between two points in a high-dimensional space, and the‬
‭Jaccard similarity function, which measures the similarity between two sets of data.‬

‭ he choice of similarity function depends on the specific task and the type of data being‬
T
‭analyzed. Selecting the appropriate similarity function is an important step in designing‬
‭effective machine learning algorithms.‬

‭Q.what is soft margin classification?‬

‭ oft margin classification is a variation of the Support Vector Machine (SVM) algorithm‬
S
‭used in machine learning for binary classification tasks. Unlike the traditional SVM,‬
‭which requires that the data be linearly separable, soft margin classification allows for‬
‭some amount of misclassification in order to find a more generalizable solution.‬

I‭n soft margin classification, a "margin" is a boundary that separates the data points of‬
‭different classes. The goal of the algorithm is to find the optimal margin that maximizes‬
‭the distance between the margin and the closest data points of each class, while also‬
‭minimizing the number of misclassified points. Parameter C, which is a hyperparameter‬
‭that controls the trade-off between maximizing the margin and minimizing the‬
‭misclassification, is introduced in the soft margin classification.‬

‭ oft margin classification allows for some amount of misclassification by allowing data‬
S
‭points to be on the wrong side of the margin or within the margin itself. The degree to‬
‭which misclassification is allowed is controlled by the parameter C. When C is small, the‬
‭algorithm allows for more misclassification and a wider margin. When C is large, the‬
‭algorithm allows for less misclassification and a narrower margin.‬

‭ oft margin classification is useful when the data is not perfectly separable, which is‬
S
‭often the case in real-world problems. By allowing for some amount of‬
‭misclassification, the algorithm can still find a good separation between the data points‬
‭of different classes, while avoiding overfitting to the training data.‬
‭Q.What is the fundamental idea behind Support Vector Machines?‬

‭The fundamental idea behind Support Vector Machines (SVMs) is to find the optimal‬

‭hyperplane that separates the data points of different classes in a way that maximizes the‬

‭margin between the hyperplane and the closest data points of each class. In other words,‬

‭the goal of SVMs is to find a decision boundary that not only correctly classifies the training‬

‭data but also generalizes well to unseen data.‬

‭To achieve this, SVMs transform the input data into a higher-dimensional feature space‬

‭using a kernel function. In this feature space, the data points of different classes can be‬

‭separated by a hyperplane. The hyperplane is chosen such that it maximizes the distance‬

‭between the closest data points of each class, which is known as the margin.‬

‭In addition to finding the optimal hyperplane, SVMs also handle outliers and noisy data‬

‭points by introducing a soft margin. The soft margin allows for some misclassification of‬

‭data points that fall within the margin or on the wrong side of the hyperplane. The degree to‬

‭which misclassification is allowed is controlled by a hyperparameter known as the‬

‭regularization parameter.‬

‭SVMs can be used for both binary and multi-class classification tasks, as well as regression‬

‭tasks. They have proven to be effective in a wide range of applications, including text‬

‭classification, image classification, and bioinformatics. SVMs are also widely used in deep‬

‭learning as a tool for dimensionality reduction, feature extraction, and classification.‬

‭Q.Explain the CART Training Algorithm‬

‭ he Classification and Regression Tree (CART) algorithm is a popular decision tree‬


T
‭algorithm used in machine learning for both classification and regression tasks. The‬
‭algorithm works by recursively partitioning the data into smaller subsets, where each‬
‭partition is represented by a node in the tree.‬

‭The CART algorithm follows the following steps to train a decision tree:‬
‭1.‬ S ‭ elect the best attribute: The algorithm starts by selecting the best attribute to‬
‭split the data at the current node. The best attribute is selected based on a‬
‭criterion such as Gini index or information gain, which measures the impurity of‬
‭the data.‬
‭2.‬ ‭Split the data: The data is split into two or more subsets based on the value of‬
‭the selected attribute. Each subset forms a branch of the tree.‬
‭3.‬ ‭Recurse: The above steps are repeated recursively for each subset until a‬
‭stopping criterion is met, such as a minimum number of instances per node or a‬
‭maximum depth of the tree.‬
‭4.‬ ‭Prune the tree: Finally, the tree is pruned to avoid overfitting by removing‬
‭branches that do not improve the accuracy of the tree on the validation data.‬

‭ he CART algorithm is a greedy algorithm, meaning that it selects the best attribute at‬
T
‭each step without considering the global optimum. This can lead to suboptimal trees,‬
‭but the algorithm can still produce good results with proper tuning of hyperparameters.‬

‭Advantages of CART algorithm:‬

‭ ART trees can be used for both classification and regression tasks. For classification‬
C
‭tasks, the decision tree outputs the majority class of the instances in the leaf node. For‬
‭regression tasks, the decision tree outputs the mean or median value of the instances in‬
‭the leaf node.‬

‭1.‬ E ‭ asy to understand and interpret: CART decision trees are easy to understand‬
‭and interpret, even for non-experts. The tree structure is intuitive and can be‬
‭visualized, allowing users to easily understand how the algorithm arrived at a‬
‭particular decision.‬
‭2.‬ ‭Handles both categorical and numerical data: The CART algorithm can handle‬
‭both categorical and numerical data, making it a versatile algorithm that can be‬
‭applied to a wide range of datasets.‬
‭3.‬ ‭Can handle missing data: The CART algorithm can handle missing data by‬
‭making an estimate based on the available data, which is a useful feature when‬
‭working with real-world datasets that often have missing values.‬
‭4.‬ ‭Scalable: The CART algorithm can handle large datasets and can be parallelized‬
‭to speed up computation.‬
‭5.‬ N ‭ on-parametric: The CART algorithm is non-parametric, which means it makes‬
‭no assumptions about the underlying distribution of the data. This makes it‬
‭useful for modeling complex relationships between variables.‬
‭6.‬ ‭Can handle both classification and regression tasks: The CART algorithm can be‬
‭used for both classification and regression tasks, making it a versatile algorithm‬
‭that can be applied to a wide range of problems.‬
‭7.‬ ‭Can be used in ensemble methods: CART decision trees can be used in ensemble‬
‭methods such as random forests and boosting, which can improve the accuracy‬
‭of the model.‬

‭ verall, the CART algorithm is a powerful and flexible machine learning algorithm that is‬
O
‭widely used for both classification and regression tasks. Its ability to handle both‬
‭categorical and numerical data, missing values, and large datasets make it a popular‬
‭choice for data scientists and machine learning practitioners.‬

‭Q.‬‭What is a kernel trick? what are different types‬‭of kernel functions‬

‭Kernel trick is a technique used in machine learning to allow linear algorithms to perform‬

‭nonlinear classification or regression tasks by mapping the original input space into a‬

‭higher-dimensional feature space. The kernel trick computes the dot product of the‬

‭transformed input vectors in the higher-dimensional space without explicitly computing the‬

‭transformation, which is computationally expensive.‬

‭Kernel functions are used to define the dot product between two vectors in the transformed‬

‭feature space. There are several types of kernel functions, including:‬

‭1.‬ L ‭ inear kernel: The linear kernel simply computes the dot product between the‬
‭original input vectors, without any transformation.‬
‭2.‬ ‭Polynomial kernel: The polynomial kernel maps the original input vectors into a‬
‭higher-dimensional space using a polynomial function.‬
‭3.‬ ‭Radial basis function (RBF) kernel: The RBF kernel maps the original input vectors‬
‭into an infinite-dimensional space using a Gaussian function.‬
‭4.‬ ‭Sigmoid kernel: The sigmoid kernel maps the original input vectors into a‬
‭higher-dimensional space using a sigmoid function.‬
‭5.‬ ‭Laplacian kernel: The Laplacian kernel is similar to the RBF kernel, but uses the‬
‭Laplacian function instead of the Gaussian function.‬
‭The choice of kernel function depends on the nature of the data and the specific problem‬

‭being solved. The linear kernel is often used when the data is linearly separable, while the‬

‭polynomial and RBF kernels are used for nonlinear problems. The sigmoid and Laplacian‬

‭kernels are less commonly used but can be useful in certain situations.‬

‭The kernel trick is used in a variety of machine learning algorithms, including support vector‬

‭machines (SVMs), kernel PCA, and Gaussian processes. It has proven to be a powerful tool‬

‭for solving complex machine learning problems and has led to significant advances in the‬

‭field.‬

‭Q. Write a note on Decision Trees.‬

‭Decision trees are a popular machine learning algorithm used for both classification and‬

‭regression tasks. They are a non-parametric supervised learning method that builds a‬

‭model in the form of a tree structure. Each internal node in the tree represents a‬

‭decision based on a feature or attribute, and each leaf node represents a predicted‬

‭outcome or target value.‬

‭The decision tree algorithm partitions the training data recursively based on different‬

‭attributes, creating splits in the data based on the values of the features. The goal is to‬

‭find the best splits that maximize the information gain or decrease the impurity within‬

‭each partition. The impurity or uncertainty can be measured using various criteria such‬

‭as Gini index or entropy.‬

‭During the training process, the decision tree algorithm evaluates different attributes‬

‭and their potential splits based on the chosen impurity measure. The attribute that‬

‭provides the best split, resulting in the highest information gain or the greatest reduction‬

‭in impurity, is selected at each node. This process continues until a stopping criterion is‬
‭met, such as reaching a maximum tree depth, a minimum number of samples in a leaf‬

‭node, or when no further improvement can be achieved.‬

‭Once the decision tree is built, it can be used to make predictions on new, unseen data‬

‭by traversing the tree based on the attribute values of the input features. The path‬

‭followed through the tree leads to a leaf node, which represents the predicted outcome‬

‭or value.‬

‭Decision trees have several advantages:‬

‭1. Interpretability: Decision trees are easy to understand and interpret. The tree structure‬

‭provides a clear visual representation of the decision-making process.‬

‭2. Feature importance: Decision trees can rank the importance of features based on‬

‭how much they contribute to the decision-making process. This information can be‬

‭useful for feature selection or understanding the underlying relationships in the data.‬

‭3. Handling both numerical and categorical data: Decision trees can handle a mixture of‬

‭numerical and categorical features without requiring feature scaling or one-hot‬

‭encoding.‬

‭4. Non-linear relationships: Decision trees can capture non-linear relationships between‬

‭features and the target variable.‬

‭However, decision trees also have some limitations:‬

‭1. Overfitting: Decision trees are prone to overfitting, especially when the tree becomes‬

‭too deep or when the training data is noisy. Overfitting occurs when the tree captures‬

‭the training data's noise or outliers, leading to poor generalization on unseen data.‬
‭2. Lack of smoothness: Decision trees partition the feature space into rectangular‬

‭regions, which can result in a lack of smoothness in the predicted outcomes. This‬

‭limitation can be overcome by using ensemble methods like random forests or gradient‬

‭boosting.‬

‭3. Instability: Decision trees can be sensitive to small changes in the training data, which‬

‭can result in different tree structures. This instability can be mitigated by using‬

‭ensemble methods or by setting randomization parameters during training.‬

‭To address some of the limitations, ensemble methods like random forests and gradient‬

‭boosting are often used with decision trees. These methods combine multiple decision‬

‭trees to make more robust and accurate predictions.‬

‭Overall, decision trees are a versatile and widely used algorithm in machine learning due‬

‭to their interpretability, ability to handle different types of data, and capability to capture‬

‭complex relationships.‬

‭Q.What is regularization? How do you reduce the risk of overfitting of Decision Tree?‬

‭ egularization is a technique used in machine learning to prevent overfitting of models.‬


R
‭Overfitting occurs when a model is too complex and fits the training data too closely,‬
‭resulting in poor generalization performance on new, unseen data.‬

‭ ecision trees can also suffer from overfitting when they become too complex and‬
D
‭capture the noise in the training data. Regularization techniques can be used to reduce‬
‭the risk of overfitting in decision trees.‬

‭There are several ways to regularize decision trees:‬

‭1.‬ P
‭ runing: Pruning is a technique used to remove branches or nodes from the tree‬
‭that do not improve the accuracy of the model on the validation data. Pruning‬
‭can be done using techniques such as reduced-error pruning, cost-complexity‬
‭pruning, or minimum description length pruning.‬
‭2.‬ M ‭ inimum samples per leaf: This technique sets a minimum threshold for the‬
‭number of samples required to create a leaf node in the decision tree. This helps‬
‭to prevent the creation of overly specific leaves that capture noise in the data.‬
‭3.‬ ‭Maximum depth: Setting a maximum depth for the decision tree can prevent it‬
‭from becoming too complex and overfitting the training data.‬
‭4.‬ ‭Minimum impurity decrease: This technique sets a threshold for the minimum‬
‭improvement in impurity that must be achieved for a split to be considered in the‬
‭decision tree. This helps to prevent the creation of overly complex branches that‬
‭capture noise in the data.‬
‭5.‬ ‭Ensemble methods: Ensemble methods such as random forests and boosting‬
‭can be used to regularize decision trees by creating multiple trees and combining‬
‭their predictions.‬

‭ y using one or more of these regularization techniques, the risk of overfitting can be‬
B
‭reduced, and the decision tree can generalize better to new, unseen data.‬

‭Q. What is Entropy. Also describe Information Gain.‬

‭Entropy:‬

‭ ntropy is a concept from information theory that measures the amount of uncertainty‬
E
‭or disorder in a set of data. In the context of machine learning and decision trees,‬
‭entropy is often used as a measure of impurity or randomness in a given dataset or a‬
‭subset of data.‬

‭ he entropy is minimum (0) when all the data points in a subset belong to the same‬
T
‭class, indicating perfect purity. On the other hand, entropy is maximum when the‬
‭distribution of classes is uniform or evenly distributed, indicating maximum impurity or‬
‭randomness.‬

I‭n the context of decision trees, entropy is used as a criterion to determine the best‬
‭attribute to split the data at each node. The attribute that leads to the greatest reduction‬
‭in entropy after the split is chosen as the splitting criterion. The reduction in entropy is‬
‭calculated by comparing the entropy of the parent node with the weighted average‬
‭entropy of the child nodes after the split.‬
‭ y using entropy as a measure of impurity, decision trees aim to create splits that‬
B
‭maximize the information gain. Information gain is defined as the difference between‬
‭the entropy of the parent node and the weighted average entropy of the child nodes. The‬
‭goal is to find the attribute that results in the highest information gain, indicating the‬
‭most significant reduction in uncertainty or impurity.‬

I‭n summary, entropy is a measure of uncertainty or disorder in a dataset. In decision‬


‭trees, it is used as a criterion to evaluate the purity of subsets of data and select the‬
‭best attribute for splitting the data, aiming to create a tree that effectively classifies or‬
‭predicts the target variable.‬

‭ athematically, entropy is calculated using the probability distribution of the different‬


M
‭classes or outcomes in the dataset. For a binary classification problem, where there are‬
‭two possible outcomes (e.g., class A and class B), the entropy is defined as:‬

‭Formula given above**‬


‭ here Pyes and Pno are the probabilities of class yes and class no, respectively. The‬
w
‭logarithm base 2 is commonly used, which results in entropy measured in bits.‬

‭Information Gain:‬

I‭nformation gain is a concept used in decision tree algorithms to measure the‬


‭effectiveness of a particular attribute in splitting the data. It quantifies the amount of‬
‭information gained by partitioning the data based on that attribute.‬

I‭nformation gain is calculated by comparing the entropy (or impurity) of the parent node‬
‭before the split with the weighted average of the entropies of the child nodes after the‬
‭split. The attribute that results in the highest information gain is chosen as the splitting‬
‭criterion.‬

‭ o understand information gain, let's consider a binary classification problem with two‬
T
‭classes: class A and class B. The entropy of the parent node before the split is‬
‭calculated using the probabilities of class A (p(A)) and class B (p(B)). The entropy is‬
‭given by:‬

‭Entropy(parent) = -p(A) * log2(p(A)) - p(B) * log2(p(B))‬

‭ ow, suppose we split the data based on a specific attribute, creating child nodes. The‬
N
‭entropy of each child node is calculated using the probabilities of class A and class B‬
‭within that node. The weighted average entropy of the child nodes is calculated by‬
‭considering the proportion of samples in each child node.‬

‭ he information gain is then calculated as the difference between the entropy of the‬
T
‭parent node and the weighted average entropy of the child nodes:‬

‭Information Gain = Entropy(parent) - Weighted Average Entropy(child nodes)‬

‭ he attribute that maximizes the information gain is chosen as the best attribute to split‬
T
‭the data. Higher information gain implies a greater reduction in entropy, indicating a‬
‭more effective split and better discrimination between the classes.‬

‭ he information gain criterion is commonly used in decision tree algorithms like ID3‬
T
‭(Iterative Dichotomiser 3) and C4.5. However, information gain has a bias towards‬
‭attributes with a large number of distinct values. To address this bias, another criterion‬
‭ alled gain ratio, which takes into account the intrinsic information of an attribute, is‬
c
‭often used.‬

I‭nformation gain is a key concept in decision tree algorithms as it guides the‬


‭tree-building process by selecting the most informative attributes for splitting the data‬
‭and constructing an effective predictive model.‬

‭Q.What is Gini Impurity?‬

‭ ini impurity is a measure of impurity or randomness used in decision trees for‬


G
‭classification problems. It measures the probability of incorrectly classifying a randomly‬
‭chosen element in the dataset if it were labeled randomly according to the distribution‬
‭of classes in the dataset.‬

I‭n other words, Gini impurity is a measure of the likelihood of a randomly chosen sample‬
‭being incorrectly labeled if it were randomly labeled according to the proportion of‬
‭classes in the subset it belongs to.‬

‭Mathematically, the Gini impurity of a set of samples is defined as follows:‬

‭ here D is the set of samples, k is the number of classes, and p_i is the proportion of‬
w
‭samples that belong to class i.‬

‭ Gini impurity of 0 indicates a perfectly pure dataset, where all samples belong to the‬
A
‭same class. A Gini impurity of 1 indicates a perfectly impure dataset, where the samples‬
‭are evenly distributed among all classes.‬

I‭n decision tree algorithms, Gini impurity is used to evaluate the quality of a split. The‬
‭goal is to find the split that results in the lowest Gini impurity, which corresponds to the‬
‭split that separates the classes most effectively. By iteratively splitting the data based‬
‭on the features that result in the lowest Gini impurity, a decision tree can be constructed‬
‭that effectively classifies new, unseen data.‬
‭Q.Write a note on the advantages and disadvantages of using decision trees.‬

‭Decision trees are a popular machine learning algorithm that can be used for both‬

‭classification and regression tasks. They have several advantages and disadvantages,‬

‭which are discussed below.‬

‭Advantages:‬

‭1.‬ E ‭ asy to understand and interpret: Decision trees are easy to visualize and‬
‭understand, making them a popular choice for many applications. The resulting‬
‭decision tree can be easily explained to stakeholders and decision-makers,‬
‭making it a useful tool for decision-making.‬
‭2.‬ ‭Can handle both categorical and numerical data: Decision trees can handle both‬
‭categorical and numerical data, making them versatile for a wide range of‬
‭applications.‬
‭3.‬ ‭Robust to outliers and missing data: Decision trees are robust to outliers and‬
‭missing data, as they do not require the data to be preprocessed in any specific‬
‭way.‬
‭4.‬ ‭Non-parametric method: Decision trees are a non-parametric method, meaning‬
‭that they do not assume any specific distribution for the data.‬
‭5.‬ ‭Can handle non-linear relationships: Decision trees can handle non-linear‬
‭relationships between features, making them useful for non-linear classification‬
‭and regression problems.‬

‭Disadvantages:‬

‭1.‬ P ‭ rone to overfitting: Decision trees are prone to overfitting, as they can create‬
‭overly complex trees that capture the noise in the training data. Regularization‬
‭techniques such as pruning can be used to mitigate this issue.‬
‭2.‬ ‭Can be unstable: Small changes in the training data can result in significant‬
‭changes to the resulting decision tree, making them potentially unstable.‬
‭3.‬ B ‭ iased towards features with more levels: Decision trees are biased towards‬
‭features with more levels, as these features can result in more splits and‬
‭therefore more information gain.‬
‭4.‬ ‭Poor performance with imbalanced data: Decision trees can perform poorly with‬
‭imbalanced data, where one class is significantly more prevalent than the others.‬
‭5.‬ ‭Not suitable for some tasks: Decision trees may not be suitable for tasks where‬
‭the relationship between features and the target variable is too complex or not‬
‭well-defined.‬

‭Overall, decision trees are a useful machine learning algorithm with several advantages‬

‭and disadvantages. Careful consideration should be given to the specific application‬

‭and the characteristics of the data before deciding whether to use a decision tree or‬

‭another algorithm.‬

‭ .What is SVM? Briefly explain support vectors, hyperplane, and margin with respect‬
Q
‭to SVM.‬

‭ VM (Support Vector Machine) is a popular machine‬‭learning algorithm used for‬


S
‭classification and regression tasks. The basic idea of SVM is to find the hyperplane that‬
‭best separates the data into different classes. The hyperplane is chosen so that it‬
‭maximizes the distance between the closest points of different classes, which are‬
‭called support vectors. The distance between the support vectors and the hyperplane is‬
‭called the margin.‬

‭Here are some key concepts in SVM:‬

‭1.‬ S ‭ upport Vectors: Support vectors are the data points that lie closest to the‬
‭decision boundary or hyperplane. These are the critical data points that‬
‭determine the position and orientation of the hyperplane. SVM aims to maximize‬
‭the margin, which is the distance between the hyperplane and the closest‬
‭support vectors.‬
‭2.‬ ‭Hyperplane: A hyperplane is a decision boundary that separates the data into‬
‭different classes. In SVM, the goal is to find the hyperplane that maximizes the‬
‭margin, which is the distance between the hyperplane and the closest support‬
‭vectors.‬
‭3.‬ ‭Margin: The margin is the distance between the hyperplane and the closest‬
‭support vectors. In SVM, the goal is to find the hyperplane that maximizes the‬
‭ argin, as this is expected to lead to better generalization performance on new,‬
m
‭unseen data.‬

‭ VM can be used for both linear and non-linear classification tasks. In the case of‬
S
‭non-linear classification, SVM can use a kernel function to map the input data into a‬
‭higher-dimensional feature space, where the data can be more easily separated by a‬
‭linear hyperplane.‬

‭ verall, SVM is a powerful machine learning algorithm that has been shown to perform‬
O
‭well in many applications. Its ability to handle both linear and non-linear classification‬
‭tasks, and its robustness to outliers and noise, make it a popular choice for many‬
‭machine learning practitioners.‬

‭Q.Write a note on the Types of SVMs. Why SVMs are used in Machine Learning?‬

‭There are several types of SVMs, which can be classified based on their use case and‬

‭the complexity of the decision boundary:‬

‭1.‬ L ‭ inear SVM: In linear SVM, the decision boundary is a linear hyperplane. Linear‬
‭SVM is used when the data is linearly separable and can be separated into‬
‭different classes using a straight line.‬
‭2.‬ ‭Non-linear SVM: In non-linear SVM, the decision boundary is a non-linear function‬
‭that can separate the data into different classes. Non-linear SVM is used when‬
‭the data is not linearly separable and requires a more complex decision‬
‭boundary.‬
‭3.‬ ‭Support Vector Regression (SVR): In SVR, the goal is to predict a continuous‬
‭output variable, rather than a discrete class label. The hyperplane in SVR is‬
‭chosen to minimize the deviation of the predicted output from the actual output.‬
‭4.‬ ‭Nu-SVM: Nu-SVM is a variant of SVM that uses a parameter called "nu" to control‬
‭the trade-off between the margin and the number of support vectors. This can‬
‭lead to a more efficient and accurate SVM model.‬
‭5.‬ ‭One-Class SVM: One-Class SVM is used for anomaly detection, where the goal is‬
‭to identify data points that are significantly different from the rest of the data.‬
‭One-Class SVM is trained on only one class of data and is used to identify data‬
‭points that are outside the decision boundary.‬

‭SVMs are used in machine learning for several reasons:‬


‭1.‬ R ‭ obustness: SVMs are robust to outliers and noise in the data, as they aim to‬
‭maximize the margin between the classes rather than minimizing the error.‬
‭2.‬ ‭Flexibility: SVMs can handle both linear and non-linear classification tasks, as‬
‭well as regression tasks.‬
‭3.‬ ‭Efficiency: SVMs can be efficiently trained on large datasets, as they only depend‬
‭on the support vectors.‬
‭4.‬ ‭Generalization: SVMs have good generalization performance on new, unseen‬
‭data, as they aim to maximize the margin and minimize the error.‬

‭Overall, SVMs are a powerful machine learning algorithm that can be used for a wide‬

‭range of classification and regression tasks. Their robustness, flexibility, efficiency, and‬

‭generalization performance make them a popular choice for many machine learning‬

‭practitioners.‬

‭Q.Explain the working of SVM‬

‭The working of SVM can be summarized in the following steps:‬

‭1.‬ ‭Input data: SVM takes as input a set of labeled training examples, where each‬

‭example is a pair of input feature vectors and their corresponding class labels.‬

‭2.‬ ‭Feature mapping: In non-linear SVM, the input features are mapped to a‬

‭higher-dimensional feature space using a kernel function. This mapping is done‬

‭to make the data more separable by a linear hyperplane in the new feature space.‬

‭3.‬ ‭Margin maximization: SVM finds the hyperplane that maximizes the margin‬

‭between the closest points of different classes. The hyperplane is chosen such‬

‭that it maximizes the distance between the hyperplane and the closest data‬

‭points, which are called support vectors.‬

‭4.‬ ‭Classification: After the hyperplane is identified, new input examples can be‬

‭classified based on which side of the hyperplane they fall on.‬

‭5.‬ ‭Model optimization: In SVM, the optimization problem is formulated as a convex‬

‭optimization problem, which can be solved efficiently using various numerical‬


‭methods. The goal is to minimize the sum of misclassifications subject to the‬

‭constraint that the margin is maximized.‬

‭6.‬ ‭Model evaluation: After the model is trained on the labeled training examples, it is‬

‭evaluated on a separate set of test examples to measure its performance. The‬

‭performance metrics can include accuracy, precision, recall, and F1 score.‬

‭SVM is a versatile machine learning algorithm that can be used for both classification‬

‭and regression tasks. In the case of regression, the goal is to predict a continuous‬

‭output variable rather than a discrete class label. In regression SVM, the hyperplane is‬

‭chosen to minimize the deviation of the predicted output from the actual output.‬

‭Overall, SVM is a powerful machine learning algorithm that can handle both linear and‬

‭non-linear classification and regression tasks. Its ability to maximize the margin‬

‭between classes, handle noisy data, and generalize well to new, unseen data make it a‬

‭popular choice for many machine learning practitioners‬

‭Q.Briefly outline the use of Gini Impurity vs Entropy in decision trees.‬

‭Gini Impurity and Entropy are two popular measures of impurity used in decision tree‬

‭algorithms. Both measures are used to determine the best split in a decision tree by‬

‭minimizing the impurity of the resulting child nodes.‬

‭Gini Impurity measures the probability of misclassification if a random sample is‬

‭classified according to the distribution of the classes in a given node. A Gini score of 0‬

‭indicates that all the samples in a node belong to the same class, while a score of 1‬

‭indicates that the samples are equally distributed among all classes. Gini Impurity is‬
‭preferred over entropy in decision trees due to its computational efficiency and‬

‭tendency to create unbalanced trees with better classification performance.‬

‭Entropy, on the other hand, measures the degree of disorder or uncertainty in a set of‬

‭samples. It measures the information gain that would result from splitting a node based‬

‭on a particular attribute. Entropy is calculated as the sum of the negative logarithm of‬

‭the probabilities of each class label. A low entropy score indicates that the samples in a‬

‭node belong to the same class, while a high entropy score indicates that the samples‬

‭are equally distributed among all classes. Entropy can be computationally expensive,‬

‭especially in large datasets, but it tends to create more balanced trees.‬

‭In practice, both Gini Impurity and Entropy can be used interchangeably in decision‬

‭trees, and their performance depends on the specific dataset and problem at hand.‬

‭Some decision tree algorithms, such as CART, allow the user to choose between Gini‬

‭Impurity and Entropy as the splitting criterion.‬

You might also like