0% found this document useful (0 votes)
22 views15 pages

MLT Unit 1

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
22 views15 pages

MLT Unit 1

Uploaded by

sahil.utube2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 15

Unit 1

Introduction to Machine Learning


1.1-Machine Learning and it’s Types
Machine Learning
 Machine learning is a branch of artificial intelligence (AI) and computer
science which focuses on the use of data and algorithms to imitate the
way that humans learn, gradually improving its accuracy.
 Machine learning is an important component of the growing field of data
science. Through the use of statistical methods, algorithms are trained to
make classifications or predictions, and to uncover key insights in data
mining projects.
 Machine learning enables a machine to automatically learn from data,
improve performance from experiences, and predict things without being
explicitly programmed.
 A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the output
for it.
 The accuracy of predicted output depends upon the amount of data, as the
huge amount of data helps to build a better model which predicts the
output more accurately.

Terminology:

 Model: Also known as “hypothesis”, a machine learning model is the


mathematical representation of a real-world process. A machine learning
algorithm along with the training data builds a machine learning model.
 Feature: A feature is a measurable property or parameter of the data-set.
 Feature Vector: It is a set of multiple numeric features. We use it as an
input to the machine learning model for training and prediction purposes.
 Training: An algorithm takes a set of data known as “training data” as
input. The learning algorithm finds patterns in the input data and trains
the model for expected results (target). The output of the training process
is the machine learning model.
 Prediction: Once the machine learning model is ready, it can be fed with
input data to provide a predicted output.
 Target (Label): The value that the machine learning model has to predict
is called the target or label.
 Overfitting: When a massive amount of data trains a machine learning
model, it tends to learn from the noise and inaccurate data entries. Here
the model fails to characterize the data correctly.
 Underfitting: It is the scenario when the model fails to decipher the
underlying trend in the input data. It destroys the accuracy of the machine
learning model. In simple terms, the model or the algorithm does not fit
the data well enough.

Types of Machine Learning

1.Supervised Machine Learning

 Supervised learning is the type of machine learning in which machines


are trained using well labelled training data and on basis of that machine
predict the output.
 The labelled data means some input data is already tagged with the
correct output.
 In supervised learning the training data provide to the machine works as
the supervisor that teaches the machine to predict the output correctly.
 The aim of a supervised learning algorithm is to find a mapping function
to map the input variable i.e., X with the output variable i.e., Y.
 Supervised learning can be used for Risk assessment, image
classification, Fraud detection.

How Supervised learning works

 In supervised learning models are trained using labelled dataset where the
model learns about each type of data. Once the training process is
completed, the model is tested based on test data and then it predicts the
output.
 Suppose we have a dataset of different type of shape includes square,
rectangle, triangle, and polygon. Now the first step is that we need to train
the model for each shape.
 If the given shape has four sides, and all the sides are equal, then it will
be labelled as a Square.
 If the given shape has three sides, then it will be labelled as a triangle.
 If the given shape has six equal sides, then it will be labelled as hexagon.
 Now, after training, we test our model using the test set, and the task of
the model is to identify the shape.
 The machine is already trained on all types of shapes, and when it finds a
new shape, it classifies the shape on the bases of a number of sides, and
predicts the output.

Steps involved in supervised learning

 First Determine the type of training dataset


 Collect/Gather the labelled training data.
 Split the training dataset into training dataset, test dataset, and
validation dataset.
 Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
 Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
 Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of training
datasets.
 Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate.

Types of Supervised Machine Learning Algorithms

1.Regression

Regression algorithms are used if there is a relationship between the input


variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Below are some
popular Regression algorithms which come under supervised learning:

 Linear Regression
 Regression Trees
 Non-Linear Regression
 Bayesian Linear Regression
 Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical,


which means there are two classes such as Yes-No, Male-Female, True-false,
etc.

 Random Forest
 Decision Trees
 Logistic Regression
 Support vector Machines

Advantage of Supervised Machine Learning

 With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
 In supervised learning, we can have an exact idea about the classes of
objects.
 Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantage of Supervised Machine Learning

 Supervised learning models are not suitable for handling the complex
tasks.
 Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
 Training required lots of computation times.
 In supervised learning, we need enough knowledge about the classes of
object.

2.Unsupervised Machine Learning

 As the name suggests, unsupervised learning is a machine learning


technique in which models are not supervised using training dataset.
Instead, models itself find the hidden patterns and insights from the given
data. It can be compared to learning which takes place in the human brain
while learning new things.
 Unsupervised learning is a type of machine learning in which models are
trained using unlabelled dataset and are allowed to act on that data
without any supervision
 Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the
input data but no corresponding output data.
 The goal of unsupervised learning is to find the underlying structure of
dataset, group that data according to similarities, and represent that
dataset in a compressed format.

Why use Supervised Learning

 Unsupervised learning is helpful for finding useful insights from the data.
 Unsupervised learning is much similar as a human learns to think by their
own experiences, which makes it closer to the real AI.
 Unsupervised learning works on unlabelled and uncategorized data which
make unsupervised learning more important.
 In real-world, we do not always have input data with the corresponding
output so to solve such cases, we need unsupervised learning.

Working of Unsupervised learning

 we have taken an unlabelled input data, which means it is not categorized


and corresponding outputs are also not given.
 Now, this unlabelled input data is fed to the machine learning model in
order to train it.
 Firstly, it will interpret the raw data to find the hidden patterns from the
data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
 Once it applies the suitable algorithm, the algorithm divides the data
objects into groups according to the similarities and difference between
the objects.

Types of Unsupervised Learning Algorithms


1.Clustering

 Clustering is a method of grouping the objects into clusters such that


objects with most similarities remains into a group and has less or no
similarities with the objects of another group.
 Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.

2.Association

 An association rule is an unsupervised learning method which is used for


finding the relationships between variables in the large database.
 It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.

Unsupervised Learning algorithms:

 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis
 Apriori algorithm
 Singular value decomposition

Advantages of Unsupervised Learning

 Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we do not have
labelled input data.
 Unsupervised learning is preferable as it is easy to get unlabelled data in
comparison to labelled data.

Disadvantages of Unsupervised Learning

 Unsupervised learning is intrinsically more difficult than supervised


learning as it does not have corresponding output.
 The result of the unsupervised learning algorithm might be less accurate
as input data is not labelled, and algorithms do not know the exact output
in advance

3.Reinforcement Learning

 Reinforcement Learning (RL) is a feedback-based machine learning in


which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback and for each bad action the agent gets negative
feedback or penalty.
 In RL, the agents learns automatically using feedbacks without any
labelled data like supervised learning.
 RL solve a specific types of problems where decision making is
sequential and the goal is long term such as game playing, robotics etc.
 The agent interacts with the environment and explores it by itself. The
primary goal of an agent in RL is to improve the performance by getting
the maximum positive rewards.
 RL is a type of machine learning method where an intelligent agent i.e.,
compute program interacts with the environment and learns to acts within
that.
 The agent continues doing these three things i.e., take actions change
state/remain in the same state and get feedback, and by doing these
actions he learns and explore the environment.
 The agents learns that what actions lead to positive feedback and what
actions leads to negative feedback penalty. As a positive reward the agent
gets a positive point and as a penalty it gets a negative point.

Types of RL

1.Positive Reinforcement learning

 The positive reinforcement learning means adding something to increase


the tendency that expected behaviour would occur again. It impacts
positively on the behaviour of the agent increase the strength of the
behaviour.
 This type of reinforcement can sustain the change for a long time but too
much positive reinforcement may lead to an overload of states that can
reduce the consequences.

2.Negative Reinforcement learning


 The negative reinforcement learning is opposite to the positive
reinforcement as it increases the tendency that the specific behaviour will
occurs again by avoiding the negative conditions.
 It can be more effective that the positive reinforcement depending on
situation and behaviour it provides reinforcement only to meet minimum
behaviour.

1.2-Well-posed Learning Problems


A computer program is said to learn from experience E in context to some task
T and some performance measure P, if its performance on T, as was measured
by P, upgrades with experience E.

Any problem can be segregated as well-posed learning problem if it has three


traits –

 Task
 Performance Measure
 Experience

Certain examples that efficiently defines the well-posed learning problem


are –

1. To better filter emails as spam or not

 Task – Classifying emails as spam or not


 Performance Measure – The fraction of emails accurately classified as
spam or not spam
 Experience – Observing you label emails as spam or not spam
2. A checkers learning problem

 Task – Playing checker’s game


 Performance Measure – percent of games won against opposer
 Experience – playing implementation games against itself

3. Handwriting Recognition Problem

 Task – Acknowledging handwritten words within portrayal


 Performance Measure – percent of words accurately classified
 Experience – a directory of handwritten words with given classifications

4. A Robot Driving Problem

 Task – driving on public four-lane highways using sight scanners


 Performance Measure – average distance progressed before a fallacy
 Experience – order of images and steering instructions noted down while
observing a human driver

1.3-Applications of Machine Learning

1. Image Recognition:

Image recognition is one of the most common applications of machine learning.


It is used to identify objects, persons, places, digital images, etc. The popular
use case of image recognition and face detection is, Automatic friend tagging
suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we


upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible


for face recognition and person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under


speech recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it


is also known as "Speech to text", or "Computer speech recognition." At
present, machine learning algorithms are widely used by various applications of
speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech
recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us
the correct path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving,


or heavily congested with the help of two ways:

 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the
performance.

1.4-Issues in Machine Learning


1. Inadequate Training Data

 The major issue that comes while using machine learning algorithms is
the lack of quality as well as quantity of data. Although data plays a vital
role in the processing of machine learning algorithms, many data
scientists claim that inadequate data, noisy data, and unclean data are
extremely exhausting the machine learning algorithms.
 For example, a simple task requires thousands of sample data, and an
advanced task such as speech or image recognition needs millions of
sample data examples.
 Further, data quality is also important for the algorithms to work ideally,
but the absence of data quality is also found in Machine Learning
applications. Data quality can be affected by some factors as follows:

 Noisy Data- It is responsible for an inaccurate prediction that affects the


decision as well as accuracy in classification tasks.
 Incorrect data- It is also responsible for faulty programming and results
obtained in machine learning models. Hence, incorrect data may affect
the accuracy of the results also.
 Generalizing of output data- Sometimes, it is also found that
generalizing output data becomes complex, which results in
comparatively poor future actions.

2. Poor quality of data


 As we have discussed above, data plays a significant role in machine
learning, and it must be of good quality as well.
 Noisy data, incomplete data, inaccurate data, and unclean data lead to
less accuracy in classification and low-quality results.
 Hence, data quality can also be considered as a major common problem
while processing machine learning algorithms.

3. non-representative training data

 To make sure our training model is generalized well or not, we have to


ensure that sample training data must be representative of new cases that
we need to generalize. The training data must cover all cases that are
already occurred as well as occurring.
 Further, if we are using non-representative training data in the model, it
results in less accurate predictions.
 A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training
data, then there will be a sampling noise in the model, called the non-
representative training set.
 It won't be accurate in predictions. To overcome this, it will be biased
against one class or a group.
 Hence, we should use representative data in training to protect against
being biased and make accurate predictions without any drift.

4. Monitoring and maintenance

 As we know that generalized output data is mandatory for any machine


learning model; hence, regular monitoring and maintenance become
compulsory for the same.
 Different results for different actions require data change; hence editing
of codes as well as resources for monitoring them also become necessary.

5. Getting bad recommendations

 A machine learning model operates under a specific context which results


in bad recommendations and concept drift in the model.
 Let us understand with an example where at a specific time customer is
looking for some gadgets, but now customer requirement changed over
time but still machine learning model showing same recommendations to
the customer while customer expectation has been changed.
 This incident is called a Data Drift. It generally occurs when new data is
introduced or interpretation of data changes. However, we can overcome
this by regularly updating and monitoring data according to the
expectations.
6. Data Bias

 Data Biasing is also found a big challenge in Machine Learning. These


errors exist when certain elements of the dataset are heavily weighted or
need more importance than others.
 Biased data leads to inaccurate results, skewed outcomes, and other
analytical errors. However, we can resolve this error by determining
where data is actually biased in the dataset. Further, take necessary steps
to reduce it.

1.5-Types of Data

Almost anything can be turned into Data. Building a deep understanding of the
different data types is a crucial condition for doing a Exploratory Data Analysis
(EDA) and Feature engineering for the machine learning models.

Quantitative Data Types

This type of data consists of numerical values. Anything is which is measured


by numbers.

Numerical Data-This are the numbers and can be split into two categories.

A-Discrete Data

Numbers that are limited to integers. Example: - The number of cars passing by.

The number data which have discrete values or Whole numbers.

This type of variable value if expressed in decimal format will have no proper
meaning. Their values can be counted.

B-Continuous Data

Numbers that are of infinite value. Example: - The price of item, size of item.

The numerical measures which can take the value within a certain range. This
type of variable value if expressed in decimal format has true meaning.

Their values can not be counted but measured.

Qualitative Data Types

These are the data types that cannot be expressed in numbers. This describes
categories or groups and hence known as the categorical data types.
A-Categorical Data

This are the values that cannot be measured up against each other. Example: -
Colour value or any Yes/No values.

B-Structured Data

This type of data is either numbers or words. This can take numerical values but
mathematical operations cannot be performed on it. This type of data is
expressed in tabular format.

C-Unstructured Data

This type of data does not have proper format and thus known as unstructured
data. This comprises textual data, sounds, images etc.

D-Ordinal Data

This are like categorical data, but can be measured up against each other.

Example-School grades where A is better than B and so on.

E-Nominal Data

This is not a measurable data. This data type also belongs to categorical type.

1.6-Data remediation
 Data remediation is the process of cleansing, organizing, and migrating
data so that it’s properly protected and best serves its intended purpose.
 There is a misconception that data remediation simply means deleting
business data that is no longer needed.
 It is important to remember that the key word “remediation” derives from
the word “remedy,” which is to correct a mistake. Since the core initiative
is to correct data, the data remediation process typically involves
replacing, modifying, cleansing, or deleting any “dirty” data.
Data remediation terminology

As you explore the data remediation process, you will come across unique
terminology. These are common terms related to data remediation that you
should get acquainted with.

 Data Migration – The process of moving data between two or more


systems, data formats or servers.
 Data Discovery – A manual or automated process of searching for
patterns in data sets to identify structured and unstructured data in an
organization’s systems.
 ROT – An acronym that stands for redundant, obsolete, and trivial data.
According to the Association for Intelligent Information Management,
ROT data accounts for nearly 80 percent of the unstructured data that is
beyond its recommended retention period and no longer useful to an
organization.
 Dark Data – Any information that businesses collect, process and store,
but do not use for other purposes. Some examples include customer call
records, raw survey data or email correspondences. Often, the storing and
securing of this type of data incurs more expense and sometimes even
greater risk than it does value.
 Dirty Data – Data that damages the integrity of the organization’s
complete dataset. This can include data that is unnecessarily duplicated,
outdated, incomplete or inaccurate.
 Data Overload – This is when an organization has acquired too much
data, including low-quality or dark data. Data overload makes the tasks of
identifying, classifying and remediating data laborious.
 Data Cleansing – Transforming data in its native state to a predefined
standardized format.
 Data Governance – Management of the availability, usability, integrity,
and security of the data stored within an organization.

You might also like