Machine Learning 2

Machine
Learning
D R. S O U M I D U T TA
Advantages of Machine Learning
Automation − With machine learning, every task especially repetitive can be done seamlessly saving time and energy for
humans. For example, the deployment of chatbots has improved customer experience and reduced waiting time. While human
agents can work on dealing with creativity and complex problems.
Enhancing user experience and decision making − Machine learning models can analyze and gain insights from large
datasets for decision making. Machine learning also allows for the personalization of products and services to enhance the
customer experience. An algorithm analyzes customer preferences and past behavior to recommend products that enhance
retail and also user experience.
Wide Applicability − This technology has wide range of applications. From health care and finance to business and
marketing, machine learning is applied in almost all sectors to improve productivity.
Continuous Improvement − Machine learning algorithms are designed in a way that they keep learning to improve
accuracy and efficiency. Every time the data is retrained by the model, the decisions improve.
Disadvantages of Machine Learning
Data acquisition − The most crucial and the most difficult task in machine learning is collecting data. Every
machine learning algorithm requires data that is relevant, unbiased, and good quality. Better data would result in
better performance of the machine learning model.
Inaccurate Results − Another major challenge in machine learning is the credibility of the interpreted result
generated by the algorithm.
Chances of Error − Machine learning depends on two things data and algorithm. Any incorrectness or bias in these
could result in errors and inaccurate outcomes. For example, if the dataset trained is small, then the algorithm
cannot fully understand the patterns resulting in biased and irrelevant perdition.
Maintenance − Machine learning models have to continuously be maintained and monitored to ensure that they
remain effective and accurate over time.
Machine Learning
Algorithms Vs.
Traditional Programming
What is a dataset?
A dataset is a collection of data in which data is arranged in some order. A dataset can contain any data from a
series of an array to a database table. Below table shows an example of the dataset:
A tabular dataset can be understood as a database table or

matrix, where each column corresponds to a particular
variable, and each row corresponds to the fields of the
dataset. The most supported file type for a tabular dataset
is "Comma Separated File," or CSV. But to store a "tree-like data,"
we can use the JSON file more efficiently.
Types of data in datasets
Numerical data: Such as house price, temperature, etc.
Categorical data: Such as Yes/No, True/False, Blue/green, etc.
Ordinal data: These data are similar to categorical data but can be measured on the
basis of comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and process at the
initial level. Therefore, to practice machine learning algorithms, we can use any dummy
dataset.
Types of datasets
Machine learning incorporates different domains, each requiring explicit sorts of datasets. A few normal sorts of
datasets utilized in machine learning include:
Image Datasets: Image datasets contain an assortment of images and are normally utilized in computer vision
tasks such as image classification, object detection, and image segmentation.
Examples :
◦ ImageNet
◦ CIFAR-10
◦ MNIST
Types of datasets
Text Datasets:
Text datasets comprise textual information, like articles, books, or virtual entertainment posts. These datasets are
utilized in NLP techniques like sentiment analysis, text classification, and machine translation.
Examples :
◦ Gutenberg Task dataset
◦ IMDb film reviews dataset

Types of datasets
Time Series Datasets:
Time series datasets include information focuses gathered after some time. They are generally utilized in
determining, abnormality location, and pattern examination.
Examples :
◦ Securities exchange information
◦ Climate information
◦ Sensor readings.
Types of datasets
Tabular Datasets:
Tabular datasets are organized information coordinated in tables or calculation sheets. They contain lines addressing
examples or tests and segments addressing highlights or qualities. Tabular datasets are utilized for undertakings like
relapse and arrangement. The dataset given before in the article is an illustration of a tabular dataset.
Need of Dataset
Completely ready and pre-handled datasets are significant for machine learning projects.
They give the establishment to prepare exact and solid models. Notwithstanding, working with
enormous datasets can introduce difficulties regarding the board and handling.
To address these difficulties, productive information the executive's strategies and are expected to
handle calculations.
Data Pre-processing
Data pre-processing is a fundamental stage in preparing datasets for machine learning. It includes
changing raw data into a configuration reasonable for model training. Normal pre-processing
procedures incorporate data cleaning to eliminate irregularities or blunders, standardization to scale
data inside a particular reach, highlight scaling to guarantee highlights have comparative ranges, and
taking care of missing qualities through ascription or evacuation.
During the development of the ML project, the developers completely rely on the datasets. In building
ML applications, datasets are divided into two parts:
Training dataset
Test Dataset
Training Dataset and Test Dataset:
In machine learning, datasets are ordinarily partitioned into two sections: the training dataset and the
test dataset. The training dataset is utilized to prepare the machine learning model, while the test
dataset is utilized to assess the model's exhibition. This division surveys the model's capacity, to sum
up to inconspicuous data. It is fundamental to guarantee that the datasets are representative of the
issue space and appropriately split to stay away from inclination or overfitting.
Data Pre-processing
Popular sources for Machine Learning
datasets
Kaggle is one of the best sources for providing
datasets for Data Scientists and Machine Learners.
It allows users to find, download, and publish
datasets in an easy way. It also provides the
opportunity to work with other machine learning
engineers and solve difficult Data Science related
tasks.
Kaggle provides a high-quality dataset in different

formats that we can easily find and download.
The link for the Kaggle dataset is

https://www.kaggle.com/datasets.
UCI Machine Learning Repository
The UCI Machine Learning Repository is an
important asset that has been broadly utilized by
scientists and specialists beginning around 1987. It
contains a huge collection of datasets sorted by
machine learning tasks such as regression,
classification, and clustering. Remarkable datasets
in the storehouse incorporate the Iris dataset,
Vehicle Assessment dataset, and Poker Hand
dataset.
The link for the UCI machine learning repository is

https://archive.ics.uci.edu/ml/index.php.
Datasets via AWS
We can search, download, access, and share the
datasets that are publicly available via AWS
resources. These datasets can be accessed
through AWS resources but provided and
maintained by different government organizations,
researches, businesses, or individuals. This source
provides the various types of datasets with
examples and ways to use the dataset. It also
provides the search box using which we can search
for the required dataset. Anyone can add any
dataset or example to the Registry of Open Data
on AWS.
Google's Dataset Search Engine
Google's Dataset Web index helps
scientists find and access
important datasets from different
sources across the web. It files
datasets from areas like
sociologies, science, and
environmental science. Specialists
can utilize catchphrases to find
datasets, channel results in light of
explicit standards, and access the
datasets straightforwardly from the
source.
Microsoft Datasets
The Microsoft has launched the
"Microsoft Research Open data"
repository with the collection of
free datasets in various areas such
as natural language processing,
computer vision, and domain-
specific sciences. It gives
admittance to assorted and
arranged datasets that can be
significant for machine learning
projects.
Awesome Public Dataset Collection
Awesome public dataset collection
provides high-quality datasets that
are arranged in a well-organized
manner within a list according to
topics such as Agriculture, Biology,
Climate, Complex networks, etc.
Most of the datasets are available
free, but some may not, so it is
better to check the license before
downloading the dataset.
Government Datasets
There are different sources to get government-related data. Various countries publish government data for
public use collected by them from different departments.
The goal of providing these datasets is to increase transparency of government work among the people and
to use the data in an innovative approach. Below are some links of government datasets:
• Indian Government dataset
• US Government Dataset
• Northern Ireland Public Sector Datasets
• European Union Open Data Portal

Computer Vision Datasets
Visual data provides multiple numbers of the
great dataset that are specific to computer
visions such as Image Classification, Video
classification, Image Segmentation, etc.
Therefore, if you want to build a project on deep
learning or image processing, then you can refer
to this source.
The link for downloading the dataset from this

source is https://www.visualdata.io/
Scikit-learn dataset
Scikit-learn, a well-known machine learning
library in Python, gives a few underlying datasets
to practice and trial and error. These datasets are
open through the sci-kit-learn Programming
interface and can be utilized for learning different
machine-learning calculations. Scikit-learn offers
both toy datasets, which are little and improved,
and genuine world datasets with greater intricacy.
Instances of sci-kit-learn datasets incorporate the
Iris dataset, the Boston Lodging dataset, and the
Wine dataset.
Types of Machine Learning
We can categorize the machine learning algorithms into three different types - supervised, unsupervised, and
reinforcement learning. Let's discuss these three types in detail.
Machine learning models can be categorized mainly into the following four types −
1) Supervised Machine Learning
2) Unsupervised Machine Learning
3) Semi-supervised Machine Learning
4) Reinforcement Machine Learning

Supervised Learning
Supervised learning that uses labeled dataset to train algorithms to understand data patterns and predict outcomes. For example,
filtering a mail into inbox or spam folder. The supervised learning further can be classified into two types − classification and
regression. There are different supervised learning algorithms that are widely used −
a) Linear Regression
b) Logistic Regression
c) Decision Trees
d) Random Forest
e) K-nearest Neighbor
f) Support Vector Machine
g) Naive Bayes
h) Linear Discriminant Analysis
i) Neural Networks
Unsupervised Learning
Unsupervised learning is a type of Machine learning that uses unlabelled dataset to discover patterns without any explicit guidance
or instruction. For example, customer segmentation i.e, dividing a company's customers into groups that reflect similarity. Further,
we can classify the unsupervised learning algorithms into three types − clustering, association, and dimensionality reduction.
Followings are some commonly used unsupervised learning algorithms −
a) K-Means Clustering
b) Principal Component Analysis(PCA)
c) Hierarchical Clustering
d) DBSCAN Clustering
e) Agglomerative Clustering
f) Apriori Algorithm
g) Autoencoder
h) Restricted Boltzmann machine (RBM)

Reinforcement Learning
Reinforcement learning algorithms are trained on datasets to make decisions and achieve optimized results by minimizing the trial
and error method. For example, Robotics.
Following are some common reinforcement learning algorithms −
a) Q-learning
b) Markov Decision Process (MDP)
c) SARSA
d) DQN
e) DDPG
Few Important Terminology
a) Data
b) Feature
c) Model
d) Training
e) Testing
f) Overfitting
g) Underfitting
Data
Data is the foundation of machine learning. Without data, there would be nothing for the
algorithm to learn from. Data can come in many forms, including structured data (such as
spreadsheets and databases) and unstructured data (such as text and images). The quality
and quantity of the data used to train the machine learning algorithm are crucial factors that
can significantly impact its performance.

Feature
In machine learning, features are the variables or attributes used to describe the input data.
The goal is to select the most relevant and informative features that will allow the algorithm
to make accurate predictions or decisions. Feature selection is a crucial step in the machine
learning process because the performance of the algorithm is heavily dependent on the
quality and relevance of the features used.

Model
A machine learning model is a mathematical representation of the relationship between the
input data (features) and the output (predictions or decisions). The model is created using a
training dataset and then evaluated using a separate validation dataset. The goal is to create
a model that can accurately generalize to new, unseen data.

Training
Training is the process of teaching the machine learning algorithm to make accurate
predictions or decisions. This is done by providing the algorithm with a large dataset and
allowing it to learn from the patterns and relationships in the data. During training, the
algorithm adjusts its internal parameters to minimize the difference between its predicted
output and the actual output.

Testing
Testing is the process of evaluating the performance of the machine learning
algorithm on a separate dataset that it has not seen before. The goal is to
determine how well the algorithm generalizes to new, unseen data. If the
algorithm performs well on the testing dataset, it is considered to be a successful
model.
Overfitting
Overfitting occurs when a machine learning model is too complex and fits the training
data too closely. This can lead to poor performance on new, unseen data because the
model is too specialized to the training dataset. To prevent overfitting, it is important
to use a validation dataset to evaluate the model's performance and to use
regularization techniques to simplify the model.

Underfitting
Underfitting occurs when a machine learning model is too simple and cannot capture the patterns
and relationships in the data. This can lead to poor performance on both the training and testing
datasets. To prevent underfitting, we can use several techniques such as increasing model
complexity, collect more data, reduce regularization, and feature engineering.
It is important to note that preventing underfitting is a balancing act between model complexity
and the amount of data available. Increasing model complexity can help prevent underfitting, but
if there is not enough data to support the increased complexity, overfitting may occur instead.
Therefore, it is important to monitor the model's performance and adjust the complexity as
necessary.
Machine Learning Vs. Deep Learning
Deep learning is a sub-field of Machine learning. The actual difference between these is the way
the algorithm learns.
In Machine learning, computers learn from large datasets using algorithms to perform tasks like
prediction and recommendation. Whereas Deep learning uses a complex structure of algorithms
developed similar to the human brain.
The effectiveness of deep learning models for complex problems is more compared to machine
learning models. For example, autonomous vehicles are usually developed using deep learning
where it can identify a U-TURN sign board using image segmentation while if a machine learning
model was used, the features of the signboard are selected and then identified using a classifier
algorithm.
Machine Learning Vs. Generative AI
Machine learning and Generative AI are different branches with different applications. While
Machine Learning is used for predictive analysis and decision-making, Generative AI focuses on
creating content, including realistic images and videos in existing patterns.

Machine Learning 2

Uploaded by

Machine Learning 2

Uploaded by

Machine

A tabular dataset can be understood as a database table or

Categorical data: Such as Yes/No, True/False, Blue/green, etc.

◦ IMDb film reviews dataset

Kaggle provides a high-quality dataset in different

The link for the Kaggle dataset is

The link for the UCI machine learning repository is

• Northern Ireland Public Sector Datasets

• European Union Open Data Portal

The link for downloading the dataset from this

1) Supervised Machine Learning

2) Unsupervised Machine Learning

3) Semi-supervised Machine Learning

4) Reinforcement Machine Learning

f) Support Vector Machine

h) Linear Discriminant Analysis

b) Principal Component Analysis(PCA)

h) Restricted Boltzmann machine (RBM)

Following are some common reinforcement learning algorithms −

b) Markov Decision Process (MDP)

can significantly impact its performance.

quality and relevance of the features used.

a model that can accurately generalize to new, unseen data.

output and the actual output.

algorithm performs well on the testing dataset, it is considered to be a successful

model is too specialized to the training dataset. To prevent overfitting, it is important

to use a validation dataset to evaluate the model's performance and to use

regularization techniques to simplify the model.

You might also like