0% found this document useful (0 votes)

13 views

Unit - 5 (Data Science)

Data Science Notes

Uploaded by

sanskar26cs114

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Unit - 5 (Data Science)

Data Science Notes

Uploaded by

sanskar26cs114

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Name – Prof.

Garima Jain Branch – Computer Science & Engineering

Unit – 5
Subject – Fundation of Data science

Introduction of Big Data

• Data science is the study of data analysis by advanced technology (Machine

Learning, Artificial Intelligence, Big data).
• It processes a huge amount of structured, semi-structured, and unstructured data to extract
insight meaning, from which one pattern can be designed that will be useful to take a
decision for grabbing the new business opportunity, the betterment of product/service, and
ultimately business growth.
• Data science process to make sense of Big data/huge amount of data that is used in business.
• Big data is a combination of structured, semi-structured and unstructured data that
organizations collect, analyze and mine for information and insights. It's used in machine
learning projects, predictive modeling and other advanced analytics applications.
• Systems that process and store big data have become a common component of data
management architectures in organizations. They're combined with tools that support big
data analytics uses.

The Five “Vs” of Big Data-

Traditionally, we’ve recognized big data by three characteristics: variety, volume, and velocity, also
known as the “three Vs.” However, two additional Vs have emerged over the past few years: value
and veracity.
Those additions make sense because today, data has become capital. Think of some of the world’s
biggest tech companies. Many of the products they offer are based on their data, which they’re
constantly analyzing to produce more efficiency and develop new initiatives. Success depends on
all five Vs.
• Volume. The amount of data matters. With big data, you’ll have to process high volumes of low-
density, unstructured data. This can be data of unknown value, such as X (formerly Twitter) data
feeds, clickstreams on a web page or a mobile app, or sensor-enabled equipment. For some
organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
• Velocity. Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the
highest velocity of data streams directly into memory versus being written to disk. Some internet-
enabled smart products operate in real time or near real time and will require real-time evaluation
and action.
• Variety. Variety refers to the many types of data that are available. Traditional data types were
structured and fit neatly in a relational database. With the rise of big data, data comes in new
unstructured data types. Unstructured and semistructured data types, such as text, audio, and video,
require additional preprocessing to derive meaning and support metadata.
• Veracity. How truthful is your data—and how much can you rely on it? The idea of veracity in data
is tied to other functional concepts, such as data quality and data integrity. Ultimately, these all
overlap and steward the organization to a data repository that delivers high-quality, accurate, and
reliable data to power insights and decisions.

1
Name – Prof. Garima Jain Branch – Computer Science & Engineering

• Value. Data has intrinsic value in business. But it’s of no use until that value is discovered. Because
big data assembles both breadth and depth of insights, somewhere within all of that information lies
insights that can benefit your organization. This value can be internal, such as operational processes
that might be optimized, or external, such as customer profile suggestions that can maximize
engagement.

Types Of Big Data : - The following are the types of Big Data:
1. Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period, developed technology in computer science has achieved greater
success in developing techniques for working with such kinds of data (where the format is well
known in advance) and also deriving value from it. However, nowadays, we are foreseeing issues
when the size of such data grows to a huge extent; typical sizes are in the range of multiple
zettabytes.
2. Unstructured
Any data with an unknown form or structure is classified as unstructured data. In addition to the
huge size, unstructured data poses multiple challenges regarding its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Nowadays, organisations have a wealth of
available data. Still, unfortunately, they don't know how to derive value from it since this data is in
its raw form or unstructured format.
3. Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as
structured in form, but it is not defined with e.g. a table definition in relational DBMS. Example of
semi-structured data is a data represented in an XML file.

Big Data Use Cases : -

Big data can help you optimize a range of business activities, including customer experience and
analytics. Here are just a few.
1. Retail and ecommerce. Companies such as Netflix and Procter & Gamble use big data to
anticipate customer demand. They build predictive models for new products and services by
classifying key attributes of past and current products or services and modeling the relationship
between those attributes and the commercial success of the offerings. In addition, P&G uses data
and analytics from focus groups, social media, test markets, and early store rollouts to plan,
produce, and launch new products.
2. Healthcare. The healthcare industry can combine numerous data sources internally, such as
electronic health records, patient wearable devices, and staffing data, and externally, including
insurance records and disease studies, to optimize both provider and patient experiences. Internally,
staffing schedules, supply chains, and facility management can be optimized with insights provided
by operations teams. For patients, their immediate and long-term care can change with data driving
everything such as personalized recommendations and predictive scans.

2
Name – Prof. Garima Jain Branch – Computer Science & Engineering

3. Financial services. When it comes to security, it’s not just a few rogue attackers—you’re up
against entire expert teams. Security landscapes and compliance requirements are constantly
evolving. Big data helps you identify patterns in data that indicate fraud and aggregate large
volumes of information to make regulatory reporting much faster.
4. Manufacturing. Factors that can predict mechanical failures may be deeply buried in structured
data—think the year, make, and model of equipment—as well as in unstructured data that covers
millions of log entries, sensor data, error messages, and engine temperature readings. By analyzing
these indications of potential issues before problems happen, organizations can deploy maintenance
more cost effectively and maximize parts and equipment uptime.
5. Government and public services. Government offices can potentially collect data from many
different sources, such as DMV records, traffic data, police/firefighter data, public school records,
and more. This can drive efficiencies in many different ways, such as detecting driver trends for
optimized intersection management and better resource allocation in schools. Governments can also
post data publicly, allowing for improved transparency to bolster public trust.

Big Data Benefits :-

Big data services enable a more comprehensive understanding of trends and patterns, by integrating
diverse data sets to form a complete picture. This fusion not only facilitates retrospective analysis
but also enhances predictive capabilities, allowing for more accurate forecasts and strategic
decision-making. Additionally, when combined with AI, big data transcends traditional analytics,
empowering organizations to unlock innovative solutions and drive transformational outcomes.
More complete answers mean more confidence in the data—which means a completely different
approach to tackling problems.
• Better insights. When organizations have more data, they’re able to derive better insights. In some
cases, the broader range confirms gut instincts against a more diverse set of circumstances. In other
cases, a larger pool of data uncovers previously hidden connections and expands potentially missed
perspectives. All of this allows organizations to have a more comprehensive understanding into the
how and why of things, particularly when automation allows for faster, easier processing of big
data.
• Decision-making. With better insights, organizations can make data-driven decisions with more
reliable projections and predictions. When big data combines with automation and analytics, that
opens an entire range of possibilities, including more up-to-date market trends, social media
analysis, and patterns that inform risk management.
• Personalized customer experiences. Big data allows organizations to build customer profiles
through a combination of customer sales data, industry demographic data, and related data such as
social media activity and marketing campaign engagement. Before automation and analytics, this
type of personalization was impossible due to its sheer scope; with big data, this level of granularity
improves engagement and enhances the customer experience.
• Improved operational efficiency. Every department generates data, even when teams don’t really
think about it. That means that every department can benefit from data on an operational level for
tasks such as detecting process anomalies, identifying patterns for maintenance and resource use,

3
Name – Prof. Garima Jain Branch – Computer Science & Engineering

and highlighting hidden drivers of human error. Whether technical problems or staff performance
issues, big data produces insights about how an organization operates—and how it can improve.

Disadvantages of Big Data :-

1. A talent gap
A study by AtScale found that for the past three years, the biggest challenge in this industry has
been a lack of big data specialists and data scientists. Given that it requires a different skill set, big
data analytics is currently beyond the scope of many IT professionals. Finding data scientists who
are also knowledgeable about big data can be difficult.
Data scientists and big data specialists are two well-paid professions in the data science industry. As
a result, hiring big data analysts can be very costly for businesses, particularly for start-ups. Some
businesses must wait a long time to hire the necessary personnel to carry out their big data analytics
tasks.
2. Security hazard
For big data analytics, businesses frequently collect sensitive data. These data need to be protected,
and security risks can be detrimental if they are not properly maintained.
Additionally, having access to enormous data sets can attract the unwanted attention of hackers, and
your company could become the target of a potential cyber-attack. You are aware that for many
businesses today, data breaches are the biggest threat. Unless you take all necessary precautions,
important information could be leaked to rivals, which is another risk associated with big data.
3. Adherence
Another disadvantage of big data is the requirement for legal compliance with governmental
regulations. To store, handle, maintain, and process big data that contains sensitive or private
information, a company must make sure that they adhere to all applicable laws and industry
standards. As a result, managing data governance tasks, transmission, and storage will become more
challenging as big data volumes grow.
4. High Cost
Given that it is a science that is constantly evolving and has as its goal the processing of ever-
increasing amounts of data, only large companies can sustain the investment in the development of
their Big Data techniques.
5. Data quality
Dealing with data quality issues was the main drawback of working with big data. Data scientists
and analysts must ensure the data they are using is accurate, pertinent, and in the right format for
analysis before they can use big data for analytics efforts.
This significantly slows down the reporting process, but if businesses don't address data quality
problems, they may discover that the insights their analytics produce are useless or even harmful if
used.
4
Name – Prof. Garima Jain Branch – Computer Science & Engineering

6. Rapid Change
The fact that technology is evolving quickly is another potential disadvantage of big data analytics.
Businesses must deal with the possibility of spending money on one technology only to see
something better emerge a few months later. This big data drawback was ranked fourth among all
the potential difficulties by Syncsort respondents.

Problems handling large data: -

A large volume of data poses new challenges, such as overloaded memory and
algorithms that never stop running. It forces you to adapt and expand your repertoire of
techniques. But even when you can perform your analysis, you should take care of issues such
as I/O (input/output) and CPU starvation, because these can cause speed issues. Figure 4.1
shows a mind map that will gradually unfold as we go through the steps: problems, solutions,
and tips.

A computer only has a limited amount of RAM. When you try to squeeze more data
into this memory than actually fits, the OS will start swapping out memory blocks to disks,
which is far less efficient than having it all in memory. But only a few algorithms are designed
to handle large data sets; most of them load the whole data set into memory at once, which
causes the out-of-memory error. Other algorithms need to hold multiple copies of the data in
memory or store intermediate results. All of these aggravate the problem.
Even when you cure the memory issues, you may need to deal with another limited
resource: time. Although a computer may think you live for millions of years, in reality
you won’t. Certain algorithms don’t take time into account; they’ll keep running forever.
Other algorithms can’t end in a reasonable amount of time when they need to process only a
few megabytes of data.
A third thing you’ll observe when dealing with large data sets is that components of
your computer can start to form a bottleneck while leaving other systems idle. Although this
isn’t as severe as a never-ending algorithm or out-of-memory errors, it still incurs a serious
cost. Think of the cost savings in terms of person days and computing infrastructure for CPU
5
Name – Prof. Garima Jain Branch – Computer Science & Engineering

starvation. Certain programs don’t feed data fast enough to the processor because they have to
read data from the hard drive, which is one of the slowest components on a computer. This
has been addressed with the introduction of solid state drives (SSD), but SSDs are still much
more expensive than the slower and more widespread hard disk drive (HDD) technology.

General Techniques for Handling Large Data :-

1. Allocate More Memory

• Some machine learning tools or libraries may be limited by a default memory configuration.
• Check if you can re-configure your tool or library to allocate more memory.
• A good example is Weka, where you can increase the memory as a parameter when starting
the application.

2. Work with a Smaller Sample

• Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller
sample to work through your problem before fitting a final model on all of your data (using
progressive data loading techniques).
• I think this is a good practice in general for machine learning to give you quick spot-checks
of algorithms and turnaround of results.
• You may also consider performing a sensitivity analysis of the amount of data used to fit one
algorithm compared to the model skill. Perhaps there is a natural point of diminishing
returns that you can use as a heuristic size of your smaller sample.
3. Use a Computer with More Memory

• Perhaps you can get access to a much larger computer with an order of magnitude more
memory.
• For example, a good option is to rent compute time on a cloud service like Amazon Web
Services that offers machines with tens of gigabytes of RAM for less than a US dollar per
hour.

4. Change the Data Format

• Perhaps you can speed up data loading and use less memory by using another data format. A
good example is a binary format like GRIB, NetCDF, or HDF.
• There are many command line tools that you can use to transform one data format into
another that do not require the entire dataset to be loaded into memory.
• Using another format may allow you to store the data in a more compact form that saves
memory, such as 2-byte integers, or 4-byte floats.
•
5. Stream Data or Use Progressive Loading

6
Name – Prof. Garima Jain Branch – Computer Science & Engineering

This may require algorithms that can learn iteratively using optimization techniques such as
stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix
operations such as some implementations of linear and logistic regression.

6. Use a Relational Database

• Relational databases provide a standard way of storing and accessing very large datasets.
• Internally, the data is stored on disk can be progressively loaded in batches and can be
queried using a standard query language (SQL).
• Free open source database tools like MySQL or Postgres can be used and most (all?)
programming languages and many machine learning tools can connect directly to relational
databases. You can also use a lightweight approach, such as SQLite.

7. Use a Big Data Platform

• In some cases, you may need to resort to a big data platform.

• That is, a platform designed for handling very large datasets, that allows you to use data
transforms and machine learning algorithms on top of it.
• Two good examples are Hadoop with the Mahout machine learning library and Spark wit
the MLLib library.

Machine Learning :-
Machine learning (ML) is a type of artificial intelligence (AI) that allows computers to learn
without being explicitly programmed. This article explores the concept of machine learning,
providing various definitions and discussing its applications. The article also dives into different
classifications of machine learning tasks, giving you a comprehensive understanding of this
powerful technology.
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences.
Arthur Samuel first used the term "machine learning" in 1959.

7
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Features of Machine Learning:

•Machine learning uses data to detect various patterns in a given dataset.
•It can learn from past data and improve automatically.
•It is a data-driven technology.
•Machine learning is much similar to data mining as it also deals with the huge amount of
the data.
1. Data-Driven Learning
• Learning from Data: ML algorithms improve their performance by analyzing data,
allowing them to adapt to new information over time.

2. Automation
• Reduced Manual Intervention: Once trained, ML models can automatically make
predictions or decisions without human intervention, streamlining processes.

3. Generalization
• Ability to Generalize: A well-trained model can apply learned knowledge to new, unseen
data, making it useful in real-world scenarios.

4. Pattern Recognition
• Identifying Patterns: ML excels at recognizing complex patterns and correlations in large
datasets, which might be difficult for humans to identify.

5. Scalability
• Handling Large Datasets: ML algorithms can efficiently process and analyze vast amounts
of data, making them suitable for big data applications.

6. Versatility
• Wide Range of Applications: ML can be applied across various domains, including
finance, healthcare, marketing, and more, adapting to different types of data and problems.

7. Continuous Improvement
• Incremental Learning: ML models can continue to learn and improve over time as new
data becomes available, allowing for ongoing refinement.

8. Feature Engineering
• Automatic Feature Selection: Many ML algorithms can automatically identify the most
relevant features for making predictions, reducing the need for manual feature selection.

9. Predictive Capabilities
• Making Predictions: ML models are designed to predict future outcomes based on
historical data, enabling proactive decision-making.

8
Name – Prof. Garima Jain Branch – Computer Science & Engineering

10. Interpretable and Explainable Models

• Understanding Decisions: While some ML models (like neural networks) can be complex,
there are also interpretable models (like decision trees) that help users understand how
decisions are made.

11. Robustness to Noise

• Dealing with Imperfections: Many ML algorithms can tolerate noisy or incomplete data,
making them robust in real-world applications.

12. Transfer Learning

• Leveraging Existing Models: ML allows the use of pre-trained models on new but related
tasks, reducing the need for extensive data and training time.

Classification of Machine Learning

Machine learning implementations are classified into four major categories, depending on the
nature of the learning “signal” or “response” available to a learning system which are as follows:
1. Supervised learning:
Supervised learning is the machine learning task of learning a function that maps an input to an
output based on example input-output pairs. The given data is labeled.
Both classification and regression problems are supervised learning problems.Involves training a
model on labeled data, where the input-output pairs are known. The model learns to map inputs to
the correct outputs (e.g., predicting house prices).

2. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms,
classification or categorization is not included in the observations.Works with unlabeled data,
seeking to find patterns or groupings (e.g., clustering customers based on purchasing behavior).
3. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its
rewards.
A learner is not told what actions to take as in most forms of machine learning but instead must
discover which actions yield the most reward by trying them.Involves agents that learn to make
decisions by taking actions in an environment to maximize some notion of cumulative reward (e.g.,
game-playing AI).

9
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Model Training in Machine Learning

Model training is the process of feeding engineered data to a parametrized machine learning
algorithm in order to output a model with optimal learned trainable parameters that minimize an
objective function.

Let’s dissect the different parts of this definition:

•Feeding engineered data: The input to any ML model is data. Even the most advanced
machine learning model can only be as good as the data from which it learns. The simple
concept of ”garbage in, garbage out” explains why feature engineering is so relevant for—
and intertwined with—model training. Feature engineering should be performed with
awareness of common hidden errors in the data set that can bias the training, including data
leakage where the target is indirectly represented in one or more features.
•Parametrized machine learning algorithm: ML algorithms are coded procedures with a
set of input parameters, known as “hyperparameters”. The developer can customize the
hyperparameters to tune the algorithm’s learning to the specific data set and use case. The
documentation of each algorithm should highlight implementation details, including the
complete set of tunable hyperparameters.
•Model with optimal learned trained parameters: Machine learning algorithms have
another set of parameters, known as “trainable parameters”, which correspond to the

10
Name – Prof. Garima Jain Branch – Computer Science & Engineering

coefficients automatically learned during model training. Trainable parameters make the
algorithm derive an output from an unseen input at prediction time within an expected range
of accuracy. Each algorithm learns in its own specific way, so each has a unique set of
trainable parameters. For example, a decision tree learns the choice of decision variables at
each node, while a neural network learns the weights associated with each layer’s activation
function.
•Minimize an objective function: An objective function defines how a machine learning
model learns its trainable parameters. The model adjusts its learnable parameters so as to
optimize—i.e., minimize or maximize—the value outputted by the objective function.
Specifically, loss functions are the type of objective function most commonly used in ML
training, often accompanied by a regularization term. A loss function defines how well the
algorithm models the training data by providing an error between the estimated and the true
output value. The higher the error, the more the trainable parameters are updated for that
training iteration.

Validating Model in achine Learning:

• Model validation is a technique where we try to validate the model that has been built by
gathering, preprocessing, and feeding appropriate data to the machine learning algorithms.
• We can not directly feed the data to the model, train it and deploy it.
• It is essential to validate the performance or results of a model to check whether a model is
performing as per our expectations or not.
• There are multiple model validation techniques that are used to evaluate and validate the
model according to the different types of model and their behaviors.
• Machine learning and proper training go hand-in-hand. You can’t directly use or fit the
model on a set of training data and say ‘Yes, this will work.’
• To ensure that the model is correctly trained on the data provided without much noise, you
need to use cross-validation techniques. These are statistical methods used to estimate the
performance of machine learning models.

11
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Types of cross-validation
1.K-fold cross-validation
2.Hold-out cross-validation
3.Stratified k-fold cross-validation
4.Leave-p-out cross-validation
5.Leave-one-out cross-validation
6.Monte Carlo (shuffle-split)
7.Time series (rolling cross-validation)

K-fold cross-validation
In this technique, the whole dataset is partitioned in k parts of equal size and each partition is called
a fold. It’s known as k-fold since there are k parts where k can be any integer - 3,4,5, etc.
One fold is used for validation and other K-1 folds are used for training the model. To use every
fold as a validation set and other left-outs as a training set, this technique is repeated k times until
each fold is used once.

12
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Holdout Validation
Similar to train-test split, holdout validation involves setting aside a portion of the data for
evaluation. However, this portion is held out during the entire training process and only evaluated
once the final model is built.

This can be useful for datasets that are constantly updated, as the holdout set can be used to evaluate
the model's performance on the most recent data.

Stratified K-Fold Cross-Validation

Stratified K-fold cross-validation ensures each fold contains a representative proportion of each
class for datasets with imbalanced classes, where one class dominates the others. It shuffles your
data and then splits it into ’n’ different parts. Now, it will use each part to test the model and only
shuffle data one time before splitting.

13
Name – Prof. Garima Jain Branch – Computer Science & Engineering

This prevents the model from favoring the majority class and provides a more accurate assessment
of its performance across all classes.

Leave-p-out cross-validation

An exhaustive cross-validation technique, p samples are used as the validation set and n-p samples
are used as the training set if a dataset has n samples. The process is repeated until the entire dataset
containing n samples gets divided on the validation set of p samples and the training set of n-p
samples. This continues till all samples are used as a validation set.
The technique, which has a high computation time, produces good results. However, it’s not
considered ideal for an imbalanced dataset and is deemed to be a computationally unfeasible
method. This is because if the training set has all samples of one class, the model will not be able to
properly generalize and will become biased to either of the classes.

Leave-one-out cross-validation
In this technique, only 1 sample point is used as a validation set and the remaining n-1 samples are
used in the training set. Think of it as a more specific case of the leave-p-out cross-validation
technique with P=1.
To understand this better, consider this example: There are 1000 instances in your dataset. In each
iteration, 1 instance will be used for the validation set and the remaining 999 instances will be used
as the training set. The process repeats itself until every instance from the dataset is used as a
validation sample.

14
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Monte Carlo cross-validation

Also known as shuffle split cross-validation and repeated random subsampling cross-validation, the
Monte Carlo technique involves splitting the whole data into training data and test data. Splitting
can be done in the percentage of 70-30% or 60-40% - or anything you prefer. The only condition for
each iteration is to keep the train-test split percentage different.

The next step is to fit the model on the train data set in that iteration and calculate the accuracy of
the fitted model on the test dataset. Repeat these iterations many times - 100,400,500 or even higher
- and take the average of all the test errors to conclude how well your model performs.

Time Series Cross-Validation

Cross-validation in time series is a procedure designed specifically for time series data. This
technique utilizes overlapping windows. The model trains on one window and evaluates on the
next, moving sequentially through the data. This accounts for the inherent temporal dependencies
present in time series data and provides a more accurate assessment of the model's ability to predict
future values.

15
Name – Prof. Garima Jain Branch – Computer Science & Engineering

It mimics real-world scenarios where past data is used to predict the future, preventing the model
from peering into the future during training.

Advantages of Model Validation

Here are many advantages that model validation provides.
Quality of the Model
The first and foremost advantage of model validation is the quality of the model; yes, we can
quickly get an idea bout the performance and quality of the model by validating the same.
The flexibility of the Model
Secondly, validating the model makes it easy to get an idea about the flexibility. Model validation
helps make the model more flexible also.
Overfitting and Underfitting
Model validation help identify if the model is underfitted or overfitted. In the case of underfitting,
the model gives high accuracy in training data, and the model performs poorly during the validation
phase. In the case of underfitting, the model does not perform well during either the training or
validation phase.

Supervised learning
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled
data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised
learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data

16
Name – Prof. Garima Jain Branch – Computer Science & Engineering

For example, a labeled dataset of images of Elephant, Camel and Cow would have each image

tagged with either “Elephant” , “Camel”or “Cow.”

Key Points:
•Supervised learning involves training a machine from labeled data.
•Labeled data consists of examples with the correct answer or classification.
•The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
•The trained machine can then make predictions on new, unlabeled data.
Example:
Let’s say you have a fruit basket that you want to identify. The machine would first analyze the
image to extract features such as its shape, color, and texture. Then, it would compare these features
to the features of the fruits it has already learned about. If the new image’s features are most similar
to those of an apple, the machine would predict that the fruit is an apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step
is to train the machine with all the different fruits one by one like this:
•If the shape of the object is rounded and has a depression at the top, is red in color, then it will
be labeled as –Apple.
•If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be
labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in the Banana category. Thus the machine learns the things from training
data(basket containing fruits) and then applies the knowledge to test data(new fruit).

17
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Types of Supervised Learning

Supervised learning is classified into two categories of algorithms:
•Regression: A regression problem is when the output variable is a real value, such as “dollars”
or “weight”.
•Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is already
tagged with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous values, such as house
prices, stock prices, or customer churn. Regression algorithms learn a function that maps from the
input features to the output value.
Some common regression algorithms include:
•Linear Regression
•Polynomial Regression
•Support Vector Machine Regression
•Decision Tree Regression
•Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict categorical values, such as
whether a customer will churn or not, whether an email is spam or not, or whether a medical image
shows a tumor or not. Classification algorithms learn a function that maps from the input features to
a probability distribution over the output classes.
Some common classification algorithms include:
•Logistic Regression
•Support Vector Machines
•Decision Trees
•Random Forests
•Naive Baye
Evaluating Supervised Learning Models

Evaluating supervised learning models is an important step in ensuring that the model is accurate
and generalizable. There are a number of different metrics that can be used to evaluate supervised
learning models, but some of the most common ones include:
For Regression
•Mean Squared Error (MSE): MSE measures the average squared difference between the
predicted values and the actual values. Lower MSE values indicate better model performance.

18
Name – Prof. Garima Jain Branch – Computer Science & Engineering

•Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the
standard deviation of the prediction errors. Similar to MSE, lower RMSE values indicate better
model performance.
•Mean Absolute Error (MAE): MAE measures the average absolute difference between the
predicted values and the actual values. It is less sensitive to outliers compared to MSE or
RMSE.
•R-squared (Coefficient of Determination): R-squared measures the proportion of the
variance in the target variable that is explained by the model. Higher R-squared values indicate
better model fit.
For Classification
•Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is
calculated by dividing the number of correct predictions by the total number of predictions.
•Precision: Precision is the percentage of positive predictions that the model makes that are
actually correct. It is calculated by dividing the number of true positives by the total number of
positive predictions.
•Recall: Recall is the percentage of all positive examples that the model correctly identifies. It
is calculated by dividing the number of true positives by the total number of positive examples.
•F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking
the harmonic mean of precision and recall.
•Confusion matrix: A confusion matrix is a table that shows the number of predictions for each
class, along with the actual class labels. It can be used to visualize the performance of the model
and identify areas where the model is struggling.
Applications of Supervised learning

Supervised learning can be used to solve a wide variety of problems, including:

•Spam filtering: Supervised learning algorithms can be trained to identify and classify spam
emails based on their content, helping users avoid unwanted messages.
•Image classification: Supervised learning can automatically classify images into different
categories, such as animals, objects, or scenes, facilitating tasks like image search, content
moderation, and image-based product recommendations.

19
Name – Prof. Garima Jain Branch – Computer Science & Engineering

•Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient
data, such as medical images, test results, and patient history, to identify patterns that suggest
specific diseases or conditions.
•Fraud detection: Supervised learning models can analyze financial transactions and identify
patterns that indicate fraudulent activity, helping financial institutions prevent fraud and protect
their customers.
•Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks,
including sentiment analysis, machine translation, and text summarization, enabling machines to
understand and process human language effectively.
Advantages of Supervised learning
•Supervised learning allows collecting data and produces data output from previous
experiences.
•Helps to optimize performance criteria with the help of experience.
•Supervised machine learning helps to solve various types of real-world computation problems.
•It performs classification and regression tasks.
•It allows estimating or mapping the result to a new sample.
•We have complete control over choosing the number of classes we want in the training data.
Disadvantages of Supervised learning
•Classifying big data can be challenging.
•Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
•Supervised learning cannot handle all complex tasks in Machine Learning.
•Computation time is vast for supervised learning.
•It requires a labelled data set.
•It requires a training process.
Unsupervised learning

Unsupervised learning is a type of machine learning that learns from unlabeled data. This means
that the data does not have any pre-existing labels or categories. The goal of unsupervised learning
is to discover patterns and relationships in the data without any explicit guidance.
Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without
any prior training of data.

20
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
You can use unsupervised learning to examine the animal data that has been gathered and
distinguish between several groups according to the traits and actions of the animals. These
groupings might correspond to various animal species, providing you to categorize the creatures
without depending on labels that already exist.

Key Points
•Unsupervised learning allows the model to discover patterns and relationships in unlabeled
data.
•Clustering algorithms group similar data points together based on their inherent characteristics.
•Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
•Label association assigns categories to the clusters based on the extracted patterns and
characteristics.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and it has
no pre-existing labels or categories for these animals. Your task is to use unsupervised learning to
identify the dogs and cats in a new, unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs
and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e.,
we can easily categorize the above picture into two parts. The first may contain all pics
having dogs in them and the second part may contain all pics having cats in them. Here you didn’t
learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.

21
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Types of Unsupervised Learning

Unsupervised learning is classified into two categories of algorithms:

•Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
•Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points
together. Clustering algorithms work by iteratively moving data points closer to their cluster
centers and further away from data points in other clusters.
1.Exclusive (partitioning)
2.Agglomerative
3.Overlapping
4.Probabilistic
Clustering Types:-
1.Hierarchical clustering
2.K-means clustering
3.Principal Component Analysis
4.Singular Value Decomposition
5.Independent Component Analysis
6.Gaussian Mixture Models (GMMs)
7.Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association rule learning
Association rule learning is a type of unsupervised learning that is used to identify patterns in a
data. Association rule learning algorithms work by finding relationships between different items in
a dataset.
Some common association rule learning algorithms include:
•Apriori Algorithm
•Eclat Algorithm
•FP-Growth Algorithm
Evaluating Non-Supervised Learning Models

Evaluating non-supervised learning models is an important step in ensuring that the model is
effective and useful. However, it can be more challenging than evaluating supervised learning
models, as there is no ground truth data to compare the model’s predictions to.
22
Name – Prof. Garima Jain Branch – Computer Science & Engineering

There are a number of different metrics that can be used to evaluate non-supervised learning
models, but some of the most common ones include:
•Silhouette score: The silhouette score measures how well each data point is clustered with its
own cluster members and separated from other clusters. It ranges from -1 to 1, with higher
scores indicating better clustering.
•Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the
variance between clusters and the variance within clusters. It ranges from 0 to infinity, with
higher scores indicating better clustering.
•Adjusted Rand index: The adjusted Rand index measures the similarity between two
clusterings. It ranges from -1 to 1, with higher scores indicating more similar clusterings.
•Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between
clusters. It ranges from 0 to infinity, with lower scores indicating better clustering.
•F1 score: The F1 score is a weighted average of precision and recall, which are two metrics
that are commonly used in supervised learning to evaluate classification models. However, the
F1 score can also be used to evaluate non-supervised learning models, such as clustering
models.
Application of Unsupervised learning

Non-supervised learning can be used to solve a wide variety of problems, including:

•Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from
normal behavior in data, enabling the detection of fraud, intrusion, or system failures.
•Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in
scientific data, leading to new hypotheses and insights in various scientific fields.
•Recommendation systems: Unsupervised learning can identify patterns and similarities in
user behavior and preferences to recommend products, movies, or music that align with their
interests.
•Customer segmentation: Unsupervised learning can identify groups of customers with similar
characteristics, allowing businesses to target marketing campaigns and improve customer
service more effectively.
•Image analysis: Unsupervised learning can group images based on their content, facilitating
tasks such as image classification, object detection, and image retrieval.
Advantages of Unsupervised learning

23
Name – Prof. Garima Jain Branch – Computer Science & Engineering

• It does not require training data to be labeled.

•Dimensionality reduction can be easily accomplished using unsupervised learning.
•Capable of finding previously unknown patterns in data.
•Unsupervised learning can help you gain insights from unlabeled data that you might not have
been able to get otherwise.
•Unsupervised learning is good at finding patterns and relationships in data without being told
what to look for. This can help you learn new things about your data.

Disadvantages of Unsupervised learning

•Difficult to measure accuracy or effectiveness due to lack of predefined answers during
training.
•The results often have lesser accuracy.
•The user needs to spend time interpreting and label the classes which follow that classification.
•Unsupervised learning can be sensitive to data quality, including missing values, outliers, and
noisy data.
•Without labeled data, it can be difficult to evaluate the performance of unsupervised learning
models, making it challenging to assess their effectiveness.
Supervised vs. Unsupervised Machine Learning
Unsupervised machine
Parameters Supervised machine learning
learning

Algorithms are used against

Input Data Algorithms are trained using labeled data.
data that is not labeled

Computational
Simpler method Computationally complex
Complexity

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

Linear and Logistics regression,KNN

K-Means clustering,
Random forest, multi-class classification,
Algorithms used Hierarchical clustering, Apriori
decision tree, Support Vector Machine,
algorithm, etc.
Neural Network, etc.

Output Desired output is given. Desired output is not given.

Training data Use training data to infer model. No training data is used.

Complex model It is not possible to learn larger and more It is possible to learn larger and
complex models than with supervised more complex models with

24
Name – Prof. Garima Jain Branch – Computer Science & Engineering

Unsupervised machine
Parameters Supervised machine learning
learning

learning. unsupervised learning.

Model We can test our model. We can not test our model.

Supervised learning is also called Unsupervised learning is also

Called as
classification. called clustering.

Example: Find a face in an

Example Example: Optical character recognition.
image.

Supervision Unsupervised learning does not

supervised learning needs supervision to
need any supervision to train
train the model.
the model.

IDC State of The Market IT Spending by Industry, 2023 - 2023 Feb Presentation
No ratings yet
IDC State of The Market IT Spending by Industry, 2023 - 2023 Feb Presentation
28 pages
Project Report On Data Analytics
67% (3)
Project Report On Data Analytics
44 pages
Movie Prediction
100% (1)
Movie Prediction
7 pages
Big Data Skn
No ratings yet
Big Data Skn
24 pages
3-2 Csd Bda Full Notes
No ratings yet
3-2 Csd Bda Full Notes
115 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Big Data12
No ratings yet
Big Data12
11 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Big Data Analytics
No ratings yet
Big Data Analytics
73 pages
Big Data Analytics
No ratings yet
Big Data Analytics
83 pages
Big Data
No ratings yet
Big Data
14 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
BD Unit1
No ratings yet
BD Unit1
45 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
BDM 1
No ratings yet
BDM 1
37 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 1
No ratings yet
Unit 1
56 pages
BIG_DATA
No ratings yet
BIG_DATA
41 pages
big_data-intro
No ratings yet
big_data-intro
31 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
The Three Vs of Big Data
No ratings yet
The Three Vs of Big Data
4 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
BDCC Unit 1
No ratings yet
BDCC Unit 1
165 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
BDAV Question Bank Solution
No ratings yet
BDAV Question Bank Solution
63 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Big Data
No ratings yet
Big Data
7 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
Acc 411 Topic 2
No ratings yet
Acc 411 Topic 2
30 pages
Big Data Hadoop
No ratings yet
Big Data Hadoop
35 pages
BDT 1
No ratings yet
BDT 1
49 pages
Big Data UNIT1
No ratings yet
Big Data UNIT1
23 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
Unit 1 and Unit 2 notes bda
No ratings yet
Unit 1 and Unit 2 notes bda
11 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
BDA Unit-1
No ratings yet
BDA Unit-1
56 pages
BD-Topic 3-Big Data
No ratings yet
BD-Topic 3-Big Data
12 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Understanding of Big Data
No ratings yet
Understanding of Big Data
25 pages
What Is Big Data00
No ratings yet
What Is Big Data00
5 pages
Bda MST Merged
No ratings yet
Bda MST Merged
230 pages
BIG Data 1
No ratings yet
BIG Data 1
10 pages
BDA Introduction
No ratings yet
BDA Introduction
61 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
BDA GTU Study Material E-Notes All-Units 03122021014217PM
No ratings yet
BDA GTU Study Material E-Notes All-Units 03122021014217PM
42 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
BIG DATA ANALYTICS NOTES
No ratings yet
BIG DATA ANALYTICS NOTES
115 pages
R19 BDA UNIT-1
No ratings yet
R19 BDA UNIT-1
22 pages
BDA Notes
No ratings yet
BDA Notes
35 pages
IMTC634_Data Science_Chapter 11
No ratings yet
IMTC634_Data Science_Chapter 11
22 pages
Course Material
100% (1)
Course Material
57 pages
Module 6_Big Data and NOSQL
No ratings yet
Module 6_Big Data and NOSQL
63 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
What Is Big Data?
No ratings yet
What Is Big Data?
3 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Unit 1
No ratings yet
Unit 1
26 pages
07-08 What Is Big Data
No ratings yet
07-08 What Is Big Data
41 pages
The Definition of Big Data
No ratings yet
The Definition of Big Data
7 pages
BD Unit-1 Upd
No ratings yet
BD Unit-1 Upd
29 pages
UNIT I
No ratings yet
UNIT I
25 pages
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Azure Databricks Overview
No ratings yet
Azure Databricks Overview
23 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Listening Without Ears Artificial Intelligence
No ratings yet
Listening Without Ears Artificial Intelligence
16 pages
Hadoop
No ratings yet
Hadoop
11 pages
Hadoop Release 2.0
No ratings yet
Hadoop Release 2.0
54 pages
A Brief Analysis of Palantir Gotham a Collaborative and Interactive Big Data Visualization Analysis Software Based on Dynamic Ontology
No ratings yet
A Brief Analysis of Palantir Gotham a Collaborative and Interactive Big Data Visualization Analysis Software Based on Dynamic Ontology
8 pages
Itfm Assignment Group 5
No ratings yet
Itfm Assignment Group 5
14 pages
Data Science Overview - Part1
No ratings yet
Data Science Overview - Part1
28 pages
AI_Book_8_Ch1
No ratings yet
AI_Book_8_Ch1
2 pages
Data Science Bro_2a
No ratings yet
Data Science Bro_2a
28 pages
MGT 122 Course Checklist
No ratings yet
MGT 122 Course Checklist
2 pages
Big Data Performance Testing-The SandStorm Way
No ratings yet
Big Data Performance Testing-The SandStorm Way
10 pages
UNIT 3 Notes by ARUN JHAPATE
No ratings yet
UNIT 3 Notes by ARUN JHAPATE
9 pages
Complete Download Activity Based Intelligence Principles and Applications Patrick Biltgen PDF All Chapters
100% (1)
Complete Download Activity Based Intelligence Principles and Applications Patrick Biltgen PDF All Chapters
77 pages
JD Propel X GDG Douala (28th Feb)
No ratings yet
JD Propel X GDG Douala (28th Feb)
15 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
29 pages
BDA Question Bank
No ratings yet
BDA Question Bank
3 pages
Position Title: Recruiter Location: Noida (Currently Remote) Position Type: Full-Time
No ratings yet
Position Title: Recruiter Location: Noida (Currently Remote) Position Type: Full-Time
2 pages
The Future of Internet 2025
100% (1)
The Future of Internet 2025
61 pages
Deepak Gaur - 4 Year(s) 6 Month(s)
No ratings yet
Deepak Gaur - 4 Year(s) 6 Month(s)
2 pages
Air Force Medical Service - Total Exposure Health Strategic Plan
100% (1)
Air Force Medical Service - Total Exposure Health Strategic Plan
50 pages
Sample Questions:: Section I: Subjective Questions
No ratings yet
Sample Questions:: Section I: Subjective Questions
8 pages
An Extended Technology-Organization-Environment (TOE) Framework For Online Retailing Utilization in Digital Transformation Empirical Evidence From Vietnam
No ratings yet
An Extended Technology-Organization-Environment (TOE) Framework For Online Retailing Utilization in Digital Transformation Empirical Evidence From Vietnam
22 pages
Still Insights For Orchestrating Innovation Ecosystems
No ratings yet
Still Insights For Orchestrating Innovation Ecosystems
27 pages
Life To Artificial Intelligence With Other Systems
No ratings yet
Life To Artificial Intelligence With Other Systems
3 pages
Big Data Simplified: Book Description
No ratings yet
Big Data Simplified: Book Description
14 pages
Data Analyst Resume Entry Level
100% (2)
Data Analyst Resume Entry Level
5 pages