Unit - 5 (Data Science)
Unit - 5 (Data Science)
Unit – 5
Subject – Fundation of Data science
1
Name – Prof. Garima Jain Branch – Computer Science & Engineering
• Value. Data has intrinsic value in business. But it’s of no use until that value is discovered. Because
big data assembles both breadth and depth of insights, somewhere within all of that information lies
insights that can benefit your organization. This value can be internal, such as operational processes
that might be optimized, or external, such as customer profile suggestions that can maximize
engagement.
Types Of Big Data : - The following are the types of Big Data:
1. Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period, developed technology in computer science has achieved greater
success in developing techniques for working with such kinds of data (where the format is well
known in advance) and also deriving value from it. However, nowadays, we are foreseeing issues
when the size of such data grows to a huge extent; typical sizes are in the range of multiple
zettabytes.
2. Unstructured
Any data with an unknown form or structure is classified as unstructured data. In addition to the
huge size, unstructured data poses multiple challenges regarding its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Nowadays, organisations have a wealth of
available data. Still, unfortunately, they don't know how to derive value from it since this data is in
its raw form or unstructured format.
3. Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as
structured in form, but it is not defined with e.g. a table definition in relational DBMS. Example of
semi-structured data is a data represented in an XML file.
2
Name – Prof. Garima Jain Branch – Computer Science & Engineering
3. Financial services. When it comes to security, it’s not just a few rogue attackers—you’re up
against entire expert teams. Security landscapes and compliance requirements are constantly
evolving. Big data helps you identify patterns in data that indicate fraud and aggregate large
volumes of information to make regulatory reporting much faster.
4. Manufacturing. Factors that can predict mechanical failures may be deeply buried in structured
data—think the year, make, and model of equipment—as well as in unstructured data that covers
millions of log entries, sensor data, error messages, and engine temperature readings. By analyzing
these indications of potential issues before problems happen, organizations can deploy maintenance
more cost effectively and maximize parts and equipment uptime.
5. Government and public services. Government offices can potentially collect data from many
different sources, such as DMV records, traffic data, police/firefighter data, public school records,
and more. This can drive efficiencies in many different ways, such as detecting driver trends for
optimized intersection management and better resource allocation in schools. Governments can also
post data publicly, allowing for improved transparency to bolster public trust.
3
Name – Prof. Garima Jain Branch – Computer Science & Engineering
and highlighting hidden drivers of human error. Whether technical problems or staff performance
issues, big data produces insights about how an organization operates—and how it can improve.
6. Rapid Change
The fact that technology is evolving quickly is another potential disadvantage of big data analytics.
Businesses must deal with the possibility of spending money on one technology only to see
something better emerge a few months later. This big data drawback was ranked fourth among all
the potential difficulties by Syncsort respondents.
A computer only has a limited amount of RAM. When you try to squeeze more data
into this memory than actually fits, the OS will start swapping out memory blocks to disks,
which is far less efficient than having it all in memory. But only a few algorithms are designed
to handle large data sets; most of them load the whole data set into memory at once, which
causes the out-of-memory error. Other algorithms need to hold multiple copies of the data in
memory or store intermediate results. All of these aggravate the problem.
Even when you cure the memory issues, you may need to deal with another limited
resource: time. Although a computer may think you live for millions of years, in reality
you won’t. Certain algorithms don’t take time into account; they’ll keep running forever.
Other algorithms can’t end in a reasonable amount of time when they need to process only a
few megabytes of data.
A third thing you’ll observe when dealing with large data sets is that components of
your computer can start to form a bottleneck while leaving other systems idle. Although this
isn’t as severe as a never-ending algorithm or out-of-memory errors, it still incurs a serious
cost. Think of the cost savings in terms of person days and computing infrastructure for CPU
5
Name – Prof. Garima Jain Branch – Computer Science & Engineering
starvation. Certain programs don’t feed data fast enough to the processor because they have to
read data from the hard drive, which is one of the slowest components on a computer. This
has been addressed with the introduction of solid state drives (SSD), but SSDs are still much
more expensive than the slower and more widespread hard disk drive (HDD) technology.
• Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller
sample to work through your problem before fitting a final model on all of your data (using
progressive data loading techniques).
• I think this is a good practice in general for machine learning to give you quick spot-checks
of algorithms and turnaround of results.
• You may also consider performing a sensitivity analysis of the amount of data used to fit one
algorithm compared to the model skill. Perhaps there is a natural point of diminishing
returns that you can use as a heuristic size of your smaller sample.
3. Use a Computer with More Memory
• Perhaps you can get access to a much larger computer with an order of magnitude more
memory.
• For example, a good option is to rent compute time on a cloud service like Amazon Web
Services that offers machines with tens of gigabytes of RAM for less than a US dollar per
hour.
• Perhaps you can speed up data loading and use less memory by using another data format. A
good example is a binary format like GRIB, NetCDF, or HDF.
• There are many command line tools that you can use to transform one data format into
another that do not require the entire dataset to be loaded into memory.
• Using another format may allow you to store the data in a more compact form that saves
memory, such as 2-byte integers, or 4-byte floats.
•
5. Stream Data or Use Progressive Loading
6
Name – Prof. Garima Jain Branch – Computer Science & Engineering
This may require algorithms that can learn iteratively using optimization techniques such as
stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix
operations such as some implementations of linear and logistic regression.
• Relational databases provide a standard way of storing and accessing very large datasets.
• Internally, the data is stored on disk can be progressively loaded in batches and can be
queried using a standard query language (SQL).
• Free open source database tools like MySQL or Postgres can be used and most (all?)
programming languages and many machine learning tools can connect directly to relational
databases. You can also use a lightweight approach, such as SQLite.
Machine Learning :-
Machine learning (ML) is a type of artificial intelligence (AI) that allows computers to learn
without being explicitly programmed. This article explores the concept of machine learning,
providing various definitions and discussing its applications. The article also dives into different
classifications of machine learning tasks, giving you a comprehensive understanding of this
powerful technology.
A subset of artificial intelligence known as machine learning focuses primarily on the creation of
algorithms that enable a computer to independently learn from data and previous experiences.
Arthur Samuel first used the term "machine learning" in 1959.
7
Name – Prof. Garima Jain Branch – Computer Science & Engineering
2. Automation
• Reduced Manual Intervention: Once trained, ML models can automatically make
predictions or decisions without human intervention, streamlining processes.
3. Generalization
• Ability to Generalize: A well-trained model can apply learned knowledge to new, unseen
data, making it useful in real-world scenarios.
4. Pattern Recognition
• Identifying Patterns: ML excels at recognizing complex patterns and correlations in large
datasets, which might be difficult for humans to identify.
5. Scalability
• Handling Large Datasets: ML algorithms can efficiently process and analyze vast amounts
of data, making them suitable for big data applications.
6. Versatility
• Wide Range of Applications: ML can be applied across various domains, including
finance, healthcare, marketing, and more, adapting to different types of data and problems.
7. Continuous Improvement
• Incremental Learning: ML models can continue to learn and improve over time as new
data becomes available, allowing for ongoing refinement.
8. Feature Engineering
• Automatic Feature Selection: Many ML algorithms can automatically identify the most
relevant features for making predictions, reducing the need for manual feature selection.
9. Predictive Capabilities
• Making Predictions: ML models are designed to predict future outcomes based on
historical data, enabling proactive decision-making.
8
Name – Prof. Garima Jain Branch – Computer Science & Engineering
2. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. In unsupervised learning algorithms,
classification or categorization is not included in the observations.Works with unlabeled data,
seeking to find patterns or groupings (e.g., clustering customers based on purchasing behavior).
3. Reinforcement learning:
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its
rewards.
A learner is not told what actions to take as in most forms of machine learning but instead must
discover which actions yield the most reward by trying them.Involves agents that learn to make
decisions by taking actions in an environment to maximize some notion of cumulative reward (e.g.,
game-playing AI).
9
Name – Prof. Garima Jain Branch – Computer Science & Engineering
•Feeding engineered data: The input to any ML model is data. Even the most advanced
machine learning model can only be as good as the data from which it learns. The simple
concept of ”garbage in, garbage out” explains why feature engineering is so relevant for—
and intertwined with—model training. Feature engineering should be performed with
awareness of common hidden errors in the data set that can bias the training, including data
leakage where the target is indirectly represented in one or more features.
•Parametrized machine learning algorithm: ML algorithms are coded procedures with a
set of input parameters, known as “hyperparameters”. The developer can customize the
hyperparameters to tune the algorithm’s learning to the specific data set and use case. The
documentation of each algorithm should highlight implementation details, including the
complete set of tunable hyperparameters.
•Model with optimal learned trained parameters: Machine learning algorithms have
another set of parameters, known as “trainable parameters”, which correspond to the
10
Name – Prof. Garima Jain Branch – Computer Science & Engineering
coefficients automatically learned during model training. Trainable parameters make the
algorithm derive an output from an unseen input at prediction time within an expected range
of accuracy. Each algorithm learns in its own specific way, so each has a unique set of
trainable parameters. For example, a decision tree learns the choice of decision variables at
each node, while a neural network learns the weights associated with each layer’s activation
function.
•Minimize an objective function: An objective function defines how a machine learning
model learns its trainable parameters. The model adjusts its learnable parameters so as to
optimize—i.e., minimize or maximize—the value outputted by the objective function.
Specifically, loss functions are the type of objective function most commonly used in ML
training, often accompanied by a regularization term. A loss function defines how well the
algorithm models the training data by providing an error between the estimated and the true
output value. The higher the error, the more the trainable parameters are updated for that
training iteration.
11
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Types of cross-validation
1.K-fold cross-validation
2.Hold-out cross-validation
3.Stratified k-fold cross-validation
4.Leave-p-out cross-validation
5.Leave-one-out cross-validation
6.Monte Carlo (shuffle-split)
7.Time series (rolling cross-validation)
K-fold cross-validation
In this technique, the whole dataset is partitioned in k parts of equal size and each partition is called
a fold. It’s known as k-fold since there are k parts where k can be any integer - 3,4,5, etc.
One fold is used for validation and other K-1 folds are used for training the model. To use every
fold as a validation set and other left-outs as a training set, this technique is repeated k times until
each fold is used once.
12
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Holdout Validation
Similar to train-test split, holdout validation involves setting aside a portion of the data for
evaluation. However, this portion is held out during the entire training process and only evaluated
once the final model is built.
This can be useful for datasets that are constantly updated, as the holdout set can be used to evaluate
the model's performance on the most recent data.
Stratified K-fold cross-validation ensures each fold contains a representative proportion of each
class for datasets with imbalanced classes, where one class dominates the others. It shuffles your
data and then splits it into ’n’ different parts. Now, it will use each part to test the model and only
shuffle data one time before splitting.
13
Name – Prof. Garima Jain Branch – Computer Science & Engineering
This prevents the model from favoring the majority class and provides a more accurate assessment
of its performance across all classes.
Leave-p-out cross-validation
An exhaustive cross-validation technique, p samples are used as the validation set and n-p samples
are used as the training set if a dataset has n samples. The process is repeated until the entire dataset
containing n samples gets divided on the validation set of p samples and the training set of n-p
samples. This continues till all samples are used as a validation set.
The technique, which has a high computation time, produces good results. However, it’s not
considered ideal for an imbalanced dataset and is deemed to be a computationally unfeasible
method. This is because if the training set has all samples of one class, the model will not be able to
properly generalize and will become biased to either of the classes.
Leave-one-out cross-validation
In this technique, only 1 sample point is used as a validation set and the remaining n-1 samples are
used in the training set. Think of it as a more specific case of the leave-p-out cross-validation
technique with P=1.
To understand this better, consider this example: There are 1000 instances in your dataset. In each
iteration, 1 instance will be used for the validation set and the remaining 999 instances will be used
as the training set. The process repeats itself until every instance from the dataset is used as a
validation sample.
14
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Also known as shuffle split cross-validation and repeated random subsampling cross-validation, the
Monte Carlo technique involves splitting the whole data into training data and test data. Splitting
can be done in the percentage of 70-30% or 60-40% - or anything you prefer. The only condition for
each iteration is to keep the train-test split percentage different.
The next step is to fit the model on the train data set in that iteration and calculate the accuracy of
the fitted model on the test dataset. Repeat these iterations many times - 100,400,500 or even higher
- and take the average of all the test errors to conclude how well your model performs.
15
Name – Prof. Garima Jain Branch – Computer Science & Engineering
It mimics real-world scenarios where past data is used to predict the future, preventing the model
from peering into the future during training.
Supervised learning
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled
data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised
learning is when we teach or train the machine using data that is well-labelled. Which means some
data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data
16
Name – Prof. Garima Jain Branch – Computer Science & Engineering
For example, a labeled dataset of images of Elephant, Camel and Cow would have each image
Key Points:
•Supervised learning involves training a machine from labeled data.
•Labeled data consists of examples with the correct answer or classification.
•The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
•The trained machine can then make predictions on new, unlabeled data.
Example:
Let’s say you have a fruit basket that you want to identify. The machine would first analyze the
image to extract features such as its shape, color, and texture. Then, it would compare these features
to the features of the fruits it has already learned about. If the new image’s features are most similar
to those of an apple, the machine would predict that the fruit is an apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step
is to train the machine with all the different fruits one by one like this:
•If the shape of the object is rounded and has a depression at the top, is red in color, then it will
be labeled as –Apple.
•If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be
labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in the Banana category. Thus the machine learns the things from training
data(basket containing fruits) and then applies the knowledge to test data(new fruit).
17
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Evaluating supervised learning models is an important step in ensuring that the model is accurate
and generalizable. There are a number of different metrics that can be used to evaluate supervised
learning models, but some of the most common ones include:
For Regression
•Mean Squared Error (MSE): MSE measures the average squared difference between the
predicted values and the actual values. Lower MSE values indicate better model performance.
18
Name – Prof. Garima Jain Branch – Computer Science & Engineering
•Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the
standard deviation of the prediction errors. Similar to MSE, lower RMSE values indicate better
model performance.
•Mean Absolute Error (MAE): MAE measures the average absolute difference between the
predicted values and the actual values. It is less sensitive to outliers compared to MSE or
RMSE.
•R-squared (Coefficient of Determination): R-squared measures the proportion of the
variance in the target variable that is explained by the model. Higher R-squared values indicate
better model fit.
For Classification
•Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is
calculated by dividing the number of correct predictions by the total number of predictions.
•Precision: Precision is the percentage of positive predictions that the model makes that are
actually correct. It is calculated by dividing the number of true positives by the total number of
positive predictions.
•Recall: Recall is the percentage of all positive examples that the model correctly identifies. It
is calculated by dividing the number of true positives by the total number of positive examples.
•F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking
the harmonic mean of precision and recall.
•Confusion matrix: A confusion matrix is a table that shows the number of predictions for each
class, along with the actual class labels. It can be used to visualize the performance of the model
and identify areas where the model is struggling.
Applications of Supervised learning
19
Name – Prof. Garima Jain Branch – Computer Science & Engineering
•Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient
data, such as medical images, test results, and patient history, to identify patterns that suggest
specific diseases or conditions.
•Fraud detection: Supervised learning models can analyze financial transactions and identify
patterns that indicate fraudulent activity, helping financial institutions prevent fraud and protect
their customers.
•Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks,
including sentiment analysis, machine translation, and text summarization, enabling machines to
understand and process human language effectively.
Advantages of Supervised learning
•Supervised learning allows collecting data and produces data output from previous
experiences.
•Helps to optimize performance criteria with the help of experience.
•Supervised machine learning helps to solve various types of real-world computation problems.
•It performs classification and regression tasks.
•It allows estimating or mapping the result to a new sample.
•We have complete control over choosing the number of classes we want in the training data.
Disadvantages of Supervised learning
•Classifying big data can be challenging.
•Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
•Supervised learning cannot handle all complex tasks in Machine Learning.
•Computation time is vast for supervised learning.
•It requires a labelled data set.
•It requires a training process.
Unsupervised learning
Unsupervised learning is a type of machine learning that learns from unlabeled data. This means
that the data does not have any pre-existing labels or categories. The goal of unsupervised learning
is to discover patterns and relationships in the data without any explicit guidance.
Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without
any prior training of data.
20
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
You can use unsupervised learning to examine the animal data that has been gathered and
distinguish between several groups according to the traits and actions of the animals. These
groupings might correspond to various animal species, providing you to categorize the creatures
without depending on labels that already exist.
Key Points
•Unsupervised learning allows the model to discover patterns and relationships in unlabeled
data.
•Clustering algorithms group similar data points together based on their inherent characteristics.
•Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
•Label association assigns categories to the clusters based on the extracted patterns and
characteristics.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and it has
no pre-existing labels or categories for these animals. Your task is to use unsupervised learning to
identify the dogs and cats in a new, unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs
and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e.,
we can easily categorize the above picture into two parts. The first may contain all pics
having dogs in them and the second part may contain all pics having cats in them. Here you didn’t
learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.
21
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Evaluating non-supervised learning models is an important step in ensuring that the model is
effective and useful. However, it can be more challenging than evaluating supervised learning
models, as there is no ground truth data to compare the model’s predictions to.
22
Name – Prof. Garima Jain Branch – Computer Science & Engineering
There are a number of different metrics that can be used to evaluate non-supervised learning
models, but some of the most common ones include:
•Silhouette score: The silhouette score measures how well each data point is clustered with its
own cluster members and separated from other clusters. It ranges from -1 to 1, with higher
scores indicating better clustering.
•Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the
variance between clusters and the variance within clusters. It ranges from 0 to infinity, with
higher scores indicating better clustering.
•Adjusted Rand index: The adjusted Rand index measures the similarity between two
clusterings. It ranges from -1 to 1, with higher scores indicating more similar clusterings.
•Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between
clusters. It ranges from 0 to infinity, with lower scores indicating better clustering.
•F1 score: The F1 score is a weighted average of precision and recall, which are two metrics
that are commonly used in supervised learning to evaluate classification models. However, the
F1 score can also be used to evaluate non-supervised learning models, such as clustering
models.
Application of Unsupervised learning
23
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Computational
Simpler method Computationally complex
Complexity
Training data Use training data to infer model. No training data is used.
Complex model It is not possible to learn larger and more It is possible to learn larger and
complex models than with supervised more complex models with
24
Name – Prof. Garima Jain Branch – Computer Science & Engineering
Unsupervised machine
Parameters Supervised machine learning
learning
Model We can test our model. We can not test our model.
25