Understanding Data Science Basics
Understanding Data Science Basics
INTRODUCTION
The term "Data Science" was coined at the beginning of the 21st Century. It is
attributed to William S.
Data analytics is the science of analyzing raw data in order to make conclusions
about that information.
Data science is the study of the generalizable extraction of knowledge from data.
Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques and machine learning algorithms to derive
meaningful information and make business decisions.
The data used for analysis can be multiple sources & present in various formats
The goal of data science is to gain insights and knowledge from any type of data
both structured and unstructured.
Examples:
1. Basket ball teams are using data for tracking team strategies and outcomes of
matches.
2. Amazon has huge amount of consumer purchasing data. The data consists of
consumer demographics (Age, Sex, Location), purchasing history, past browsing
history. Based on this data, Amazon segments its customers, draws a pattern and
recommends the right product to the right customer at right time.
3. Google Self driving car is smart, driverless car. It collects data from environment
through sensors, takes decisions like when to speed up, speed down, overtake and run
à Prediction.
à Pattern Discovery.
Where it is used?
Applications/ Usecases
1. Travel:
Dynamic pricing
Predicting flight delays
2. Marketing:
Upselling
Cross selling
Predicting the life time value of a customer
3. Healthcare:
Disease prediction
Medication effectiveness
4. Social Media:
Sentiment Analysis
Digital marketing
5. Automation:
Self driving car
Pilotless aircraft, drones
Credit and Insurance
6. Sales:
Discount offering
Demand forecasting
7. Fraud & Risk detection:
Claims prediction
Detection of fraudulent transactions
Life Cycle
Phase 1—Discovery:
It involves acquiring the data from all the identified internal and external sources
that can help answer the business question.
The data could be: logs from web servers, social media data, census datasets, data
streamed from online sources
The data can have inconsistencies like missing values, blank columns, incorrect
data format which need to be cleaned. So, it is required to explore and preprocess
the data prior to data modeling. This will help us to spot the outlier and establish a
relationship between the variables.
We clean and preprocess the data by removing the outliers, filling up the null
values, and normalizing of data type during this phase.
Model planning: In this phase, we determine methods and techniques to draw the
relation between the variables.
Phase 5—Operationalize:
In this phase, you deliver final reports, briefings, code and technical documents.
This will provide you a clear picture of the performance and other related
constraints on a small scale before full deployment.
So, in the last phase, you identify all the key findings, communicate to the
stakeholders and determine if the results of the project are a success or a failure
based on the criteria developed in Phase 1.
Product Recommendation
Suppose, A salesperson of Big Bazaar is trying to increase the sales of the store
by bundling the products together and giving discounts on them. So he bundled
shampoo and conditioner together and gave a discount on them. Furthermore,
customers will buy them together for a discounted price.
Future Forecasting:
Predictive analysis is one of the most used domains in data science. We are all
aware of Weather forecasting or future forecasting based on various types of data
that are collected from various sources. For example: Suppose, if we want to
forecast COVID 19 cases to get an overview of upcoming days in this pandemic
situation. On the based on collected data science techniques will be used to
forecast the future condition
As the online transactions are booming with time there are many high
possibilities to lose your personal data. So one of the most intellectual
applications of data science is Fraud and risk detection.
For example, Credit card fraud detection depends on the amount, merchant,
location, time and other variables as well. If any of them looks unnatural the
transaction will be automatically cancelled and it will block your card for 24
hours or more.
Self Driving Car:
Today’s world the self-driving car is one of the most successful inventions.
Based on the previous data we train our car to take decisions on its own. In this
process, we can give a penalty to our model if it does not perform well.
The car (model) becomes more intelligent with time when it starts learning by all
the real-time experiences.
Image Recognition:
When you want to recognize some images data science have the ability to detect
the object and then classify and recognize it. The most popular example of image
recognition is the face recognition – If you say to your smart phone to unblock it
will scan your face.
So first, The system will detect the face, Then classify your face as a human face
and after that only it will decide if the phone belongs to the actual owner or not.
I know it’s quite interesting right. So basically data science has plenty of
exciting applications to work on.
Speech to text Convert:
I know it’s a quite huge thing to understand but we can look at the bigger picture
on this. So Google Assistance first tries to recognize our speech and then it
converts those speeches into the text form using some algorithm.
2. Mathematics: Mathematics is the most critical, primary, and necessary part of data
science. It is used to study structure, quantity, quality, space, and change in data. So
every aspiring data scientist must have good knowledge in mathematics to read the
data mathematically and build meaningful insights from the data
5. Domain Expertise: Domain expertise helps to get a proper explanation from using
their expertise in different areas.
7. Machine learning: Machine learning is the most useful and essential part of data
science. It helps identify the best features to build an accurate model. Here is a
Machine learning Tutorial which will help you get started with Machine learning.
1. Data Storage
2. Exploratory Data Analysis
3. Data Modelling
4. Data Visualization
Apache Spark – This tool is an improved alternative of Hadoop and functions 100
times faster than MapReduce. Spark is designed specifically to manage batch
processing and stream processing. Several Machine Learning APIs in Spark
help data scientists to make accurate and powerful predictions with given data. It is a
highly superior tool than other big-data platforms as it can process real-time data,
unlike other analytical tools which are only able to process batches of historical data.
TensorFlow – TensorFlow is again used for Machine Learning, but more advanced
algorithms such as deep learning. Due to the high processing ability of TensorFlow, it
finds a variety of applications in image classification, speech recognition, drug
discovery, etc.
BIG DATA
Big Data is a collection of data that is huge in volume.
àBig data can be analyzed for insights that lead to better decisions and strategic
business moves.
Types of Big Data
Structured: Structured is one of the types of big data and by structured data, we
mean data that can be processed, stored, and retrieved in a fixed format.
Unstructured: Unstructured data refers to the data that lacks any specific form.
This makes it very difficult and time-consuming to process and analyze
unstructured data.
1)Variety:
Variety of Big Data refers to structured, unstructured, and semi structured data that
is gathered from multiple sources. While in the past, data could only be collected
from spreadsheets and databases, today data comes in an array of forms such as
emails, PDFs, photos, videos, audios and more.
2) Velocity: Velocity essentially refers to the speed at which data is being created
in real-time.
3) Volume: We already know that Big Data indicates huge ‘volumes’ of data that
is being generated on a daily basis from various sources like social media
Platforms, business processes, machines, networks, human interactions, etc. Such a
large amount of data are stored in data warehouses.
It tells us what other people think it is, or how they are perceiving it.
i) Hacking skills
These are the hacking skills that make for a successful data hacker.
Once you have acquired and cleaned the data, the next step is to actually
extract insight from it.
you need to apply appropriate math and statistics methods, which required.
Machine learning it refer to the ability to automatically and quickly apply
mathematical calculations to big data.
Nathan yau’s in 2009 Post that "Rise of data Scientist”, which include:
Visualization (graph,tool,etc..)
Datafication refers to the fact that daily interactions of living things can be
rendered into a data format and put to social use.
Once we datafy things, we can transform their purpose and turn the information
into new forms of value.
Examples:
Let’s say social platforms, Facebook or Instagram, for example, collect and
monitor data information of our friendships to market products and services to us
and surveillance services to agencies which in turn changes our behaviour;
promotions that we daily see on the socials are also the result of the monitored
data. In this model, data is used to redefine how content is created by datafication
being used to inform content rather than recommendation systems.
However, there are other industries where datafication process is actively used:
• What is EDA
• Exploratory data analysis tools
• Types of exploratory data analysis
• EDA Process
• Datascience Process
• What does a Data Scientist do
What is exploratory data analysis?
Exploratory data analysis (EDA): It is used by data scientists to analyze and
investigate data sets and summarize their main characteristics using data
visualization methods
Exploratory Data Analysis does three main things:
2. Univariate visualization
4. Multivariate visualizations
5. K-means Clustering
6. Predictive models
Types of exploratory data analysis
1. Univariate non-graphical
Univariate non-graphical EDA techniques are concerned with understanding the
underlying sample distribution and make observations about the population. This
also involves Outlier detection.
For univariate categorical data, we are interested in the range and the
frequency.
2. Univariate graphical
3. Multivariate nongraphical
Variable Identification
The very first step in exploratory data analysis is to identify the type of variables in
the dataset. Variables are of two types — Numerical and Categorical. They can be
further classified as follows:
Once the type of variables is identified, the next step is to identify the Predictor
(Inputs) and Target (output) variables.
Univariate Analysis
Bivariate Analysis
Missing Value Treatment
Outlier Removal
Outliers are unusual values in your dataset, and they can distort statistical analyses and violate
their assumptions. Outliers increase the variability in your data, which decreases
statistical power.
Imagine that we’re measuring the height of adult men and gather the following dataset.
The first thing you have to do before you solve a problem is to define exactly what
it is. You need to be able to translate data questions into something actionable.
You’ll often get ambiguous inputs from the people who have problems. You’ll
have to develop the intuition to turn scarce inputs into actionable outputs–and to
ask the questions that nobody else is asking.
Say you’re solving a problem for the VP Sales of your company. You should start
by understanding their goals and the underlying why behind their data questions.
Before you can start thinking of solutions, you’ll want to work with them to clearly
define the problem.
4. What is different from segments who are performing well and those that are
performing below expectations?
5. How much money will we lose if we don’t actively sell the product to these
groups?
It’s important that at the end of this stage, you have all of the information and
context you need to solve this problem.
Once you’ve defined the problem, you’ll need data to give you the insights needed
to turn the problem around with a solution. This part of the process involves
finding ways to get that data, whether it’s querying internal databases, or
purchasing external datasets.
You might find out that your company stores all of their sales data in a CRM or a
customer relationship management software platform. You can export the CRM
data in a CSV file for further analysis.
Now that you have all of the raw data, you’ll need to process it before you can do
any analysis. Oftentimes, data can be quite messy, especially if it hasn’t been well-
maintained. You’ll see errors that will corrupt your analysis: values set to null
though they really are zero, duplicate values, and missing values. It’s up to you to
go through and check your data to make sure you’ll get accurate insights.
3. Timezone differences, perhaps your database doesn’t take into account the
different timezones of your users
4. Date range errors, perhaps you’ll have dates that makes no sense, such as
data registered from before sales started
You’ll need to look through aggregates of your file rows and columns and sample
some test values to see if your values make sense. If you detect something that
doesn’t make sense, you’ll need to remove that data or replace it with a default
value. You’ll need to use your intuition here: if a customer doesn’t have an initial
contact date, does it make sense to say that there was NO initial contact date? Or
do you have to hunt down the VP Sales and ask if anybody has data on the
customer’s missing initial contact dates?
Once you’re done working with those questions and cleaning your data, you’ll be
ready for exploratory data analysis (EDA).
When your data is clean, you’ll should start playing with it!
The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that
are likely to turn into insights. You’ll have a fixed deadline for your data science
project (your VP Sales is probably waiting on your analysis eagerly!), so you’ll
have to prioritize your questions. ‘
You’ll have to look at some of the most interesting patterns that can help explain
why sales are reduced for this group. You might notice that they don’t tend to be
very active on social media, with few of them having Twitter or Facebook
accounts. You might also notice that most of them are older than your general
audience. From that you can begin to trace patterns you can analyze more deeply.
In this case, you might have to create a predictive model that compares your
underperforming group with your average customer. You might find out that the
age and social media activity are significant factors in predicting who will buy the
product.
If you’d asked a lot of the right questions while framing your problem, you might
realize that the company has been concentrating heavily on social media marketing
efforts, with messaging that is aimed at younger audiences. You would know that
certain demographics prefer being reached by telephone rather than by social
media. You begin to see how the way the product has been has been marketed is
significantly affecting sales: maybe this problem group isn’t a lost cause! A change
in tactics from social media marketing to more in-person interactions could change
everything for the better. This is something you’ll have to flag to your VP Sales.
You can now combine all of those qualitative insights with data from your
quantitative analysis to craft a story that moves people to action.
It’s important that the VP Sales understand why the insights you’ve uncovered
are important. Ultimately, you’ve been called upon to create a solution
throughout the data science process. Proper communication will mean the
difference between action and inaction on your proposals.
You need to craft a compelling story here that ties your data with their knowledge.
You start by explaining the reasons behind the underperformance of the older
demographic. You tie that in with the answers your VP Sales gave you and the
insights you’ve uncovered from the data. Then you move to concrete solutions that
address the problem: we could shift some resources from social media to personal
calls. You tie it all together into a narrative that solves the pain of your VP Sales:
she now has clarity on how she can reclaim sales and hit her objectives.
It’s important to understand these steps if you want to systematically think about
data science, and even more so if you’re looking to start a career in data science.
Ask questions to
frame the Business
problem Get relevant
data for
analysis of
communicat
problem
e results
Model Explore data
data for in to make error
depth correction
analysis
• Frame the problem
• Collect the raw data
• Process the data for analysis
• Explore the data
• Perform in-depth analysis
• Communicate results of the analysis
Categories of Data
Data can be categorized into two sub-categories:
1. Qualitative Data
2. Quantitative Data
Categories of Data
Qualitative Data: Qualitative data deals with characteristics and descriptors that can’t be easily
measured, but can be observed subjectively. Qualitative data is further divided into two types of
data:
Nominal Data: Data with no inherent order or ranking such as gender or race.
Nominal Data
Ordinal Data: Data with an ordered series of information is called ordinal data.
Ordinal Data
Quantitative Data: Quantitative data deals with numbers and things you can measure
objectively. This is further divided into two:
Discrete Data: Also known as categorical data, it can hold a finite number of possible
values.
Continuous Data: Data that can hold an infinite number of possible values.
So these were the different categories of data. The upcoming sections will focus on the basic
Statistics concepts, so buckle up and get ready to do some math.
What Is Statistics?
Statistics is an area of applied mathematics concerned with data collection, analysis,
interpretation, and presentation.
What Is Statistics
This area of mathematics deals with understanding how data can be used to solve complex
problems. Here are a couple of example problems that can be solved by using statistics:
Your company has created a new drug that may cure cancer. How would you conduct a
test to confirm the drug’s effectiveness?
You and a friend are at a baseball game, and out of the blue, he offers you a bet that
neither team will hit a home run in that game. Should you take the bet?
The latest sales data have just come in, and your boss wants you to prepare a report for
management on places where the company could improve its business. What should you
look for? What should you not look for?
These above-mentioned problems can be easily solved by using statistical techniques. In the
upcoming sections, we will see how this can be done.
If you want a more in-depth explanation on Statistics and Probability, you can check out this
video by our Statistics experts.
Now you must be wondering how one can choose a sample that best represents the entire
population.
Sampling Techniques
Sampling is a statistical method that deals with the selection of individual observations within a
population. It is performed to infer statistical knowledge about a population.
Consider a scenario wherein you’re asked to perform a survey about the eating habits of
teenagers in the US. There are over 42 million teens in the US at present and this number is
growing as you read this blog. Is it possible to survey each of these 42 million individuals about
their health? Obviously not! That’s why sampling is used. It is a method wherein a sample of the
population is studied in order to draw inference about the entire population.
1. Probability Sampling
2. Non-Probability Sampling
Sampling Techniques
In this blog, we’ll be focusing only on probability sampling techniques because non-probability
sampling is not within the scope of this blog.
Probability Sampling: This is a sampling technique in which samples from a large population
are chosen using the theory of probability. There are three types of probability sampling:
Random Sampling: In this method, each member of the population has an equal chance
of being selected in the sample.
Random Sampling
Systematic Sampling: In Systematic sampling, every nth record is chosen from the
population to be a part of the sample. Refer the below figure to better understand how
Systematic sampling works.
Systematic Sampling
Stratified Sampling
Types of Statistics
There are two well-defined types of statistics:
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics
Descriptive statistics is a method used to describe and understand the features of a specific data
set by giving short summaries about the sample and measures of the data.
Descriptive Statistics is mainly focused upon the main characteristics of data. It provides a
graphical summary of the data.
Descriptive Statistics
Suppose you want to gift all your classmate’s t-shirts. To study the average shirt size of students
in a classroom, in descriptive statistics you would record the shirt size of all students in the class
and then you would find out the maximum, minimum and average shirt size of the class.
Inferential Statistics
Inferential statistics makes inferences and predictions about a population based on a sample of
data taken from the population in question.
Inferential statistics generalizes a large dataset and applies probability to draw a conclusion. It
allows us to infer data parameters based on a statistical model using sample data.
Inferential Statistics
So, if we consider the same example of finding the average shirt size of students in a class, in
Inferential Statistics, you will take a sample set of the class, which is basically a few people from
the entire class. You already have had grouped the class into large, medium and small. In this
method, you basically build a statistical model and expand it for the entire population in the
class.
Measures of Centre
Measures of the center are statistical measures that represent the summary of a dataset. There are
three main measures of center:
Measures of Centre
1. Mean: Measure of the average of all the values in a sample is called Mean.
2. Median: Measure of the central value of the sample set is called Median.
3. Mode: The value most recurrent in the sample set is known as Mode.
To better understand the Measures of central tendency let’s look at an example. The below cars
dataset contains the following variables:
DataSet
Cars
Mileage per Gallon(mpg)
Cylinder Type (cyl)
Displacement (disp)
Horse Power(hp)
Real Axle Ratio(drat)
Using descriptive Analysis, you can analyze each of the variables in the sample data set for
mean, standard deviation, minimum and maximum.
If we want to find out the mean or average horsepower of the cars among the population of cars,
we will check and calculate the average of all values. In this case, we’ll take the sum of the
Horse Power of each car, divided by the total number of cars:
Mean = (110+110+93+96+90+110+110+110)/8 = 103.625
If we want to find out the center value of mpg among the population of cars, we will arrange the
mpg values in ascending or descending order and choose the middle value. In this case, we have
8 values which is an even entry. Hence we must take the average of the two middle values.
If we want to find out the most common type of cylinder among the population of cars, we will
check the value which is repeated the most number of times. Here we can see that the cylinders
come in two values, 4 and 6. Take a look at the data set, you can see that the most recurring
value is 6. Hence 6 is our Mode.
Measures Of The Spread
A measure of spread, sometimes also called a measure of dispersion, is used to describe the
variability in a sample or population.
Measures of Spread
Just like the measure of center, we also have measures of the spread, which comprises of the
following measures:
Range: It is the given measure of how spread apart the values in a data set are. The range
can be calculated as:
Here,
Quartile: Quartiles tell us about the spread of a data set by breaking the data set into
quarters, just like the median breaks it in half.
To better understand how quartile and the IQR are calculated, let’s look at an example.
Measures of Spread example
The above image shows marks of 100 students ordered from lowest to highest scores. The
quartiles lie in the following ranges:
1. The first quartile (Q1) lies between the 25th and 26th observation.
2. The second quartile (Q2) lies between the 50th and 51st observation.
3. The third quartile (Q3) lies between the 75th and 76th observation.
Inter Quartile Range (IQR): It is the measure of variability, based on dividing a data set
into quartiles. The interquartile range is equal to Q3 minus Q1, i.e. IQR = Q3 – Q1
Variance: It describes how much a random variable differs from its expected value. It
entails computing squares of deviations. Variance can be calculated by using the below
formula:
Here,
x: Individual data points
n: Total number of data points
x̅ : Mean of data points
Deviation is the difference between each element from the mean. It can be calculated by
using the below formula:
Deviation = (𝑥_𝑖 – µ)
Sample Variance is the average of squared differences from the mean. It can be
calculated by using the below formula:
Standard Deviation: It is the measure of the dispersion of a set of data from its mean. It
can be calculated by using the below formula:
To better understand how the Measures of spread are calculated, let’s look at a use case.
Problem statement: Daenerys has 20 Dragons. They have the numbers 9, 2, 5, 4, 12, 7, 8, 11, 9,
3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4. Work out the Standard Deviation.
Let’s look at the solution step by step:
+7+4+12+5+4+10+9+6+9+4 / 20
µ=7
Step 2: Then for each number, subtract the Mean and square the result.
(x_i – μ)²
(9-7)²= 2²=4
(2-7)²= (-5)²=25
(5-7)²= (-2)²=4
And so on…
4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9 / 20
⸫ σ² = 8.9
σ = 2.983
To better understand the measures of spread and center, let’s execute a short demo by using the
R language.
[Link] Distribution
[Link] Distribution
[Link] independent.
[Link] Distribution
[Link] Distribution
➢ As an example, imagine the roll of a fair dice. In this case, there are
multiple possible events with each of them having the same probability to
happen.
[Link] Distribution
Model Fitting
➢ Underfitting - Underfitting happens when the machine learning model
cannot sufficiently model the training data nor generalize new data.
➢ It will perform well on the training set, but very poorly on the test set.
OverFitting: High Specificity, Low Generalizability
Unit-III
Machine learning algorithms are programs that can learn from data and improve
from experience, without human intervention. Learning tasks may include learning
the function that maps the input to the output, learning the hidden structure in
unlabeled data; or ‘instance-based learning’, where a class label is produced for a
new instance by comparing the new instance (row) to instances from the training
data, which were stored in memory. ‘Instance-based learning’ does not create an
abstraction from specific instances.
Y = f (X)
We’ll talk about two types of supervised learning: classification and regression.
Classification predicts the outcome of a given sample when the output variable is
in the form of categories. A classification model might look at the input data and
try to predict labels like “sick” or “healthy.”
Regression is used to predict the outcome of a given sample when the output
variable is in the form of real values. For example, a regression model might
process input data to predict the amount of rainfall, the height of a person, etc.
Linear Regression, Logistic Regression, CART, Naïve-Bayes, and K-Nearest
Neighbors (KNN) — are examples of supervised learning.
Clustering is used to group samples such that objects within the same cluster are
more similar to each other than to the objects from another cluster.
Algorithms 6-8 that we cover here — Apriori, K-means, PCA — are examples of
unsupervised learning.
Reinforcement learning:
Reinforcement algorithms usually learn optimal actions through trial and error.
Imagine, for example, a video game in which the player needs to move to certain
places at certain times to earn points. A reinforcement algorithm playing that game
would start by moving randomly but, over time through trial and error, it would
learn where and when it needed to move the in-game character to maximize its
point total.
1. Linear Regression
In machine learning, we have a set of input variables (x) that are used to determine
an output variable (y). A relationship exists between the input variables and the
output variable. The goal of ML is to quantify this relationship.
Figure 1 shows the plotted x and y values for a data set. The goal is to fit a line that
is nearest to most of the points. This would reduce the distance (‘error’) between
the y value of a data point and the line.
2. Logistic Regression
Linear regression predictions are continuous values (i.e., rainfall in cm), logistic
regression predictions are discrete values (i.e., whether a student passed/failed)
after applying a transformation function.
Logistic regression is best suited for binary classification: data sets where y = 0 or
1, where 1 denotes the default class. For example, in predicting whether an event
will occur or not, there are only two possibilities: that it occurs (which we denote
as 1) or that it does not (0). So if we were predicting whether a patient was sick, we
would label sick patients using the value of 1 in our data set.
In logistic regression, the output takes the form of probabilities of the default class
(unlike linear regression, where the output is directly produced). As it is a
probability, the output lies in the range of 0-1. So, for example, if we’re trying to
predict whether patients are sick, we already know that sick patients are denoted
as 1, so if our algorithm assigns the score of 0.98 to a patient, it thinks that patient
is quite likely to be sick.
This output (y-value) is generated by log transforming the x-value, using the
logistic function h(x)= 1/ (1 + e^ -x) . A threshold is then applied to force this
probability into a binary classification.
Figure 2: Logistic Regression to determine if a tumor is malignant or benign.
Classified as malignant if the probability h(x)>= 0.5. Source
The logistic regression equation P(x) = e ^ (b0 +b1x) / (1 + e(b0 + b1x)) can be
transformed into ln(p(x) / 1-p(x)) = b0 + b1x.
The goal of logistic regression is to use the training data to find the values of
coefficients b0 and b1 such that it will minimize the error between the predicted
outcome and the actual outcome. These coefficients are estimated using the
technique of Maximum Likelihood Estimation.
3. CART
The non-terminal nodes of Classification and Regression Trees are the root node
and the internal node. The terminal nodes are the leaf nodes. Each non-terminal
node represents a single input variable (x) and a splitting point on that variable; the
leaf nodes represent the output variable (y). The model is used as follows to make
predictions: walk the splits of the tree to arrive at a leaf node and output the value
present at the leaf node.
The decision tree in Figure 3 below classifies whether a person will buy a sports
car or a minivan depending on their age and marital status. If the person is over 30
years and is not married, we walk the tree as follows : ‘over 30 years?’ -> yes ->
’married?’ -> no. Hence, the model outputs a sports car.
4. Naïve Bayes
To calculate the probability that an event will occur, given that another event has
already occurred, we use Bayes’s Theorem. To calculate the probability of
hypothesis(h) being true, given our prior knowledge(d), we use Bayes’s Theorem
as follows:
where:
This algorithm is called ‘naive’ because it assumes that all the variables are
independent of each other, which is a naive assumption to make in real-world
examples.
Figure 4: Using Naive Bayes to predict the status of ‘play’ using the variable
‘weather’.
To determine the outcome play = ‘yes’ or ‘no’ given the value of variable weather
= ‘sunny’, calculate P(yes|sunny) and P(no|sunny) and choose the outcome with
higher probability.
5. KNN
The K-Nearest Neighbors algorithm uses the entire data set as the training set,
rather than splitting the data set into a training set and test set.
When an outcome is required for a new data instance, the KNN algorithm goes
through the entire data set to find the k-nearest instances to the new instance, or the
k number of instances most similar to the new record, and then outputs the mean of
the outcomes (for a regression problem) or the mode (most frequent class) for a
classification problem. The value of k is user-specified.
6. Apriori
The Apriori algorithm is used in a transactional database to mine frequent item sets
and then generate association rules. It is popularly used in market basket analysis,
where one checks for combinations of products that frequently co-occur in the
database. In general, we write the association rule for ‘if a person purchases item
X, then he purchases item Y’ as : X -> Y.
Example: if a person purchases milk and sugar, then she is likely to purchase
coffee powder. This could be written in the form of an association rule as:
{milk,sugar} -> coffee powder. Association rules are generated after crossing the
threshold for support and confidence.
Figure 5: Formulae for support, confidence and lift for the association rule X->Y.
The Support measure helps prune the number of candidate item sets to be
considered during frequent item set generation. This support measure is guided by
the Apriori principle. The Apriori principle states that if an itemset is frequent, then
all of its subsets must also be frequent.
7. K-means
K-means is an iterative algorithm that groups similar data into [Link] calculates
the centroids of k clusters and assigns a data point to that cluster having least
distance between its centroid and the data point.
Next, reassign each point to the closest cluster centroid. In the figure above, the
upper 5 points got assigned to the cluster with the blue centroid. Follow the same
procedure to assign points to the clusters containing the red and green centroids.
Then, calculate centroids for the new clusters. The old centroids are gray stars; the
new centroids are the red, green, and blue stars.
Finally, repeat steps 2-3 until there is no switching of points from one cluster to
another. Once there is no switching for 2 consecutive steps, exit the K-means
algorithm.
Unit-IV
Data viz is the communication of data in a visual manner, or turning raw data into insights that
can be easily interpreted by your readers.
Wikipedia's definition of Data Visualization: Data visualization refers to the techniques used to
communicate data or information by encoding it as visual objects (points, lines or bars)
contained in graphics.
Techopedia's definition of Data Visualization: Data visualization is the process of displaying data
or information in graphical charts, figures and bars.
In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.
The best data in the world won't be worth anything if no one can understand it. The job of a data
analyst is not only to collect and analyze data, but also to present it to the end users and other interested
parties who will then act on that data. Here’s where data visualization comes in.
Many data analysts are not necessarily experts in data communication or graphic design. This means a
lot of them can be lost in the translation of data from the collection to the presentation in the boardroom. I
often find myself teaching data visualization classes to more and more data science teams, who
recognize this as an area of weakness.
If your job entails presenting findings from a set of data or analysis to a group of laymen, then it’s part of
your job to present it to them in such a way that it’s easy to understand and therefore take appropriate
action.
In this post, I’ll share a few tips to help you turn data into actionable insights that people will understand.
Tables should be used when you want to display precise values. Graphs should be
used to present information with regards to data patterns, relationships, and how
things change over time. From my experience, it’s best to reduce the use of tables
and focus more on the graphs.
Provide Context
A well-done presentation should prompt the user to act on the presented data.
However, this is hard to achieve if the context for that action has not been
provided. Use size, color, and other visual cues to provide context, and be sure to
include some short narratives to highlight the key insights.
Ensure that your displays of information are vertically and horizontally aligned, to
make sure that the can be compared accurately. This also helps to prevent
misleading optical illusions with your presentation.
You should us color to draw the attention of the audience to key data pieces, not
just to brighten your presentation. Moreover, choose your color combinations
wisely. For instance, you don’t want to use red and green in the same diagram,
since they will appear brown to color blinded people.
Give your graphs and charts useful, explanatory titles. This helps to highlight the
focus of that presentation. View titles as the headline that draw people in, give
them a snapshot of key insights and focuses them on the right questions.
Steer away from fancy gauges and labels that can affect the clarity. Always start at
zero when labeling the axis of a graph or chart, unless there’s a strong reason not
to, such as when the data has been clustered at unreasonably high values.
These basic principles should help you increase the effectiveness of your data
presentation and communication. This way, key stakeholders will be in a better
position to make better, and more informed decisions based on the data you have
gathered and presented.