0% found this document useful (0 votes)
20 views58 pages

Understanding Data Science Basics

data science

Uploaded by

Rachana jakkula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views58 pages

Understanding Data Science Basics

data science

Uploaded by

Rachana jakkula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DATA SCIENCE AND ANALYTICS

INTRODUCTION

The term "Data Science" was coined at the beginning of the 21st Century. It is
attributed to William S.

Purpose of using data science and data analytics:

Data science is all about using data to solve problems.

Data analytics is the science of analyzing raw data in order to make conclusions
about that information.

What is Data science?

Data science is the study of the generalizable extraction of knowledge from data.

Data science is an inter-disciplinary field that uses scientific methods, processes,


programming skills, mathematics, algorithms and systems to extract knowledge
and insights from structured and unstructured data.

Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques and machine learning algorithms to derive
meaningful information and make business decisions.

The data used for analysis can be multiple sources & present in various formats

The goal of data science is to gain insights and knowledge from any type of data
both structured and unstructured.
Examples:
1. Basket ball teams are using data for tracking team strategies and outcomes of
matches.
2. Amazon has huge amount of consumer purchasing data. The data consists of
consumer demographics (Age, Sex, Location), purchasing history, past browsing
history. Based on this data, Amazon segments its customers, draws a pattern and
recommends the right product to the right customer at right time.
3. Google Self driving car is smart, driverless car. It collects data from environment
through sensors, takes decisions like when to speed up, speed down, overtake and run

Data Science can answer a lot of questions

1. Which viewer likes the same kind of tv show?

2. Which route should the cab take so that it reaches faster?

3. Will the refrigerator fail in the next 3 years. Yes/No?

4. Who will win the elections?

Why do we use data science?


à Decision Making.

à Prediction.

à Pattern Discovery.

Where it is used?

There are various industries like banking, finance, manufacturing, transport, e-


commerce, education, etc. that use data science.

1. Recommend right product to right customer to enhance the business

2. Predict the fraudulent transactions before hand

3. Build Intelligence and ability in machines

4. Perform Sentimental Analysis to predict outcome of elections


5. predict the characteristics of high LTV customers and helps in user segmentation

Applications/ Usecases
1. Travel:
 Dynamic pricing
 Predicting flight delays
2. Marketing:
 Upselling
 Cross selling
 Predicting the life time value of a customer
3. Healthcare:
 Disease prediction
 Medication effectiveness
4. Social Media:
 Sentiment Analysis
 Digital marketing
5. Automation:
 Self driving car
 Pilotless aircraft, drones
 Credit and Insurance
6. Sales:
 Discount offering
 Demand forecasting
7. Fraud & Risk detection:
 Claims prediction
 Detection of fraudulent transactions
Life Cycle
Phase 1—Discovery:

It involves acquiring the data from all the identified internal and external sources
that can help answer the business question.

The data could be: logs from web servers, social media data, census datasets, data
streamed from online sources

Phase 2—Data preparation:

The data can have inconsistencies like missing values, blank columns, incorrect
data format which need to be cleaned. So, it is required to explore and preprocess
the data prior to data modeling. This will help us to spot the outlier and establish a
relationship between the variables.

We clean and preprocess the data by removing the outliers, filling up the null
values, and normalizing of data type during this phase.

Phase 3---Model Planning:

Model planning: In this phase, we determine methods and techniques to draw the
relation between the variables.

We apply EDA using various statistical formulas and visualization tools

Tools: SQL Analysis Services, R, SAS/Access


Use of visualization tools like histograms, line graphs, box plots to get a fair idea
of the distribution of data

Phase 4—Model building:

In this phase, we will

 Develop datasets for training and testing purposes.


 Analyze various learning techniques like classification, association and
clustering to build the model.

Tools: SAS Enterprise Miner, WEKA, R, Python, Statistical

Phase 5—Operationalize:

In this phase, you deliver final reports, briefings, code and technical documents.
This will provide you a clear picture of the performance and other related
constraints on a small scale before full deployment.

Phase 6—Communicate results:

So, in the last phase, you identify all the key findings, communicate to the
stakeholders and determine if the results of the project are a success or a failure
based on the criteria developed in Phase 1.

Applications of Data Science


Some of the popular applications of data science are:

Product Recommendation

Product recommendation technique becomes one of the most popular techniques


to influence the customer to buy similar products. Let’s see an example.

Suppose, A salesperson of Big Bazaar is trying to increase the sales of the store
by bundling the products together and giving discounts on them. So he bundled
shampoo and conditioner together and gave a discount on them. Furthermore,
customers will buy them together for a discounted price.
Future Forecasting:

Predictive analysis is one of the most used domains in data science. We are all
aware of Weather forecasting or future forecasting based on various types of data
that are collected from various sources. For example: Suppose, if we want to
forecast COVID 19 cases to get an overview of upcoming days in this pandemic
situation. On the based on collected data science techniques will be used to
forecast the future condition

Fraud and Risk Detection:

As the online transactions are booming with time there are many high
possibilities to lose your personal data. So one of the most intellectual
applications of data science is Fraud and risk detection.

For example, Credit card fraud detection depends on the amount, merchant,
location, time and other variables as well. If any of them looks unnatural the
transaction will be automatically cancelled and it will block your card for 24
hours or more.
Self Driving Car:

Today’s world the self-driving car is one of the most successful inventions.
Based on the previous data we train our car to take decisions on its own. In this
process, we can give a penalty to our model if it does not perform well.
The car (model) becomes more intelligent with time when it starts learning by all
the real-time experiences.

Image Recognition:

When you want to recognize some images data science have the ability to detect
the object and then classify and recognize it. The most popular example of image
recognition is the face recognition – If you say to your smart phone to unblock it
will scan your face.
So first, The system will detect the face, Then classify your face as a human face
and after that only it will decide if the phone belongs to the actual owner or not.
I know it’s quite interesting right. So basically data science has plenty of
exciting applications to work on.
Speech to text Convert:

Speech recognition is a process to understand natural language by the computer.


I think we are all quite familiar with Google Assistance. Have you ever tried to
understand how this assistance works?

I know it’s a quite huge thing to understand but we can look at the bigger picture
on this. So Google Assistance first tries to recognize our speech and then it
converts those speeches into the text form using some algorithm.

What are the Components of Data Science?


1. Statistics: It is most important for a data scientist to understand data and having a
very firm hold on statistics will surely help to understand the data. If you are starting
with data science, I would suggest enhancing your knowledge about statistics as it is a
vital component of data science. Here are two sources to get you started
with descriptive statistics and inferential statistics .

2. Mathematics: Mathematics is the most critical, primary, and necessary part of data
science. It is used to study structure, quantity, quality, space, and change in data. So
every aspiring data scientist must have good knowledge in mathematics to read the
data mathematically and build meaningful insights from the data

3. Visualization: Visualization represents the context visually with the insights. It


helps to understand the huge volume of data properly
4. Data engineering: Data engineering helps to acquire, store, retrieve, and transform
the data, and it also includes metadata (data about data) to the data.

5. Domain Expertise: Domain expertise helps to get a proper explanation from using
their expertise in different areas.

6. Advanced computing: Advance computing is a big part of designing, writing,


debugging, and maintaining the source code of computer programs.

7. Machine learning: Machine learning is the most useful and essential part of data
science. It helps identify the best features to build an accurate model. Here is a
Machine learning Tutorial which will help you get started with Machine learning.

Tools for Data Science


Although there are various tools that a data scientist may have to use during his
project, here are some tools that you may require in every data science project.
These tools are divided into four categories:

1. Data Storage
2. Exploratory Data Analysis
3. Data Modelling
4. Data Visualization

1. Data Storage: Tools are used to store a huge amount of data:


1. Apache Spark.
2. Microsoft HD Insights
3. Hadoop
2. Exploratory Data analysis: EDA is an approach to analyze these huge
amounts of unstructured data.
1. Informatica
2. SAS
3. Python
4. MATLAB
3. Data modelling: Data modelling Tools comes with inbuilt machine
learning algorithms. So all you need to do is just pass the processed data
to train your model.
1. [Link]
2. BigML
3. DataRobot
4. Scikit Learn
5. TensorFlow
4. Data Visualization: After all the process we just need to visualize our
data to find all the insights and hidden patterns from it to build proper
reports from that.
1. Tableau
2. Matplotlib
3. Seaborn
Now I ‘ll briefly describe a few of these tools:

SAS – It is specifically designed for operations and is a closed source proprietary


software used majorly by large organizations to analyze data. It uses the base SAS
programming language which is generally used for performing statistical modelling. It
also offers various statistical libraries and tools that are used by data scientists
for data modelling and organising.

Apache Spark – This tool is an improved alternative of Hadoop and functions 100
times faster than MapReduce. Spark is designed specifically to manage batch
processing and stream processing. Several Machine Learning APIs in Spark
help data scientists to make accurate and powerful predictions with given data. It is a
highly superior tool than other big-data platforms as it can process real-time data,
unlike other analytical tools which are only able to process batches of historical data.

MATLAB – It is a numerical computing environment that can process complex


mathematical operations. It has a powerful graphics library to create great
visualizations that help aid image and signal processing applications. It is a popular
tool among data scientists as it can help with multiple problems ranging
from data cleaning and analysis to much advanced deep learning problems. It can be
easily integrated with enterprise applications and other embedded systems.
Tableau – It is a Data Visualization software that helps in creating interactive
visualizations with its powerful graphics. It is suited best for the industries working on
business intelligence projects. Tableau can easily interface with
spreadsheets, databases, and OLAP (Online Analytical Processing) cubes. It sees a
great application in visualizing geographical data.

Matplotlib – Matplotlib is developed for Python and is a plotting and visualization


library used for generating graphs with the analyzed data. It is a powerful tool to plot
complex graphs by putting together some simple lines of code. The most widely used
module of the many matplotlib modules is the Pyplot. It is an open-source module that
has a MATLAB-like interface and is a good alternative to MATLAB’s graphics
modules. NASA’s data visualizations of Phoenix Spacecraft’s landing were illustrated
using Matplotlib.

NLTK – It is a collection of libraries in Python called Natural Language Processing


Toolkit. It helps in building the statistical models that along with several algorithms
can help machines understand human language.

Scikit-learn – It is a tool that makes complex ML algorithm simpler to use. A variety


of Machine Learning features such as data pre-processing, regression, classification,
clustering, etc. are supported by Scikit-learn making it easy to use complex ML
algorithms.

TensorFlow – TensorFlow is again used for Machine Learning, but more advanced
algorithms such as deep learning. Due to the high processing ability of TensorFlow, it
finds a variety of applications in image classification, speech recognition, drug
discovery, etc.

BIG DATA
Big Data is a collection of data that is huge in volume.

àIt is the act of gathering and storing large amounts of information.

àBig data can be analyzed for insights that lead to better decisions and strategic
business moves.
Types of Big Data

Structured: Structured is one of the types of big data and by structured data, we
mean data that can be processed, stored, and retrieved in a fixed format.

Unstructured: Unstructured data refers to the data that lacks any specific form.
This makes it very difficult and time-consuming to process and analyze
unstructured data.

Email is an example of unstructured data.

Semi-structured: Semi structured is the third type of big data. Combination of


both structured and unstructured data. It refers to the data that although has not
been classified under a particular repository (database).

Characteristics of Big Data

1)Variety:

Variety of Big Data refers to structured, unstructured, and semi structured data that
is gathered from multiple sources. While in the past, data could only be collected
from spreadsheets and databases, today data comes in an array of forms such as
emails, PDFs, photos, videos, audios and more.
2) Velocity: Velocity essentially refers to the speed at which data is being created
in real-time.

3) Volume: We already know that Big Data indicates huge ‘volumes’ of data that
is being generated on a daily basis from various sources like social media
Platforms, business processes, machines, networks, human interactions, etc. Such a
large amount of data are stored in data warehouses.

Benefits of Big Data Processing:

 Businesses can utilize outside intelligence while taking decisions.


 Improved customer service
 Early identification of risk to the product/services, if any
 Better operational efficiency.
DATASCIENCE HYPE -----GETTING PAST THE HYPE WHY NOW?

The current landscape

 This is an ongoing discussion, but one way to understand what’s going on


this industry is to look online and see what current discussions are taking
place.

 It tells us what other people think it is, or how they are perceiving it.

 It poses a practical knowledge of tools and material.

current landscape perspectives


The venn diagram comprises the disciplines of

i) Hacking skills

ii) Maths and statistics

iii) machine learning.

iv) Substantive expertise

Hacking skills it refer to the computer skills. In order to efficiently


manipulate the data.

Being able to manipulate text files at the command-line, thinking


algorithmically.

These are the hacking skills that make for a successful data hacker.

Maths and statistics it consists to linear algebra, probability , calculas

Once you have acquired and cleaned the data, the next step is to actually
extract insight from it.

you need to apply appropriate math and statistics methods, which required.
Machine learning it refer to the ability to automatically and quickly apply
mathematical calculations to big data.

Substantive expertise it refer to working knowledge and skills as defined


by different domains varying from a product manager,company.

Nathan yau’s in 2009 Post that "Rise of data Scientist”, which include:

 Statistics (traditional analysis you’re used to thinking about)

 Data munging (parsing, scraping and formatting data)

 Visualization (graph,tool,etc..)

Datafication and current landscape perspectives

 Datafication is a technological trend turning many aspects of our life into


data which is subsequently transferred into information realized as a new
form of value.

 Datafication, is the transformation of social action into online quantified


data, thus allowing for real-time tracking and predictive analysis. Simply
said, it is about taking previously invisible process/activity and turning it
into data, that can be monitored, tracked, analysed and optimised. Latest
technologies we use have enabled lots of new ways of ‘datify’ our daily and
basic activities.

Datafication refers to the fact that daily interactions of living things can be
rendered into a data format and put to social use.

Once we datafy things, we can transform their purpose and turn the information
into new forms of value.

Datafication is an interesting concept and led us to consider its importance with


respect to people’s intentions about sharing their own data. We are being datafied,
or rather our actions are, and when we “like” someone or something online, we are
intending to be datafied, or at least we should expect to be. But when we merely
browse the Web, we are unintentionally, or at least passively, being datafied
through cookies that we might or might not be aware of. And when we walk
around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.

Examples:

And here could be many examples of datification.

Let’s say social platforms, Facebook or Instagram, for example, collect and
monitor data information of our friendships to market products and services to us
and surveillance services to agencies which in turn changes our behaviour;
promotions that we daily see on the socials are also the result of the monitored
data. In this model, data is used to redefine how content is created by datafication
being used to inform content rather than recommendation systems.

However, there are other industries where datafication process is actively used:

 Insurance: Data used to update risk profile development and business


models.

 Banking: Data used to establish trustworthiness and likelihood of a person


paying back a loan.

 Human resources: Data used to identify e.g. employees risk-taking profiles.

 Hiring and recruitment: Data used to replace personality tests.

 Social science research: Datafication replaces sampling techniques and


restructures the manner in which social science research is performed.

Taking all activities of life converting them into a data


Why now Datafication

 Organizations can only keep up with the latest technological advancements


if they turn to datafication.
Datafication makes it possible for businesses to improve operations to
increase productivity

 It can help organizations accomplish day-to-day tasks while maximizing


resources.

• It can streamline current processes, allowing users to remain competitive.

Required Skills for a Data Scientist

Programming: Python, SQL, Scala, Java, R, MATLAB

Machine Learning: Natural Language Processing, Classification, Clustering, Ensemble


methods, Deep Learning

Data Visualization: Tableau, SAS, [Link], Python, Java, R libraries

Big data platforms: MongoDB, Oracle, Microsoft Azure, Cloudera


UNIT –II
[ EDA

• What is EDA
• Exploratory data analysis tools
• Types of exploratory data analysis
• EDA Process
• Datascience Process
• What does a Data Scientist do
What is exploratory data analysis?
Exploratory data analysis (EDA): It is used by data scientists to analyze and
investigate data sets and summarize their main characteristics using data
visualization methods
Exploratory Data Analysis does three main things:

 Understanding your variables

 Cleaning your dataset

 Analyzing relationships between variables

Exploratory data analysis tools

1. Clustering and dimension reduction techniques

2. Univariate visualization

3. Bivariate visualizations and summary statistics

4. Multivariate visualizations

5. K-means Clustering

6. Predictive models
Types of exploratory data analysis

1. Univariate non-graphical
Univariate non-graphical EDA techniques are concerned with understanding the
underlying sample distribution and make observations about the population. This
also involves Outlier detection.

For univariate categorical data, we are interested in the range and the
frequency.

Univariate EDA for quantitative data involves making preliminary assessments


about the population distribution of the variable using the data from the observed
sample. The characteristics of the population distribution inferred include center,
spread, modality, shape and outliers.

2. Univariate graphical

For graphical analysis of univariate categorical data, histograms are used.


The histogram represents the frequency (count) or proportion (count/total
count) of cases for a range of values. Typically, between about 5 and 30 bins
are chosen. Histograms are one of the best ways to quickly learn a lot about
your data, including central tendency, spread, modality, shape and
outliers. Stem and Leaf plots could also be used for the same
purpose. Boxplots can also be used to present information about the central
tendency, symmetry and skew, as well as outliers. Quantile normal plots or
QQ plots and other techniques could also be used here.

3. Multivariate nongraphical

Multivariate non-graphical EDA techniques generally show the


relationship between two or more variables in the form of either cross-
tabulation or statistics. For each combination of categorical variable (usually
explanatory) and one quantitative variable (usually outcome), we can create
a statistic for a quantitative variables separately for each level of the
categorical variable, and then compare the statistics across levels of the
categorical variable.
4. Multivariate graphical

For categorical multivariate quantities, the most commonly used


graphical technique is the barplot with each group rep-resenting one level of
one of the variables and each bar within a group representing the levels of
the other variable. For each category, we could have side-by-side boxplots or
Parallel box plots. For two quantitative multivariate variables, the basic
graphical EDA technique is the scatterplot which has one variable on the x-
axis, one on the y-axis and a point for each case in your dataset. Typically,
the explanatory variable goes on the X axis. Additional categorical variables
can be accommodated by the use of colour or symbols.

Steps In Exploratory Data Analysis

Variable Identification

The very first step in exploratory data analysis is to identify the type of variables in
the dataset. Variables are of two types — Numerical and Categorical. They can be
further classified as follows:
Once the type of variables is identified, the next step is to identify the Predictor
(Inputs) and Target (output) variables.

Univariate Analysis

Bivariate Analysis
Missing Value Treatment

Outlier Removal
Outliers are unusual values in your dataset, and they can distort statistical analyses and violate
their assumptions. Outliers increase the variability in your data, which decreases
statistical power.

Imagine that we’re measuring the height of adult men and gather the following dataset.

In this dataset, the value of 10.8135 is clearly an outlier.


Not only does it stand out, but it’s an impossible height
value. Examining the numbers more closely, we
conclude the zero might have been accidental. These
types of errors are easy cases to understand. If you
determine that an outlier value is an error, correct the
value when possible

Exploratory Data Analysis refers to the critical process of performing initial


investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
The Data Science Process

Step 1: Frame the problem

The first thing you have to do before you solve a problem is to define exactly what
it is. You need to be able to translate data questions into something actionable.

You’ll often get ambiguous inputs from the people who have problems. You’ll
have to develop the intuition to turn scarce inputs into actionable outputs–and to
ask the questions that nobody else is asking.

Say you’re solving a problem for the VP Sales of your company. You should start
by understanding their goals and the underlying why behind their data questions.
Before you can start thinking of solutions, you’ll want to work with them to clearly
define the problem.

A great way to do this is to ask the right questions.


You should then figure out what the sales process looks like, and who the
customers are. You need as much context as possible for your numbers to become
insights.

You should ask questions like the following:

1. Who are the customers?

2. Why are they buying our product?

3. How do we predict if a customer is going to buy our product?

4. What is different from segments who are performing well and those that are
performing below expectations?

5. How much money will we lose if we don’t actively sell the product to these
groups?

It’s important that at the end of this stage, you have all of the information and
context you need to solve this problem.

Step 2: Collect the raw data needed for your problem

Once you’ve defined the problem, you’ll need data to give you the insights needed
to turn the problem around with a solution. This part of the process involves
finding ways to get that data, whether it’s querying internal databases, or
purchasing external datasets.

You might find out that your company stores all of their sales data in a CRM or a
customer relationship management software platform. You can export the CRM
data in a CSV file for further analysis.

Step 3: Process the data for analysis

Now that you have all of the raw data, you’ll need to process it before you can do
any analysis. Oftentimes, data can be quite messy, especially if it hasn’t been well-
maintained. You’ll see errors that will corrupt your analysis: values set to null
though they really are zero, duplicate values, and missing values. It’s up to you to
go through and check your data to make sure you’ll get accurate insights.

You’ll want to check for the following common errors:


1. Missing values, perhaps customers without an initial contact date

2. Corrupted values, such as invalid entries

3. Timezone differences, perhaps your database doesn’t take into account the
different timezones of your users

4. Date range errors, perhaps you’ll have dates that makes no sense, such as
data registered from before sales started

You’ll need to look through aggregates of your file rows and columns and sample
some test values to see if your values make sense. If you detect something that
doesn’t make sense, you’ll need to remove that data or replace it with a default
value. You’ll need to use your intuition here: if a customer doesn’t have an initial
contact date, does it make sense to say that there was NO initial contact date? Or
do you have to hunt down the VP Sales and ask if anybody has data on the
customer’s missing initial contact dates?

Once you’re done working with those questions and cleaning your data, you’ll be
ready for exploratory data analysis (EDA).

Step 4: Explore the data

When your data is clean, you’ll should start playing with it!

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that
are likely to turn into insights. You’ll have a fixed deadline for your data science
project (your VP Sales is probably waiting on your analysis eagerly!), so you’ll
have to prioritize your questions. ‘

You’ll have to look at some of the most interesting patterns that can help explain
why sales are reduced for this group. You might notice that they don’t tend to be
very active on social media, with few of them having Twitter or Facebook
accounts. You might also notice that most of them are older than your general
audience. From that you can begin to trace patterns you can analyze more deeply.

Step 5: Perform in-depth analysis


This step of the process is where you’re going to have to apply your statistical,
mathematical and technological knowledge and leverage all of the data science
tools at your disposal to crunch the data and find every insight you can.

In this case, you might have to create a predictive model that compares your
underperforming group with your average customer. You might find out that the
age and social media activity are significant factors in predicting who will buy the
product.

If you’d asked a lot of the right questions while framing your problem, you might
realize that the company has been concentrating heavily on social media marketing
efforts, with messaging that is aimed at younger audiences. You would know that
certain demographics prefer being reached by telephone rather than by social
media. You begin to see how the way the product has been has been marketed is
significantly affecting sales: maybe this problem group isn’t a lost cause! A change
in tactics from social media marketing to more in-person interactions could change
everything for the better. This is something you’ll have to flag to your VP Sales.

You can now combine all of those qualitative insights with data from your
quantitative analysis to craft a story that moves people to action.

Step 6: Communicate results of the analysis

It’s important that the VP Sales understand why the insights you’ve uncovered
are important. Ultimately, you’ve been called upon to create a solution
throughout the data science process. Proper communication will mean the
difference between action and inaction on your proposals.

You need to craft a compelling story here that ties your data with their knowledge.
You start by explaining the reasons behind the underperformance of the older
demographic. You tie that in with the answers your VP Sales gave you and the
insights you’ve uncovered from the data. Then you move to concrete solutions that
address the problem: we could shift some resources from social media to personal
calls. You tie it all together into a narrative that solves the pain of your VP Sales:
she now has clarity on how she can reclaim sales and hit her objectives.

She is now ready to act on your proposals.


Throughout the data science process, your day-to-day will vary significantly
depending on where you are–and you will definitely receive tasks that fall outside
of this standard process! You’ll also often be juggling different projects all at once.

It’s important to understand these steps if you want to systematically think about
data science, and even more so if you’re looking to start a career in data science.

A Data Scientist’s Role Data Science Process

Ask questions to
frame the Business
problem Get relevant
data for
analysis of
communicat
problem
e results
Model Explore data
data for in to make error
depth correction
analysis
• Frame the problem
• Collect the raw data
• Process the data for analysis
• Explore the data
• Perform in-depth analysis
• Communicate results of the analysis

Data refers to facts and statistics collected together for reference or


analysis.
Data can be collected, measured and analyzed. It can also be visualized by using
statistical models and graphs.

Categories of Data
Data can be categorized into two sub-categories:

1. Qualitative Data
2. Quantitative Data

Categories of Data

Qualitative Data: Qualitative data deals with characteristics and descriptors that can’t be easily
measured, but can be observed subjectively. Qualitative data is further divided into two types of
data:

 Nominal Data: Data with no inherent order or ranking such as gender or race.

Nominal Data

 Ordinal Data: Data with an ordered series of information is called ordinal data.
Ordinal Data

Quantitative Data: Quantitative data deals with numbers and things you can measure
objectively. This is further divided into two:

 Discrete Data: Also known as categorical data, it can hold a finite number of possible
values.

Example: Number of students in a class.

 Continuous Data: Data that can hold an infinite number of possible values.

Example: Weight of a person.

So these were the different categories of data. The upcoming sections will focus on the basic
Statistics concepts, so buckle up and get ready to do some math.

What Is Statistics?
Statistics is an area of applied mathematics concerned with data collection, analysis,
interpretation, and presentation.

What Is Statistics
This area of mathematics deals with understanding how data can be used to solve complex
problems. Here are a couple of example problems that can be solved by using statistics:

 Your company has created a new drug that may cure cancer. How would you conduct a
test to confirm the drug’s effectiveness?
 You and a friend are at a baseball game, and out of the blue, he offers you a bet that
neither team will hit a home run in that game. Should you take the bet?
 The latest sales data have just come in, and your boss wants you to prepare a report for
management on places where the company could improve its business. What should you
look for? What should you not look for?

These above-mentioned problems can be easily solved by using statistical techniques. In the
upcoming sections, we will see how this can be done.

If you want a more in-depth explanation on Statistics and Probability, you can check out this
video by our Statistics experts.

Basic Terminologies in Statistics


Before you dive deep into Statistics, it is important that you understand the basic terminologies
used in Statistics. The two most important terminologies in statistics are population and sample.
Population and Sample

 Population: A collection or set of individuals or objects or events whose properties are to


be analyzed
 Sample: A subset of the population is called ‘Sample’. A well-chosen sample will
contain most of the information about a particular population parameter.

Now you must be wondering how one can choose a sample that best represents the entire
population.

Sampling Techniques
Sampling is a statistical method that deals with the selection of individual observations within a
population. It is performed to infer statistical knowledge about a population.

Consider a scenario wherein you’re asked to perform a survey about the eating habits of
teenagers in the US. There are over 42 million teens in the US at present and this number is
growing as you read this blog. Is it possible to survey each of these 42 million individuals about
their health? Obviously not! That’s why sampling is used. It is a method wherein a sample of the
population is studied in order to draw inference about the entire population.

There are two main types of Sampling techniques:

1. Probability Sampling
2. Non-Probability Sampling

Sampling Techniques

In this blog, we’ll be focusing only on probability sampling techniques because non-probability
sampling is not within the scope of this blog.
Probability Sampling: This is a sampling technique in which samples from a large population
are chosen using the theory of probability. There are three types of probability sampling:

 Random Sampling: In this method, each member of the population has an equal chance
of being selected in the sample.

Random Sampling

 Systematic Sampling: In Systematic sampling, every nth record is chosen from the
population to be a part of the sample. Refer the below figure to better understand how
Systematic sampling works.

Systematic Sampling

 Stratified Sampling: In Stratified sampling, a stratum is used to form samples from a


large population. A stratum is a subset of the population that shares at least one common
characteristic. After this, the random sampling method is used to select a sufficient
number of subjects from each stratum.

Stratified Sampling

Types of Statistics
There are two well-defined types of statistics:

1. Descriptive Statistics
2. Inferential Statistics

Descriptive Statistics
Descriptive statistics is a method used to describe and understand the features of a specific data
set by giving short summaries about the sample and measures of the data.

Descriptive Statistics is mainly focused upon the main characteristics of data. It provides a
graphical summary of the data.

Descriptive Statistics

Suppose you want to gift all your classmate’s t-shirts. To study the average shirt size of students
in a classroom, in descriptive statistics you would record the shirt size of all students in the class
and then you would find out the maximum, minimum and average shirt size of the class.

Inferential Statistics
Inferential statistics makes inferences and predictions about a population based on a sample of
data taken from the population in question.

Inferential statistics generalizes a large dataset and applies probability to draw a conclusion. It
allows us to infer data parameters based on a statistical model using sample data.
Inferential Statistics

So, if we consider the same example of finding the average shirt size of students in a class, in
Inferential Statistics, you will take a sample set of the class, which is basically a few people from
the entire class. You already have had grouped the class into large, medium and small. In this
method, you basically build a statistical model and expand it for the entire population in the
class.

Understanding Descriptive Statistics


Descriptive Statistics is broken down into two categories:

1. Measures of Central Tendency


2. Measures of Variability (spread)

Measures of Centre

Measures of the center are statistical measures that represent the summary of a dataset. There are
three main measures of center:

Measures of Centre

1. Mean: Measure of the average of all the values in a sample is called Mean.
2. Median: Measure of the central value of the sample set is called Median.
3. Mode: The value most recurrent in the sample set is known as Mode.
To better understand the Measures of central tendency let’s look at an example. The below cars
dataset contains the following variables:

DataSet

 Cars
 Mileage per Gallon(mpg)
 Cylinder Type (cyl)
 Displacement (disp)
 Horse Power(hp)
 Real Axle Ratio(drat)

Using descriptive Analysis, you can analyze each of the variables in the sample data set for
mean, standard deviation, minimum and maximum.

If we want to find out the mean or average horsepower of the cars among the population of cars,
we will check and calculate the average of all values. In this case, we’ll take the sum of the
Horse Power of each car, divided by the total number of cars:
Mean = (110+110+93+96+90+110+110+110)/8 = 103.625

If we want to find out the center value of mpg among the population of cars, we will arrange the
mpg values in ascending or descending order and choose the middle value. In this case, we have
8 values which is an even entry. Hence we must take the average of the two middle values.

The mpg for 8 cars: 21, 21, 21.3, 22.8,23,23,23,23


Median = (22.8+23 ) /2 = 22.9

If we want to find out the most common type of cylinder among the population of cars, we will
check the value which is repeated the most number of times. Here we can see that the cylinders
come in two values, 4 and 6. Take a look at the data set, you can see that the most recurring
value is 6. Hence 6 is our Mode.
Measures Of The Spread

A measure of spread, sometimes also called a measure of dispersion, is used to describe the
variability in a sample or population.

Measures of Spread

Just like the measure of center, we also have measures of the spread, which comprises of the
following measures:

 Range: It is the given measure of how spread apart the values in a data set are. The range
can be calculated as:

Range = Max(𝑥_𝑖) – Min(𝑥_𝑖)

Here,

Max(𝑥_𝑖): Maximum value of x

Min(𝑥_𝑖): Minimum value of x

 Quartile: Quartiles tell us about the spread of a data set by breaking the data set into
quarters, just like the median breaks it in half.

To better understand how quartile and the IQR are calculated, let’s look at an example.
Measures of Spread example

The above image shows marks of 100 students ordered from lowest to highest scores. The
quartiles lie in the following ranges:

1. The first quartile (Q1) lies between the 25th and 26th observation.
2. The second quartile (Q2) lies between the 50th and 51st observation.
3. The third quartile (Q3) lies between the 75th and 76th observation.

 Inter Quartile Range (IQR): It is the measure of variability, based on dividing a data set
into quartiles. The interquartile range is equal to Q3 minus Q1, i.e. IQR = Q3 – Q1

 Variance: It describes how much a random variable differs from its expected value. It
entails computing squares of deviations. Variance can be calculated by using the below
formula:

Measures of Spread Variance

Here,
x: Individual data points
n: Total number of data points
x̅ : Mean of data points

 Deviation is the difference between each element from the mean. It can be calculated by
using the below formula:

Deviation = (𝑥_𝑖 – µ)

 Population Variance is the average of squared deviations. It can be calculated by using


the below formula:

Measures of Spread Population Variance

 Sample Variance is the average of squared differences from the mean. It can be
calculated by using the below formula:

Measures of Spread Sample Variance

 Standard Deviation: It is the measure of the dispersion of a set of data from its mean. It
can be calculated by using the below formula:

Measures of Spread Standard Deviation

To better understand how the Measures of spread are calculated, let’s look at a use case.

Problem statement: Daenerys has 20 Dragons. They have the numbers 9, 2, 5, 4, 12, 7, 8, 11, 9,
3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4. Work out the Standard Deviation.
Let’s look at the solution step by step:

Step 1: Find out the mean for your sample set.

The mean is = 9+2+5+4+12+7+8+11+9+3

Then work out the mean of those squared differences.

+7+4+12+5+4+10+9+6+9+4 / 20
µ=7

Step 2: Then for each number, subtract the Mean and square the result.

(x_i – μ)²

(9-7)²= 2²=4
(2-7)²= (-5)²=25
(5-7)²= (-2)²=4
And so on…

We get the following results:


4, 25, 4, 9, 25, 0, 1, 16, 4, 16, 0, 9, 25, 4, 9, 9, 4, 1, 4, 9

Step 3: Then work out the mean of those squared differences.

4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9 / 20
⸫ σ² = 8.9

Step 4: Take the square root of σ².

σ = 2.983

To better understand the measures of spread and center, let’s execute a short demo by using the
R language.

Probability Distribution & Model Fitting


Probability Distribution
Distribution is nothing but a function which provide the possible value of
variable and how often they occur. Probability distribution is mathematical
function which provide the possibilities of occurrence of various
possible outcome that can occur in an experiment.

Types of Probability Distribution

[Link] Distribution

➢ The Normal Distribution is one of the most used distributions in


Data Science.

➢ Normal Distributions such as: the income distribution in the


economy, student’s average reports, etc…

➢ In addition to this, the sum of small random variables also turns


out to usually follow a normal distribution

➢ Normal Distribution, you should look for a bell-shape curve.

[Link] Distribution

➢ Binomial distribution is discrete distribution.

➢ Probability of x success in n trial ,given success probability p in each trial.


➢ If the distribution satisfies the below conditions then such distribution is
called as binomial distribution:

[Link] number of trial.

[Link] two possible outcome.

[Link] independent.

[Link] and failure remain same.

[Link] Distribution

➢ Bernoulli distribution is easiest distribution among all distributions.

➢ It is similar to binomial distribution. The only difference is it takes only


one trial while binomial distribution considers n trial.

➢ Only two possible outcome i.e success vs failure.

➢ Let’s consider random variable X with only one parameter p which


represents probability of occurrence of event.

[Link] Distribution

➢ Uniform distribution is also called rectangular distribution.

➢ The Uniform Distribution can be easily derived from the Bernoulli


Distribution.
➢ In this case, a possibly unlimited number of outcomes are allowed and all
the events hold the same probability to take place.

➢ As an example, imagine the roll of a fair dice. In this case, there are
multiple possible events with each of them having the same probability to
happen.

[Link] Distribution

➢ Poisson distribution is a distribution of count i,e number of times event


has occured in given interval of time.

➢ Poisson distribution can be used to predict probability of number of


successful event that may occur in specific interval of time.

➢ Example,if a call center recieved 50 calls in 1 hour, then using Poisson


distribution we can predict probability of getting 20 calls in next 30 minutes.

Model Fitting
➢ Underfitting - Underfitting happens when the machine learning model
cannot sufficiently model the training data nor generalize new data.

➢ An underfit machine learning model is not a suitable model; this will be


obvious as it will have a poor performance on the training data.

UnderFitting: Low Specificity, High Generalizability

➢ Overfitting - Overfitting negatively impacts the performance of the


model on new data. It occurs when a model learns the details and noise in
the training data too efficiently.
➢ When random fluctuations or the noise in the training data are picked up
and learned as concepts by the model, the model “overfits”.

➢ It will perform well on the training set, but very poorly on the test set.
OverFitting: High Specificity, Low Generalizability

Model Fitting Diagram

Unit-III

Machine learning algorithms are programs that can learn from data and improve
from experience, without human intervention. Learning tasks may include learning
the function that maps the input to the output, learning the hidden structure in
unlabeled data; or ‘instance-based learning’, where a class label is produced for a
new instance by comparing the new instance (row) to instances from the training
data, which were stored in memory. ‘Instance-based learning’ does not create an
abstraction from specific instances.

Types of Machine Learning Algorithms

There are 3 types of machine learning (ML) algorithms:

Supervised Learning Algorithms:


Supervised learning uses labeled training data to learn the mapping function that
turns input variables (X) into the output variable (Y). In other words, it solves
for f in the following equation:

Y = f (X)

This allows us to accurately generate outputs when given new inputs.

We’ll talk about two types of supervised learning: classification and regression.

Classification predicts the outcome of a given sample when the output variable is
in the form of categories. A classification model might look at the input data and
try to predict labels like “sick” or “healthy.”

Regression is used to predict the outcome of a given sample when the output
variable is in the form of real values. For example, a regression model might
process input data to predict the amount of rainfall, the height of a person, etc.
Linear Regression, Logistic Regression, CART, Naïve-Bayes, and K-Nearest
Neighbors (KNN) — are examples of supervised learning.

Unsupervised Learning Algorithms:


Unsupervised learning models are used when we only have the input variables (X)
and no corresponding output variables. They use unlabeled training data to model
the underlying structure of the data.

Three types of unsupervised learning:

Association is used to discover the probability of the co-occurrence of items in a


collection. It is extensively used in market-basket analysis. For example, an
association model might be used to discover that if a customer purchases bread,
s/he is 80% likely to also purchase eggs.

Clustering is used to group samples such that objects within the same cluster are
more similar to each other than to the objects from another cluster.

Dimensionality Reduction is used to reduce the number of variables of a data set


while ensuring that important information is still conveyed. Dimensionality
Reduction can be done using Feature Extraction methods and Feature Selection
methods. Feature Selection selects a subset of the original variables. Feature
Extraction performs data transformation from a high-dimensional space to a low-
dimensional space. Example: PCA algorithm is a Feature Extraction approach.

Algorithms 6-8 that we cover here — Apriori, K-means, PCA — are examples of
unsupervised learning.

Reinforcement learning:

Reinforcement learning is a type of machine learning algorithm that allows an


agent to decide the best next action based on its current state by learning behaviors
that will maximize a reward.

Reinforcement algorithms usually learn optimal actions through trial and error.
Imagine, for example, a video game in which the player needs to move to certain
places at certain times to earn points. A reinforcement algorithm playing that game
would start by moving randomly but, over time through trial and error, it would
learn where and when it needed to move the in-game character to maximize its
point total.

1. Linear Regression

In machine learning, we have a set of input variables (x) that are used to determine
an output variable (y). A relationship exists between the input variables and the
output variable. The goal of ML is to quantify this relationship.

Figure 1: Linear Regression is represented as a line in the form of


y = a + bx. Source
In Linear Regression, the relationship between the input variables (x) and output
variable (y) is expressed as an equation of the form y = a + bx. Thus, the goal of
linear regression is to find out the values of coefficients a and b. Here, a is the
intercept and b is the slope of the line.

Figure 1 shows the plotted x and y values for a data set. The goal is to fit a line that
is nearest to most of the points. This would reduce the distance (‘error’) between
the y value of a data point and the line.

2. Logistic Regression

Linear regression predictions are continuous values (i.e., rainfall in cm), logistic
regression predictions are discrete values (i.e., whether a student passed/failed)
after applying a transformation function.

Logistic regression is best suited for binary classification: data sets where y = 0 or
1, where 1 denotes the default class. For example, in predicting whether an event
will occur or not, there are only two possibilities: that it occurs (which we denote
as 1) or that it does not (0). So if we were predicting whether a patient was sick, we
would label sick patients using the value of 1 in our data set.

Logistic regression is named after the transformation function it uses, which is


called the logistic function h(x)= 1/ (1 + ex). This forms an S-shaped curve.

In logistic regression, the output takes the form of probabilities of the default class
(unlike linear regression, where the output is directly produced). As it is a
probability, the output lies in the range of 0-1. So, for example, if we’re trying to
predict whether patients are sick, we already know that sick patients are denoted
as 1, so if our algorithm assigns the score of 0.98 to a patient, it thinks that patient
is quite likely to be sick.

This output (y-value) is generated by log transforming the x-value, using the
logistic function h(x)= 1/ (1 + e^ -x) . A threshold is then applied to force this
probability into a binary classification.
Figure 2: Logistic Regression to determine if a tumor is malignant or benign.
Classified as malignant if the probability h(x)>= 0.5. Source

In Figure 2, to determine whether a tumor is malignant or not, the default variable


is y = 1 (tumor = malignant). The x variable could be a measurement of the tumor,
such as the size of the tumor. As shown in the figure, the logistic function
transforms the x-value of the various instances of the data set, into the range of 0 to
1. If the probability crosses the threshold of 0.5 (shown by the horizontal line), the
tumor is classified as malignant.

The logistic regression equation P(x) = e ^ (b0 +b1x) / (1 + e(b0 + b1x)) can be
transformed into ln(p(x) / 1-p(x)) = b0 + b1x.

The goal of logistic regression is to use the training data to find the values of
coefficients b0 and b1 such that it will minimize the error between the predicted
outcome and the actual outcome. These coefficients are estimated using the
technique of Maximum Likelihood Estimation.

3. CART

Classification and Regression Trees (CART) are one implementation of Decision


Trees.

The non-terminal nodes of Classification and Regression Trees are the root node
and the internal node. The terminal nodes are the leaf nodes. Each non-terminal
node represents a single input variable (x) and a splitting point on that variable; the
leaf nodes represent the output variable (y). The model is used as follows to make
predictions: walk the splits of the tree to arrive at a leaf node and output the value
present at the leaf node.
The decision tree in Figure 3 below classifies whether a person will buy a sports
car or a minivan depending on their age and marital status. If the person is over 30
years and is not married, we walk the tree as follows : ‘over 30 years?’ -> yes ->
’married?’ -> no. Hence, the model outputs a sports car.

Figure 3: Parts of a decision tree. Source

4. Naïve Bayes

To calculate the probability that an event will occur, given that another event has
already occurred, we use Bayes’s Theorem. To calculate the probability of
hypothesis(h) being true, given our prior knowledge(d), we use Bayes’s Theorem
as follows:

P(h|d)= (P(d|h) P(h)) / P(d)

where:

 P(h|d) = Posterior probability. The probability of hypothesis h being true,


given the data d, where P(h|d)= P(d1| h) P(d2| h)….P(dn| h) P(d)

 P(d|h) = Likelihood. The probability of data d given that the hypothesis h


was true.

 P(h) = Class prior probability. The probability of hypothesis h being true


(irrespective of the data)
 P(d) = Predictor prior probability. Probability of the data (irrespective of the
hypothesis)

This algorithm is called ‘naive’ because it assumes that all the variables are
independent of each other, which is a naive assumption to make in real-world
examples.

Figure 4: Using Naive Bayes to predict the status of ‘play’ using the variable
‘weather’.

Using Figure 4 as an example, what is the outcome if weather = ‘sunny’?

To determine the outcome play = ‘yes’ or ‘no’ given the value of variable weather
= ‘sunny’, calculate P(yes|sunny) and P(no|sunny) and choose the outcome with
higher probability.

->P(yes|sunny)= (P(sunny|yes) * P(yes)) / P(sunny) = (3/9 * 9/14 ) / (5/14) = 0.60

-> P(no|sunny)= (P(sunny|no) * P(no)) / P(sunny) = (2/5 * 5/14 ) / (5/14) = 0.40

Thus, if the weather = ‘sunny’, the outcome is play = ‘yes’.

5. KNN

The K-Nearest Neighbors algorithm uses the entire data set as the training set,
rather than splitting the data set into a training set and test set.
When an outcome is required for a new data instance, the KNN algorithm goes
through the entire data set to find the k-nearest instances to the new instance, or the
k number of instances most similar to the new record, and then outputs the mean of
the outcomes (for a regression problem) or the mode (most frequent class) for a
classification problem. The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean


distance and Hamming distance.

Unsupervised learning algorithms

6. Apriori

The Apriori algorithm is used in a transactional database to mine frequent item sets
and then generate association rules. It is popularly used in market basket analysis,
where one checks for combinations of products that frequently co-occur in the
database. In general, we write the association rule for ‘if a person purchases item
X, then he purchases item Y’ as : X -> Y.

Example: if a person purchases milk and sugar, then she is likely to purchase
coffee powder. This could be written in the form of an association rule as:
{milk,sugar} -> coffee powder. Association rules are generated after crossing the
threshold for support and confidence.

Figure 5: Formulae for support, confidence and lift for the association rule X->Y.

The Support measure helps prune the number of candidate item sets to be
considered during frequent item set generation. This support measure is guided by
the Apriori principle. The Apriori principle states that if an itemset is frequent, then
all of its subsets must also be frequent.
7. K-means

K-means is an iterative algorithm that groups similar data into [Link] calculates
the centroids of k clusters and assigns a data point to that cluster having least
distance between its centroid and the data point.

Figure 6: Steps of the K-means algorithm. Source

Here’s how it works:

We start by choosing a value of k. Here, let us say k = 3. Then, we randomly


assign each data point to any of the 3 clusters. Compute cluster centroid for each of
the clusters. The red, blue and green stars denote the centroids for each of the 3
clusters.

Next, reassign each point to the closest cluster centroid. In the figure above, the
upper 5 points got assigned to the cluster with the blue centroid. Follow the same
procedure to assign points to the clusters containing the red and green centroids.

Then, calculate centroids for the new clusters. The old centroids are gray stars; the
new centroids are the red, green, and blue stars.

Finally, repeat steps 2-3 until there is no switching of points from one cluster to
another. Once there is no switching for 2 consecutive steps, exit the K-means
algorithm.
Unit-IV

What is data visualization?


There's a story behind your numbers. Visualizing data brings them to life.

Data viz is the communication of data in a visual manner, or turning raw data into insights that
can be easily interpreted by your readers.

Other definitions include:

 Wikipedia's definition of Data Visualization: Data visualization refers to the techniques used to
communicate data or information by encoding it as visual objects (points, lines or bars)
contained in graphics.
 Techopedia's definition of Data Visualization: Data visualization is the process of displaying data
or information in graphical charts, figures and bars.

Data visualization is a graphical representation of information and data. By using visual


elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.

Basic Principles of Data Visualization

The best data in the world won't be worth anything if no one can understand it. The job of a data
analyst is not only to collect and analyze data, but also to present it to the end users and other interested
parties who will then act on that data. Here’s where data visualization comes in.
Many data analysts are not necessarily experts in data communication or graphic design. This means a
lot of them can be lost in the translation of data from the collection to the presentation in the boardroom. I
often find myself teaching data visualization classes to more and more data science teams, who
recognize this as an area of weakness.

If your job entails presenting findings from a set of data or analysis to a group of laymen, then it’s part of
your job to present it to them in such a way that it’s easy to understand and therefore take appropriate
action.

In this post, I’ll share a few tips to help you turn data into actionable insights that people will understand.

Keep your Audience in Mind


Any data visualization should be designed in such a way that it meets the needs of the audience and their
information needs. As such, you need to determine exactly who is in that audience, and the kind of
questions they may need answers to.

Choose the Chart Wisely


Not all charts are equal, and some will do a better job at displaying certain kinds of information than
others. Check out the following flowchart to help you choose the best type of chart to display your
information.

Think Beyond the PowerPoint Templates


PowerPoint is by far the most popular visualization tool, but the built-in templates in the program might not
be doing your data any favors. Rather than trying to get fancy (yeah, this is directed at 3D pie charts), try
to keep your visualizations as simple and uncluttered as possible. If you really want to go for it Design
Bundles has a great selection of tools for infographics. These can look spectacular and really make data
sing.

Form follows Function


How will your audience use the data? Consider this and let it determine how you will present the data.
Think of your audience as the dashboard of a cockpit, and be sure to only present the most useful,
relevant information and in the clearest way possible.

Direct Attention to the Important Details


As you design your visualizations, be sure to leverage the sensory details like size, color, graphics, and
fonts to direct the attention of your audience to the most important pieces of the information.
Use Tables and Graphs Appropriately

Tables should be used when you want to display precise values. Graphs should be
used to present information with regards to data patterns, relationships, and how
things change over time. From my experience, it’s best to reduce the use of tables
and focus more on the graphs.

Provide Context

A well-done presentation should prompt the user to act on the presented data.
However, this is hard to achieve if the context for that action has not been
provided. Use size, color, and other visual cues to provide context, and be sure to
include some short narratives to highlight the key insights.

Align the Data and the Displays Right

Ensure that your displays of information are vertically and horizontally aligned, to
make sure that the can be compared accurately. This also helps to prevent
misleading optical illusions with your presentation.

Choose the Right Colors

You should us color to draw the attention of the audience to key data pieces, not
just to brighten your presentation. Moreover, choose your color combinations
wisely. For instance, you don’t want to use red and green in the same diagram,
since they will appear brown to color blinded people.

Pay Careful Attention to Titles

Give your graphs and charts useful, explanatory titles. This helps to highlight the
focus of that presentation. View titles as the headline that draw people in, give
them a snapshot of key insights and focuses them on the right questions.

Use Clear Axis Labels and Numbers

Steer away from fancy gauges and labels that can affect the clarity. Always start at
zero when labeling the axis of a graph or chart, unless there’s a strong reason not
to, such as when the data has been clustered at unreasonably high values.

Leverage Interactivity When Appropriate


The newer generations of data visualization tools allow you to build interactivity
into visualizations that can benefit the end user. However, remember that this is
more of a parlor trick, which should be used when the interactivity helps to clarify,
and not confuse the data presentation.

These basic principles should help you increase the effectiveness of your data
presentation and communication. This way, key stakeholders will be in a better
position to make better, and more informed decisions based on the data you have
gathered and presented.

You might also like