Unit 1 - Exploratory Data Analysis Fundamentals
Unit 1 - Exploratory Data Analysis Fundamentals
• Now when the popular frameworks like Hadoop and others solved the
problem of storage, the focus is on processing the data. And here Data
Science plays a big role.
• What is Data Science?
• Data science is the science of analyzing raw data using statistics and machine
learning techniques with the purpose of drawing conclusions about that
information.
• A Data Scientist can also be divided into different roles based on their skill
sets.
• Data Researcher
• Data Developers
Roles & Responsibilities of a Data Scientist:
• Collecting large amounts of noisy data and transforming it into a more usable
format.
• Solving business-related problems using data-driven techniques.
• Working with a variety of programming languages, including Statistical
Analysis System (SAS), R and Python.
• Having a solid grasp of statistics, including statistical tests and distributions.
• Staying on top of analytical techniques such as machine learning, deep
learning and text analytics.
• Communicating and collaborating with both IT and business.
• Looking for order and patterns in data, as well as spotting trends that can
help a business’s bottom line.
• What’s in a data scientist’s toolbox?
• Unstructured data:
• It is typically categorized as qualitative data, cannot be processed and analyzed via
conventional data tools and methods.
• Unstructured data could be represented by a set of text files, photos, or video files.
• it is best managed in non-relational (NoSQL) databases. Another way to manage unstructured
data is to use data lakes to preserve it in raw form.
• Uses: Structured data is used in machine learning (ML) and drives its algorithms, whereas
unstructured data is used in natural language processing (NLP) and text mining.
• Data Preprocessing:
• Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model.
• Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can
be checked by the following:
• 1. Accuracy: To check whether the data entered is correct or not.
• 2. Completeness: To check whether the data is available or not recorded.
• 3. Consistency: To check whether the same data is kept in all the places
that does or do not match.
• 4. Timeliness: The data should be updated correctly.
• 5. Believability: The data should be trustable.
• 6. Interpretability: The understandability of the data.
• Data Cleaning
• It is particularly done as part of data preprocessing to clean the data by filling
missing values, smoothing the noisy data, resolving the inconsistency, and
removing outliers.
• Handling Missing Values: The null values in the dataset are imputed using mean/median
or mode based on the type of data that is missing:
• Numerical Data: If a numerical value is missing, then replace that NaN value with mean
or median. It is preferred to impute using the median value as the average or the mean
values are influenced by the outliers and skewness present in the data and are pulled in
their respective direction.
• Categorical Data: When categorical data is missing, replace that with the value which is
most occurring i.e. by mode.
• Drop the columns: Now, if a column has, let’s say, 50% of its values missing, and then
do we replace all of those missing values with the respective median or mode value?
Actually, we don’t. We delete that particular column in that case. We don’t impute it
because then that column will be biased towards the median/mode value and will
naturally have the most influence on the dependent variable
The significance of EDA
• Different fields of science, economics, engineering, and marketing accumulate
and store data primarily in electronic databases.
• To be certain of the insights that the collected data provides and to make
further decisions, data mining is performed where we go through distinctive
analysis processes.
• Exploratory data analysis is key, and usually the first exercise in data mining.
• It allows us to visualize data to understand it as well as to create hypotheses
for further analysis.
• The exploratory analysis centers around creating a synopsis of data or insights
for the next steps in a data mining project.
• Steps in EDA
• Problem definition: Before trying to extract useful insight from the data, it is
essential to define the business problem to be solved.
• The problem definition works as the driving force for a data analysis plan
execution.
• The main tasks involved in problem definition are defining the main objective
of the analysis, defining the main deliverables, outlining the main roles and
responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis.
• Based on such a problem definition, an execution plan can be created.
• Data preparation: This step involves methods for preparing the dataset before
actual analysis.
• In this step, we define the sources of data, define data schemas and tables,
understand the main characteristics of the data, clean the dataset, delete non-
relevant datasets, transform the data, and divide the data into required chunks
for analysis.
• Data analysis: The main tasks involve summarizing the data, finding the
hidden correlation and relationships among the data, developing predictive
models, evaluating the models, and calculating the accuracies.
• Some of the techniques used for data summarization are summary tables,
graphs, descriptive statistics, inferential statistics, correlation statistics,
searching, grouping, and mathematical models.
• Classical data analysis: For the classical data analysis approach, the problem
definition and data collection step are followed by model development,
which is followed by analysis and result communication.
• Exploratory data analysis approach: For the EDA approach, it follows the
same approach as classical data analysis except the model imposition and the
data analysis steps are swapped.
• The main focus is on the data, its structure, outliers, models, and
visualizations. Generally, in EDA, we do not impose any deterministic or
probabilistic models on the data.
• Bayesian data analysis approach: The Bayesian approach incorporates prior
probability distribution knowledge into the analysis steps.
Software tools available for EDA
• There are several software tools that are available to facilitate EDA.
• Python: This is an open source programming language widely used in data analysis, data
mining, and data science (https://www.python.org/). For this book, we will be using
Python.
• Weka: This is an open source data mining package that involves several EDA tools and
algorithms (https://www.cs.waikato.ac.nz/ml/weka/).
• KNIME: This is an open source tool for data analysis and is based on Eclipse
(https://www.knime.com/).
Making sense of data
• It is crucial to identify the type of data under analysis.
• Different disciplines store different kinds of data for different purposes.
• For example, medical researchers store patients' data, universities store
students' and teachers' data, and real estate industries storehouse and
building datasets.
• A dataset contains many observations about a particular object.
• For instance, a dataset about patients in a hospital can contain many
observations.
• A patient can be described by a patient identifier (ID), name, address, weight,
date of birth, address, email, and gender. Each of these features that
describes a patient is a variable.
• Each observation can have a specific value for each of these variables. For
example, a patient can have the following:
• PATIENT_ID = 1001
• Name = Yoshmi Mukhiya
• Address = Mannsverk 61, 5094, Bergen, Norway
• Date of birth = 10th July 2018
• Email = yoshmimukhiya@gmail.com
• Weight = 10
• Gender = Female
• These datasets are stored in hospitals and are presented for analysis. Most of
this data is stored in some sort of database management system in
tables/schema. An example of a table for storing patient information is shown
here:
• To summarize the preceding table, there are four observations (001, 002, 003,
004, 005).
• Each observation describes variables (PatientID, name, address, dob, email,
gender, and weight).
• Most of the dataset broadly falls into two groups—numerical data and
categorical data.
• Numerical data
• This data has a sense of measurement involved in it; for example, a person's
age, height, weight, blood pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family members.
• Discrete data
• This is data that is countable and its values can be listed out. For example, if
we flip a coin, the number of heads in 200 coin flips can take values from 0 to
200 (finite) cases.
• Continuous data
• A variable that can have an infinite number of numerical values within a
specific range is classified as continuous data.
• A variable describing continuous data is a continuous variable. For example,
what is the temperature of your city today?
• Categorical data
• This type of data represents the characteristics of an object; for example,
gender, marital status, type of address, or categories of the movies.
• This data is often referred to as qualitative datasets in statistics.
• To understand clearly, here are some of the most common types of categorical
data you can find in data:
• Gender (Male, Female, Other, or Unknown)
• Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married,
Polygamous, Never Married, Domestic Partner, Unmarried, Widowed, or
Unknown)
• Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical,
Horror, Mystery, Philosophical, Political, Romance, Saga, Satire, Science
Fiction, Social, Thriller, Urban, or Western)
Visual Aids for Exploratory Data Analysis
• Line chart
• Bar chart
• Scatter plot
• Histogram
• Pie chart
• Box plot
What is Data Visualization?