Data2 Science Process Am

Fundamentals of
Data Science
LECTURE 2: DATA SCIENCE PROCESS

2. Data Science Process
Data Science Process
• The methodical discovery of useful relationships and patterns in data
is enabled by a set of iterative activities collectively known as the
data science process.
• The standard data science process
1- Understanding the problem, 1. Prior Knowledge
2- Preparing the data samples,
2. Preparation
3- Developing the model,
4- Applying the model on a dataset to
3. Modeling
see how the model may work in the real world,
5- Deploying and maintaining the models.
4. Application
5. Knowledge
What Motivated Data Mining?
Why Is It Important?
• Wide availability of huge amounts of
data and the imminent need for turning
such data into useful information and
knowledge.
• Data mining can be viewed as a result of
the natural evolution of information
technology.
Data science process frameworks
• One of the most popular data science process frameworks is Cross
Industry Standard Process for Data Mining ( CRISP-DM ),
which is an acronym for Cross Industry Standard Process for Data
Mining.
• The CRISP-DM process is the most widely adopted framework for
developing data science solutions.
• Other data science frameworks are SEMMA, an acronym for Sample,
Explore, Modify, Model, and Assess,
• DMAIC, is an acronym for Define, Measure, Analyze, Improve, and
Control, used in Six Sigma practice and the Selection, Preprocessing,
Transformation, Data Mining, Interpretation, and Evaluation
framework used in the knowledge discovery in databases process .
CRISP DM process
Data understanding focus
to identify, collect, and
CRISP-DM is a process model with analyze the data sets
six phases that naturally describes
the data science life cycle. It's like a Business Data
set of guardrails to help you plan, Understanding Understanding
organize, and implement your data
science (or machine learning) Data
project. Preparation
Deployment
Data
The Business Understanding phase
Modeling
focuses on understanding the objectives
and requirements of the project.
• Understanding of the customer’s Evaluation
needs
Data science Process
• The data science process is a generic set of steps that is
problem, algorithm, and data science tool agnostic.
• The fundamental objective of any process that involves
data science is to address the analysis question.
• The learning algorithm used to solve the business
question could be a decision tree, an artificial neural
network, or a scatterplot.
• The software tool to develop and implement the data
science algorithm used could be custom coding,
RapidMiner, R, Weka, SAS, Oracle Data Miner, Python.
Data Science Process
Business Data
Understanding Understanding 1. Prior Knowledge
Prepare Data 2. Preparation
Building Model using

Training Data Algorithms
3. Modeling
Test Data Applying Model and
performance evaluation
Deployment 4. Application
Knowledge and Actions 5. Knowledge

1. Prior Knowledge
• Prior knowledge refers to information that is already known about a subject.
• The prior knowledge step in the data science process helps to define what
problem is being solved, how it fits in the business context, and what data is
needed in order to solve the problem.
Gaining information on:
1. Objective of the problem

2. Subject area of the problem
3. Data
1. Prior Knowledge
1. Objective of the problem
- The data science process starts with a need for analysis, a question,
or a business objective.
- This is the most important step in the data science process
- Without a well-defined statement of the problem, it is impossible to
come up with the right dataset and pick the right data science
algorithm.
- As an iterative process, it is common to go back to previous data
science process steps, revise the assumptions, approach, and tactics.
- However, it is imperative to get the first step —the objective of the
whole process— right.
1. Prior Knowledge
2- Subject area of the problem

– The process of data science uncovers hidden patterns in the dataset
by exposing relationships between attributes.
– But the problem is that it uncovers a lot of patterns.
– The false or spurious (fake) signals are a major problem in the data
science process.
– It is up to the practitioner to sift through the exposed patterns and
accept the ones that are valid and relevant to the answer of the
objective question.
– Hence, it is essential to know the subject matter, the context, and the
business process generating the data.
1. Prior Knowledge
3- Data
• Similar to the prior knowledge in the subject area, prior knowledge in the
data can also be gathered.
• Understanding how the data is collected, stored, transformed, reported,
and used is essential to the data science process.
• This part of the step surveys all the data available to answer the business
question and narrows down the new data that need to be sourced.
• There are quite a range of factors to consider: quality of the data,
quantity of data, availability of data, gaps in data, does lack of data compel
the practitioner to change the business question, etc.
• The objective of this step is to come up with a dataset to answer the
business question through the data science process.
1. Prior Knowledge
3- Data
The terminology used in the data science process
• A dataset (example set) is a collection of data
with a defined structure.
• Table 2.1, has a well-defined structure with 10
rows and 3 columns along with the column
headers. This structure is also sometimes
referred to as a “data frame”.
• A data point (record, object or example) is a
single instance in the dataset.
• Each row in the table is a data point.
1. Prior Knowledge
3- Data
the terminology used in the data science process
• Each instance contains the same structure as
the dataset.
• An attribute (feature, input, dimension,
variable, or predictor) is a single property of the
dataset.
• Each column in the table is an attribute.
• Attributes can be numeric, categorical, date-
time, text, or Boolean data types.
1. Prior Knowledge
3- Data
the terminology used in the data science process
• A label (class label, output, prediction, target, or
response) is the special attribute to be
predicted based on all the input attributes.
• In Table 2.1, the interest rate is the output
variable.
• Identifiers (PK) are special attributes that are
used for locating or providing context to
individual records. For example, common
attributes like names, account numbers, and
employee ID numbers are identifier attributes.
• In Table 2.1, the attribute ID is the identifier..
2. Data Preparation
• Preparing the dataset to suit a data science task is the

most time-consuming part of the process.
• It is extremely rare that datasets are available in the form
required by the data science algorithms.
• Most of the data science algorithms would require data to
be structured in a tabular format with records in the rows
and attributes in the columns.
• If the data is in any other format, the data would need to
be transformed by applying pivot, type conversion, join, or
transpose functions, etc., to condition the data into the
required structure.
2. Data Preparation
1. Data Exploration
2. Data quality
3. Handling missing values
4. Data type conversion
5. Transformation
6. Outliers
7. Feature selection
8. Sampling
2. Data Preparation
1. Data Exploration
– Data exploration, also known as exploratory
data analysis, provides a set of simple tools to
achieve basic understanding of the data.
– Data exploration approaches involve computing
descriptive statistics and visualization of data.
– These approaches can expose the structure of
the data, the distribution of the values, the
presence of extreme values, and highlight the
inter-relationships within the dataset.
2. Data Preparation
1. Data Exploration
• Descriptive statistics like mean, median,
mode, standard deviation, and range for
each attribute provide an easily
readable summary of the key
characteristics of the distribution of
data.
• Fig. 2.3 shows the scatterplot of credit
score vs. loan interest rate and it can be
observed that as credit score
increases, interest rate decreases.
2. Data Preparation
2- Data quality
• Data quality is an ongoing concern
wherever data is collected, processed,
and stored.
• In the interest rate dataset (Table 2.1),
how does one know if the credit score
and interest rate data are accurate?
What if a credit score has a recorded
value of 900 (beyond the theoretical limit)
or if there was a data entry error? Errors
in data will impact the
representativeness of the model.
2. Data Preparation
2- Data quality
• Organizations use data alerts, cleansing, and transformation
techniques to improve and manage the quality of the data and
store them in companywide repositories called data warehouses.
• Data sourced from well-maintained data warehouses have
higher quality, as there are proper controls in place to ensure a
level of data accuracy for new and existing data.
• The data cleansing practices include elimination of duplicate
records, quarantining outlier records that exceed the bounds,
standardization of attribute values, substitution of missing values,
etc.
2. Data Preparation
3- Handling missing values
• One of the most common data quality issues is that some records have missing
attribute values
• There are several different mitigation methods to deal with this problem, but each method
has pros and cons.
• The first step of managing missing values is to understand the reason behind why
the values are missing.
• Missing credit score values can be replaced with a credit score derived from the dataset
(mean, minimum, or maximum value, depending on the characteristics of the attribute).
• This method is useful if the missing values occur randomly and the frequency of occurrence
is quite rare.
• Alternatively, to build the representative model, all the data records with missing values or
records with poor data quality can be ignored. This method reduces the size of the
dataset
2. Data Preparation
4- Data type conversion
• The attributes in a dataset can be of different types, such as
continuous numeric (interest rate), integer numeric (credit score), or
categorical. For example, the credit score can be expressed as
categorical values (poor, good, excellent) or numeric score.
• In case of linear regression models, the input attributes have to be
numeric. If the available data are categorical, they must be converted
to continuous numeric attribute.
• Numeric values can be converted to categorical data types by a
technique called binning, where a range of values are specified for
each category, for example, a score between 400 and 500 can be
encoded as “low” and so on.
2. Data Preparation
5- Transformation
• In some data science algorithms like k-nearest neighbor (k-NN), the input
attributes are expected to be numeric and normalized, because the algorithm
compares the values of different attributes and calculates distance between
the data points.
• Normalization prevents one attribute dominating the distance results because
of large values. For example, consider income (expressed in USD, in
thousands) and credit score (in hundreds). The distance calculation will
always be dominated by slight variations in income. One solution is to convert
the range of income and credit score to a more uniform scale from 0 to 1 by
normalization.
• This way, a consistent comparison can be made between the two different
attributes with different units.
2. Data Preparation
6- Outliers
• Outliers are anomalies – abnormal or upnormal - in a given
dataset.
• Outliers may occur because of correct data capture (few people
with income in tens of millions) or erroneous data capture (human
height as 1.73 cm instead of 1.73 m).
• Regardless, the presence of outliers needs to be understood and
will require special treatments.
• Detecting outliers may be the primary purpose of some data
science applications, like fraud or intrusion detection.
2. Data Preparation
7- Feature selection
• The example dataset shown in Table 2.1 has one
attribute or feature — the credit score — and one
label — the interest rate — In practice, many data
science problems involve a dataset with hundreds to
thousands of attributes.
• A large number of attributes in the dataset
significantly increases the complexity of a model
and may degrade the performance of the model due
to the curse of dimensionality
• Not all the attributes are equally important or
useful in predicting the target.
2. Data Preparation
8- Sampling
• Sampling is a process of selecting a subset of records as a
representation of the original dataset for use in data analysis or
modeling.
• The sample data serve as a representative of the original dataset with
similar properties, such as a similar mean.
• Sampling reduces the amount of data that need to be processed and
speeds up the build process of the modeling.
• In most cases, to gain insights, extract the information, and to build
representative predictive models it is sufficient to work with
samples.
• Theoretically, the error introduced by sampling impacts the relevancy
of the model, but their benefits far outweigh the risks.
3. Modeling
• A model is the abstract

representation of the data and the
relationships in a given dataset.
• A simple rule of thumb like
“mortgage interest rate reduces with
increase in credit score” is a model;
although there is not enough
quantitative information to use in a
production scenario, it provides
directional information by
abstracting the relationship between
credit score and interest rate.
3. Modeling
• Fig. 2.4 shows the steps in the

modeling phase of predictive data
science.
• Association analysis and clustering
are descriptive data science
techniques where there is no target
variable to predict; hence, there is
no test dataset.
• However, both predictive and
descriptive models have an
evaluation step.
3. Modeling
Splitting training and test data sets
• The modeling step creates a representative
model inferred from the data. The dataset used
to create the model, with known attributes and
target, is called the training dataset.
• The validity of the created model will also need
to be checked with another known dataset
called the test dataset or validation dataset.
• To facilitate this process, the overall known
dataset can be split into a training dataset and
a test dataset.
• A standard rule of thumb is two-thirds of the
data are to be used as training and one-third
as a test dataset
3. Modeling
Splitting training and test data sets Training Data
Test Data
4. Application
• In business applications, the results of the data science
process have to be assimilated into the business process —
usually in software applications —.
• Deployment is the stage at which the model becomes
production ready
• The model deployment stage has to deal with:
1. Product readiness
2. Technical integration
3. Model response time
4. Remodeling
5. Assimilation
5. Knowledge
• The data science process provides a framework to
extract nontrivial information from data
• To extract knowledge from these massive data assets,
advanced approaches need to be employed, like data
science algorithms.
• The data science process starts with prior knowledge
and ends with posterior knowledge, which is the
incremental insight gained.
• The data science process can bring up spurious
irrelevant patterns from the dataset.

Data2 Science Process Am

Uploaded by

Data2 Science Process Am

Uploaded by

Fundamentals of

LECTURE 2: DATA SCIENCE PROCESS

Prepare Data 2. Preparation

Building Model using

Knowledge and Actions 5. Knowledge

Gaining information on:

1. Objective of the problem

2- Subject area of the problem

• Preparing the dataset to suit a data science task is the

• A model is the abstract

• Fig. 2.4 shows the steps in the

You might also like