0% found this document useful (0 votes)
45 views

Data Science Methodology

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Data Science Methodology

Uploaded by

Aathmika Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Science Methodology

Data Science Methodology


• Indicates the routine for finding solutions to a specific problem.
• It is a cyclic process that undergoes a critical behavior guiding
business analysts and data scientists to act accordingly.

• It is a structured approach to solving complex problems using


data.
Steps involved in a Data Science
methodology
Classified into 5 parts
• From Problem to Approach
• From Requirements to Collection
• From Understanding to Preparation
• From Modeling to Evaluation
• From Deployment to Feedback
From Problem to Approach

• Every customer’s request starts with a problem, and a Data scientist’s job
is first to understand it and approach this problem with statistical and
machine-learning techniques.
Business Understanding
• Crucial stage

• The business problem is defined, and the objective of the


analysis is identified.
• ask a lot of questions to the customer about every single
aspect of the problem;
• The data science team should work closely with the business
stakeholders to understand the problem and define the
goals.
• At the end of this stage, have a list of business requirements.
Analytical Approach
• once the business problem has been clearly stated, the data
scientist can define the analytic approach to solve the problem.
• Expressing the problem in the context of statistical and machine-
learning techniques - it helps identify what type of patterns will be
needed to address the question most effectively.
✓If the issue is to determine the probabilities of something – a
predictive model
✓If the question is to show relationships- a descriptive approach
✓If our problem requires counts- statistical analysis is the best way
to solve it.

• For each type of approach, we use different algorithms.


From Requirements to Collection

• Once we have found a way to solve our problem, need to discover the
correct data for our model
Data Requirements
✓is the stage where we identify the necessary data content, formats,
and sources for initial data collection, and we use the data inside
the algorithm of the approach we chose.
Data Collection Stage,
• The data scientists identify the available data resources relevant to
the problem domain.

• To retrieve data, we can do web scraping on a related website, or


we can use a repository with premade datasets ready to use.

• Usually, premade datasets are CSV files or Excel; anyway, if we


want to collect data from any website or repository, we should use
Pandas, a useful tool to download, convert, and modify datasets.
Data collection
• is a process of gathering information from all the
relevant sources to find a solution to the research
problem.
• It helps to evaluate the outcome of the problem.
• The data collection methods allow a person to conclude
an answer to the relevant question.
• Most of the organizations use data collection methods
to make assumptions about future probabilities and
trends.
Data can be classified into two types
• primary data
• secondary data.
• The primary importance of data collection in any
research or business process is that it helps to
determine many important things about the company,
particularly its performance.
• the data collection process plays an important role in all
the streams.
From Understanding to Preparation
• The data scientists use descriptive statistics and visualization techniques to
understand data better.

• Data scientists, explore the dataset to understand its content, determine if


additional data is necessary to fill any gaps but also to verify the quality of the
data.
Data Understanding stage
• Data scientists try to understand more about the data collected
before.

• check the type of each data and to learn more about the attributes
and their names.
Data Preparation stage
• Data scientists prepare data for modeling, which is one of the most crucial steps
because the model has to be clean and without errors.
• the data are in the correct format for the machine learning algorithm we chose
in the analytic approach stage.
• The data frame has to have appropriate column names, and unified boolean
values (yes, no or 1, 0).
• pay attention to the name of each data because sometimes they might be written in
different characters, but they are the same thing; for example (WaTeR, water), we
can fix this by making all the values of a column lowercase.
From Modeling to Evaluation
• Data Modeling: The data science team selects the appropriate modeling techniques
to analyze the data and build predictive models.
• This stage involves selecting the right algorithms, tuning the model parameters, and
validating the model.
• descriptive or predictive- based on the analytic approach that was taken
• Descriptive modeling is a mathematical process that describes real-world events and
the relationships between factors responsible for them, for example, a descriptive
model might examine things like: if a person did this, then they’re likely to prefer
that.
• Predictive modeling is a process that uses data mining and probability to forecast
outcomes; for example, a predictive model might be used to determine whether an
email is a spam or not.
• For predictive modeling, data scientists use a training set that is a set of historical
data in which the outcomes are already known. This step can be repeated more times
until the model understands the question and answer it.
Evaluation:
• The data science team evaluates the model’s performance and
its ability to solve the business problem.
• They use various evaluation metrics to determine the
effectiveness of the model and make improvements if
necessary.
From Deployment to Feedback

• Data scientists have to make the stakeholders familiar with the tool
produced in different scenarios, so once the model is evaluated and the
data scientist is confident it will work, it is deployed and put to the
ultimate test.
Deployment:
• The data science team deploys the model in the
✓production environment,
✓integrating it into the business processes, and
✓ensuring that it is working correctly.
• The Deployment stage depends on the purpose of the model, and it
may be rolled out to a limited group of users or in a test environment.
• A real case study example can be a model destined for the healthcare
system; the model can be deployed for some patients with low-risk
and for high-risk patients too.
Feedback stage
• usually made the most from the customer.
• Customers after the deployment stage can say if the model works for
their purposes or not.
• Data scientists take this feedback and decide if they should improve
the model;
• that’s because the process from modeling to feedback is highly
iterative.
• The data science team monitors the model’s performance in
the production environment, making necessary changes and
improvements to ensure that it continues to work effectively
Summary
• Data Science methodology involves a
• structured approach to problem-solving using data,
• including understanding the business problem,
• collecting and preparing the data,
• modeling and evaluating the data,
• deploying the model, and
• monitoring and maintaining its performance.

• When the model meets all the requirements of the customer, our data
science project is complete.
Data science process
Data Management Plan
• Describes how research data are collected or created, how data are
used and stored during research and how made accessible foe others
after the research has been completed.
✓Describe quickly what kind of data will be collected and how they will
be collected
✓Outline the type of data (eg: survey, interview, face to face, focus
group..etc) and estimate the foreseeable amount and volume of each
data type
✓Describe any existing data you will reuse
Ethical Issues: copy rights

• Who owns the copyright, intellectual property rights, management


rights to data
• Who has the right to grant access to data
• What procedures are used to inform research participants
Processing Quantitative Data Files

You might also like