Data Science Methodology
Data Science Methodology
• Every customer’s request starts with a problem, and a Data scientist’s job
is first to understand it and approach this problem with statistical and
machine-learning techniques.
Business Understanding
• Crucial stage
• Once we have found a way to solve our problem, need to discover the
correct data for our model
Data Requirements
✓is the stage where we identify the necessary data content, formats,
and sources for initial data collection, and we use the data inside
the algorithm of the approach we chose.
Data Collection Stage,
• The data scientists identify the available data resources relevant to
the problem domain.
• check the type of each data and to learn more about the attributes
and their names.
Data Preparation stage
• Data scientists prepare data for modeling, which is one of the most crucial steps
because the model has to be clean and without errors.
• the data are in the correct format for the machine learning algorithm we chose
in the analytic approach stage.
• The data frame has to have appropriate column names, and unified boolean
values (yes, no or 1, 0).
• pay attention to the name of each data because sometimes they might be written in
different characters, but they are the same thing; for example (WaTeR, water), we
can fix this by making all the values of a column lowercase.
From Modeling to Evaluation
• Data Modeling: The data science team selects the appropriate modeling techniques
to analyze the data and build predictive models.
• This stage involves selecting the right algorithms, tuning the model parameters, and
validating the model.
• descriptive or predictive- based on the analytic approach that was taken
• Descriptive modeling is a mathematical process that describes real-world events and
the relationships between factors responsible for them, for example, a descriptive
model might examine things like: if a person did this, then they’re likely to prefer
that.
• Predictive modeling is a process that uses data mining and probability to forecast
outcomes; for example, a predictive model might be used to determine whether an
email is a spam or not.
• For predictive modeling, data scientists use a training set that is a set of historical
data in which the outcomes are already known. This step can be repeated more times
until the model understands the question and answer it.
Evaluation:
• The data science team evaluates the model’s performance and
its ability to solve the business problem.
• They use various evaluation metrics to determine the
effectiveness of the model and make improvements if
necessary.
From Deployment to Feedback
• Data scientists have to make the stakeholders familiar with the tool
produced in different scenarios, so once the model is evaluated and the
data scientist is confident it will work, it is deployed and put to the
ultimate test.
Deployment:
• The data science team deploys the model in the
✓production environment,
✓integrating it into the business processes, and
✓ensuring that it is working correctly.
• The Deployment stage depends on the purpose of the model, and it
may be rolled out to a limited group of users or in a test environment.
• A real case study example can be a model destined for the healthcare
system; the model can be deployed for some patients with low-risk
and for high-risk patients too.
Feedback stage
• usually made the most from the customer.
• Customers after the deployment stage can say if the model works for
their purposes or not.
• Data scientists take this feedback and decide if they should improve
the model;
• that’s because the process from modeling to feedback is highly
iterative.
• The data science team monitors the model’s performance in
the production environment, making necessary changes and
improvements to ensure that it continues to work effectively
Summary
• Data Science methodology involves a
• structured approach to problem-solving using data,
• including understanding the business problem,
• collecting and preparing the data,
• modeling and evaluating the data,
• deploying the model, and
• monitoring and maintaining its performance.
• When the model meets all the requirements of the customer, our data
science project is complete.
Data science process
Data Management Plan
• Describes how research data are collected or created, how data are
used and stored during research and how made accessible foe others
after the research has been completed.
✓Describe quickly what kind of data will be collected and how they will
be collected
✓Outline the type of data (eg: survey, interview, face to face, focus
group..etc) and estimate the foreseeable amount and volume of each
data type
✓Describe any existing data you will reuse
Ethical Issues: copy rights