0% found this document useful (0 votes)

158 views45 pages

Introduction To Data Science Methodology

The document outlines the role of a data scientist and the methodology they follow in data science projects, which includes ten iterative steps from problem identification to feedback collection. It details various types of analytics (descriptive, diagnostic, predictive, and prescriptive) and emphasizes the importance of data collection, preparation, modeling, evaluation, and deployment. Additionally, it discusses model validation techniques and performance evaluation metrics for both classification and regression models.

Uploaded by

dhruv.scn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

158 views45 pages

Introduction To Data Science Methodology

Uploaded by

dhruv.scn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI Class 12

INTRODUCTION TO
DATA SCIENCE
METHODOLOGY
A data scientist is someone who studies data to find useful
information.
They use math, coding, and AI to understand patterns and make
predictions.
For example, a data scientist can:

✅ Help a company recommend products (like Netflix suggesting

movies)
✅ Predict the weather using past data
✅ Detect fraud in banking transactions
They work with a lot of numbers and computers to solve real-world
problems
A Methodology gives the Data Scientist a framework for designing
an AI Project.
The framework will help the team to decide on the methods,
processes and strategies that will be employed to obtain the correct
output required from the AI Project.
It is the best way to organize the entire project and finish it in a
systematic way without losing time and cost.
Data Science Methodology is a process with a prescribed sequence of iterative steps that
data scientists follow to approach a problem and find a solution.
Data Science Methodology John Rollins, a Data Scientist at IBM Analytics
It consists of 10 steps
This methodology provides a deep insight on how every AI project can be solved from
beginning to end.
There are five modules, each going through two stages of the methodology, explaining the
rationale as to why each stage is required.
1. From Problem to Approach

From
5. From
Requirements
Deployment
to Collection
to Feedback
1. From Problem to Approach

This is also known as Problem Scoping and defining.

Key Pointers :
✅ Understand the Problem – Ask questions to know the customer’s exact needs.
✅ Define Objectives – Identify goals that align with the customer’s requirements.
✅ Use 5W1H Canvas – Analyze the issue deeply (Who, What, When, Where, Why, How).
✅ Apply Design Thinking (DT) – Focus on user-centric problem-solving.
✅ Engage Stakeholders – Discuss with everyone involved to gather insights.
✅ List Business Needs – Document all requirements for a clear action plan.
When the business problem has been established clearly, the data scientist will be able to
define the analytical approach to solve the problem.
Key Pointers :
✅ Clarify the Problem – Ask stakeholders detailed questions to choose the best AI approach.
✅ Choose the Right Method – Decide based on the type of problem:
•Regression → Finding "how much" or "how many.“ Classification → Identifying categories.
Clustering → Grouping similar data. Anomaly Detection → Spotting unusual patterns.
Recommendation → Suggesting options to users.
✅ Types of Data Analytics:
Descriptive Analytics
It summarizes past data to show trends and patterns using graphs, charts, and statistics (mean,
median, mode, etc.).
It also checks data spread with range, variance, and standard deviation.
🔹 Example: Finding the average marks of students or analyzing last year’s sales.

Diagnostic Analytics: It explains why something happened by analyzing past data using root
cause analysis, hypothesis testing, and correlation analysis.
The main purpose is to identify the causes or factors that led to a certain outcome.
🔹 Example: If a company's sales drop, diagnostic analytics finds out why—like poor
customer service or low product quality.
Predictive Analytics: This uses the past data to make predictions about future events or trends,
using techniques like regression, classification, clustering etc.
Its main purpose is to foresee future outcomes and make informed decisions.
For example: A company can use predictive analytics to forecast its sales, demand, inventory,
customer purchase pattern etc., based on previous sales data.

Prescriptive Analytics: This recommends the action to be taken to achieve the desired
outcome, using techniques such as optimization, simulation, decision analysis etc.
Its main purpose is to guide decisions by suggesting the best course of action based on data
analysis.
For example: To design the right strategy to increase the sales during festival season by
analyzing past data and thus optimize pricing, marketing, production etc.
Q1. Which type of analytics questioning is being
utilized here?
A) Descriptive Analytics B) Diagnostic Analytics
C) Predictive Analytics D) Prescriptive Analytics

Q2. What type of approach is chosen here?

Classification Approach

Q3. Which algorithm is depicted in the figure given

here? Decision Tree
2. From Requirements to Collection

The 5W1H questioning method can be employed in this stage also to determine the data requirements. It is
necessary to identify the data content, formats, and sources for initial data collection, in this stage.

This stage defines what data is needed, its format, source, and preprocessing to make it usable.
🔹 Types of Data:
✅ Structured Data → Organized (e.g., customer databases)
✅ Unstructured Data → No fixed structure (e.g., social media posts, images)
✅ Semi-Structured Data → Partially organized (e.g., emails, XML files)
Understanding data types helps in efficient collection and management for project success!
Data collection is a systematic process of gathering observations or measurements.
There are mainly two sources of data collection:
→ Primary data Source A primary data source refers to the original source of data, where
the data is collected firsthand through direct observation, experimentation, surveys,
interviews, or other methods.
This data is raw, unprocessed, and unbiased, providing the most accurate and reliable
information for research, analysis, or decision-making purposes.
Examples include marketing campaigns, feedback forms, IoT sensor data etc.

→ Secondary data Source A secondary data source refers to the data which is already stored
and ready for use. Data given in books, journals, websites, internal transactional databases,
etc. can be reused for data analysis.
Some methods of collecting secondary data are social media data tracking, web scraping,
and satellite data tracking. Some sources of online data are data.gov, World bank open data,
UNICEF, open data network, Kaggle, World Health Organization, Google etc. Smart forms
are an easy way to procure data online.

The Data Collection stage may be revisited after the Data Understanding stage, where gaps in the data are identified, and
strategies are developed to either collect additional data or make substitutions to ensure data completeness.
3. From Understanding to Preparation

Data Understanding encompasses all activities related to constructing the dataset. In this
stage, we check whether the data collected represents the problem to be solved or not

🔹 Key Activities:
✅ Check Data Relevance – Does it match the problem?
✅ Assess Data Quality – Is it complete and accurate?
✅ Use Descriptive Statistics – Methods like univariate analysis and correlation.
✅ Visualize Data – Graphs like histograms help in understanding patterns.
This step helps in identifying any gaps or errors before moving to data preparation!
a. Basic ingredients of sushi are rice, soy sauce, wasabi and vegetables. Is the dish listed in
the data? Are all ingredients available?
Yes. Vegetables and Soy Sauce is not available in the data.
b. Find out the ingredients for the dish “Pulao”. Check for invalid data or missing data.
Common Ingredients for Fried Rice are Rice, Vegetables, Oil, Garlic, Soy sauce, Salt, Chilli,
Onion.
Here in the data Garlic is not found.

c. Inspect all columns for invalid, incorrect or missing data and list them below.

Invalid: Salt (Y), Oil (N), Sugar(N)

Incorrect: Rice (one), Chicken(2)
Missing data: Potato, Soy sauce

d. Which ingredients is common for all dishes? Which ingredient is not used for any dish?
Common- Rice, Not common- Potato
This stage covers all the activities to build the set of data that will be used in the modelling
step. Data is transformed into a state where it is easier to work with.

Data preparation includes

● cleaning of data (dealing with invalid or missing values, removal of duplicate values and
assigning a suitable format)
● combine data from multiple sources (archives, tables and platforms)
● transform data into meaningful input variables

Feature Engineering is a part of Data Preparation. The preparation of data is the most time
consuming step among the Data Science stages.
4. From Modelling to Evaluation

Modeling Process (Simplified):

🔹 Iterative Process → Models are tested and refined multiple times.
🔹 Data Adjustments → Data may need modifications for better accuracy.
🔹 Algorithm Selection → Different machine learning models are tested to find the best
one.
💡 Goal: Choose the most effective model for solving the problem in the Capstone Project
Data Modelling focuses on developing models that are either descriptive or predictive.

1. Descriptive Modelling 📊
✅ Goal: Summarizes and explains data without making predictions.
✅ Use: Helps understand trends, patterns, and key characteristics in a dataset.
✅ Techniques:
•Summary Statistics → Mean, median, mode, standard deviation, range.
•Visualizations → Bar charts, histograms, pie charts, box plots, scatter plots.

💡 Example: Analyzing student scores to find the average and distribution but not
predicting future scores.
2. Predictive Modelling 🔮

✅ Goal: Uses past data to make predictions about future outcomes.

✅ Use: Identifies patterns and forecasts trends using statistical models.
✅ Techniques:
•Regression → Predicts numerical values (e.g., future sales).
•Classification → Categorizes data (e.g., spam or not spam).
•Time-Series Forecasting → Predicts future trends (e.g., weather, stock prices).
💡 Example: Using past exam scores to predict future student performance.

🔹 Key Takeaway:
•Descriptive modeling explains what happened 📈.
•Predictive modeling forecasts what might happen next 🔮. 🚀
Evaluation in an AI project cycle is the process of assessing how well a model performs after
training. This helps determine if the model is reliable and effective before deploying it in real-
world situations. Model evaluation can have two main phases.
1. Diagnostic Measures ️
✅ Purpose: Checks if the model is working as expected.
✅ How?
•For Predictive Models → Use decision trees to see if predictions align with expected results.
•For Descriptive Models → Test relationships using a dataset with known outcomes.
💡 Example: If a predictive model forecasts student grades, we compare it with actual past scores
to verify accuracy.

2. Statistical Significance Test 📊

✅ Purpose: Ensures the model is processing data correctly and its results are not due to random
chance.
✅ How? Apply statistical tests (like p-value, confidence intervals) to confirm reliability.
💡 Example: A model predicting exam scores should show a real pattern, not just random guesses.
🔹 Why is this important?
•Phase 1 checks if the model needs improvement ️.
•Phase 2 confirms if the model is trustworthy ✅.
🚀 Final Goal: A well-tested, accurate AI model ready for real-world use!
5. From Deployment to Feedback

Deployment is the final stage where the AI model is made available for real-world use.
✅ Key Steps:
1.Make stakeholders familiar with how the AI model works.
2.Test the model in a small, controlled environment before full release.
3.Roll out gradually to ensure smooth integration into business processes.
4.Collaborate with internal teams (IT, software engineers, etc.) for seamless deployment.
💡 Example: A customer support chatbot is tested with a few users first before launching for
all customers.
🚀 Final Goal: Ensure the AI model works effectively in real-world scenarios!
✅ What is it?
The final step where user feedback and real-world results help improve the AI model.
✅ Why is it important?
•Helps refine the model for better accuracy and performance.
•Identifies issues and areas for improvement.
•Ensures the model is useful and effective for stakeholders.
✅ How is feedback collected?
1.User reviews & client feedback (direct input from those using the model).
2.Performance monitoring (checking how well the model is working).
3.Automated feedback systems (AI adjusts itself based on new data).
💡 Example: If a chatbot struggles to answer customer queries, feedback helps train it better
for improved responses.
🔄 Final Goal: A continuously improving AI model that meets user needs! 🚀
2.2. MODEL VALIDATION Evaluating the performance of a trained machine learning
model is essential. Model Validation offers a systematic approach to measure its accuracy
and reliability, providing insights into how well it generalizes to new, unseen data.
Model validation is the step conducted post Model Training, wherein the effectiveness of the
trained model is assessed using a validation testing dataset.
Why is it important?
✔ Improves model quality
✔ Reduces errors
✔ Prevents overfitting (model too specific to training data)
✔ Prevents underfitting (model too simple to capture patterns)

Model Validation Techniques The commonly used Validation techniques are Train-test
split, K-Fold Cross Validation, Leave One out Cross Validation, Time Series Cross
Validation etc. Let’s discuss the Train test split and K-Fold Cross Validation
Train-Test Split (Simplified) 🎯
✅ What is it?
The train-test split is a technique for evaluating the performance of a machine learning algorithm. It
can be used for classification or regression problems and can be used for any supervised learning
algorithm.
Training Set – Used to fit the machine learning model.
Testing Set – Used to test the model’s accuracy
✅ Why use it?
✔ Helps estimate model performance on new data
✔ Prevents overfitting (memorizing training data)
✔ Ensures the model generalizes well to unseen data
✅ How to split the data?

The train-test procedure is appropriate when there

is a sufficiently large dataset available.
The dataset is divided into a training and test set. Common split ratios:
🔹 80% Train – 20% Test
🔹 70% Train – 30% Test
🔹 67% Train – 33% Test
✅ How to choose the split?
🔸 Consider computational cost (more training = longer processing)
🔸 Ensure the train & test sets represent the full dataset
🔸 Larger datasets can use smaller test sizes (e.g., 80/20 split)
💡 Example:
Imagine a student preparing for an exam. They study 80% of the syllabus and test themselves on 20% of the
syllabus to check their understanding!

from sklearn.model_selection import train_test_split

# Sample data (features & labels)
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Features
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # Labels
# Splitting data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Output the split data
print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)
K-Fold Cross-Validation is a great technique to evaluate model performance more reliably.
How It Works:
1.The dataset is split into k equal-sized subsets (folds).
2.The model is trained on k-1 folds and tested on the remaining 1 fold.
3.This process is repeated k times, with each fold serving as the test set once.
4.The final model performance is the average score across all k runs.
Example: k=5
•If you have 100 data points, you split them into 5 folds (each fold has 20 data points).
•Iteration 1: Train on Folds 2,3,4,5, test on Fold 1.
•Iteration 2: Train on Folds 1,3,4,5, test on Fold 2.
•Iteration 3: Train on Folds 1,2,4,5, test on Fold 3.
•Iteration 4: Train on Folds 1,2,3,5, test on Fold 4.
•Iteration 5: Train on Folds 1,2,3,4, test on Fold 5.
•Final result = average of all 5 test performances.
Advantages:
✔ More reliable performance estimates
✔ Uses the full dataset for training & testing
✔ Reduces overfitting compared to a simple train-test split
2.3. MODEL PERFORMANCE - EVALUATION METRICS

Evaluation metrics help assess the performance of a trained model on a test dataset,
providing insights into its strengths and weaknesses.
2.3.1 Evaluation Metrics for Classification

1. Confusion Matrix
Evaluation Metrics for Regression
1. MAE Mean Absolute Error is a sum of the absolute differences between predictions and
actual values. A value of 0 indicates no error or perfect predictions

2. MSE Mean Square Error (MSE) is the most commonly

used metric to evaluate the performance of a regression
model. MSE is the mean(average) of squared distances
between our target variable and predicted values.

3. RMSE Root Mean Square Error (RMSE) is the standard

deviation of the residuals (prediction errors). RMSE is often
preferred over MSE because it is easier to interpret since it is
in the same units as the target variable.

Notes Unit 1
No ratings yet
Notes Unit 1
8 pages
Unit2data Science Methodology
No ratings yet
Unit2data Science Methodology
6 pages
Class Xi Chapter 2
No ratings yet
Class Xi Chapter 2
10 pages
Unit 2 - Data Science Methodology Notes
No ratings yet
Unit 2 - Data Science Methodology Notes
26 pages
AI Student HandbookXII
No ratings yet
AI Student HandbookXII
48 pages
CH 2
No ratings yet
CH 2
26 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
Capstone Project - Unit2
No ratings yet
Capstone Project - Unit2
81 pages
Class 12 AI - Chapter 1
No ratings yet
Class 12 AI - Chapter 1
5 pages
Lesson2 Notes
No ratings yet
Lesson2 Notes
13 pages
Data Science Fundamentals and Processes
No ratings yet
Data Science Fundamentals and Processes
33 pages
DTS 201 Lecture Note
No ratings yet
DTS 201 Lecture Note
24 pages
Data Science (Quick Guide) For College Exams
No ratings yet
Data Science (Quick Guide) For College Exams
34 pages
Data Gathering: Methods and Types Explained
No ratings yet
Data Gathering: Methods and Types Explained
14 pages
Data Science Overview and Process Guide
No ratings yet
Data Science Overview and Process Guide
26 pages
Unit 2 - Data Science Methodology
No ratings yet
Unit 2 - Data Science Methodology
11 pages
Daftar Isi Modul Data Science
100% (1)
Daftar Isi Modul Data Science
56 pages
Unit 2 MCQ 12th Class
No ratings yet
Unit 2 MCQ 12th Class
11 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Antim Prahar 2024 Data Analytics For Business Decisions
50% (2)
Antim Prahar 2024 Data Analytics For Business Decisions
38 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Architecture of Data Science Projects: Components
No ratings yet
Architecture of Data Science Projects: Components
4 pages
Dsur Ea2352001010391 W3
No ratings yet
Dsur Ea2352001010391 W3
3 pages
Dsbda Unit3
No ratings yet
Dsbda Unit3
21 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Essential Data Science Concepts Explained
No ratings yet
Essential Data Science Concepts Explained
29 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Analytics Overview and Methodology
No ratings yet
Data Analytics Overview and Methodology
4 pages
Data Science Fundamentals Detailed Notes
No ratings yet
Data Science Fundamentals Detailed Notes
31 pages
Data Science
No ratings yet
Data Science
11 pages
EDA Fundamentals: Data Analysis Guide
No ratings yet
EDA Fundamentals: Data Analysis Guide
10 pages
Data Science Fundamentals Explained
No ratings yet
Data Science Fundamentals Explained
41 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
4 pages
Cs3352 - Foundation of Data Science
No ratings yet
Cs3352 - Foundation of Data Science
56 pages
Data Science Methodology
No ratings yet
Data Science Methodology
26 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Ds 3
No ratings yet
Ds 3
9 pages
Liceria Tech
No ratings yet
Liceria Tech
12 pages
Intro to Data Science Technical Report
No ratings yet
Intro to Data Science Technical Report
7 pages
Understanding Data Science Process
No ratings yet
Understanding Data Science Process
15 pages
Data Types and Analytics Overview
No ratings yet
Data Types and Analytics Overview
22 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
11 pages
C1000-177 STU SGC1000177v2
No ratings yet
C1000-177 STU SGC1000177v2
9 pages
Data Acquisition and Preprocessing Guide
No ratings yet
Data Acquisition and Preprocessing Guide
5 pages
Data Science Process Overview Guide
No ratings yet
Data Science Process Overview Guide
5 pages
Overview of the Data Science Process
No ratings yet
Overview of the Data Science Process
7 pages
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
No ratings yet
IDS - UNIT-2 - Notes Part1 - Introduction To Data Science and Prob Concept
66 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Data Analytics: Life Cycle & Stages Explained
No ratings yet
Data Analytics: Life Cycle & Stages Explained
7 pages
Data Science Process
No ratings yet
Data Science Process
13 pages
Capstone Project
No ratings yet
Capstone Project
28 pages
5G NR Wireless Standards Overview
No ratings yet
5G NR Wireless Standards Overview
1 page
Theory of Architecture Reviewer Ale Summary
No ratings yet
Theory of Architecture Reviewer Ale Summary
43 pages
B.Tech Mathematics-I Exam Paper 2023
No ratings yet
B.Tech Mathematics-I Exam Paper 2023
4 pages
Ideal Gas Equation Explained
No ratings yet
Ideal Gas Equation Explained
15 pages
Fallas Boiler (Babcock and Wilcox)
No ratings yet
Fallas Boiler (Babcock and Wilcox)
8 pages
Polyolefine Pipe Welding Machine
No ratings yet
Polyolefine Pipe Welding Machine
1 page
ACSD31QB
No ratings yet
ACSD31QB
33 pages
Primavera Project Management Skills
No ratings yet
Primavera Project Management Skills
2 pages
Acid Neutralizing Capacity
No ratings yet
Acid Neutralizing Capacity
6 pages
The Principle of The Altimeter and Sources of Error: Z (RT/GM) .Log
No ratings yet
The Principle of The Altimeter and Sources of Error: Z (RT/GM) .Log
3 pages
(Updated) TCS NQT Syllabus 2022 For Online Test - PrepInsta
No ratings yet
(Updated) TCS NQT Syllabus 2022 For Online Test - PrepInsta
9 pages
PHYS 2426 Lab 6 Magnetic Field of A Solenoid
No ratings yet
PHYS 2426 Lab 6 Magnetic Field of A Solenoid
6 pages
Timken Spherical Roller Bearing Catalog
100% (1)
Timken Spherical Roller Bearing Catalog
236 pages
Understanding Triangles and Similarity
No ratings yet
Understanding Triangles and Similarity
42 pages
Data Warehouse Architecture Guide
No ratings yet
Data Warehouse Architecture Guide
10 pages
Stress Concentration Examples
100% (4)
Stress Concentration Examples
5 pages
Trimmer Motor Types and Specifications
No ratings yet
Trimmer Motor Types and Specifications
7 pages
CE142P Exercise No. 02
No ratings yet
CE142P Exercise No. 02
2 pages
Wa0026.
No ratings yet
Wa0026.
16 pages
JEE Main 2024 Model A Exam Syllabus
No ratings yet
JEE Main 2024 Model A Exam Syllabus
21 pages
Binary Search Tree Operations Guide
No ratings yet
Binary Search Tree Operations Guide
3 pages
Acp NC 2 Activities
100% (1)
Acp NC 2 Activities
5 pages
Portrait Lighting Techniques Guide
100% (4)
Portrait Lighting Techniques Guide
46 pages
Schermer 1984
No ratings yet
Schermer 1984
25 pages
Confinement Effects on Dye Diffusivity
No ratings yet
Confinement Effects on Dye Diffusivity
9 pages
EE 617 Autumn 2014 Midterm Exam Guide
No ratings yet
EE 617 Autumn 2014 Midterm Exam Guide
2 pages
CIST1305 Test 2 Review: Program Design
No ratings yet
CIST1305 Test 2 Review: Program Design
7 pages
VSP Software Development Kit
No ratings yet
VSP Software Development Kit
54 pages
Hr300-Manual Control
No ratings yet
Hr300-Manual Control
1 page
?revision - Physics Practical Exam - Grade 12 - 24-25
No ratings yet
?revision - Physics Practical Exam - Grade 12 - 24-25
25 pages

Introduction To Data Science Methodology

Uploaded by

Introduction To Data Science Methodology

Uploaded by

AI Class 12

✅ Help a company recommend products (like Netflix suggesting

This is also known as Problem Scoping and defining.

Q2. What type of approach is chosen here?

Q3. Which algorithm is depicted in the figure given

Invalid: Salt (Y), Oil (N), Sugar(N)

Data preparation includes

Modeling Process (Simplified):

✅ Goal: Uses past data to make predictions about future outcomes.

2. Statistical Significance Test 📊

The train-test procedure is appropriate when there

from sklearn.model_selection import train_test_split

2. MSE Mean Square Error (MSE) is the most commonly

3. RMSE Root Mean Square Error (RMSE) is the standard

You might also like