AI Class 12
INTRODUCTION TO
DATA SCIENCE
METHODOLOGY
A data scientist is someone who studies data to find useful
information.
They use math, coding, and AI to understand patterns and make
predictions.
For example, a data scientist can:
✅ Help a company recommend products (like Netflix suggesting
movies)
✅ Predict the weather using past data
✅ Detect fraud in banking transactions
They work with a lot of numbers and computers to solve real-world
problems
A Methodology gives the Data Scientist a framework for designing
an AI Project.
The framework will help the team to decide on the methods,
processes and strategies that will be employed to obtain the correct
output required from the AI Project.
It is the best way to organize the entire project and finish it in a
systematic way without losing time and cost.
Data Science Methodology is a process with a prescribed sequence of iterative steps that
data scientists follow to approach a problem and find a solution.
Data Science Methodology John Rollins, a Data Scientist at IBM Analytics
It consists of 10 steps
This methodology provides a deep insight on how every AI project can be solved from
beginning to end.
There are five modules, each going through two stages of the methodology, explaining the
rationale as to why each stage is required.
1. From Problem to Approach
From
5. From
Requirements
Deployment
to Collection
to Feedback
1. From Problem to Approach
This is also known as Problem Scoping and defining.
Key Pointers :
✅ Understand the Problem – Ask questions to know the customer’s exact needs.
✅ Define Objectives – Identify goals that align with the customer’s requirements.
✅ Use 5W1H Canvas – Analyze the issue deeply (Who, What, When, Where, Why, How).
✅ Apply Design Thinking (DT) – Focus on user-centric problem-solving.
✅ Engage Stakeholders – Discuss with everyone involved to gather insights.
✅ List Business Needs – Document all requirements for a clear action plan.
When the business problem has been established clearly, the data scientist will be able to
define the analytical approach to solve the problem.
Key Pointers :
✅ Clarify the Problem – Ask stakeholders detailed questions to choose the best AI approach.
✅ Choose the Right Method – Decide based on the type of problem:
•Regression → Finding "how much" or "how many.“ Classification → Identifying categories.
Clustering → Grouping similar data. Anomaly Detection → Spotting unusual patterns.
Recommendation → Suggesting options to users.
✅ Types of Data Analytics:
Descriptive Analytics
It summarizes past data to show trends and patterns using graphs, charts, and statistics (mean,
median, mode, etc.).
It also checks data spread with range, variance, and standard deviation.
🔹 Example: Finding the average marks of students or analyzing last year’s sales.
Diagnostic Analytics: It explains why something happened by analyzing past data using root
cause analysis, hypothesis testing, and correlation analysis.
The main purpose is to identify the causes or factors that led to a certain outcome.
🔹 Example: If a company's sales drop, diagnostic analytics finds out why—like poor
customer service or low product quality.
Predictive Analytics: This uses the past data to make predictions about future events or trends,
using techniques like regression, classification, clustering etc.
Its main purpose is to foresee future outcomes and make informed decisions.
For example: A company can use predictive analytics to forecast its sales, demand, inventory,
customer purchase pattern etc., based on previous sales data.
Prescriptive Analytics: This recommends the action to be taken to achieve the desired
outcome, using techniques such as optimization, simulation, decision analysis etc.
Its main purpose is to guide decisions by suggesting the best course of action based on data
analysis.
For example: To design the right strategy to increase the sales during festival season by
analyzing past data and thus optimize pricing, marketing, production etc.
Q1. Which type of analytics questioning is being
utilized here?
A) Descriptive Analytics B) Diagnostic Analytics
C) Predictive Analytics D) Prescriptive Analytics
Q2. What type of approach is chosen here?
Classification Approach
Q3. Which algorithm is depicted in the figure given
here? Decision Tree
2. From Requirements to Collection
The 5W1H questioning method can be employed in this stage also to determine the data requirements. It is
necessary to identify the data content, formats, and sources for initial data collection, in this stage.
This stage defines what data is needed, its format, source, and preprocessing to make it usable.
🔹 Types of Data:
✅ Structured Data → Organized (e.g., customer databases)
✅ Unstructured Data → No fixed structure (e.g., social media posts, images)
✅ Semi-Structured Data → Partially organized (e.g., emails, XML files)
Understanding data types helps in efficient collection and management for project success!
Data collection is a systematic process of gathering observations or measurements.
There are mainly two sources of data collection:
→ Primary data Source A primary data source refers to the original source of data, where
the data is collected firsthand through direct observation, experimentation, surveys,
interviews, or other methods.
This data is raw, unprocessed, and unbiased, providing the most accurate and reliable
information for research, analysis, or decision-making purposes.
Examples include marketing campaigns, feedback forms, IoT sensor data etc.
→ Secondary data Source A secondary data source refers to the data which is already stored
and ready for use. Data given in books, journals, websites, internal transactional databases,
etc. can be reused for data analysis.
Some methods of collecting secondary data are social media data tracking, web scraping,
and satellite data tracking. Some sources of online data are data.gov, World bank open data,
UNICEF, open data network, Kaggle, World Health Organization, Google etc. Smart forms
are an easy way to procure data online.
The Data Collection stage may be revisited after the Data Understanding stage, where gaps in the data are identified, and
strategies are developed to either collect additional data or make substitutions to ensure data completeness.
3. From Understanding to Preparation
Data Understanding encompasses all activities related to constructing the dataset. In this
stage, we check whether the data collected represents the problem to be solved or not
🔹 Key Activities:
✅ Check Data Relevance – Does it match the problem?
✅ Assess Data Quality – Is it complete and accurate?
✅ Use Descriptive Statistics – Methods like univariate analysis and correlation.
✅ Visualize Data – Graphs like histograms help in understanding patterns.
This step helps in identifying any gaps or errors before moving to data preparation!
a. Basic ingredients of sushi are rice, soy sauce, wasabi and vegetables. Is the dish listed in
the data? Are all ingredients available?
Yes. Vegetables and Soy Sauce is not available in the data.
b. Find out the ingredients for the dish “Pulao”. Check for invalid data or missing data.
Common Ingredients for Fried Rice are Rice, Vegetables, Oil, Garlic, Soy sauce, Salt, Chilli,
Onion.
Here in the data Garlic is not found.
c. Inspect all columns for invalid, incorrect or missing data and list them below.
Invalid: Salt (Y), Oil (N), Sugar(N)
Incorrect: Rice (one), Chicken(2)
Missing data: Potato, Soy sauce
d. Which ingredients is common for all dishes? Which ingredient is not used for any dish?
Common- Rice, Not common- Potato
This stage covers all the activities to build the set of data that will be used in the modelling
step. Data is transformed into a state where it is easier to work with.
Data preparation includes
● cleaning of data (dealing with invalid or missing values, removal of duplicate values and
assigning a suitable format)
● combine data from multiple sources (archives, tables and platforms)
● transform data into meaningful input variables
Feature Engineering is a part of Data Preparation. The preparation of data is the most time
consuming step among the Data Science stages.
4. From Modelling to Evaluation
Modeling Process (Simplified):
🔹 Iterative Process → Models are tested and refined multiple times.
🔹 Data Adjustments → Data may need modifications for better accuracy.
🔹 Algorithm Selection → Different machine learning models are tested to find the best
one.
💡 Goal: Choose the most effective model for solving the problem in the Capstone Project
Data Modelling focuses on developing models that are either descriptive or predictive.
1. Descriptive Modelling 📊
✅ Goal: Summarizes and explains data without making predictions.
✅ Use: Helps understand trends, patterns, and key characteristics in a dataset.
✅ Techniques:
•Summary Statistics → Mean, median, mode, standard deviation, range.
•Visualizations → Bar charts, histograms, pie charts, box plots, scatter plots.
💡 Example: Analyzing student scores to find the average and distribution but not
predicting future scores.
2. Predictive Modelling 🔮
✅ Goal: Uses past data to make predictions about future outcomes.
✅ Use: Identifies patterns and forecasts trends using statistical models.
✅ Techniques:
•Regression → Predicts numerical values (e.g., future sales).
•Classification → Categorizes data (e.g., spam or not spam).
•Time-Series Forecasting → Predicts future trends (e.g., weather, stock prices).
💡 Example: Using past exam scores to predict future student performance.
🔹 Key Takeaway:
•Descriptive modeling explains what happened 📈.
•Predictive modeling forecasts what might happen next 🔮. 🚀
Evaluation in an AI project cycle is the process of assessing how well a model performs after
training. This helps determine if the model is reliable and effective before deploying it in real-
world situations. Model evaluation can have two main phases.
1. Diagnostic Measures ️
✅ Purpose: Checks if the model is working as expected.
✅ How?
•For Predictive Models → Use decision trees to see if predictions align with expected results.
•For Descriptive Models → Test relationships using a dataset with known outcomes.
💡 Example: If a predictive model forecasts student grades, we compare it with actual past scores
to verify accuracy.
2. Statistical Significance Test 📊
✅ Purpose: Ensures the model is processing data correctly and its results are not due to random
chance.
✅ How? Apply statistical tests (like p-value, confidence intervals) to confirm reliability.
💡 Example: A model predicting exam scores should show a real pattern, not just random guesses.
🔹 Why is this important?
•Phase 1 checks if the model needs improvement ️.
•Phase 2 confirms if the model is trustworthy ✅.
🚀 Final Goal: A well-tested, accurate AI model ready for real-world use!
5. From Deployment to Feedback
Deployment is the final stage where the AI model is made available for real-world use.
✅ Key Steps:
1.Make stakeholders familiar with how the AI model works.
2.Test the model in a small, controlled environment before full release.
3.Roll out gradually to ensure smooth integration into business processes.
4.Collaborate with internal teams (IT, software engineers, etc.) for seamless deployment.
💡 Example: A customer support chatbot is tested with a few users first before launching for
all customers.
🚀 Final Goal: Ensure the AI model works effectively in real-world scenarios!
✅ What is it?
The final step where user feedback and real-world results help improve the AI model.
✅ Why is it important?
•Helps refine the model for better accuracy and performance.
•Identifies issues and areas for improvement.
•Ensures the model is useful and effective for stakeholders.
✅ How is feedback collected?
1.User reviews & client feedback (direct input from those using the model).
2.Performance monitoring (checking how well the model is working).
3.Automated feedback systems (AI adjusts itself based on new data).
💡 Example: If a chatbot struggles to answer customer queries, feedback helps train it better
for improved responses.
🔄 Final Goal: A continuously improving AI model that meets user needs! 🚀
2.2. MODEL VALIDATION Evaluating the performance of a trained machine learning
model is essential. Model Validation offers a systematic approach to measure its accuracy
and reliability, providing insights into how well it generalizes to new, unseen data.
Model validation is the step conducted post Model Training, wherein the effectiveness of the
trained model is assessed using a validation testing dataset.
Why is it important?
✔ Improves model quality
✔ Reduces errors
✔ Prevents overfitting (model too specific to training data)
✔ Prevents underfitting (model too simple to capture patterns)
Model Validation Techniques The commonly used Validation techniques are Train-test
split, K-Fold Cross Validation, Leave One out Cross Validation, Time Series Cross
Validation etc. Let’s discuss the Train test split and K-Fold Cross Validation
Train-Test Split (Simplified) 🎯
✅ What is it?
The train-test split is a technique for evaluating the performance of a machine learning algorithm. It
can be used for classification or regression problems and can be used for any supervised learning
algorithm.
Training Set – Used to fit the machine learning model.
Testing Set – Used to test the model’s accuracy
✅ Why use it?
✔ Helps estimate model performance on new data
✔ Prevents overfitting (memorizing training data)
✔ Ensures the model generalizes well to unseen data
✅ How to split the data?
The train-test procedure is appropriate when there
is a sufficiently large dataset available.
The dataset is divided into a training and test set. Common split ratios:
🔹 80% Train – 20% Test
🔹 70% Train – 30% Test
🔹 67% Train – 33% Test
✅ How to choose the split?
🔸 Consider computational cost (more training = longer processing)
🔸 Ensure the train & test sets represent the full dataset
🔸 Larger datasets can use smaller test sizes (e.g., 80/20 split)
💡 Example:
Imagine a student preparing for an exam. They study 80% of the syllabus and test themselves on 20% of the
syllabus to check their understanding!
from sklearn.model_selection import train_test_split
# Sample data (features & labels)
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Features
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # Labels
# Splitting data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Output the split data
print("Training Features:", X_train)
print("Testing Features:", X_test)
print("Training Labels:", y_train)
print("Testing Labels:", y_test)
K-Fold Cross-Validation is a great technique to evaluate model performance more reliably.
How It Works:
1.The dataset is split into k equal-sized subsets (folds).
2.The model is trained on k-1 folds and tested on the remaining 1 fold.
3.This process is repeated k times, with each fold serving as the test set once.
4.The final model performance is the average score across all k runs.
Example: k=5
•If you have 100 data points, you split them into 5 folds (each fold has 20 data points).
•Iteration 1: Train on Folds 2,3,4,5, test on Fold 1.
•Iteration 2: Train on Folds 1,3,4,5, test on Fold 2.
•Iteration 3: Train on Folds 1,2,4,5, test on Fold 3.
•Iteration 4: Train on Folds 1,2,3,5, test on Fold 4.
•Iteration 5: Train on Folds 1,2,3,4, test on Fold 5.
•Final result = average of all 5 test performances.
Advantages:
✔ More reliable performance estimates
✔ Uses the full dataset for training & testing
✔ Reduces overfitting compared to a simple train-test split
2.3. MODEL PERFORMANCE - EVALUATION METRICS
Evaluation metrics help assess the performance of a trained model on a test dataset,
providing insights into its strengths and weaknesses.
2.3.1 Evaluation Metrics for Classification
1. Confusion Matrix
Evaluation Metrics for Regression
1. MAE Mean Absolute Error is a sum of the absolute differences between predictions and
actual values. A value of 0 indicates no error or perfect predictions
2. MSE Mean Square Error (MSE) is the most commonly
used metric to evaluate the performance of a regression
model. MSE is the mean(average) of squared distances
between our target variable and predicted values.
3. RMSE Root Mean Square Error (RMSE) is the standard
deviation of the residuals (prediction errors). RMSE is often
preferred over MSE because it is easier to interpret since it is
in the same units as the target variable.