Module 1 Applied Data Science 1.1 and 1.2

Applied Data
Science
B. Tech. Semester - VIII
Scheme
Applied Data Science
Applied Data Science
Laboratory
• Mid Semester Examination • Experiment – 5 Marks
– 20 Marks • CIAP – (20 Marks, (Activity-
• CCE Activity – (20 Marks 1), 20 Marks (Activity-2),
(Activity-1), 20 Marks Attendance – 10 Marks) – 40
(Activity-2), Attendance – 10 Marks
Marks) – 20 Marks
• End Semester Examination
– (100 Marks) – 60 Marks
Course Outcome
After successful completion of this course, the students should be able

to
1.Gain a comprehensive understanding of the data science process, recognize its
significance, and distinguish between key areas and differences with data analytics.
2.Demonstrate expertise in descriptive and inferential statistics, encompassing
univariate and multivariate exploration, correlation, probability distributions, and
hypothesis testing.
3.Possess advanced skills in data processing, employing supervised and unsupervised
methods, cross-validation, and error metrics. Proficiency in visualization tools like
Microsoft Power BI and Tableau will enable effective interpretation of complex
datasets for practical problem-solving.
4.Expertise in detecting anomalies, understanding outlier causes, and mastering
advanced time series forecasting methods.
Syllabus
Module 2: Data
Module 1:
Processing &
Introduction
Visualization
Module 4: Time Module 3: Anomaly

Series Detection
Module 1: Introduction
Module 1.1 & 1.2: Introduction & Process
1. Introduction to Data Science

2. Data Science Process
3. Types of Analytics
4. Applications of Data Science in Industries
5. Data Ethics and Challenges
6. Future Trends and Career Opportunities in Data Science
Lecture 1:
Introduction to
Data Science
Objective: Understand the basics of Data
Science, its importance, and core
concepts.
Definition and scope
What is Data Science? Differences between Data Science, AI, ML, and
Big Data
Data collection and cleaning

Data analysis and visualization
Components of Data Science Machine learning models
Data interpretation
Outline Importance of Data Science Real-world impact (examples)
Roles in Data Science Data Scientist, Data Analyst, Data Engineer
Programming (Python, R)
Skills Required for Data Statistics and probability
Science Data visualization tools
What is Data Science?
Differences between
Definition of Data
Data Science, ML, AI
Science.
and Big Data Analytics.
Definition of Data Science
Definition of Data Science
• Data Science is an interdisciplinary field that involves extracting meaningful insights and knowledge from
structured and unstructured data using scientific methods, processes, algorithms, and systems. It
combines elements of statistics, mathematics, computer science, and domain expertise to analyze data
and solve complex real-world problems.
Key Aspects of Data Science:
• Data Handling
• Exploratory Data Analysis (EDA)
• Modeling and Machine Learning
• Visualization and Communication
• Applications Across Domains
Example
• Predicting customer behavior in e-commerce to improve product recommendations and increase sales.
Artificial Intelligence
Definition:
• The simulation of human intelligence in machines,

enabling them to perform tasks like reasoning,
Difference learning, and problem-solving.
between Data Key Features:
Science, AI, ML • Encompasses a broad range of techniques,

including logic-based reasoning and machine
and Big Data learning.

• Focuses on creating systems that mimic human
intelligence.
Analytics • Subfields: Machine Learning, Natural Language
Processing (NLP), Computer Vision.
Applications:
• Chatbots, autonomous vehicles, facial recognition

Machine Learning
Definition:
• A subset of AI that uses statistical techniques to

enable machines to learn patterns from data and
Difference improve
programming.
performance without explicit
Science, AI, ML • Focuses on building models to make predictions

or decisions.
and Big Data • Types of ML: Supervised, Unsupervised,
Reinforcement Learning.
Analytics • Algorithms: Linear regression, decision trees,
neural networks.
Applications:
• Spam email detection, stock price prediction,

recommendation systems.
Big Data Analytics
Definition:
• The process of examining large and complex

datasets (Big Data) to uncover patterns, trends,
Difference and insights that traditional data-processing
tools cannot handle.
Science, AI, ML • Focuses on volume, velocity, variety, and
and Big Data veracity of data (4 Vs).
• Involves distributed computing systems like
Hadoop and Spark.
Analytics • Tools: Apache Hive, Pig, Kafka.
Applications:
• Social media analysis, supply chain

optimization, predictive maintenance.
Difference between Data Science, AI,
ML and Big Data Analytics
Aspect DS AI ML BDA
Creating
Analyzing large
Focus Insights from data intelligent Learning from data
datasets
systems
Statistics, Reasoning, Algorithms, data- Distributed
Core Technique
visualization problem-solving driven learning computing
Broad: includes AI, Subset of Data Specialized for
Scope Subset of AI
ML, and analytics Science large-scale data
Python, R, SQL, TensorFlow, Hadoop, Spark,
Tools Scikit-learn, Keras
Tableau, PowerBI PyTorch Hive
Robotics, natural Personalized Social media
Business insights,
Application language recommendations, trends, real-time
research
understanding predictive modeling data analysis
What is Data Science? Differences between Data Science, AI, ML,
and Big Data

Data analysis and visualization
Components of Data Science Machine learning models
Data interpretation
Components of Data Science
Data collection and Data analysis and

cleaning visualization
Machine learning
Data interpretation
models
This is the foundational step in any Data Science project.
Data Collection:
• The process of gathering data from various sources for analysis.
• Sources of Data:
• Databases (e.g., SQL, NoSQL)
• Web scraping (e.g., BeautifulSoup, Scrapy)
• APIs (e.g., Twitter API, REST APIs)
• Sensors and IoT devices
• Surveys and questionnaires
• Challenges in Data Collection:
1. Data Collection
• Missing data
• Inconsistent formats
• High volume of unstructured data
& Cleaning Data Cleaning:

• Preparing the collected data by fixing errors, handling missing values, and
standardizing formats.
Key Tasks:
• Removing duplicates
• Handling missing values (e.g., imputation, deletion)
• Encoding categorical data (e.g., one-hot encoding)
• Standardizing and normalizing numerical data
• Tools Used:
• Python Libraries: Pandas, NumPy
• Software: Excel, OpenRefine
The goal is to explore and summarize the data to
understand trends, patterns, and relationships.
Data Analysis:
•Descriptive Statistics: Mean, median, standard deviation

•Exploratory Data Analysis (EDA):Understanding distributions,
outliers, and correlations
•Hypothesis testing and drawing preliminary insights
2. Data Analysis Data Visualization:
and Visualization •Presenting data in a visual format to convey insights
effectively.
•Types of Visualizations:
•Bar charts, line graphs, scatter plots
•Heatmaps, pie charts, histograms
•Interactive dashboards
•Tools Used:
•Python Libraries: Matplotlib, Seaborn, Plotly
•Software: Tableau, Power BI
Machine Learning enables Data Science projects to
predict, classify, and optimize outcomes.
Types of Models:
• Supervised Learning:
• Examples: Linear Regression, Decision Trees
• Use case: Predicting house prices
• Unsupervised Learning:
• Examples: K-Means Clustering, PCA
Machine Learning • Use case: Customer segmentation
• Reinforcement Learning:
Models • Example: Q-Learning

• Use case: Robotics and game AI
Model Evaluation Metrics:
• Accuracy, precision, recall, F1-scoreMean Squared Error (MSE), R-

squared for regression models
Tools Used:
• Libraries: Scikit-learn, TensorFlow, PyTorch

The final step involves making sense of the results and
deriving actionable insights.
Key Elements:
• Interpret Model Results:

• Understanding the significance of model coefficients or feature
importance
• Analyzing model outputs to answer business questions
Communicating Insights:
4. Data • Creating reports, dashboards, and presentations
Interpretation • Translating technical results into business recommendations
Real-World Impact:
• Helping stakeholders make data-driven decisions

• Implementing solutions based on findings (e.g., improving customer
service, optimizing inventory)
Best Practices for Interpretation:
• Use simple language to explain complex concepts.

• Back up insights with data visualizations and statistical evidence.
and Big Data

Components of Data Data analysis and visualization
Science Machine learning models

Data interpretation
Importance of Data Science
1. Data-Driven Decision 6. Fraud Detection and Risk
Making Management
2. Improved Business 7. Enhanced Decision
Efficiency Accuracy in Complex
3. Personalized Customer Systems
Experiences 8. Supports Sustainable
4. Competitive Advantage Development
5. Innovation and Product 9. Boosts Workforce

Development Productivity
10. Applicable Across
Industries
Real-life Example
E-Commerce: Amazon's Recommendation System
Amazon uses Data Science to analyze customer browsing and

Scenario: purchasing behavior, identifying patterns to recommend products.
By analyzing data such as past purchases, search history, and items

in the cart, Amazon's recommendation system suggests products
How Data Science Helps: tailored to individual preferences.
This increases customer satisfaction, boosts sales, and improves
user engagement.
Personalized recommendations drive approximately 35% of

Impact: Amazon's total revenue.
Real-life Example
Retail: Walmart’s Inventory Management
Walmart needs to ensure that products are always available while

Scenario: minimizing excess inventory.
Walmart uses Data Science to analyze sales data, weather patterns,

and local events to forecast demand accurately.
How Data Science Helps: Advanced algorithms optimize inventory levels, reducing stockouts
and overstock situations.
Walmart saves millions in operational costs while maintaining high

Impact: customer satisfaction.
Real-life Example
Entertainment: Netflix’s Content Recommendation System
Netflix relies on Data Science to recommend shows and movies

Scenario: based on user preferences.
Data on viewing habits, ratings, and user interactions is analyzed to

suggest personalized content.
How Data Science Helps: Machine learning models predict what users will enjoy watching
next.
Personalized recommendations keep users engaged, reducing

Impact: churn rates and increasing subscriptions.
Real-life Example
Transportation: Uber’s Surge Pricing
Scenario: Uber dynamically adjusts pricing based on supply and demand.
Data on real-time demand, traffic conditions, and driver availability

is analyzed to determine surge pricing.
How Data Science Helps: This ensures that riders get cabs faster while incentivizing drivers to
operate in high-demand areas.
Efficient resource allocation and increased driver earnings during

Impact: peak times.
Detailed Discussion - Amazon
DATA TECHNIQUES BENEFITS CHALLENGES IMPACT

SOURCES ADDRESSED
Data Sources used
• Past Purchases: Helps identify what a customer is likely to buy next.
• Browsing History: Tracks which products a customer has viewed but not
purchased.
• Cart Items and Wishlist Data: Items added to the cart or wishlist indicate strong
purchase intent.
• Search Queries: Analyzes what customers are searching for, even if they don’t
make a purchase.
• Customer Demographics: Uses age, location, and gender to tailor
recommendations.
• Behavior of Similar Customers: Recommends products based on what other
customers with similar interests or behaviors have purchased.
Collaborative Filtering: Recommends products based on
the preferences of similar users.
Content-Based Filtering: Focuses on product

characteristics and recommends items similar to those the
user has interacted with
Data Science
Techniques Hybrid Recommendation Systems: Combines collaborative
and content-based filtering for better accuracy.
Deep Learning and Natural Language Processing (NLP):

Analyzes product reviews and ratings to understand
customer sentiment.
Real-Time Analytics: Tracks user activity in real time and

updates recommendations dynamically.
Boosts Sales and Revenue
How It Helps:
• Personalized recommendations increase the likelihood
of customers making additional purchases.
• Recommendations like "Customers who bought this also
bought" encourage users to explore related products,
Benefits leading to upselling and cross-selling.
Proof:
• McKinsey Report: Suggests that 35% of Amazon's total
revenue is generated from its recommendation system.
• Case Study: Research published in Journal of Big Data
highlights that personalized product recommendations
can increase average order value by 10-15%.
Increases Customer Retention and Loyalty
How It Helps:
• By offering a personalized shopping experience,
customers feel understood and valued, leading to higher
retention rates.
• Repeat customers tend to purchase more, contributing
significantly to overall revenue.
Benefits
Proof:
• Forbes Study (2022): Found that 75% of customers are
more likely to return to an online retailer that provides
personalized recommendations.
• Amazon Prime Impact: Prime members are offered highly
personalized recommendations, which helps Amazon
achieve higher retention rates compared to competitors.
Enhances Customer Experience
How It Helps:
• Simplifies product discovery by filtering through millions
of items and presenting relevant options.
• Reduces decision fatigue, saving customers time and
effort in searching for products.
Benefits Proof:
• Baymard Institute Study: Reports that 58% of users
abandon e-commerce sites due to difficulty in finding
relevant products. Amazon mitigates this by offering
tailored recommendations, improving user satisfaction.
• Statista (2023): Amazon scored 78 out of 100 in
customer satisfaction surveys, partly attributed to its
recommendation system.
Drives User Engagement
How It Helps:
• By continuously suggesting new and relevant
products, users spend more time browsing Amazon’s
platform.
• Increased time on site correlates with a higher
probability of making a purchase.
Benefits
Proof:
• Deloitte Research: Shows that personalized product
recommendations increase site engagement by up to
30%.
• Internal Amazon Analytics (reported by CNBC): Users
who engage with recommendations are 3 times more
likely to complete a purchase than those who don’t.
Facilitates Efficient Inventory Management
How It Helps:
• Data insights help Amazon understand product
demand patterns, allowing for better inventory
planning.
• Recommending slow-moving items to targeted
customers helps reduce inventory holding costs.
Benefits
Proof:
• Business Insider Report: Amazon's efficient inventory
turnover is attributed to its recommendation system’s
ability to boost sales of underperforming products.
• Walmart Benchmarking Study: Shows that Amazon’s
inventory management outperforms traditional
retailers, with Data Science playing a crucial role.
Improves Marketing ROI
How It Helps:
• Amazon uses recommendation insights to target
customers with personalized emails and ads, improving
the return on investment (ROI) for marketing campaigns.
• Targeted advertising increases the relevance of
promotions, driving better conversion rates.
Benefits Proof:
• Econsultancy Research: Personalized email
recommendations result in 26% higher click-through
rates and 760% more revenue compared to non-
personalized emails.
• Amazon’s Sponsored Product Ads: These leverage
recommendation data to show relevant ads, which have
significantly higher conversion rates than generic ads.
Competitive Advantage and Market Leadership
How It Helps:
• The recommendation system differentiates Amazon from
competitors, providing a unique and seamless shopping
experience.
• Competitors like Walmart and Alibaba have tried to
replicate Amazon's recommendation engine but have not
Benefits matched its efficiency.
Proof:
• eMarketer (2023): Amazon holds a 39.5% share of the U.S.
e-commerce market, far ahead of competitors, with its
personalized shopping experience being a key factor.
• Comparison Study: Studies show that Amazon’s conversion
rate (13%) is significantly higher than the industry average
(2-3%), largely due to its recommendation engine.
Empowers Small Sellers on Amazon Marketplace
How It Helps:
• The recommendation system promotes products from small
and medium-sized sellers, leveling the playing field.
• By analyzing data on customer preferences, even niche
products gain visibility through recommendations.
Benefits Proof:
• Amazon Seller Statistics (2022): Over 50% of Amazon’s sales
come from third-party sellers. Many report significant sales
growth due to visibility from Amazon’s recommendation
system.
• Seller Testimonials: Several small businesses credit
Amazon’s recommendation system for boosting their sales
without requiring extensive marketing efforts.
Overwhelming Product Choices:
• With millions of products, customers can
feel overwhelmed.
• The recommendation system narrows
options, presenting relevant and desirable
choices.
Challenges Dynamic Customer Preferences:
Addressed Using • Customer interests change over time.
• Data Science models adapt by continuously
Data Science learning from new data to stay relevant.
Handling Large-Scale Data:

• Amazon processes terabytes of data daily,
and advanced Data Science techniques
ensure real-time recommendations without
delays.
Revenue Growth:
• Personalized recommendations account
for approximately 35% of Amazon’s
total revenue, showing how effective
the system is in driving sales.
Customer Retention:
Impact on • A seamless and personalized shopping
Amazon’s Business experience encourages repeat visits
and fosters customer loyalty.
Market Leadership:
• By leveraging Data Science, Amazon
maintains its competitive edge and sets
the benchmark for e-commerce
personalization.
and Big Data


Data interpretation
Core Data Science Roles
Role Primary Responsibility Key Skills
Python, R, Machine Learning (Scikit-
Extract insights and build predictive
Data Scientist learn, TensorFlow), Statistics, Data
models to solve business problems.
Visualization (Tableau, Power BI)
SQL, Apache Spark, Data Pipelines,
Build and maintain data infrastructure
Data Engineer Cloud Platforms (AWS, Azure), Big
and pipelines.
Data (Hadoop, Kafka)
Excel, SQL, Tableau, Power BI,
Analyze data and create actionable
Data Analyst Exploratory Data Analysis, Statistical
insights for decision-making.
Techniques
Machine Learning Develop and deploy machine learning Python, TensorFlow, PyTorch, MLOps
Engineer models into production. (MLflow), Model Deployment (APIs)
BI Tools (Power BI, QlikView), KPI
Business Provide data-driven insights through
Analysis, Communication, Business
Intelligence Analyst dashboards and reports.
Acumen
Specialized Data Science Roles
Role Primary Responsibility Key Skills
Data Modeling (Erwin), Cloud Data
Design and manage the overall data
Data Architect Architecture, SQL, NoSQL, Data
framework and architecture.
Governance
Database Management (MySQL,
Ensure the smooth operation and
Data Administrator Oracle), Performance Optimization,
security of databases.
Data Backup & Recovery
Apply statistical techniques to analyze SAS, SPSS, R, Hypothesis Testing,
Statistician
and interpret data. Regression Analysis
Deep Learning, NLP, TensorFlow,
Conduct research to develop advanced
AI Research Scientist PyTorch, Research & Publication,
AI algorithms and solutions.
Advanced Mathematics
Data Privacy Laws (GDPR), Data
Data Governance Ensure compliance, data integrity, and
Management Tools (Collibra), Risk
Specialist ethical data use.
Management, Communication
and Big Data


Data interpretation
Technical Skills Analytical Skills Soft Skills
Programming Language: Statistical Knowledge Communication Skills
Python, R, SQL,
Skills Java/Scala, etc.
Data Exploration &
Analysis
Business Acumen
Teamwork &
Required Data Manipulation and
Processing: Pandas,
Problem Solving Collaboration
Critical Thinking Adaptability & Curiosity
for Data Data Visualization Tools:
NumPy, Excel, etc.
Science Tableau, Power BI,

Matplotlib, Seaborn, etc.
Machine Learning
Framework: Scikit-learn,
TensorFlow, PyTorch
Cloud Platforms: AWS,
Azure, Google Cloud
Lecture 2: Data
Science Process
Objective: Learn the systematic approach
to solving problems using Data Science.
Outline
Define the problem
Collect data
Overview of the Data Science Clean and preprocess data
Process Explore and analyze data

Build models
Interpret and communicate results
Problem formulation: understanding the business objective

Data acquisition: sources, web scraping, APIs
Detailed Steps in the Process Data cleaning: handling missing data, outliers
Exploratory Data Analysis (EDA): identifying patterns, trends
Real-World Examples Use cases from industries

Overview of the Data Science Process
The Data Science process involves a structured approach to solving

real-world problems. It can be broken into six key steps:
• Define the Problem: Clearly articulate the problem to align the solution with business
objectives. Example: Predict customer churn in a subscription-based business.
• Collect Data: Gather relevant data from various sources like databases, APIs, and web
scraping. Example: Collect customer interaction logs and subscription details.
• Clean and Preprocess Data: Handle missing values, remove outliers, and format data
for analysis. Example: Fill missing age data with median values and normalize income
data.
• Explore and Analyze Data: Use statistical and visualization tools to identify trends,
patterns, and correlations. Example: Use histograms and scatterplots to find
relationships between variables.
• Build Models: Develop machine learning models to predict or classify outcomes.
Example: Use logistic regression or decision trees to predict customer churn.
• Interpret and Communicate Results: Translate model outputs into actionable insights
and present findings to stakeholders. Example: Create a dashboard showing churn
probability for each customer.
Data Science Process – Life Cycle
Problem
Data Acquisition
Formulation
Exploratory Data
Data Preparation
Analytics
Interpret &
Build Models Communicate
Result
Problem Formulation
Understanding the Problem Statement, thorough study of the
Business model is required.
Problem Formulation
Problem formulation is the foundation of any Data Science project. It

involves clearly defining the problem to ensure the solution aligns with
business goals. This step sets the direction for the entire Data Science
process and helps stakeholders understand what is being addressed,
why it is important, and how success will be measured.
1. Understanding the Business Context and Objectives
2. Defining the Scope, Constraints, and Success Criteria
Problem Formation - Understanding
the Business Context and Objectives
• To solve a problem effectively, it is critical to understand the business's domain, operations,

and objectives. This requires close collaboration with stakeholders, including business
managers, subject matter experts, and end-users.
• Key Aspects:
• Business Context: Understand the industry and its challenges (e.g., retail, healthcare,
finance).
• Stakeholder Goals: Identify what the stakeholders aim to achieve.
• Current State: Analyze the existing processes, tools, and data availability.
Problem Formation – Define Scope,
Constraints & Success Criteria
• After understanding the problem, the next step is to define the project's boundaries,
constraints, and measures of success.
• Scope: The scope specifies the problem's extent and what will be included or excluded from
the analysis. It helps set realistic expectations and prevents scope creep (uncontrolled
expansion of project boundaries).
• Constraints: Constraints are the limitations or restrictions that might affect the project. These
could be related to resources, time, data availability, or technology.
• Success Criteria: Define measurable outcomes to evaluate whether the solution is
successful. These criteria help stakeholders assess the project’s impact and guide iterative
improvements.
Problem Formation - Example
• Retail Business – Optimizing Inventory Management
• Scenario
• A retail chain faces frequent challenges with inventory. Some products are overstocked, leading to high storage costs, while others frequently run out of
stock, causing customer dissatisfaction and missed sales opportunities.
• Business Objective
• To optimize inventory management by predicting product demand and maintaining optimal stock levels, which would reduce costs and improve sales.
• Problem Formulation:
• Understand the Context:
• The company operates 100 retail outlets.
• Seasonal trends and promotional events significantly affect product demand.
• Manual inventory tracking is time-consuming and prone to errors.
• Define the Scope:
• Analyze data for 50 high-demand product categories.
• Focus on sales data from the last two years.
• Exclude niche products and data from outlets that do not generate significant revenue.
• Identify Constraints:
• Incomplete data for certain periods due to system outages.
• Limited IT infrastructure for real-time analytics.
• Deadline to implement the solution before the upcoming holiday season.
• Set Success Criteria:
• Achieve a 90% demand forecast accuracy for key products.
• Reduce stockouts to fewer than 5 occurrences per month.
• Lower inventory holding costs by at least 10%.
Problem Formulation
Data Acquisition
Identify the relevant data sources as per defined problem statement
and decide the format and various tools for data acquisition.
Identify Relevant Data Sources in
Data Science
• Identifying the right data sources is a crucial step in any Data Science project. The quality and
relevance of data directly impact the accuracy and usefulness of the insights generated. Data
sources can be broadly categorized into internal and external sources, and specific tools are
used to acquire this data effectively.
• Types of Data
• Tools for Data Acquisition
Problem Formulation
Data Preparation
the process of cleaning, organizing, and
transforming raw data into a format that can
be analyzed
3
Data Acquisition
Identify the relevant data sources as per defined problem statement
and decide the format and various tools for data acquisition.
Data Cleaning
Data cleaning, also known as data cleansing or preprocessing, is a crucial step in the
Data Science process. It involves preparing raw data for analysis by identifying and
correcting errors, filling in missing values, standardizing formats, and addressing
inconsistencies. Clean data ensures the accuracy, reliability, and efficiency of subsequent
analyses or machine learning models.
Why Data Cleaning Is Important
Enhances Model
Improves Data Quality:
Performance: Clean data
Ensures accuracy,
results in better-performing
consistency, and reliability.
machine learning models.
Facilitates Analysis:
Saves Time: Prevents issues
Reduces noise, making it
during later stages of data
easier to detect meaningful
analysis.
patterns.
Key Steps in Data Cleaning
Detect and Handle Missing Data
• Methods to Handle Missing Data:

• Imputation
• Removal
Identify & Remove Outliers
• Method to detect outliers

• Z-Score
• IQR
• Visualization
• Handling Outlier
Data Transformation
Data inconsistency can

arise from variations in
formats, units, or Example of Standardize
categories. Standardization Data:
ensures uniformity across
the dataset.
Categorical Encoding:
Units and Scales: Convert
Standardize categorical
Date Formats: Convert all measurements to a
variables using consistent
date entries to a consistent common unit (e.g., inches
labels or encoding
format (e.g., YYYY-MM-DD). to centimeters, USD to
methods (e.g., one-hot
EUR).
encoding, label encoding).
Other Common Data Cleaning Tasks
1 2 3
Remove Duplicate Fix Structural Errors: Filter Irrelevant Data:
Records: Eliminate Correct typos, Remove data that does
redundant entries to inconsistent not contribute to
avoid skewing results. capitalization, or solving the problem
misnamed categories (e.g., unrelated
(e.g., “NY” vs. “New columns).
York”).
Importance of Data Cleaning
• Improves Data Quality: Reduces errors and inconsistencies.

• Enhances Model Accuracy: Clean data leads to better-performing machine
learning models.
• Facilitates Analysis: Easier to identify patterns and draw insights.
• Saves Time: Avoids time spent addressing issues during analysis or modeling.
• Accurate Analysis: Reduces the risk of incorrect insights due to bad data.
• Improved Model Performance: Clean data enhances the predictive power of
machine learning models.
• Efficiency: Streamlines the workflow by minimizing errors in later stages.
Problem Formulation
Data Preparation
be analyzed
3
4
Data Acquisition Exploratory Data Analytics
Identify the relevant data sources as per is a process used by data scientists to:
defined problem statement and decide the Validate data, Generate hypotheses, Identify
format and various tools for data acquisition. trends, Summarize data characteristics
Exploratory Data Analytics
Exploratory Data Analysis (EDA) is a critical step in the data science

process, where you analyze and summarize the main characteristics
of a dataset, often using visual methods. The goal of EDA is to
uncover patterns, spot anomalies, test hypotheses, and check
assumptions using both statistical and graphical techniques.
• Objective of EDA
• Key Steps in EDA
• Tools for EDA
• Benefits of EDA
Problem Formulation
Data Preparation
be analyzed
3
5
Build Model
is the process of deploying
machines for
understanding a system
4
Building the Model
Define the Problem Type Select the Appropriate Model Train the Model
Supervised Learning (e.g., regression, Regression: Linear Regression, Random Data splitting: Training and testing sets
classification) Forest Cross-validation techniques
Unsupervised Learning (e.g., clustering, Classification: Logistic Regression, SVM,
anomaly detection) Neural Networks
Reinforcement Learning Clustering: k-Means, Hierarchical
Clustering
Anomaly Detection: Isolation Forest,
Autoencoders
Building the Model
Evaluate Model Performance Fine-Tune the Model Test and Deploy the Model
Regression Metrics: MAE, MSE, RMSE, Hyperparameter tuning: Grid search, Final evaluation using test data
R-squared random search Deployment for real-time predictions
Classification Metrics: Accuracy,
Precision, Recall, F1-Score, ROC-AUC
Clustering Metrics: Silhouette Score,
Davies-Bouldin Index
Problem Formulation
Data Preparation
the process of cleaning, organizing, and transforming
raw data into a format that can be analyzed Interpret &
3 Communicate
Summarize key finding & provide recommendation
5
Build Model
is the process of deploying
machines for
understanding a system
4
Interpret & Communicate
Highlight essential insights derived from the model.
Summarize Key Findings Explain the significance of key features driving the predictions.
Types of visualizations:
•Bar charts, pie charts, line graphs, scatter plots, heatmaps, confusion
Use Clear and Effective Visualizations matrix.
Tools for visualization: Matplotlib, Seaborn, Tableau, Power BI.
Relate findings to business goals.

Provide Context to Stakeholders Use simple language and practical examples.
Optimize business strategies (e.g., marketing, operations).

Offer Actionable Recommendations Mitigate risks (e.g., fraud detection, churn reduction).
Create comprehensive reports with an executive summary and detailed

Document and Present Results appendices.
Deliver engaging presentations using visuals and storytelling.
Tailor communication to the audience (technical vs. non-technical).

Best Practices Emphasize actionable steps and business impact.
Address limitations and assumptions in the analysis.
Real-life Scenerio
• Netflix:
https://www.analyticsvidhya.com/blog/2023/06/netflix-
case-study-eda-unveiling-data-driven-strategies-for-
streaming/#h-official-documentation-and-resources
• Airbnb, AstraZeneca, Johnson & Johnson, IMD, HDFC
https://www.upgrad.com/blog/top-data-science-case-
studies-for-inspiration/
Lecture 3: Types of
Analytics
Objective: Understand the different
types of analytics and their applications.
Outline
Definition and examples (e.g., sales reports, website traffic)
Descriptive Analytics Tools and techniques
Identifying root causes of issues

Diagnostic Analytics Techniques like drill-down analysis
Forecasting future outcomes (e.g., customer churn prediction)

Predictive Analytics Models: regression, decision trees
Recommending actions (e.g., pricing strategies, inventory optimization)

Prescriptive Analytics Tools: optimization and simulation
Examples Across Industries Application of each type

Descriptive Analytics
Purpose:
• Summarizes historical data to describe what has happened.

• Provides insights into business performance through trends and patterns.
• Helps decision-makers understand key performance indicators (KPIs) and past outcomes.
Techniques:
• Data Aggregation: Summing up or averaging data to provide a summary.

• Trend Analysis: Observing data points over time to identify patterns.
• Data Visualization: Creating graphs, charts, and dashboards for easy comprehension.
Examples in Business:
• Sales Reports: Understanding monthly, quarterly, or yearly sales performance.

• Website Analytics: Analyzing user visits, session duration, and bounce rates to improve website experience.
• Customer Demographics: Segmenting customers by age, location, or purchase history to tailor marketing campaigns.
Tools:
• Microsoft Excel, Tableau, Power BI, Google Analytics.

Diagnostic Analytics
Purpose:
• Answers "Why did it happen?" by drilling down into data to identify the root causes of events.
• Enables businesses to uncover insights about failures, inefficiencies, or unexpected trends.
Techniques:
• Drill-Down Analysis: Breaking data into finer levels to uncover deeper insights.
• Correlation Analysis: Identifying relationships between variables (e.g., increased social media ads lead to higher
sales).
• Root Cause Analysis: Tracing anomalies or unexpected results to their source.
• Analyzing why a particular product's sales declined in a specific region.

• Investigating high employee turnover rates in a department.
Tools:
• SQL for querying data, Python for deeper statistical analysis, and tools like Splunk for system diagnostics.
Predictive Analytics
Purpose:
• Answers "What will happen?" by forecasting future outcomes based on historical data.
• Provides actionable insights to prepare for potential scenarios or risks.
Techniques:
• Regression Analysis: Predicting numeric outcomes (e.g., sales revenue).

• Classification Models: Categorizing data (e.g., whether a customer will churn or not).
• Time Series Analysis: Analyzing sequential data over time (e.g., stock prices, weather forecasts).
• Customer Churn Prediction: Identifying customers likely to leave the service.

• Demand Forecasting: Predicting future inventory requirements to optimize supply chain.
Tools:
• Python (libraries like scikit-learn, TensorFlow), R, SAS, IBM Watson.

Prescriptive Analytics
Purpose:
• Answers "What should we do?" by providing recommendations and optimizing decisions.

• Helps businesses make informed decisions to achieve desired outcomes.
Techniques:
• Optimization Models: Finding the best solution among various alternatives (e.g., minimizing costs or maximizing profits).
• Simulation: Modeling scenarios to understand potential outcomes and their impact.
• Dynamic Pricing: Adjusting prices in real-time based on demand, competition, and market trends (e.g., surge pricing in
ride-hailing apps).
• Inventory Optimization: Ensuring the right amount of stock to meet demand without overstocking or understocking.
Tools:
• MATLAB, Gurobi, Python (PuLP library), AnyLogic.

Examples Across Industries
Healthcare:
• Descriptive: Patient record summaries for hospital management.

• Diagnostic: Identifying causes of increased hospital readmissions.
• Predictive: Forecasting patient recovery times based on health records.
• Prescriptive: Optimizing staff allocation based on patient flow.
Retail:
• Descriptive: Weekly sales reports to assess performance.

• Diagnostic: Analyzing reasons for low footfall in certain stores.
• Predictive: Forecasting holiday season sales for inventory planning.
• Prescriptive: Recommending ideal product placements in stores to boost sales.
Finance:
• Descriptive: Summarizing monthly expense reports.

• Diagnostic: Detecting the cause of failed transactions.
• Predictive: Estimating credit risk for loan approvals.
• Prescriptive: Recommending investment portfolios based on risk tolerance.
Real-World Impact of Analytics
Descriptive helps businesses reflect on past performance and adjust
strategies accordingly.
Diagnostic improves operational efficiency by identifying

inefficiencies.
Predictive empowers organizations to anticipate market changes and

adapt proactively.
Prescriptive delivers optimal solutions, ensuring businesses stay

competitive.
Lecture 4:
Applications of Data
Science in Industry
Objective: Explore real-world
applications of Data Science.
Applications of Data Science Across
Industries
Retail & E-
Healthcare Finance & Banking Manufacturing Telecommunication
Commerce
Transportation & Media & Government &

Energy & Utilities Education
Logistics Entertainment Public Sector
Agriculture Real-Estate Sports Travel & Tourism Insurance
Marketing &
Legal Cybersecurity Aerospace Fashion
Advertisement
Lecture 5: Data
Ethics and
Challenges
Objective: Discuss ethical considerations
and challenges in Data Science.
Outline
Data privacy and security
Ethical Issues Bias and fairness in algorithms
Transparency and accountability
Data quality and availability

Challenges Model interpretability
Scalability and computational challenges
Regulations and Standards GDPR, HIPAA, and other data protection laws
Case Studies Ethical dilemmas in AI and Data Science

Lecture 6: Future
Trends and Career
Opportunities in Data
Science
Objective: Highlight emerging trends
and career paths in Data Science.
Rise of AutoML and no-code
AI tools
Future Trends Integration of Data Science

with IoT and Blockchain
Advancements in NLP and
Computer Vision
Roles: Data Scientist, ML

Engineer, Business Analyst
Career Opportunities
Outline Pathways: certifications,
degree programs, self-
learning
Soft skills: communication,

Industry Expectations problem-solving
Technical skills:
and Required Skills programming, statistical
analysis
Data Exploration
Data exploration can be broadly classified into two types—descriptive statistics and
data visualization.
Descriptive statistics is the process of condensing key characteristics of the dataset

into simple numeric metrics. Some of the common quantitative metrics used are
mean, standard deviation, and correlation.
Visualization is the process of projecting the data, or parts of it, into multi-dimensional
space or abstract images. All the useful (and adorable) charts fall under this category.
Data exploration in the context of data science uses both descriptive statistics and
visualization techniques.
OBJECTIVES OF DATA EXPLORATION
In the data science process, data exploration is leveraged in many different steps including
preprocessing or data preparation, modeling, and interpretation of the modeling results
Data Understanding
Data Preparation
Data Science Task
Interpreting the results

Nominal Attribute
Binary Attribute
Types of Ordinal Attribute

Data Discrete
Numeric Attributes
Continuous
Qualitative &
Quantitative Attributes
Nominal Attribute
Nominal means “relating to names” . The utilities of a nominal attribute are sign or title of
objects . Each value represents some kind of category, code or state, and so nominal
attributes are also referred to as categorial.
Example – Suppose that skin color and education status are two attributes of expressing
person objects. In our implementation, possible values for skin color are dark, white,
brown. The attributes for education status can contain the values- undergraduate,
postgraduate, matriculate. Both skin color and education status are nominal attributes.
Binary Attribute
A binary attribute is a category of nominal attributes that contains only two

classes: 0 or 1, where 0 often tells that the attribute is not present, and 1 tells that
it is existing. Binary attributes are mentioned as Boolean if the two conditions
agree to true and false.
Example – Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the
patient undergoes a medical test that has two practicable outcomes.
Ordinal Attribute
Ordinal attribute is an attribute with a viable advantage that has
a significant sequence or ranking among them, but the
enormity between consecutive values is not known.
Suppose that food quantity corresponds to the variety of dishes

available at a restaurant. The nominal attribute has three
possible values: starters, main course, combo.
The values have a meaningful sequence that corresponds to

different food quantity however, we cannot tell from the values
how much bigger, say, a medium is than a large.
Numeric Attributes
A numeric attribute is calculable, that is, it is a quantifiable amount that

constitutes integer or real values.
Discrete Attribute : A discrete attribute The attributes skin color, drinker, medical
has a limited or restricted unlimited set report, and drink size each have a finite number
of values, which may appear as integers. of values, and so are discrete.
Continuous Attribute : A continuous

Height, weight, and temperature have real
attribute has real numbers as attribute values .
values.
Properties of Data
• Accuracy: There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties
that could be human or computer errors.
• Completeness: For some reasons, incomplete data can occur, attributes of interest such as customer information for sales
& transaction data may not always be available.
• Consistency: Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field
incoherent format. Duplicate tuples need cleaning of details, too.
• Timeliness: It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales
records on time. There are also several corrections & adjustments which flow into after the end of the month. Data stored in
the database are incomplete for a time after each month.
• Believability: It is reflective of how much users trust the data.
• Interpretability: It reflects how easy the users can understand the data.
• Descriptive statistics refers to the study of the aggregate
DESCRIPTIVE
quantities of a dataset.
• These measures are some of the commonly used notations in

everyday life.
STATISTICS • Some examples of descriptive statistics include average annual

income, median home price in a neighborhood, range of credit
scores of a population, etc.
DESCRIPTIVE STATISTICS
Descriptive statistics can be broadly classified into univariate and multivariate

exploration depending on the number of attributes under analysis
Univariate data exploration denotes analysis of one attribute at a time.
Multivariate exploration is the study of more than one attribute in the dataset
simultaneously. This technique is critical to understanding the relationship
between the attributes, which is central to data science methods.
Univariate Exploration
Measure of Central Tendency
• Mean
• Median
• Mode
Measure of Spread
• Range
• Deviation
Univariate Exploration: Measure of
Central Tendency
The objective of finding the central location of an attribute is to
quantify the dataset with one central or most common number.
Mean: Arithmetic Average
Median: Value of the central point in the distribution.
Mode: Most frequently occurring observation

Univariate Exploration: Measure
of Central Tendency: Mean
σ 𝑿𝒊 𝑿𝟏 + 𝑿𝟐 + ⋯ + 𝑿𝒏
ഥ=
𝑴𝒆𝒂𝒏 𝒐𝒓 𝑿 =
𝒏 𝒏
σ 𝑤𝑖 𝑋𝑖 𝑤1 𝑋1 + 𝑤2 𝑋2 + ⋯ + 𝑤𝑛 𝑋𝑛
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑀𝑒𝑎𝑛 𝑜𝑟 𝑋𝑤 = =
σ 𝑤𝑖 𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛
Central Tendency: Median
Median is positional average.

𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 𝑀 = 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ 𝑖𝑡𝑒𝑚
2
Central Tendency: Mode
• It is the most commonly or frequently occurred value in a series.

• The mode in a distribution is that item around which there is
maximum concentration.
Geometric Mean & Harmonic
Mean
𝑛 𝑛
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑀𝑒𝑎𝑛 𝑜𝑟 𝐺. 𝑀. = 𝜋𝑋𝑖 = 𝑋1 . 𝑋2 . 𝑋3 … 𝑋𝑛
σ 𝑅𝑒𝑐 𝑋𝑖 𝑅𝑒𝑐.𝑋1 +𝑅𝑒𝑐.𝑋2 +⋯+𝑅𝑒𝑐.𝑋𝑛

Harmonic 𝑀𝑒𝑎𝑛 𝑜𝑟 𝐻. 𝑀. = 𝑅𝑒𝑐. = 𝑅𝑒𝑐.
𝑛 𝑛
Relation Between Mean, Median and
Mode
• The three measures of central values i.e. mean, median, and

mode are closely connected by the following relations (called an
empirical relationship).
2 ∗ 𝑀𝑒𝑎𝑛 + 𝑀𝑜𝑑𝑒 = 𝑀𝑒𝑑𝑖𝑎𝑛
For example, we have data whose mode = 65 and median = 61.6.
Find Mean.
Mean v/s Average
Mean v/s Median

Module 1 Applied Data Science 1.1 and 1.2

Uploaded by

Module 1 Applied Data Science 1.1 and 1.2

Uploaded by

Applied Data

After successful completion of this course, the students should be able

Module 4: Time Module 3: Anomaly

1. Introduction to Data Science

Data collection and cleaning

Outline Importance of Data Science Real-world impact (examples)

Roles in Data Science Data Scientist, Data Analyst, Data Engineer

Key Aspects of Data Science:

• The simulation of human intelligence in machines,

between Data Key Features:

Science, AI, ML • Encompasses a broad range of techniques,

and Big Data learning.

• Chatbots, autonomous vehicles, facial recognition

• A subset of AI that uses statistical techniques to

between Data Key Features:

Science, AI, ML • Focuses on building models to make predictions

• Spam email detection, stock price prediction,

• The process of examining large and complex

• Social media analysis, supply chain

Data collection and cleaning

Outline Importance of Data Science Real-world impact (examples)

Roles in Data Science Data Scientist, Data Analyst, Data Engineer

Data collection and Data analysis and

& Cleaning Data Cleaning:

•Descriptive Statistics: Mean, median, standard deviation

Models • Example: Q-Learning

Model Evaluation Metrics:

• Accuracy, precision, recall, F1-scoreMean Squared Error (MSE), R-

• Libraries: Scikit-learn, TensorFlow, PyTorch

• Interpret Model Results:

Interpretation • Translating technical results into business recommendations

• Helping stakeholders make data-driven decisions

Best Practices for Interpretation:

• Use simple language to explain complex concepts.

Data collection and cleaning

Science Machine learning models

Outline Importance of Data Science Real-world impact (examples)

Roles in Data Science Data Scientist, Data Analyst, Data Engineer

5. Innovation and Product 9. Boosts Workforce

E-Commerce: Amazon's Recommendation System

Amazon uses Data Science to analyze customer browsing and

By analyzing data such as past purchases, search history, and items

Personalized recommendations drive approximately 35% of

Retail: Walmart’s Inventory Management

Walmart needs to ensure that products are always available while

Walmart uses Data Science to analyze sales data, weather patterns,

Walmart saves millions in operational costs while maintaining high

Entertainment: Netflix’s Content Recommendation System

Netflix relies on Data Science to recommend shows and movies

Data on viewing habits, ratings, and user interactions is analyzed to

Personalized recommendations keep users engaged, reducing

Transportation: Uber’s Surge Pricing

Scenario: Uber dynamically adjusts pricing based on supply and demand.

Data on real-time demand, traffic conditions, and driver availability

Efficient resource allocation and increased driver earnings during

DATA TECHNIQUES BENEFITS CHALLENGES IMPACT

Content-Based Filtering: Focuses on product

Deep Learning and Natural Language Processing (NLP):

Real-Time Analytics: Tracks user activity in real time and

Handling Large-Scale Data:

Data collection and cleaning

Science Machine learning models

Outline Importance of Data Science Real-world impact (examples)

Roles in Data Science Data Scientist, Data Analyst, Data Engineer

Data collection and cleaning

Science Machine learning models

Outline Importance of Data Science Real-world impact (examples)

Roles in Data Science Data Scientist, Data Analyst, Data Engineer

Science Tableau, Power BI,

Process Explore and analyze data

Problem formulation: understanding the business objective

Real-World Examples Use cases from industries

The Data Science process involves a structured approach to solving

Problem formulation is the foundation of any Data Science project. It

• To solve a problem effectively, it is critical to understand the business's domain, operations,

Detect and Handle Missing Data