0% found this document useful (0 votes)
34 views13 pages

DS01 Introduction To Data Science

The document provides a comprehensive overview of Data Science, including its definition, importance, applications, and the data science life cycle. It differentiates between Data Science and Information Science, structured and unstructured data, and compares Business Intelligence with Data Science. Additionally, it covers key concepts such as Data Wrangling, Data Cleaning, Data Transformation, and Data Reduction, along with their significance and methods.

Uploaded by

shraupatil15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views13 pages

DS01 Introduction To Data Science

The document provides a comprehensive overview of Data Science, including its definition, importance, applications, and the data science life cycle. It differentiates between Data Science and Information Science, structured and unstructured data, and compares Business Intelligence with Data Science. Additionally, it covers key concepts such as Data Wrangling, Data Cleaning, Data Transformation, and Data Reduction, along with their significance and methods.

Uploaded by

shraupatil15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DS:01 Introduction to Data Science

08 March 2026

1) What is Data Science? Differentiate between Data Science and Information Science?
2) Explain various applications of Data Science.
3) Difference between structured data and unstructured data
4) Compare Business Intelligence and Data Science.
5) Explain the use of Data Science in AI
6) What is Data Wrangling? Explain various methods of data wrangling.
7) Why is Data Wrangling Important in Data Science?
8) Explain Data Cleaning & Data Transformation.
9) What is Data Integration & Data Discretization?
10) List & explain Data Reduction techniques with example.

What is Data Science?


Definition:
Data Science is a field that uses data, scientific methods, algorithms, and tools to extract useful
information, patterns, and insights from data to help in decision-making.
It processes raw data to address business challenges and predict future trends.
Simple Meaning
In very simple words:
Data Science = Collecting data + Understanding it + Using it to make smart decisions
Example:
• Netflix suggests movies
• Amazon recommends products
• Google Maps shows fastest route

Why is Data Science Important?


Data Science helps to:
• Understand customer behaviour
• Improve business decisions
• Predict future trends
• Automate systems

Steps in Data Science (Life Cycle)


1. Data Collection: Gathering raw data from various sources like databases, websites, sensors, etc.
2. Data Cleaning: Removing errors, missing values, and unwanted data
3. Exploratory Data Analysis (EDA): Studying data to find patterns, trends and relationships
4. Data Visualization: Presenting data using visual insights like graphs, charts & dashboards
5. Model Building: Using machine learning algorithms to make predictions & train the model using
data
6. Model Evaluation: Check accuracy and performance, improve model if needed.
7. Deployment: Deploy model into real world system, make it usable for users.
8. Decision Making: Using insights to solve problems or improve systems

Applications of Data Science


1. Healthcare -
• Data science helps doctors and hospitals to make better decisions.
• It helps to predict diseases at an early stage.
• Analysis patients health and suggests appropriate treatments.

2. Finance and banking -


• In finance, data science is used in fraud detection, risk management and customer analysis.

New Section 1 Page 1


• In finance, data science is used in fraud detection, risk management and customer analysis.
• Banks used to detect unusual transactions and prevent frauds.
• It is also used in stock market for prediction and investment plan.

3. Ecommerce -
• Ecommerce companies like Amazon, Flipkart, etc. use data science to improve customer experience.
• It helps in recommending products based on users behavior and previous purchases.

4. Education -
• Helps to analyze student performance and improve learning style.
• It can identify weak students and provide personalized learning plans.
• Online platforms uses data science to recommend courses and track student progress.

5. Social media -
• Social media platforms like Instagram, Facebook use to show personalized content to users.
• It analyzes users behavior, likes and interactions to recommend posts, videos and Ads
• Helps in detecting fake accounts and harmful content.

Use of Data Science in AI


Introduction
Data Science and Artificial Intelligence (AI) are closely related fields.
Data Science provides the data, analysis, and insights that AI systems use to learn and make
intelligent decisions.
In simple words:
Data Science = Brain food (data) for AI

Uses of Data Science in AI (In Points) -


1. Data Collection and Preparation
• Collects large amounts of data from different sources
• Cleans and preprocesses data (removes errors, missing values)
• Converts raw data into usable format
Important because AI needs high-quality data

2. Training AI Models
• AI systems learn from data using algorithms
• Data Science provides structured datasets for training
• Improves model accuracy
Example: Training a chatbot using past messages

3. Pattern Recognition
• Data Science identifies patterns and relationships
• AI uses these patterns to make decisions
Example: Face recognition systems

4. Prediction and Decision Making


• Data Science builds predictive models
• AI uses them to predict future outcomes
Example: Predicting customer behavior

• Helps AI understand human language


• Used in chatbots, translators, voice assistants
Example: ChatGPT, Alexa

New Section 1 Page 2


• Data Science processes image and video data
• AI recognizes objects, faces, and actions
Example: Self-driving cars detecting objects

7. Recommendation Systems
• Analyses user behaviour and preferences
• AI recommends products, movies, or content
Example: Netflix, Amazon recommendations

8. Continuous Improvement
• Data Science monitors AI performance
• Updates models with new data
• Improves accuracy over time

Real-Life Applications
• Virtual assistants (Siri, Alexa)
• Fraud detection systems
• Healthcare diagnosis
• Self-driving cars

Difference between Data Science and Information Science


Data Science Information Science
Focuses on data analysis and predictions. Focuses on managing and organizing information.
Goal is to extract useful insights from the data. Goal is to manage and distribute information.
Works with raw and unstructured data. Works with structured and processed information.
Uses AI, machine learning, statistics, etc. It uses databases and information system.
It requires programming skills. Requires knowledge of information systems
Field nature is technical and analytical. Field nature is managerial and organizational.
Output- Predictions, patterns and insights. Output- Organized information.
Example: Recommendation systems. Example: Library management systems.

Data and Data Types


What is data?
Data refers to law facts, figures, symbols for observations that are collected for analysis.
Data = unprocessed information.
Ex: Numbers: 10, 50, 100
Names: Srushti, Purva
Images, Vedios, Text

Difference between structured data and unstructured data


Basis Structured Data Unstructured Data
Definition Data that is organized in a fixed format (rows Data that does not have a predefined
and columns) structure or format
Format Organized (rows and columns) No fixed format

New Section 1 Page 3


Storage Stored in databases like SQL, Excel sheets Stored in files, documents, images, videos,
etc.
Ease of Easy to analyse using traditional tools Difficult to analyse, requires advanced
Analysis techniques
Examples Excel sheets, SQL tables, databases Images, videos, emails, social media posts
Processing SQL, Excel, traditional database tools AI, Machine Learning (ML), NLP tools
Tools
Flexibility Less flexible (fixed schema) Highly flexible (no schema required)
Data Type Mostly numerical and categorical data Text, audio, video, images
Scalability Easier to manage in structured systems Difficult to manage due to large variety
Usage Used in traditional business systems Used in modern applications like social
media, AI

Difference between Qualitative Data & Quantitative Data


Basis Qualitative Data Quantitative Data
Definition Describes qualities (non-numeric) Represents numbers (numeric)
Nature Descriptive Measurable
Type Non-numeric Numeric
Purpose To describe characteristics To measure quantity
Examples Colour, gender, feedback Age, marks, height
Data Type Categories (labels) Discrete & Continuous
Analysis Difficult to analyse Easy to analyse
Representation Words or labels Numbers, graphs
Collection Methods Interviews, observation Surveys, experiments
Accuracy Less precise More precise
Tools Used Text analysis Statistical tools
Use Case Customer opinions Sales, performance

Compare Business Intelligence and Data Science.


Basis Business Intelligence (BI) Data Science
Focus Past and present data Future predictions
Purpose Reporting and decision support Prediction and insight generation
Type of Analysis Descriptive (what happened) Predictive & Prescriptive (what will happen)
Data Type Structured data Structured + Unstructured data
Tools Used Power BI, Excel, Tableau Python, R, ML algorithms
Complexity Less complex More complex
Output Reports, dashboards Models, predictions
Example Monthly Sales dashboard Customer churn prediction using ML model

New Section 1 Page 4


Data Wrangling & its various methods
Data Wrangling is the process of cleaning, transforming & organizing raw, messy data into a clean
and structured format for analysis.
In simple words:
Data Wrangling = Cleaning + Organizing data

Methods of Data Wrangling -


1. Discovering:
Before doing anything, you must understand what the data contains.
You look for patterns, data types, and how big the dataset is.

2. Structuring:
Raw data often comes in one big mess (like a long text file). Structuring
means organizing this data into a proper format, such as Rows and
Columns (Tables).

3. Cleaning:
This is the most important step. It involves:
• Removing duplicate records.
• ❌ Handling null/missing values (either deleting them or filling
them with an average).

4. ➕ Enriching:
Sometimes the current data isn't enough. Enriching means adding new
information from other sources to make the analysis better.
(Example: Adding "Pin Code" to an "Address" dataset).

5. ✅ Validating:
This is a quality check. You verify if the data follows the rules.
(Example: Checking if the "Birth Date" column contains only dates and
not random text).

6. Publishing:
The final cleaned and structured data is saved into a new file or database
so that Data Scientists can use it for analysis.

New Section 1 Page 5


Real-Life Applications
• Healthcare data cleaning
• E-commerce recommendation systems
• Financial data analysis
• Social media data processing

Data Wrangling Important in Data Science


Introduction
Data Wrangling is a very important step in Data Science because real-world data is often messy,
incomplete, and inconsistent. Before analysis or model building, data must be cleaned and organized.
In simple words:
Data Wrangling makes data usable

Importance of Data Wrangling -


1. Improves Data Quality
• Removes errors, missing values, and duplicates
• Ensures data is accurate and reliable
Good quality data = Correct results ✅

2. Increases Accuracy of Models


• Machine learning models depend on clean data
• Poor data leads to wrong predictions
“Garbage in = Garbage out” ❌

3. Makes Data Easy to Analyse


• Converts raw data into structured format
• Simplifies analysis and visualization
Easy to use in tools like Excel, Power BI

• Clean data reduces errors during analysis


• Makes model building faster and smoother

5. Integrates Data from Multiple Sources


• Combines data from different files, databases, APIs
• Creates a unified dataset
Example: Merging customer data from different platforms

6. Helps in Better Decision-Making


• Clean data provides meaningful insights
• Leads to correct business decisions

7. Handles Complex Real-World Data


• Real-world data is:
○ Incomplete
○ Noisy
○ Unstructured
• Wrangling makes it usable

8. Detects and Fixes Outliers


• Removes extreme or incorrect values
• Prevents misleading analysis

Example:

New Section 1 Page 6


Example:
In your house price prediction project :
• Missing prices → f ll d m v d
• Wrong values → d
• Duplicate records → m v d
Result: More accurate prediction model

Data Cleaning
Definition
Data Cleaning is the process of identifying and handling errors, removing unwanted data,
inconsistency and improving the quality of data so that it can be used for analysis.
In simple words:
Data Cleaning = Fixing and improving raw data

Why Data Cleaning is Important?


• Improves data quality
• Ensures accurate analysis
• Removes incorrect or duplicate data
• Essential before analysis or modelling

Methods of Data Cleaning-


1. Handling Missing Values :
• Fill missing values (mean, median, mode)
• Remove records with too many missing values
• Replace with default values.
2. Removing Duplicates :
• Identify repeated records in the data.
• Delete duplicate entries
3. ❌ Correcting Errors :
• Fix wrong or invalid values
Example: Marks = 150 (invalid)
4. Handling Outliers :
• Detecting unusual values
• Remove or replace them with nearest acceptable value.
5. Standardizing Data :
• Ensure consistent format
Example: Date format, text format, Phone no.- +91 format

Example
• Replacing missing age values with the average age
• Removing duplicate customer records

Data Transformation
✅ Definition
Data Transformation is the process of converting data into a suitable format or structure for analysis
and modelling.
In simple words:
Data Transformation = Changing data into useful format

Why Data Transformation is Important?


d l f l

New Section 1 Page 7


• d l f l
• Improves model performance
• Converts data into required format
• Helps in better understanding

Methods of Data Transformation -


1. Normalization:
• Scales data into a fixed range (0 to 1)
• Used when data values are very large
• Ex:
Original Age Normalized Value
20 0.20
50 0.50
80 0.80

2. Aggregation:
• ➕ Combining multiple data values into single value
• Reduces dataset size
• Ex: Daily sales → l l

3. Encoding:
• Converts categorical data into numerical form
• Types:
• Label Encoding -
• One-Hot Encoding -
• Ex:
Gender Encoded Value
Male 0
Female 1

4. Discretization:
• Converts continuous data into categories
• Simplifies complex data.
• Ex:
Age Category
18 Young
35 Adult
60 Old

5. Data Type Conversion:


• Changing data from one type to another
• Ex: "25" (string) → 25

6. Standardization:
• Converts data to have mean = 0 and standard deviation = 1
• Used in machine learning algorithms

7. Smoothing:
• Removes noise or irregular values from the data
• Ex: Using averages to smooth values

New Section 1 Page 8


Example-
Raw Data:
Name Gender Marks
Rahul Male 85
Sneha Female 90
After Transformation:
Name Gender Marks
Rahul 0 0.85
Sneha 1 0.90
Here:
• Gender is encoded
• Marks are normalized.

Data Reduction
Definition
Data Reduction is the process of reducing the size of data while maintaining its important information
so that it becomes easier to store, process, and analyse.
In simple words:
Data Reduction = Less data, same information

Why is Data Reduction Important?


• Large data is difficult to handle
• Saves storage space
• Reduces processing time
• Improves efficiency of analysis and models

Methods / Techniques of Data Reduction -


1. Data Cube Aggregation
• ➕ Combines data into summary form
• Reduce data size & highlights trends
• Ex: Daily sales → l l
Minute-w d →H l v

2. Dimensionality Reduction
• Reduces the number of attributes (features) in a dataset
• ❌ Removes irrelevant or less important features
• ⚡ Improves performance & reduces complexity
• Techniques: PCA (Principal Component Analysis)
• Ex: 100 features → d d 10 m f

3. Numerosity Reduction
• Replaces large data with smaller representation
• Techniques:
•S m l – take a small sample from dataset
• l – grouping similar data
•R – use mathematical model
• Ex: Use sample of 1000 records instead of 1 lakh

4. Data Compression
• Compresses data to reduce storage size
It encodes data using fewer bits

New Section 1 Page 9


• It encodes data using fewer bits
• Types:
•L l m :N d l
•L m :S m d l
• Ex: ZIP files, image compression

5. Sampling
• Selects a subset of data that represents the whole dataset
• Techniques:
•R d m m l
•S f d m l
• ⚡ Faster analysis with acceptable accuracy
• Ex: Analysing 10,000 records from a dataset of one million customer records

Example of Data Reduction -


Original Dataset:- 1,00,000 customer records with 50 features.
After Data Reduction:
• 10,000 records selected using sampling
• 15 important features selected using dimensionality reduction.
Result:- Dataset becomes smaller but still contains useful information for analysis.

Data Integration
Definition
Data Integration is the process of combining data from multiple sources into a single, unified dataset
for better analysis and decision-making.
In simple words:
Data Integration = Combining data from different sources

Why is Data Integration Important?


• Data is often stored in different places (files, databases, websites)
• Integration brings all data together
• Provides a complete and consistent view
• Improves accuracy and decision-making

Sources of Data Integration


Data can be combined from:
• Databases (SQL, Excel)
• Websites and APIs
• Cloud storage
• Different departments (sales, marketing, finance)

Steps in Data Integration


1. Data Collection – Gather data from different sources
2. Data Cleaning – Remove errors and inconsistencies
3. Data Transformation – Convert data into common format
4. Data Merging – Combine datasets
5. Data Storage – Store in a unified system

Methods of Data Integration-


1. Data Consolidation
• Data from multiple sources is collected and stored in one central system
• Provides a centralized data system.
Example: Data warehouse

New Section 1 Page 10


2. Data Federation
• Data remains in different sources but is accessed together through a single interface.
• No need to physically move data
Example: Querying multiple databases together

3. Data Propagation
• Data is copied and synchronized across systems
• Ensures all systems have updated data.
Example: Updating data in multiple systems automatically

Example (Easy to Understand)


A company has data in different departments:
• Sales data → Ex l
• Customer data → D
• Website data → O l
After Data Integration:
• All data is combined into one system
✔ Now company can analyse everything together

Data Discretization
Definition
Data Discretization is the process of converting continuous data into a limited number of categories
(groups or intervals).
In simple words:
Data Discretization = Converting numbers into categories

Why is Data Discretization Important?


• Sm lf m l xd
• Makes data easier to understand
• Reduces data size (data reduction)
• Helps in better analysis and visualization
• Improves performance of some machine learning models

Types of Discretization
1. Unsupervised Discretization – ❌ Does not use class labels
2. Supervised Discretization – ✅ Uses target variable (class label)

Methods of Data Discretization -


1. Equal Width Binning
• Divides the range of data into intervals of equal size
• Ex: Age (0–100) divided into:
0–20
21–40
41–60
61–80
81–100
Each interval has the same width.

2. Equal Frequency Binning


• E l m fd
• Ex: Marks of 10 students divided into 2 groups:
Group 1 → 5 d

New Section 1 Page 11


Group 1 → 5 d
Group 2 → 5 d
Each group contains equal number of values.

3. Clustering-Based Method
• Groups similar data points together
• Ex: Group customers based on spending habits:
Low spender
Medium spender
High spender

4. Decision Tree Method


• Used to automatically create categories based on the data.
• This method is uses machine learning techniques.
• Ex:
I m < 50 → L w
Income ≥ 50k → H

5. Histogram Based Method


• Uses histogram to divide data into bins
• Ex: Height grouped using histogram bars

Example-
Original Data (Continuous):
Age = 18, 25, 35, 45, 60
After Discretization:
Age Category
18 Young
25 Adult
35 Adult
45 Middle Age
60 Old
✔ Data becomes easier to analyse

Imagine you have given the data set with Null and NaN values. Apply
data wrangling methods used in data science.
Introduction
In real-world datasets, we often find NULL or NaN (Not a Number) values, which indicate missing or
undefined data.
Data Wrangling techniques are applied to handle these missing values and make the data clean and
usable.
In simple words:
Fix missing data → M k d b

Methods to Handle NULL and NaN Values -


1. Removing Missing Values
• Delete rows or columns with NULL/NaN values
When to use:
• When missing data is very small
Example:
• Remove rows where marks are missing

2. Replacing with Mean, Median, or Mode

New Section 1 Page 12


2. Replacing with Mean, Median, or Mode
• Fill missing values with statistical values
Example:
• Age column → l N Nw v
• Marks → l w m d
✔ Most commonly used method

3. Forward Fill / Backward Fill


• Fill missing values using nearby values
• Forward Fill → v v l
• Backward Fill → x v l
Example:
• NaN replaced with previous value
✔ Useful for time-series data

4. Using Constant Values


• Replace missing values with fixed value
Example:
• City → U w
• Marks → 0

5. Predicting Missing Values


• Use machine learning models to predict missing values
Example:
• Predict missing salary based on age and experience

• Remove entire column if too many missing values


Example:
• Column with 80% NaN values

Example (Easy to Understand)


Raw Data:
Name Age Marks
Rahul 20 85
Sneha NaN 90
Amit 19 NaN
After Wrangling:
Name Age Marks
Rahul 20 85
Sneha 20 90
Amit 19 88
✔ Missing values replaced with average

Steps to Apply Data Wrangling


1. Identify NULL/NaN values
2. Decide method (remove/replace)
3. Apply technique (mean, fill, drop)
4. Verify cleaned data
Advantages
• Improves data quality
• Prevents errors in analysis
• Increases model accuracy

New Section 1 Page 13

You might also like