0% found this document useful (0 votes)
100 views114 pages

Introduction to Data Science Overview

Uploaded by

rohit40005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views114 pages

Introduction to Data Science Overview

Uploaded by

rohit40005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit-1

Introduction to Data Science


1. Introduction

2. Importance

3. Career Paths

4. Core Components

5. Applications

6. Data Science Lifecycle

7. Workflow of a Data Scientist

What is Data Science?


● Data science is a multidisciplinary field that involves transforming raw data into
meaningful insights and actionable knowledge.
● It uses various tools and techniques to manipulate data, allowing us to discover new
and significant information.
● This deep study of massive amounts of data includes extracting insights from both
structured and unstructured data using scientific methods, technologies, and
algorithms.

Example: Choosing the Best Restaurant for Dinner

Imagine you want to choose the best restaurant for dinner. To make this decision, you need to
consider various factors such as:

Type of Cuisine: What kind of food you want to eat (e.g., Italian, Chinese, Indian).

Distance: How far each restaurant is from your location.

Reviews and Ratings: Customer feedback and ratings on platforms like Yelp or Google.

Price Range: The cost of dining at each restaurant.


Operating Hours: Whether the restaurant is open at the time you plan to go.

Reservation Availability: Whether you can get a reservation or if there's a long waiting time.

Ambiance and Environment: The atmosphere of the restaurant (e.g., casual, fine dining).

Special Offers: Any discounts or special deals available.

All these decision factors act as input data. By analyzing this data, you can determine the best
restaurant that meets your preferences and constraints. This process of evaluating multiple
factors to make an informed decision is a form of data analysis, which is a part of data science.

Need for Data Science: -

Some years ago, data was less and mostly available in a structured form, which could be easily
stored in excel sheets, and processed using BI tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes (2.5 billion
GB) of data is generating every day, which led to data explosion. It is estimated as per research,
that by 2020, 1.7 MB of data will be created every single second, by a single person on earth.
Every Company requires data to work, grow, and improve their businesses.

Now, handling such a huge amount of data is a challenging task for every organization. So, to
handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science.

Types of Data Science Jobs: -

As per various surveys, data scientist job is becoming the most demanding Job of the 21st
century due to increasing demands for data science. Some people also called it "the hottest job
title of the 21st century". Data scientists are experts who can use various statistical tools and
machine learning algorithms to understand and analyze data.

The main job roles are given below:


2. Data Scientist
3. Data Analyst
4. Machine learning expert
5. Data engineer
6. Data Architect
7. Data Administrator
8. Business Analyst
9. Business Intelligence Manager

Below is the explanation of some critical job titles of data science.

1. Data Analyst:

A data analyst is an individual, who mining a huge amount of data, models the data, looks for
patterns, relationships, trends, and so on. At the end of the day, he comes up with visualization
and reporting for analyzing the data for decision making and problem-solving process.

Skill required: For becoming a data analyst, you must get a good background in mathematics,
business intelligence, data mining, and basic knowledge of statistics. You should also be
familiar with some computer languages and tools such as MATLAB, Python, SQL, Excel, SAS, R,
JS, Spark, etc.

2. Machine Learning Expert:

The machine learning expert is the one who works with various machine learning algorithms
used in data science such as regression, clustering, classification, decision tree, random forest,
etc.

Skill Required: Computer programming languages such as Python, C++, R, Java, and Hadoop.
You should also understand various algorithms, problem-solving analytical skills, probability,
and statistics.

3. Data Engineer:
A data engineer works with massive amounts of data and is responsible for building and
maintaining the data architecture of a data science project. Data engineers also work on the
creation of data set processes used in modeling, mining, acquisition, and verification.

Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra,
HBase, Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++, Java,
Perl, etc.

4. Data Scientist:

A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques,
methodologies, algorithms, etc.

Skill required: To become a data scientist, one should have technical language skills such as R,
SAS, SQL, Python, Apache spark, MATLAB. Data scientists must understand Statistics,
Mathematics, visualization, and communication skills.

Data Science Components:

● Data:

We will need a lot of data to get started. The more data, the better the analysis. But make
sure you’re also using quality data from reputable sources. This data can be collected from
various sources such as sensors, text documents, social media posts, images, or human
behavior.

● Data visualization:
Data visualization is the process of displaying data in a way that helps people (non-
technical) understand it better. Data visualization makes it simple to access massive
amounts of data in visual form. Data scientists use various tools to visualize their
findings, including graphs, charts, maps, and other visualizations.

● Statistics:
Mathematical concepts such as statistics and probability are the key components of the
data science process and understanding machine learning algorithms.
It is used in collecting and analyzing huge quantities of numerical data to draw useful
conclusions.

Statistics is concerned with gathering, organizing, analyzing, and presenting data.

Probability is concerned with numerical descriptions of how likely an event will occur.

● Programming Language:

Programming is typically used to perform data organizing and analysis. Python and R are two of
the most common programming languages in data science and analysis. R is usually used for
statistical analysis, but Python provides a more thorough approach to data science.
● Machine learning:

Machine learning (ML) is regarded as the backbone of data science. It refers to the process of
training the machine so that it can automatically learn and carry out tasks without the need for
human intervention. With the use of Machine Learning, it becomes easy to create predictions
on unforeseen data. Several algorithms are applied to solve the problems.

For example, in fraud detection, and client retention, ML is used to make predictions, analyze
patterns and provide recommendations.
● Data engineering:

Data Engineering is concerned with the technologies and systems used to acquire, manage, and
utilize data. It is mainly concerned with the development of software solutions for data issues.
Ultimately, data engineering enables data to flow from or to the product via the ecosystem and
various stakeholders. Without data engineering, science cannot be conducted. A data engineer
is a superior programmer and distributed system specialist to a data scientist.
● Domain expertise:

Domain expertise refers to a person’s in-depth familiarity with a certain subject or domain.
There are many areas where we need professionals in their respective fields in data science.
Moreover, you cannot unlock any feature of the ML algorithm without proper knowledge of the
domain from which the data is coming. With domain expertise, you can even build more
accurate models. Of course, you need not be an expert in all fields, but usually, data scientists
are required to be experts in at least one or two domains.

Applications of Data Science:

● Image
● recognition and
● speech recognition
● :
Data science is currently being used for Image and speech recognition. When you
upload an image on Facebook and start getting the suggestion to tag to your friends.
This automatic tagging suggestion uses an image recognition algorithm, which is part
of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices
respond as per voice control, so this is possible with speech recognition algorithm.
● Gaming world
● :
In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user
experience.
● Internet
● search:
When we want to search for something on the internet, then we use different types
of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use
the data science technology to make the search experience better, and you can get a
search result with a fraction of seconds.
● Transport:
Transport industries also use data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents.
● Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual
medical bots, etc.
● Recommendation
● systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data
science technology for making a better user experience with personalized
recommendations. Such as, when you search for something on Amazon, and you
started getting suggestions for similar products, so this is because of data science
technology.

● Risk detection:
Finance industries always had an issue of fraud and risk of losses, but with the help
of data science, this can be rescued.

Most of the finance companies are looking for data scientists to avoid risk and any type of
losses with an increase in customer satisfaction.

Data Science Lifecycle-

1. Problem Definition:
Objective:

Clearly define the business problem or research question you aim to solve.
Determine the goals, deliverables, and success criteria for the project.

Activities:

Collaborate with stakeholders to identify the core problem or question.


Outline project objectives and desired outcomes.
Define success criteria and key performance indicators (KPIs).

Example-

Weather Prediction
Restaurant Recommendation
Stock Price Forecasting
Sentiment Analysis
Image Classification
Speech Recognition
Product Recommendation

2. Data Collection:
Objective:

Identify and gather relevant data from various sources.


Activities:

Identify data sources such as databases, APIs, web scraping, sensors, and surveys.
Collect structured, semi-structured, and unstructured data as needed.
Ensure data collection methods are reliable and comprehensive.

1. Public Datasets
● Google Dataset Search: A search engine for datasets across the web.
● Kaggle: Offers a wide range of datasets for different domains.
● UCI Machine Learning Repository: A collection of databases, domain
theories, and datasets.
● Government Portals: Such as [Link] (USA), [Link] (UK), and other
national and local government open data portals.

2. Surveys and Questionnaires


● Google Forms: Easy to create and analyze surveys.
● SurveyMonkey: A platform for creating professional surveys and analyzing
results

3. APIs (Application Programming Interfaces)


● Social media APIs: Twitter, Facebook, Instagram, etc.
● Weather APIs: OpenWeatherMap, Weatherstack, etc.
● Financial APIs: Alpha Vantage, Yahoo Finance, etc.
● Geolocation APIs: Google Maps, Mapbox, etc.
● Restaurant APIs: Yelp, Zomato, Google Places, etc.

4. Web Scraping
● BeautifulSoup: A Python library for pulling data out of HTML and XML files.
● Scrapy: An open-source and collaborative web crawling framework for
Python.
● Selenium: A portable framework for testing web applications.

3. Data Cleaning and Preprocessing:


Objective:
Prepare the data for analysis by ensuring its quality and transforming it as necessary.

Activities:

Data Cleaning Activities


1. Removing Duplicates
2. Handling Missing Values
3. Data Type Conversion
4. Outlier Detection and Treatment (data points, range)
5. Standardizing and Normalizing Data
6. Handling Inconsistent Data
7. Addressing Data Entry Errors
8. Removing Irrelevant Data

Data Preprocessing Activities


1. Data Transformation
2. Feature Scaling
3. Encoding Categorical Variables
4. Data Aggregation
5. Feature Engineering
6. Binning
7. Text Data Cleaning
8. Time Series Data Smoothing
9. Data Splitting
10. Data Integration

4. Exploratory Data Analysis (EDA):


Objective:

Gain insights into the data through visualization and statistical analysis.
Activities:

1. Descriptive Statistics: Summarize the main characteristics of the dataset using


measures like mean, median, and standard deviation.
2. Data Visualization: Present the data visually through plots such as histograms,
scatter plots, and box plots to uncover patterns and relationships.
3. Distribution Analysis: Examine the distribution of variables to understand their
underlying patterns and potential skewness.
4. Correlation Analysis: Investigate the relationships between variables to identify
potential associations or dependencies.
5. Outlier Detection: Identify and handle data points that deviate significantly from
most of the dataset.
6. Pattern Recognition: Discover underlying patterns or structures within the data to
inform further analysis.
7. Feature Engineering: Create new features or transform existing ones to improve the
performance of machine learning models.
8. Dimensionality Reduction: Reduce the number of features while preserving
important information to simplify analysis or modelling.
9. Hypothesis Testing: Evaluate statistical hypotheses to make inferences about the
population based on sample data.
10. Data Segmentation: Divide the dataset into subsets based on certain criteria to
analyze each segment separately.
11. Trend Analysis: Identify trends or patterns over time to understand long-term
behavior or changes.
12. Seasonality Analysis: Explore recurring patterns or fluctuations that follow seasonal
cycles in time-series data

5. Model Building:
Objective:

Develop predictive or descriptive models based on the data.

Activities:
1. Model Selection: Choose the appropriate machine learning algorithm or model
architecture.
2. Model Training: Train the selected model using the training data.
3. Hyperparameter Tuning: Optimize the model's hyperparameters to improve
performance.

6. Model Evaluation:
Objective:
Assess the performance and validity of the models.

Activities:

1. Performance Metrics Calculation: Compute various evaluation metrics to assess the


model's performance, such as accuracy, precision, recall, F1-score, and ROC-AUC.
2. Model Evaluation: Evaluate the model's performance on unseen data (validation or
test set) to assess its generalization ability.
3. Visualizing Results: Visualize model predictions and evaluation metrics to gain
insights into its strengths and weaknesses.
4. Comparing Models: Compare the performance of multiple models to identify the
best-performing one.
5. Interpreting Results: Interpret the model's performance in the context of the
problem domain and application requirements.
6. Iterative Improvement: Use evaluation results to iteratively refine the model,
feature engineering, or hyperparameters.
7. Model Selection: Select the final model based on its performance on validation or
test data.

7. Deployment:
Objective:

Implement the model in a production environment where it can process new data and
make predictions.

Activities:

Deploy the model into a production environment.


Develop APIs to allow other applications to interact with the model.
Create dashboards or user interfaces for stakeholders to access model outputs.

8. Monitoring and Maintenance:


Objective:

Ensure the model remains accurate and effective over time.

Activities:
Continuously monitor the model's performance in production.
Retrain and update the model with new data.
Collect feedback from users and stakeholders for continuous improvement.

How Does a Data Scientist Work?

A Data Scientist requires expertise in several backgrounds:

● Machine Learning
● Statistics
● Programming (Python or R)
● Mathematics
● Databases

A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she
must organize the data in a standard format.

Here is how a Data Scientist works:

10. Ask the right questions - To understand the business problem.


11. Explore and collect data - From database, web logs, customer feedback, etc.
12. Extract the data - Transform the data to a standardized format.
13. Clean the data - Remove erroneous values from the data.
14. Find and replace missing values - Check for missing values and replace them with a
suitable value (e.g. an average value).
15. Normalize data - Scale the values in a practical range (e.g. 140cm is smaller than 1,8
cm. However, the number 140 is larger than 1,8. - so scaling is important).
16. Analyze data, find patterns and make future predictions.
17. Represent the result - Present the result with useful insights in a way the "company"
can understand.
Unit-2
Data Collection and Data Pre-Processing
2.1 Essential Python Libraries for Data Science: NumPy, Pandas, Matplotlib, Seaborn,
scikit-learn, SciPy
3.1 Understanding Data and Its Types in Data Science
4.1 Objectives and Resources for Data Collection
5.1 Data Collection Methods: CSV and Excel Files, Web Scraping, APIs, Databases
6.1 Introduction to Data Cleaning and Preprocessing
7.1 Data Cleaning Techniques: Handling Missing Data, Duplicates, and Outliers
8.1 Data Preprocessing Techniques: Data Transformation, Normalization,
Standardization, Encoding Categorical Variables, Data Splitting

Python Libraries for Data Science-

· NumPy
· Pandas
· Matplotlib
· Seaborn
· scikit-learn
· SciPy

Command to install-

pip install numpy pandas matplotlib seaborn scikit-learn scipy


NumPy-
NumPy stands for Numerical Python. It is the fundamental package for scientific computing with
Python. It provides support for multidimensional arrays and matrices along with a collection of
mathematical functions to operate on these arrays efficiently.
Pandas-
Pandas is a powerful library for data manipulation and analysis. It provides data structures like
DataFrame and Series, which are highly efficient for working with structured data. Pandas is
particularly useful for tasks such as data cleaning, data exploration, and data preparation.
Matplotlib-
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations
in Python. It provides a MATLAB-like interface for plotting. Matplotlib is highly customizable
and allows users to create a wide variety of plots and charts.
Seaborn-
Seaborn is a statistical data visualization library based on Matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. Seaborn simplifies the
process of creating complex visualizations by providing built-in themes and functions for
common statistical plots.
Matplotlib:

● Purpose: Matplotlib is a low-level library for creating static, animated, and interactive
visualizations in Python. It provides a lot of control over the visual aspects of plots,
making it highly customizable.
● Abstraction: Low-level. Users often need to write more code to create and customize
plots.

Seaborn:

● Purpose: Seaborn is built on top of Matplotlib and is specifically designed for statistical
data visualization. It provides a high-level interface for drawing attractive and
informative statistical graphics.
● Abstraction: High-level. Seaborn abstracts much of the complexity of Matplotlib,
allowing users to create complex plots with less code.
scikit-learn
scikit-learn is a machine learning library that provides simple and efficient tools for data mining
and data analysis. It features various algorithms and utilities for classification, regression,
clustering, dimensionality reduction, and model selection. scikit-learn is built on top of NumPy,
SciPy, and Matplotlib.

Example-

SciPy:
SciPy is a library used for scientific and technical computing. It builds on NumPy and provides
many functions for numerical integration, interpolation, optimization, linear algebra, statistics,
signal processing, and more. SciPy is designed to work with NumPy arrays and is often used
with other libraries like Matplotlib and scikit-learn.
What is Data and It’s types in Data Science?

What is Data?
In data science, "data" refers to raw information that is collected, measured, and analyzed to
gain insights, make decisions, and create predictive models.

One purpose of Data Science is to structure data, making it interpretable and easy to work with.

Types: -

Data can be categorized into various types based on its structure, origin, and format. Here are the
main types of data in data science:

1. Structured Data

Definition: Data organized into a defined format, usually tabular, with rows and columns.

Examples:

Spreadsheets
CSV files

SQL tables

Characteristics:

Ø Easy to store, query, and analyze using traditional database management systems.
Ø Schema-based with well-defined data types (e.g., integers, strings, dates).

2. Semi-Structured Data

Definition: Data that does not have a rigid structure but still contains tags or markers to separate
elements.

Examples:

XML files

JSON files

NoSQL databases (e.g., MongoDB)

Characteristics:

Flexible format can contain nested structures.

Easier to store and parse than unstructured data but requires more processing than structured
data.

3. Unstructured Data

Definition: Data that does not have a predefined structure or organization.

Examples:

Text documents

Emails

Social media posts

Images, audio, and video files

Characteristics:
Requires advanced techniques for storage, processing, and analysis.

Often analyzed using natural language processing (NLP), computer vision, and other advanced
algorithms.

4. Time-Series Data

Definition: Data that is collected at different points in time, often at regular intervals.

Examples:

Stock prices

Sensor readings

Weather data

Characteristics:

Ordered sequence of data points.

Analyzed for trends, seasonal patterns, and forecasting.

5. Spatial Data

Definition: Data that represents the physical location and shape of objects.

Examples:

Geographic Information Systems (GIS) data

GPS coordinates

Satellite imagery

Characteristics:

Contains geographic components like latitude, longitude, and elevation.

Used in mapping, navigation, and location-based services.

6. Categorical Data

Definition: Data that represents categories or groups.


Examples:

Gender (Male, Female)

Blood type (A, B, AB, O)

Product categories

Characteristics:

Non-numeric, qualitative.

Analyzed using frequency counts, mode, and proportions.

7. Numerical Data

Definition: Data that represents measurable quantities.

Examples:

Height, weight, age

Sales figures

Temperature

Characteristics:

Quantitative, can be discrete (countable) or continuous (measurable).

Analyzed using statistical methods, mathematical models, and machine learning algorithms.

8. Text Data

Definition: Data that is in a textual format.

Examples:

Articles, blogs, and reviews

Customer feedback

Transcripts of conversations

Characteristics:
Requires processing using natural language processing (NLP) techniques.

Analyzed for sentiment, topic modeling, and keyword extraction.

9. Image Data

Definition: Data that consists of visual content.

Examples:

Photographs

Medical imaging (e.g., X-rays, MRIs)

Satellite images

Characteristics:

Analyzed using computer vision techniques.

Contains pixel information and can be processed for pattern recognition, object detection, and
classification.

10. Audio Data

Definition: Data that represents sound.

Examples:

Music files

Speech recordings

Sound sensors data

Characteristics:

Analyzed using signal processing techniques and speech recognition.

Contains waveform information, frequency, and amplitude.

11. Video Data

Definition: Data that consists of moving visual content.


Examples:

Movies

Surveillance footage

Video calls

Characteristics:

Combination of image and audio data.

Analyzed using techniques from both computer vision and audio processing.

Quantitative Data

1. These types of data seem to be the easiest to explain. It tries to find the answers to questions
such as

● “how many,
● “how much” and
● “how often”
2. It can be expressed as a number, so it can be quantified. In simple words, it can be measured
by numerical variables.

3. These are easily open for statistical manipulation and can be represented by a wide variety of
statistical types of graphs and charts like line charts, bar graphs, scatter plots, etc.

Examples of quantitative data:

● Scores of tests and exams e.g. 74, 67, 98, etc.


● The weight of a person.
● The temperature in a room.

There are 2 general types of quantitative data:

● Discrete data
● Continuous data

Discrete Data

1. It shows the count that involves only integers and we cannot subdivide the discrete values
into parts.

For Example, the number of students in a class is an example of discrete data since we can
count whole individuals but can’t count like 2.5, 3.75, kids.

2. In simple words, discrete data can take only certain values and the data variables cannot be
divided into smaller parts.

3. It has a limited number of possible values e.g. days of the month.

Examples of discrete data:

● The number of students in a class.


● The number of workers in a company.
● The number of test questions you answered correctly.

Continuous Data

1. It represents the information that could be meaningfully divided into its finer levels. It can be
measured on a scale or continuum and can have almost any numeric value.

For Example, we can measure our height at very precise scales in different units such as meters,
centimetres, millimetres, etc.

2. The key difference between continuous and discrete types of data is that in the former, we
can record continuous data at so many different measurements such as width, temperature,
time, etc.
3. The continuous variables can take any value between two numbers. For Example, between
the range of 60 and 82 inches, there are millions of possible heights such as 62.04762 inches,
79.948376 inches, etc.

4. A good great rule for defining if data is continuous or discrete is that if the point of
measurement can be reduced in half and still make sense, the data is continuous.

Examples of continuous data:

● The amount of time required to complete a project.


● The height of children.
● The speed of cars.

Interval Data

1. These data types are measurable and ordered with the nearest items but have no meaningful
zero.

Let’s understand the meaning of “Interval Scale”:


In the Interval scale, the term ‘Interval’ signifies space in between, which is a significant thing to
recall as interval scales not only educate us about the order but in addition, give information
about the value between every item.

2. Fundamentally, we can show interval data in the same way as ratio data, but the thing that
we have to note is their characterized zero points.

3. Hence, with the help of interval data, we can easily correlate the degrees of the data and also
add or subtract the values.

4. There are some descriptive statistics that we can calculate for interval data such as :

● Central measures of tendency (mean, median, mode)


● Range (minimum, maximum)
● Spread (percentiles, interquartile range, and standard deviation).

These are not the only statistical things to be calculated, but we can calculate more things also.

Examples of Interval data:

● Temperature (°C or F)
● Dates (1055, 1297, 1976, etc.)
● Time Gap on a 12-hour clock (6 am, 6 pm)

Ratio Data

1. These data are also in the ordered units that have the same difference.

2. Ratio values are the same as interval values, but the only difference is that Ratio data do have
an absolute zero. For Example, height, weight, length, etc.

3. These are measured and ordered with equidistant items with a meaningful zero and never be
negative like interval data.

Let’s understand this with an outstanding example- Measurement of heights.

Height can be measured in units like centimeters, inches, meters, or feet and it is not possible
to have a negative value of height.

4. It enlightens us regarding the order for variables, the contrasts among them, and they have
absolutely zero.

5. Ratio data is fundamentally the same as interval data, aside from zero means none.
6. The descriptive statistics which we can calculate for ratio data are the same as interval data
such as :

● Central measures of tendency (mean, median, mode)


● Range (minimum, maximum)
● Spread (percentiles, interquartile range, and standard deviation).

Example of Ratio data:

● Age (from 0 years to 100+)


● Time interval (measured with a stop-watch or similar)

For the above examples of ratio data, we see that there is an actual and meaningful zero-point
like the age of a person, absolute zero, distance calculated from a specified point or time all
have real zeros.

Qualitative Data

1. Qualitative data can’t be expressed as a number, so it can’t be measured. It mainly consists of


words, pictures, and symbols, but not numbers.

2. It is also known as Categorical Data as the information can be sorted by category, not by
number.

3. These can answer the questions like:

● “how this has happened”, or


● “why this has happened”.

Examples of qualitative data:

● Colors e.g. the color of the sea


● Popular holiday destinations such as Switzerland, New Zealand, South Africa, etc.
● Ethnicity such as American Indian, Asian, etc.

In general, there are 2 types of qualitative data:

● Nominal data
● Ordinal data.
Nominal Data

1. This data type is used just for labeling variables, without having any quantitative value. Here,
the term ‘nominal’ comes from the Latin word “nomen” which means ‘name’.

2. It just names a thing without applying for any particular order. The nominal data sometimes
referred to as “labels”.

Examples of Nominal Data:

● Gender (Women, Men)


● Hair color (Blonde, Brown, Brunette, Red, etc.)
● Marital status (Married, Single, Widowed)

As you can observe from the examples there is no intrinsic ordering to the variables.

Eye color is a nominal variable having a few levels or categories such as Blue, Green, Brown, etc
and there is no possible way to order these categories in a rank-wise manner i.e, from highest
to lowest or vice-versa.
Image Source: Google Images

Ordinal Data

1. The crucial difference from nominal types of data is that Ordinal Data shows where a number
is present in a particular order.

2. This type of data is placed into some kind of order by their position on a scale. Ordinal data
may indicate superiority.

3. We cannot do arithmetic operations with ordinal data because they only show the sequence.

4. Ordinal variables are considered as “in-between” qualitative and quantitative variables.

5. In simple words, we can understand the ordinal data as qualitative data for which the values
are ordered.
6. In comparison with nominal data, the second one is qualitative data for which the values
cannot be placed in an order.

7. Based on the relative position, we can also assign numbers to ordinal data. But we cannot do
math with those numbers. For example, “first, second, third…etc.”

Examples of Ordinal Data:

● Ranking of users in a competition: The first, second, and third, etc.


● Rating of a product taken by the company on a scale of 1-10.
● Economic status: low, medium, and high.

Data Collection:
Objective:

· Identify and gather relevant data from various sources.


· Gather sufficient and relevant data to solve the defined problem.
· Ensure the data is accurate, complete, and timely.
· Minimize bias and errors in the collected data.

Activities:
· Identify data sources such as databases, APIs, web scraping, sensors, and
surveys.
· Collect structured, semi-structured, and unstructured data as needed.
· Ensure data collection methods are reliable and comprehensive.
Data collections methods-
[Link] : digital or physical questioners to collect the desired data.

Ex. Feedback form

2. Transactional Tracking : collect the data generated the transactions made by users.

Ex. Bills for a customer

3. Interviews and Focus Groups :Interviews and focus groups consist of talking to subjects face-
to-face about a specific topic or issue and collect required attributes for analysis.

4. Observation : Observing people interacting with website or product can be useful for data
collection because of the openness it offers. While less accessible than other data collection
methods, observations enable to see firsthand how users interact with website or product

5. Forms

Online forms are beneficial for gathering data.

Ex. Registration forms

[Link] media monitoring

Ex. Tweet, comments, posts, reactions etc.

Websites for Data collection-

Here are the names of the websites for data collection: -

Government Open Data Portals:

● National Aeronautics and Space Administration (NASA) Open Data Portal:


[Link] (Space exploration, astronomy data)
● World Bank Open Data: [Link] (Development indicators,
economic data across countries)
● FAOSTAT: [Link] (Agriculture, food production data)
● OECD Data: [Link] (Economic, social, and environmental statistics for
countries worldwide)
● UNdata: [Link] (Comprehensive data on a variety of topics from the
United Nations)

Research Institutions and Universities:

● ICPSR Data Archive: [Link] (Social science research data)


● KDD Cup: [Link] (Datasets used in data mining and knowledge
discovery competitions)
● UCI Machine Learning Repository: [Link] (Classic datasets for
machine learning tasks)
● Dataverse Network: [Link]
source-application-sharing-discovering-and (Open-source repository for research data
across disciplines)

Non-Profit Organizations and NGOs:

● World Health Organization (WHO) Data: [Link] (Global health data,


disease statistics)
● Knoema: [Link] (Economic, demographic, and social data)
● Gapminder: [Link] (Socioeconomic data visualized for global
development comparisons)
● Human Development Reports: [Link]
report-2020 (Human development indicators by country)

Data Aggregators and Platforms:

● Kaggle: [Link] (Platform for data science competitions and public


datasets)
● Google Dataset Search: [Link] (Search engine for
publicly available datasets)
● Quandl: [Link] (Financial and economic data
platform)
● Amazon Public Datasets: [Link] (Datasets for machine
learning, natural language processing, and more)

Social Media Data:


● Twitter Public Data: [Link] (Access to a stream of public tweets)
● Facebook Public Data: [Link]
public-content-access/ (Limited access to anonymized and aggregated data) [Keep in
mind Facebook's terms of use when considering data collection from their platform]

Data collection using Python-

CSV and Excel Files-

Collecting data from CSV and Excel files is common in data science projects. Python's Pandas
library is highly efficient for this purpose.

Installation command-

pip install pandas openpyxl (for excel file)

Web Scraping

Web scraping involves extracting data from websites. Python has several libraries that make
web scraping straightforward like BeautifulSoup, Scrapy, Selenium.

BeautifulSoup:

Parses HTML and XML documents to extract data.


Installation command-

pip install requests


pip install beautifulsoup4
Using API-

Many services provide APIs (Application Programming Interfaces) to access their data
programmatically. Python's requests library is commonly used for making API requests.

Databases-
Accessing databases to collect data is another common practice. Python provides libraries for
interacting with various types of databases.

· SQL Databases: Using sqlite3, psycopg2 for PostgreSQL, or pymysql for MySQL.
Data Cleaning and Preprocessing-
· Real-world databases are highly susceptible to noisy, missing, and inconsistent data
due to their typically huge size (often several gigabytes or more) and their likely origin
from multiple, heterogeneous sources.
· Data cleaning and preprocessing is a crucial step in the data science workflow that
involves identifying and correcting (or removing) errors and inconsistencies in the data
to improve its quality and reliability and preprocessing involves transforming raw data
into a clean and usable format to make it suitable for analysis.
·
Activities-

Data Cleaning Activities


1. Removing Duplicates
2. Handling Missing Values
3. Data Type Conversion
4. Outlier Detection and Treatment (data points, range)
5. Standardizing and Normalizing Data
6. Handling Inconsistent Data
7. Addressing Data Entry Errors
8. Removing Irrelevant Data

1. Handling Missing Data


Missing data is a common issue in datasets. Pandas provides several ways to handle missing
data.

· Identifying Missing Values:


[Link]().sum() # Count missing values in each column
· Dropping Missing Values:
[Link](inplace=True) # Drop rows with any missing [Link](axis=1, inplace=True) #
Drop columns with any missing values
· Filling Missing Values:
[Link](0, inplace=True) # Fill missing values with
0df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Fill with the mean
2. Handling Duplicates
Duplicate data can skew analysis and should be handled appropriately.

· Identifying Duplicates:
[Link]().sum() # Count duplicate rows
· Removing Duplicates:
df.drop_duplicates(inplace=True) # Drop duplicate rows

[Link] Outliers
Outliers can affect the analysis. Detecting and handling them is essential.
· Detecting Outliers:
import numpy as npQ1 = df['column_name'].quantile(0.25)Q3 =
df['column_name'].quantile(0.75)IQR = Q3 - Q1outliers = df[(df['column_name'] < (Q1 - 1.5 *
IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR))]
· Removing Outliers:
df = df[~((df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR)))]

Working Code-

Data Preprocessing Activities


1. Data Transformation
2. Feature Scaling
3. Encoding Categorical Variables
4. Data Aggregation
5. Feature Engineering
6. Binning
7. Text Data Cleaning
8. Time Series Data Smoothing
9. Data Splitting
10. Data Integration

1. Data Transformation
Data transformation involves changing the format, structure, or values of data. Common
transformations include log transformation, square root transformation, and polynomial
transformation.

2. Feature Scaling
Feature scaling ensures that all features have the same scale. Common methods are
standardization and normalization.
Data Standardization

Data standardization involves transforming data to have a mean of zero and a standard
deviation of one. This process ensures that the data follows a standard normal distribution. It’s
useful when the data features have different units or scales.

How it works:

2. Mean Calculation: Calculate the average value (mean) of the data.


3. Standard Deviation Calculation: Determine how much the values in the data deviate
from the mean (standard deviation).
4. Transform: Subtract the mean from each data point and then divide by the standard
deviation.

Example:

● Original data: [10, 20, 30, 40, 50]


● Mean: 30
● Standard deviation: 14.14
● Standardized data: [(10-30)/14.14, (20-30)/14.14, (30-30)/14.14, (40-30)/14.14, (50-
30)/14.14]
● Result: [-1.41, -0.71, 0, 0.71, 1.41]

Data Normalization-

Data normalization scales the data to a fixed range, typically between 0 and 1. This technique is
useful when you want to ensure that all data features contribute equally to the analysis,
especially when the data has different units or scales.
How it works:

5. Find Min and Max: Identify the minimum and maximum values in the data.
6. Transform: Subtract the minimum value from each data point and then divide by the
range (max - min).

Example:

● Original data: [10, 20, 30, 40, 50]


● Min value: 10
● Max value: 50
● Normalized data: [(10-10)/(50-10), (20-10)/(50-10), (30-10)/(50-10), (40-10)/(50-10),
(50-10)/(50-10)]
● Result: [0, 0.25, 0.5, 0.75, 1]

When to Use Each

● Standardization: Use when you assume the data follows a normal distribution and when
the algorithm (like SVM, KNN) might be sensitive to the scale of the data.
● Normalization: Use when you need data within a specific range or when using
algorithms (like neural networks) that are sensitive to the scale of input features.

Both techniques are essential in ensuring that your data is in the right form for analysis and that
your models perform better and more consistently.
3. Encoding Categorical Variables

Categorical variables need to be encoded into numerical values. Common methods are one-hot
encoding and label encoding.

One-Hot Encoding

One-hot encoding is a technique used to convert categorical data into a format that can be
provided to machine learning algorithms to improve predictions. It turns each category value
into a new column and assigns a binary value (1 or 0) to indicate the presence of each category.

How it works:

7. Identify Categories: Determine all unique categories in your data.


8. Create Columns: Create a new binary column for each unique category.
9. Assign Values: For each row, place a 1 in the column for the category that the row
belongs to, and 0 in all other columns.

Example:

● Original data (colors): ["Red", "Green", "Blue"]


● Unique categories: "Red", "Green", "Blue"
● One-hot encoded data:
· Red: [1, 0, 0]
· Green: [0, 1, 0]
· Blue: [0, 0, 1]
So, if you have a list like ["Red", "Green", "Blue", "Green"], the one-hot encoded version would
be:

● [[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]]

Label Encoding

Label encoding is a technique that assigns a unique integer to each category. This is useful
when the categorical data has an inherent order or ranking.

How it works:

10. Identify Categories: Determine all unique categories in your data.


11. Assign Integers: Assign a unique integer to each category.

Example:

● Original data (colors): ["Red", "Green", "Blue"]


● Unique categories: "Red", "Green", "Blue"
● Assign integers:
· Red: 0
· Green: 1
· Blue: 2

So, if you have a list like ["Red", "Green", "Blue", "Green"], the label encoded version would be:

● [0, 1, 2, 1]

When to Use Each


● One-Hot Encoding: Use when the categorical variables do not have any ordinal
relationship (e.g., colors, city names). It's great for algorithms that do not assume any
order in the categories.
● Label Encoding: Use when the categorical variables have a meaningful order or ranking
(e.g., low, medium, high). It’s suitable for algorithms that can understand or require
ordered features.

Both techniques are used to make categorical data understandable for machine learning
models, which generally require numerical input
1. Data Splitting
Splitting data into training and testing sets is crucial for model evaluation.

Data splitting is a fundamental step in preparing a dataset for training and evaluating machine
learning models. It involves dividing the dataset into separate subsets to ensure that the model
can be properly trained and tested. Here’s an explanation in simple terms:

Why Split Data?


● Training Set: This subset is used to train the machine learning model. The model learns
from this data.
● Validation Set (Optional): This subset is used to tune the model parameters and choose
the best model. It helps to prevent overfitting.
● Test Set: This subset is used to evaluate the final model. It provides an unbiased
assessment of how well the model will perform on new, unseen data.

Common Splitting Techniques


1. Train-Test Split
· How it works: Split the dataset into two parts: one for training the model and
one for testing it.
· Typical ratio: 70-80% training, 20-30% testing.
· Example:
 Original data: 1000 samples
 Training set: 800 samples
 Test set: 200 samples
2. Train-Validation-Test Split
· How it works: Split the dataset into three parts: one for training, one for
validation, and one for testing.
· Typical ratio: 60-70% training, 10-20% validation, 20-30% testing.
· Example:
 Original data: 1000 samples
 Training set: 600 samples
 Validation set: 200 samples
 Test set: 200 samples
3. Cross-Validation (CV)
· How it works: The dataset is split into k subsets (folds). The model is trained k
times, each time using k-1 folds for training and the remaining fold for testing.
This process rotates through all the folds.
· Typical value: k=5 or k=10.
· Example:
 Original data: 1000 samples, k=5
 Split into 5 folds: 200 samples each
 Training/testing cycles: Train on 800 samples, test on 200 samples, repeat
5 times, each time with a different fold as the test set.

Why Is Data Splitting Important?


● Avoid Overfitting: Ensures that the model is not just memorizing the training data but
can generalize to new, unseen data.
● Model Evaluation: Provides a way to assess the model’s performance and compare
different models or parameter settings.
● Parameter Tuning: Helps in tuning model parameters using validation data to find the
best model configuration.

Example in Practice

Suppose you have a dataset of 1000 records of house prices. You want to build a model to
predict house prices based on various features like size, location, and number of bedrooms.

1. Train-Test Split:
· Split the dataset: 800 records for training, 200 records for testing.
· Train the model on the 800 training records.
· Evaluate the model on the 200 test records to see how well it performs on new
data.
2. Train-Validation-Test Split:
· Split the dataset: 600 records for training, 200 records for validation, 200 records
for testing.
· Train the model on the 600 training records.
· Use the 200 validation records to fine-tune the model parameters.
· Evaluate the final model on the 200 test records.
3. Cross-Validation:
· Split the dataset into 5 folds of 200 records each.
· Perform 5 training/testing cycles, each time using a different fold for testing and
the remaining folds for training.
· Average the results from the 5 cycles to get a robust estimate of the model’s
performance.

By splitting the data appropriately, you ensure that your machine learning model is robust,
generalizes well to new data, and is properly evaluated.

Unit-3
Exploratory Data Analysis and Data Visualization
3.1 Importance of Exploratory Data Analysis (EDA)
3.2 Basics of Statistics
3.3 Steps to Perform Exploratory Data Analysis
3.4 EDA Techniques: Univariate, Bivariate, and Multivariate Analysis
3.5 Data Visualization with Matplotlib and Seaborn

Importance of Exploratory Data Analysis (EDA)-


· Exploratory Data Analysis (EDA) is like exploring a new place. You look around, observe
things, and try to understand what’s going on. Similarly, in EDA, you look at a dataset,
check out the different parts, and try to figure out what’s happening in the data. It
involves using statistics and visual tools to understand and summarize data, helping data
scientists and data analysts inspect the dataset from various angles without making
assumptions about its contents
· Exploratory Data Analysis (EDA) is a crucial step in the data science life cycle. It helps
data scientists understand the main characteristics of the data and often employs visual
methods to summarize and interpret these characteristics.
· Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that
involves summarizing the main characteristics of a dataset, often using visual methods.
EDA helps in understanding the data, detecting anomalies, and forming hypotheses. It
sets the foundation for further analysis and modeling by providing deep insights into the
data's structure and patterns.

Why Exploratory Data Analysis is Important?


Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of
data science and statistical modeling. Here are some of the key reasons why EDA is a critical
step in the data analysis process:
1. Understanding Data Distribution:
· Purpose: Understanding how the data is distributed helps in identifying its shape,
central tendency, and variability.
· Importance:
 Helps determine if the data follows a normal distribution or is skewed.
 Understanding distribution is critical for selecting appropriate statistical tests
and models.
· Techniques:
 Histograms
 Density plots
 Box plots
2. Identifying Patterns and Relationships:
· Purpose: Discovering hidden patterns, correlations, and relationships between variables
can provide valuable insights.
· Importance:
 Facilitates feature selection and engineering by identifying important variables.
 Helps in understanding the interaction between variables.
· Techniques:
 Scatter plots
 Correlation matrices
 Pair plots
3. Detecting Outliers and Anomalies:
· Purpose: Identifying outliers and anomalies is essential as they can significantly impact
the analysis and model performance.
· Importance:
 Outliers can skew the results and lead to incorrect conclusions.
 Anomalies might indicate data entry errors or rare events that need special
handling.
· Techniques:
 Box plots
 Z-score analysis
 Isolation Forest
4. Data Cleaning and Preparation:
· Purpose: Cleaning and preparing data is vital for ensuring data quality and accuracy.
· Importance:
 Missing values and inconsistencies can lead to biased analysis and poor model
performance.
 Data preparation ensures that the data is in the correct format for analysis.
· Techniques:
 Handling missing values (e.g., imputation, removal)
 Removing duplicates
 Data transformation and normalization
5. Hypothesis Generation:
· Purpose: EDA aids in formulating hypotheses based on initial data insights, guiding
further analysis and experiments.
· Importance:
 Helps in designing experiments and selecting appropriate statistical tests.
 Provides a basis for testing assumptions and making data-driven decisions.
· Techniques:
 Observing patterns and trends
 Formulating testable hypotheses

Basics of Statistics-
Understanding basic statistics is crucial for performing EDA. Here are some fundamental
concepts and their importance:

Descriptive Statistics:
Descriptive statistics summarize and organize the characteristics of a data set. These statistics
aim to describe the basic features of the data in a study, providing simple summaries and
visualizations. Key measures in descriptive statistics include:
● Measures of Central Tendency: Mean, median, and mode.
● Measures of Variability (Spread): Range, variance, standard deviation.
● Measures of Position: Percentiles and quartiles.
● Visualizations: Charts, graphs, histograms, and scatter plots.

● Mean: The average value of a dataset.


· Helps understand the central tendency of the data.
● Median: The middle value when the dataset is ordered.
· Useful for understanding the central tendency, especially with skewed data.
● Mode: The most frequent value in the dataset.
· Helps identify the most common value.
● Standard Deviation: Measures the spread of the data around the mean.
· Indicates the variability of the data.
● Variance: The square of the standard deviation.
· Another measure of data variability.

Inferential Statistics:

● Definition: Inferential statistics involves making predictions or inferences about a population


based on a sample of data.
● Objective: To draw conclusions about population parameters (e.g., mean, proportion) using
sample statistics.

Consider you want to know the average height of all adult women in a country (the population
parameter). It's not feasible to measure every single adult woman's height, so you take a random
sample and measure their heights. You then use this sample to estimate the average height of all adult
women in the country.

Key Concepts:
· Population: The entire group being studied (e.g., all adult women in a country).
· Sample: A subset of the population selected for analysis (e.g., heights of 50 randomly
chosen women).
· Parameter: A characteristic of the population (e.g., population mean μ).
· Statistic: A characteristic of the sample (e.g., sample mean xˉ)
· Sample Statistic: A measure computed from the sample data, used to estimate a
population parameter. it includes sample mean (xˉ), sample proportion (p^ ), and
sample standard deviation (s).
· Population Parameter: A measure that describes a characteristic of the population,
which is usually unknown and needs to be estimated. Examples include population
mean (μ), population proportion (p), and population standard deviation (σ).
· Estimation: The process of inferring the value of a population parameter based on
sample data. Estimators, such as sample mean and sample proportion, provide
estimates of the unknown population parameters.
· Point Estimation: Using a single value (statistic) to estimate a population parameter.
o Example: Using the sample mean to estimate the population mean.
· Interval Estimation (Confidence Intervals): Provides a range within which the
population parameter is expected to lie.
· Confidence Interval: A range of values, derived from sample statistics, that is likely
to contain the value of an unknown population parameter. The confidence level
indicates the degree of certainty (e.g., 95%) that the interval contains the parameter.

· Hypothesis Testing: A statistical method used to decide whether there is enough


evidence to reject a null hypothesis about a population parameter. For example,
testing if the population mean is equal to a specified value.
· Null Hypothesis (H0): Assumes no effect or no difference (e.g., sample mean equals a
specified value).
· Alternative Hypothesis (H1): Assumes there is an effect or a difference.
· Test Statistic: Measures the degree to which the sample data deviates from the null
hypothesis.
· P-value: The probability of obtaining the observed results if the null hypothesis is
true.
o If p-value < 0.05, reject the null hypothesis.

● Confidence Interval: A range of values within which a population parameter is


estimated to lie.
· Provides an estimate of the uncertainty around the sample statistics.
Scenario

You are studying the average height of students in a particular school. You take a random sample
of students and measure their heights.

Calculations
12. Sample Mean: You calculate the average height of the students in your sample.
13. Standard Error: You calculate the standard error, which measures the variability of the sample
mean.
14. Confidence Level: You choose a confidence level, commonly 95%.

Confidence Interval Calculation

Using the sample mean, standard error, and a value from the t-distribution or z-distribution
(depending on your sample size and whether the population standard deviation is known), you
compute the confidence interval.

Let's say your sample mean height is 165 cm, the standard error is 2 cm, and you are using a
95% confidence level.
Interpretation
● The confidence interval (161 cm, 169 cm) means that you are 95% confident that the true
average height of all students in the school is between 161 cm and 169 cm.
● If you were to take many samples and calculate the confidence interval for each sample, about
95% of those intervals would contain the true population mean.

Key Points
● Interval: The range of values (e.g., 161 to 169 cm) is the interval.
● Confidence Level: The percentage (e.g., 95%) that expresses how confident you are that the
interval contains the true population parameter.
● Sample-Based: The interval is based on your sample data and provides an estimate about the
population.

Why Use Confidence Intervals?


● Uncertainty Quantification: They provide a way to express the uncertainty around an estimate.
● Reliability: Confidence intervals give a range of plausible values for the population parameter,
offering more information than a single point estimate.
Steps for Performing Exploratory Data Analysis-

1. Data Collection:
· Gather data from various sources (databases, CSV files, APIs, etc.).
· Ensure data is collected in a consistent format.
2. Data Cleaning:
· Handle missing values: fill in with mean/median/mode, use forward/backward
fill, or remove rows/columns.
· Remove duplicates to ensure data integrity.
· Correct inconsistencies (e.g., standardize units, correct typos).
3. Data Transformation:
· Normalize or standardize data if needed.
· Convert categorical variables to numeric (e.g., one-hot encoding).
4. Descriptive Statistics:
· Summarize and describe the main features of the data.
· Compute mean, median, mode, standard deviation, and variance.
5. Data Visualization:
· Create plots and charts to visually inspect the data.
· Use histograms, box plots, scatter plots, etc.
6. Pattern Detection:
· Identify relationships and correlations between variables.
· Use correlation matrices, pair plots, and other techniques.
7. Hypothesis Testing:
· Formulate and test hypotheses based on observed patterns.
· Use statistical tests (e.g., t-tests, chi-square tests).
EDA Techniques-

Here are four key types of EDA techniques:

1. Visualization Techniques: EDA relies heavily on visualization methods to depict


data distributions, trends, and associations. Various charts and graphs, such as
bar charts, line charts, scatter plots, and heatmaps, are used to make data easier
to understand and interpret.
2. Advanced Plots with Seaborn
1. Univariate Analysis: Univariate analysis examines individual variables to
understand their distributions and summary statistics. This includes calculating
measures such as mean, median, mode, and standard deviation, and visualizing
the data using histograms, bar charts, box plots, and violin plots.
Ø Histograms: Used to visualize the distribution of a variable.
Ø Box plots: Useful for detecting outliers and understanding the spread and skewness of the
data.
Ø Bar charts: Employed for categorical data to show the frequency of each category.
Ø Summary statistics: Calculations like mean, median, mode, variance, and standard
deviation that describe the central tendency and dispersion of the data.

2. Bivariate Analysis: Bivariate analysis explores the relationship between two


variables. It uncovers patterns through techniques like scatter plots, line graphs,
and heatmaps. This helps to identify potential associations or dependencies
between variables.
· Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter
plot helps visualize the relationship between two continuous variables.
· Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for
linear relationships) quantifies the degree to which two variables are related.
· Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the
relationship between two categorical variables. It shows the frequency distribution of
categories of one variable in rows and the other in columns, which helps in understanding the
relationship between the two variables.
· Line Graphs: In the context of time series data, line graphs can be used to compare two
variables over time. This helps in identifying trends, cycles, or patterns that emerge in the
interaction of the variables over the specified period.
· Covariance: Covariance is a measure used to determine how much two random variables
change together. However, it is sensitive to the scale of the variables, so it’s often
supplemented by the correlation coefficient for a more standardized assessment of the
relationship.

3. Multivariate Analysis: Multivariate analysis involves examining more than two


variables simultaneously to understand their relationships and combined effects.
Techniques such as pair plots, and principal component analysis (PCA) are
commonly used in multivariate EDA.
Ø Pair plots: Visualize relationships across several variables simultaneously to capture a
comprehensive view of potential interactions.
Ø Principal Component Analysis (PCA): A dimensionality reduction technique used to
reduce the dimensionality of large datasets, while preserving as much variance as
possible.

1. Univariate Analysis
Univariate analysis involves analyzing a single variable to understand its distribution and other
properties.
2. Bivariate Analysis

Bivariate analysis involves analyzing two variables to find relationships between them.
. Multivariate Analysis

Multivariate analysis involves analyzing more than two variables simultaneously to understand
complex relationships in the dataset.
Unit-4
Model Development
4.1 Introduction to Machine Learning
4.2 Types of Machine Learning Algorithms: Supervised and Unsupervised
4.3 Steps for Model Development
4.4 Linear Regression Algorithm
4.5 KNN Classification Algorithm
4.5 K-Means Clustering Algorithm

· Machine learning is the field of study that enables computers to learn from data and
make decisions without explicit programming.
· It is as a subset or just a part of artificial intelligence that focuses on developing
algorithms that are capable of learning hidden patterns and relationships within the
data allowing algorithms to generalize and make better predictions or decisions on new
data.

Types of ML algorithm-
· Supervised learning involves training a model on labelled data, where the algorithm learns
from the input data and its corresponding target (output labels). The goal is to map from
input to output, allowing the model to learn the relationship and make predictions based on
the learnings of new data.

Example of supervised learning is house price prediction. In this case, a model is trained to
predict the price of a house based on features such as the number of bedrooms, size, location, and
age of the house.

Unsupervised learning has two types-

· Classification
· Regression

1. Training Data:
● Input (features): Information about each house (e.g., square footage, number of
bedrooms, location etc.).
● Labels (target): The actual sale price of each house.
2. Model Training:

The model learns the relationship between the features (e.g., size, location) and the target
(price) by analyzing the patterns in the historical data.

3. Prediction:

Once trained, the model can predict the price of a new house by taking into account its
features, such as square footage or number of rooms.

4. Feedback:

As more data (new house sales) becomes available, the model can be updated to improve its
accuracy in predicting future house prices.

Supervised learning has two types:


● Classification: It predicts the class of the dataset based on the independent input
variable. Class is the categorical or discrete values. like the image of an animal is a cat or
dog?
Classification Examples:
1. Email Spam Detection:
· Objective: Classify emails as either "spam" or "not spam" based on features like
subject line, sender, and keywords.
· Output: A categorical value (spam or not spam).
2. Credit Card Fraud Detection:
· Objective: Classify credit card transactions as either "fraudulent" or "non-
fraudulent" based on transaction details.
· Output: A categorical value (fraudulent or non-fraudulent).
3. Sentiment Analysis:
· Objective: Classify the sentiment of a product review as either "positive,"
"neutral," or "negative" based on the text.
· Output: A categorical value (positive, neutral, negative).

● Regression: It predicts the continuous output variables based on the independent input
variable. like the prediction of house prices based on different parameters like house
age, distance from the main road, location, area, etc.
Regression Examples:
1. House Price Prediction:
· Objective: Predict the price of a house based on features such as size, number of
rooms, location, etc.
· Output: A continuous value (e.g., $300,000).
2. Stock Market Prediction:
· Objective: Predict the future price of a stock based on historical stock prices,
trading volumes, and other financial indicators.
· Output: A continuous value (e.g., the stock price after one day).
3. Temperature Forecasting:
· Objective: Predict the temperature for the next day based on weather data (e.g.,
humidity, wind speed, pressure).
· Output: A continuous value (e.g., 22°C).
4. Sales Forecasting:
· Objective: Predict the number of units of a product that will be sold in the next
quarter.
· Output: A continuous value (e.g., 5000 units).

● Common Algorithms:
○ Linear Regression: Used for predicting a continuous output variable based on one or more input
features.
○ Logistic Regression: Used for binary classification problems, predicting the probability of a binary
outcome.
○ Decision Trees: Used for both classification and regression tasks by splitting the data into subsets
based on feature values.
○ Support Vector Machines (SVM): Used for classification tasks by finding the hyperplane that best
separates the classes.

· Unsupervised learning, on the other hand, deals with unlabeled dataset where algorithms
try to uncover hidden patterns or structures within the data. Unlike supervised learning
which depends on labeled data to create patterns or relationships for further predictions,
unsupervised learning operates without such guidance.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of several types of cats and dogs. The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images

Unsupervised learning has two types:


· Clustering
· Dimensionality Reduction

Clustering Algorithms:

· K-Means: Partitions the data into K clusters where each data point belongs to the
cluster with the nearest mean.
· Hierarchical Clustering: Builds a hierarchy of clusters either by merging or splitting
clusters iteratively.
· Dimensionality Reduction Algorithms:
Principal Component Analysis (PCA): Reduces the dimensionality of the data by
transforming it into a new set of variables (principal components) that capture the most
variance.

1. Grouping Books in a Library:


● Problem: A library has thousands of books, and you want to organize them into
categories based on their content, without having predefined genres.
● Unsupervised Solution: Use clustering to group books into categories based on text
analysis (e.g., fiction, science, history).
2. Organizing Photos on a Smartphone:
● Problem: You have a diverse collection of photos and want to automatically group them
by themes, like family, vacations, or pets, without manually tagging them.
● Unsupervised Solution: Use clustering or face recognition algorithms to group photos
based on visual similarities.
3. Recommending Movies:
● Problem: A streaming service wants to recommend movies to users based on their
viewing history, without predefined genres or ratings.
● Unsupervised Solution: Use collaborative filtering to find clusters of users with similar
viewing habits and recommend movies that similar users have liked.
4. Identifying Customer Segments for Marketing:
● Problem: A retail company wants to tailor marketing strategies to different customer
groups but does not have predefined segments.
● Unsupervised Solution: Use clustering to group customers based on purchase history,
spending patterns, and demographics, allowing for targeted marketing.

Model Development-
Model development is a crucial phase in the data science pipeline, where we transform insights
gained from Exploratory Data Analysis (EDA) into actionable predictive models.

Data Splitting:
• Training Set: Used to train the model (typically 70-80% of the data).
• Test Set: Used to evaluate the final model’s performance (typically 30-40% of the data).

Selecting the Right Algorithm:


• Problem Type: Choose based on whether the task is regression, classification, clustering,
etc.
• Data Size and Complexity: Consider algorithms that can handle the data volume and
complexity efficiently.

Model Training:
Use the training data to build the model.

Linear Regression-
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Dependent and Independent variables-


Predicting Salary
Dependent Variable (Target Variable)
Salary: The variable we are trying to predict.
Independent Variables (Predictor Variables)
Age, Education Level, Years of Experience, Job Role/Position, Location, Industry, Performance
Ratings, Skills/Certifications.

Predicting Marks (Student Performance)


Dependent Variable (Target Variable)
Marks: The final grades or scores that we are trying to predict.
Independent Variables (Predictor Variables)
Age, Attendance Rate, Attendance Rate, Study Hours, Participation in Extracurricular Activities,
Parental Education Level, Teaching Quality.
Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

· Simple Linear Regression:


Simple linear regression is a statistical method used to model the relationship
between two variables: one independent variable (predictor) and one dependent
variable (outcome).

· Multiple Linear regression:


Multiple linear regression extends simple linear regression to model the relationship
between multiple independent variables and one dependent variable.

Simple Linear Regression:

Model Representation:
y=β0 +β1 X
where:
· Y is the dependent variable
· X is the independent variable
· β0 is the intercept
· β1 is the slope

Example Use Case:


Predicting house prices based on the area of the house (single predictor).
Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:

Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line, that means
the error between predicted values and actual values should be minimized. The best fit line
will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a 0 and a1 to find the best fit line, so
to calculate this we use cost function.
KNN Classification Algorithm-

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
 Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
 By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

 As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
 There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
 Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data
points for all the training samples.
Example-

In this example, the dataset is a simple, artificial set of points with two features each. Here's the
detailed breakdown of the dataset:

● Features (X): Each point has two features (e.g., [1, 2]).
● Labels (y): Each point is assigned a class label (either 0 or 1).

The dataset consists of 10 samples:

The points [1, 2], [2, 3], and [3, 3] are labeled as class 0.
The points [6, 7], [7, 8], and [8, 8] are labeled as class 1.
The points [1, 0] and [0, 1] are labeled as class 0.
The points [9, 9] and [10, 10] are labeled as class 1.

To determine the class for the input [5, 5] using the K-Nearest Neighbors (KNN) algorithm with
k=3, the algorithm follows these steps:

1. Calculate the Distance: Compute the distance between the input point [5, 5] and each
point in the dataset. The most common distance metric is the Euclidean distance.
2. Identify Neighbors: Find the k nearest neighbors to the input point. Since k=3, we will
identify the three closest points.
3. Vote for the Class: The class of the input point is determined by a majority vote among
the classes of the k nearest neighbors.
Step-by-Step Calculation:
K-means clustering algorithm-

· K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

· It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.

· It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
· It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
· The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:


Ø Determines the best value for K center points or centroids by an iterative process.
Ø Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.

We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image
As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
Unit-5
Model Evaluation
5.1 Overview of Model Evaluation
5.2 Model Evaluation Metrics for regression: MAE, MSE, RMSE
5.3 Model Evaluation Metrics for classification: Accuracy, Precision, Recall, F1 Score
5.4 Model Evaluation Techniques: Cross-Validation, Train-Test Split
Model evaluation and validation
Model evaluation is the process that uses some metrics which help us to analyze the performance
of the model. As we all know that model development is a multi-step process and a check should
be kept on how well the model generalizes future predictions. Therefore, evaluating a model
plays a vital role so that we can judge the performance of our model. The evaluation also helps to
analyze a model’s key weaknesses. There are many metrics like Accuracy, Precision, Recall, F1
score, Area under Curve, Confusion Matrix, and Mean Square Error.

Model Evaluation Metrics for regression: MAE, MSE, RMSE

In regression tasks, the model predicts a continuous value, and various metrics are used to
evaluate how well the model's predictions match the actual values. The most common metrics
include Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared
Error (RMSE). Let's dive into these with examples and Python code.

Dataset Example:-

We will use a simple example where a model is trained to predict house prices. Let’s assume we
have the following data:
Mean Absolute Error (MAE)
· Definition: MAE measures the average absolute difference between the predicted and
actual values. It shows how far the predictions are from the true values on average.
Interpretation: A lower MAE indicates better performance, as the predicted values are closer to
the actual values.

Total Absolute Error = 10,000 + 5,000 + 15,000 + 10,000 + 10,000 = 50,000


MAE = 50,000 / 5 = 10,000

Mean Squared Error (MSE)


· Definition: MSE calculates the average of the squared differences between actual and
predicted values. By squaring the errors, it penalizes larger errors more than smaller
ones.

Interpretation: A lower MSE indicates better performance, but it is sensitive to outliers due to
the squaring of errors.

Total Squared Error = 100,000,000 + 25,000,000 + 225,000,000 + 100,000,000 + 100,000,000 =


550,000,000
MSE = 550,000,000 / 5 = 110,000,000

Definition: RMSE is the square root of the MSE, which brings the error back to the same scale
as the target variable.
Formula:

Interpretation: A lower RMSE indicates better performance. RMSE gives more weight to
larger errors, making it useful when larger errors are more problematic.

Example Calculation

Using the MSE calculated above:

Insights-

The choice of using MAE, MSE, or RMSE depends on the specific characteristics of your
problem and what you prioritize in evaluating your regression model. Here's how to use each
one based on different needs:

1. Mean Absolute Error (MAE)


· When to Use:
· You want a simple, easily interpretable measure of error in the same unit as the target
variable.
· You care about all errors equally, regardless of their size.
· Outliers or large deviations in predictions are not a major concern for you.
· Example:
· In a housing price prediction model, if the average deviation (error) between predicted
and actual prices is important to understand in straightforward terms, use MAE. For
instance, if MAE is $10,000, you can tell that, on average, your model's predictions are
off by $10,000.
· Pros:
· Easy to interpret.
· Doesn't heavily penalize outliers, so it's useful when you expect some variability or
noise in the data.
· Cons:
· Doesn't give much importance to large errors, so it might not be suitable if reducing
large errors is important.

2. Mean Squared Error (MSE)


· When to Use:
· You want to heavily penalize larger errors and are concerned about outliers.
· You care more about models that handle large deviations (big mistakes) better.
· You are okay with the fact that the error metric is in squared units and harder to
interpret directly.
· Example:
· For predicting the performance of high-risk investments, where large mistakes could be
costly, use MSE. MSE will help to emphasize large prediction errors and encourage
models that minimize those.
· Pros:
· Larger errors are penalized more, so it's useful when big mistakes matter a lot.
· It helps focus on minimizing large deviations, making the model more reliable.
· Cons:
· Harder to interpret because it's not in the same units as the target variable (it's in
squared units).
· Sensitive to outliers, meaning if you have a few very large errors, it can dominate the
MSE value.

3. Root Mean Squared Error (RMSE)


· When to Use:
· You want to emphasize large errors like MSE does but still keep the error metric
interpretable in the same units as the target variable.
· You want a balanced approach that penalizes large errors more but is still easy to
explain.
· Example:
· In weather forecasting, where both small and large errors matter, use RMSE. If RMSE is
5°C, you can interpret it as "on average, the temperature predictions are off by around
5°C."
· Pros:
· Like MSE, it penalizes large errors more than small ones, but it's in the same units as
the target variable, making it easier to understand.
· Widely used in regression problems for this reason—it strikes a balance between
interpretability and sensitivity to large errors.
· Cons:
· Still more sensitive to outliers than MAE, so if your data has extreme outliers, RMSE
might be dominated by a few large errors.

Summary of When to Use:


● MAE is good when you want a simple average error and don’t care about large errors
dominating the metric.
● MSE is best when you want to penalize large errors heavily and don't mind a less interpretable
result.
● RMSE is a great balance, offering penalization of large errors like MSE but remaining easy to
interpret like MAE.
Model Evaluation Metrics for classification-

Classification metrics help evaluate the performance of a classification model by comparing the
predicted categories (e.g., spam vs. not spam) with the actual categories. These metrics are
especially useful when you are dealing with imbalanced datasets or when misclassifications
have different costs (e.g., in medical diagnosis).

Before diving into the metrics, let’s first understand the confusion matrix, which forms the
basis for calculating many of these metrics.

Confusion Matrix-

The confusion matrix is a table that helps summarize the performance of a classification
algorithm by breaking down the predicted outcomes. Here's a recap of the four terms within a
confusion matrix:

● True Positive (TP): The model correctly predicts the positive class.
● True Negative (TN): The model correctly predicts the negative class.
● False Positive (FP): The model incorrectly predicts the positive class (also called a
"Type I Error").
● False Negative (FN): The model incorrectly predicts the negative class (also called a
"Type II Error").
The confusion matrix serves as the foundation for calculating all other classification metrics.
Accuracy-

Accuracy is the most basic metric and measures the proportion of total correct predictions (both true
positives and true negatives) out of all predictions made.

Example: -

Consider the following confusion matrix:

The model has an accuracy of 85%, meaning that out of 100 total predictions, 85 predictions
(either positive or negative) were correct.

Interpretation:
Accuracy tells you the overall correctness of the model’s predictions. For example, if a model
predicts spam emails and has an accuracy of 90%, it means that it correctly classified 90% of
emails as either spam or not spam.

Limitations:

Accuracy can be misleading when the dataset is imbalanced. For example, if 95% of the emails
are not spam and the model predicts everything as "not spam," the accuracy will be 95%, but the
model hasn't learned to detect spam at all.

Example:

In a test set of 100 emails where 90 are not spam (negative) and 10 are spam (positive):

● If the model predicts all emails as not spam, accuracy would be:

Even though the model has a high accuracy, it is useless for detecting spam since it missed all
positive cases.

When to Use:
● Use accuracy when the classes are balanced, meaning the positive and negative classes occur in
roughly equal numbers.
● Avoid relying solely on accuracy when dealing with imbalanced data (e.g., rare disease
detection).

Precision-
Precision measures how many of the predicted positive cases were actually positive. It answers
the question: "Of all the instances the model predicted as positive, how many were
correct?"

Example 1:-
The precision is 72.7%, which means that out of all the instances that the model predicted as
positive (40 true positives + 15 false positives = 55 total predicted positives), 72.7% were
actually positive.

Example-2:

● If the model predicts 50 emails as spam but only 30 of them are actually spam (20 are
legitimate emails), the precision would be: -

Interpretation:
● High precision means that the model makes fewer false positive errors.
● In scenarios where predicting a false positive has a high cost, precision becomes important.

Recall (Sensitivity or True Positive Rate)-

Recall measures how many of the actual positive cases the model correctly identified. It
answers the question: "Of all the actual positives, how many did the model correctly
identify?"
Example: Calculating Recall

Consider the following confusion matrix:

Interpretation:

The recall is 62.5%, meaning that out of all the actual positive cases (80 total), the model
correctly identified 62.5% of them as positive. The other 37.5% were missed (false negatives).

Insights:
● Recall focuses on the model’s ability to detect actual positive cases. In this case, about 37.5% of
the actual positive cases were missed by the model.
● High recall is crucial when missing positive cases is costly, like in medical diagnosis or fraud
detection where false negatives can have severe consequences.

When to Use:

Use recall when it is critical to capture as many positive cases as possible, even if some false
positives are tolerated (i.e., the model should not miss actual positive instances).

F1 Score-

The F1 Score is the harmonic mean of precision and recall. It provides a single metric that
balances both precision and recall, which can be especially useful when dealing with imbalanced
datasets.
Example:

Let’s say a medical model for detecting a rare disease has a precision of 70% and a recall of
80%. The F1 score would be:

This score gives a balance between the model’s ability to correctly identify positives (recall) and
its ability to avoid false positives (precision).

Interpretation:
● The F1 score provides a balanced measure that accounts for both precision and recall,
especially when they are in tension (e.g., when increasing precision decreases recall and vice
versa).
● It’s particularly useful in imbalanced datasets, where one class is much more common than the
other.

When to Use:
● Use the F1 score when you need a balance between precision and recall, especially if you are
dealing with imbalanced classes.
● It is useful in scenarios like rare disease detection, where both missing positive cases (false
negatives) and wrongly predicting positive cases (false positives) can have serious
consequences.

Summary of Metrics:
Train-Test Split-

The train-test split is a simple and widely used method where the dataset is split into two parts:

● Training set: Used to train the machine learning model.


● Test set: Used to evaluate the model's performance after training.

How it Works:

Typically, you split the data into 70-80% for training and 20-30% for testing. After the model
is trained on the training set, it is evaluated on the test set to measure how well it generalizes to
unseen data.

Steps:
1. Divide the dataset into two subsets: one for training and one for testing (e.g., 80% for training,
20% for testing).
2. Train the model on the training set.
3. Evaluate the model's performance using the test set (e.g., accuracy, precision, recall, etc.).
Advantages:
● Simplicity: Easy to implement.
● Speed: Fast, since the model is trained and evaluated only once.

Disadvantages:
● The results depend heavily on how the data is split. If the data split is not representative (e.g.,
training set is much different from the test set), the evaluation may be biased.
● Since the test set is small, it may not give a reliable estimate of model performance for all
possible variations in the data.

Cross-Validation-
Cross-validation is a more robust model evaluation technique that involves splitting the dataset
into multiple parts (called folds) and repeatedly training and testing the model on different
subsets of the data. The most common form is k-fold cross-validation.

How it Works:
● The dataset is divided into k equal-sized folds (usually k = 5 or 10).
● The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k
times, with each fold being used as the test set once.
● The final performance is averaged across all k trials, providing a more reliable estimate of how
the model will perform on unseen data.

Steps:
1. Split the dataset into k equal parts (folds).
2. Train the model on k-1 folds and test on the remaining fold.
3. Repeat the process k times, each time using a different fold as the test set.
4. Calculate the average performance across all k iterations.
Advantages:
● More reliable: The model is evaluated on multiple test sets, so the evaluation is less dependent
on how the data is split.
● Reduces bias: All data points are used for both training and testing, which reduces the risk of
overfitting or underfitting.
● Works well on smaller datasets: When you have limited data, cross-validation helps maximize
the amount of data used for training and testing.

Disadvantages:
● Computational cost: Since the model is trained k times, cross-validation can be computationally
expensive, especially for large datasets or complex models.

Common questions

Powered by AI

Organizations use machine learning to improve fraud detection by developing models that can identify patterns and anomalies in financial transactions that deviate from typical behavior. These models are trained on historical data to predict potential fraudulent activities, allowing companies to take preventative measures. Machine learning helps in analyzing vast amounts of transaction data in real-time, improving accuracy and reducing false positives in fraud detection .

Python and R are significant in data science for different reasons. Python is favored for its comprehensive nature and extensive libraries that support a variety of data science tasks including data manipulation, visualization, and machine learning. R, on the other hand, is particularly used for statistical analysis and data visualization due to its built-in packages and tools tailored for these tasks. Together, they provide a versatile toolkit for data scientists to tackle various aspects of data analysis and model building .

Data visualization transforms complex datasets into actionable insights by presenting data in visual formats like graphs, charts, and maps that are easy to interpret. These visualizations help identify patterns, trends, and anomalies within large datasets, making it easier for stakeholders to understand the data, derive insights, and make informed decisions without needing advanced technical expertise .

Data engineering supports the data science lifecycle by providing the infrastructure needed to gather, manage, and utilize data effectively. It involves developing software solutions to handle data issues and ensuring that data flows seamlessly from or to various stakeholders in the system. Data engineering is essential because it enables the collection and structuring of raw data, which is critical for any data science analysis to take place .

One-hot encoding generates binary vectors for each category, suitable for categorical variables without an ordinal relationship, making it ideal for algorithms not sensitive to input order. Label encoding assigns an integer to each category, appropriate for variables with an ordinal relationship, such as rankings. Choosing the right encoding technique depends on whether the categories imply a hierarchy or order, ensuring that machine learning models interpret them correctly .

Machine learning is pivotal in recommendation systems as it analyzes patterns in user data and behaviors to suggest products or content that the user might be interested in. For companies like Amazon and Netflix, these systems enhance customer experience by providing personalized recommendations based on past interactions, thus increasing user engagement and retention .

Cross-validation is more reliable than a simple train-test split because it involves multiple rounds of training and testing, each on different subsets of the data. This technique reduces the dependency on a single split and provides a comprehensive assessment of model performance by exposing it to different variations of the dataset. As a result, cross-validation offers a more robust estimation of a model’s accuracy and reduces biases from data splits .

Domain expertise is crucial for building effective machine learning models as it involves understanding the context from which data is sourced. Professionals with domain knowledge can interpret data features accurately, select relevant variables, and build models that are more precise and applicable. Without this expertise, models may fail to capture meaningful insights specific to the domain, reducing their efficacy and accuracy .

Data Scientists require technical language skills such as R, SAS, SQL, Python, and Apache Spark to handle and analyze large datasets. They need statistical and mathematical knowledge to apply statistical tests and draw insights from data. Visualization skills are crucial for presenting data in an understandable format. Communication skills are essential for explaining findings to stakeholders who may not be familiar with data analytics. These skills help in using algorithms and methodologies to derive compelling business insights and make informed decisions .

Proficiency in distributed systems is vital for data engineers because big data projects require managing and processing large datasets across clustered machines. Knowledge in distributed systems enables engineers to build scalable and efficient data architectures that can handle high volumes of data with fault tolerance. This proficiency ensures that data pipelines can process data more quickly and reliably, supporting real-time data analysis and insights generation .

You might also like