0% found this document useful (0 votes)
14 views179 pages

Chapter 04 Advanced Use of Python Libraries for AI and Data Science

Uploaded by

Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
14 views179 pages

Chapter 04 Advanced Use of Python Libraries for AI and Data Science

Uploaded by

Aya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 179

1

Introduction
What is data science?
Data science is an interdisciplinary field focused on extracting
knowledge, insights, and value from structured and unstructured
data. It combines techniques from statistics, mathematics,
computer science, and domain-specific expertise to analyze,
interpret, and predict patterns in data.

Data science leverages tools and methods like machine learning,


artificial intelligence (AI), data mining, and big data processing to
derive meaningful information that supports decision-making,
problem-solving, and innovation in various industries.

2
Introduction
Key Components of Data Science
• Data Collection and Storage: Gathering data from multiple sources,
including databases, APIs, web scraping, sensors, and more. Ensuring
data is stored in formats that are accessible and organized for analysis is
critical.
• Data Cleaning and Preparation: Before analysis, data often requires
significant cleaning and transformation, such as handling missing values,
correcting errors, standardizing formats, and ensuring consistency.
• Exploratory Data Analysis (EDA): EDA involves understanding data‟s
structure, distributions, and key features through visualization and
summary statistics to form hypotheses and identify trends.

3
Introduction
Key Components of Data Science
• Statistical Analysis and Hypothesis Testing: Using statistical methods to test
hypotheses, identify correlations, and draw inferences from data.
• Machine Learning and Modeling: Applying algorithms to build predictive
models, classify data, detect patterns, and automate decision-making. Models
range from simple linear regressions to complex deep learning models,
depending on the problem and data.
• Data Visualization: Presenting insights visually, using tools like charts, graphs,
dashboards, and infographics. Visualization helps stakeholders (or Users)
understand complex data and models intuitively.
• Deployment and Optimization: Implementing data-driven solutions into
production environments where they can be used in real-time applications. This
includes ongoing monitoring and optimization to ensure models remain accurate
over time.
4
Introduction
Data categories
In data science, data types are generally categorized based on their
structure and format: structured, semi-structured, and unstructured
data
1. Structured Data
Structured data is highly organized and easily searchable in predefined formats, often found in tabular form with
rows and columns.
Characteristics:
Follows a strict schema (e.g., rows and columns in a database).
Easily searchable with SQL queries.
Data types and formats are defined in advance, making the data predictable and consistent.
Examples:
Relational databases (e.g., MySQL, PostgreSQL).
Spreadsheets (e.g., Excel tables).
Financial records, inventory data, and customer information stored in databases.
5
Introduction
Data categories
2. Semi-Structured Data
Semi-structured data lacks a fixed schema but still contains tags or markers to separate
elements, making it partially organized.
Characteristics:
Data doesn‟t strictly adhere to a fixed schema but still has some structure, like tags or
other markers.
More flexible than structured data but still organized enough for certain querying and
parsing methods.
Often represented in text files with a flexible schema (e.g., XML or JSON format).
Examples:
JSON and XML documents.
Email messages (with structured headers but free-form body content).
NoSQL databases (e.g., MongoDB, which allows documents with variable
structures).
HTML pages, where some parts have a defined structure but contain unstructured data
in free-form text.
6
Introduction
Data categories
3. Unstructured Data
Unstructured data lacks any predefined format or organization, making it harder to
process and analyze without specialized tools.
Characteristics:
No inherent structure, schema, or easily searchable format.
Data may be stored in formats that make analysis challenging without natural
language processing (NLP) or image recognition tools.
Often requires data processing and transformation before analysis can be done.
Examples:
Text documents, PDF files.
Media files (e.g., images, audio, video).
Social media posts, chat messages, and emails.
Sensor data from IoT devices, which may have raw, unprocessed values.
7
Introduction
Data categories

Summary Table
Data Type Characteristics Examples
Rigid schema, organized, easily Relational databases, Excel
Structured
searchable sheets
Flexible schema, contains JSON, XML, NoSQL
Semi-Structured
tags/markers for organization databases, emails
Images, audio, video,
No schema, difficult to
Unstructured social media posts, text
search/organize without processing
files

8
Introduction to Key Libraries for AI and
Data science

1- NumPy (Numerical Python) :


Multidimensional array manipulation

9
Introduction to NumPy

NumPy (Numerical Python) is a powerful library


in Python used for numerical computations.

One of its most important features is the support for


multidimensional arrays, which are known as
ndarray objects.

These arrays make it easy to perform mathematical


and logical operations on data efficiently.

10
Introduction to NumPy
Why Use NumPy Over Regular Python Lists
Performance: NumPy arrays are implemented in C, which makes array
operations significantly faster than equivalent operations on regular Python
lists. This is particularly true for large datasets.
Memory Efficiency: NumPy uses less memory than lists, which is
essential when working with big data.
Vectorized Operations: NumPy supports vectorized operations, allowing
you to apply a function to an entire array without needing explicit loops,
which enhances performance.
Mathematical Functions: NumPy includes a wide variety of built-in
mathematical functions and statistical operations, making it easier to
perform complex calculations.
11
Introduction to NumPy

Applications in Data Science and AI


•Data Preprocessing: Clean and transform datasets by handling missing data, normalizing values, and
reshaping data structures for machine learning models.
•Feature Extraction: Use NumPy functions to extract and manipulate features from datasets.
•Statistical Analysis: Calculate statistics to understand data distributions, variances, and correlations that
guide model selection.
•Linear Algebra: Perform operations such as matrix decomposition, linear transformations, which are
essential for building machine learning algorithms.
•Training Data Preparation: Use ndarray operations to efficiently shuffle, split, and prepare training and
testing sets.
•Model Implementation: Implement basic neural networks and custom algorithms using NumPy for
operations like dot products and activation functions.
•Simulation and Testing: Create mock data or simulate scenarios to test and validate machine learning
models.

12
Introduction to NumPy

Practical Applications
•Data Analysis: Efficiently compute statistical metrics.
•Image Processing: Manipulate pixel data as 2D or 3D arrays.
•Scientific Computing: Perform simulations and mathematical modeling.
•Machine Learning Pipelines: Integrate data transformation and feature
scaling as part of ML workflows

13
Introduction to NumPy
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Creating Multidimensional Arrays
1D Array: arr=np.array([1, 2, 3])
2D Array: arr=np.array([[1, 2, 3], [4, 5, 6]])
3D Array: arr=np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
ND Array (N-dimensional)
# Using zeros, ones, and arange …etc
zeros_array = np.zeros((3, 3)) # 3x3 array of zeros
ones_array = np.ones((2, 4)) # 2x4 array of ones
range_array = np.arange(0, 10, 2) # Array: [0, 2, 4, 6, 8]
full_array = np.full((3, 3), 7) # 3x3 array filled with the value 7
empty_array = np.empty((2, 3)) # 2x3 uninitialized array
linear_space = np.linspace(0, 10, 5) # 5 values from 0 to 10
random_array = np.random.rand(3, 2) # 3x2 array of random numbers between 0 and 1
random_integers = np.random.randint(1, 10, (2, 3)) # 2x3 array of random integers between 1 and 9
14
Introduction to NumPy
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Specifying Data Types During Creation:
When creating an array, you can specify the data type (dtype) to ensure the array
elements are of a particular type, such as integers, floats, or complex numbers. This can
be done using the dtype parameter.
import numpy as np
Specifying dtype helps in
# Integer array controlling memory usage and
int_array = np.array([1, 2, 3, 4], dtype=int)
ensures consistent data
# Float array handling, especially when
float_array = np.array([1, 2, 3, 4], dtype=float) performing mathematical
# Complex array operations.
complex_array = np.array([1, 2, 3, 4], dtype=complex) 15
Introduction to NumPy
Typecasting Arrays with astype()
You can change the data type of an existing array using the astype()
method. This is especially useful if you need to convert an array's data type
after creation.

import numpy as np
Typecasting can be beneficial
# Converting integer array to float
float_array = int_array.astype(float) for compatibility with specific
functions or for saving memory
# Converting float array to integer by switching to smaller data
int_array = float_array.astype(int)
types when possible.
# Converting integer array to complex
complex_array = int_array.astype(complex)

16
Introduction to NumPy
Working with Complex Data Types: Structured Arrays and Records
Structured arrays allow you to handle complex data by combining multiple data types in a
single array, similar to records or rows in a database table. Structured arrays are created
using compound data types, often useful when working with tabular data.
Creating a Structured Array
You can define structured arrays by specifying field names and data types:
import numpy as np
# Define a structured array with fields 'name', 'age', and 'height'
person_dtype = [('name', 'U10'), ('age', 'i4'), ('height', 'f4')]
people = np.array([('Alice', 25, 5.5), ('Bob', 30, 6.0)], dtype=person_dtype)

•U10 specifies a Unicode string of length 10 for name.


•i4 specifies a 4-byte integer for age.
•f4 specifies a 4-byte float for height.

17
Introduction to NumPy

Accessing Structured Array Data:


You can access data in structured arrays by field name or index:

# Access all names


names = people['name']

# Access the first person's age


first_age = people['age'][0]

Structured arrays are ideal when dealing with datasets that contain multiple data types across fields,
such as database records or CSV data.

18
Introduction to NumPy
Array Indexing and Slicing
• Indexing: Access specific elements (e.g., arr[0, 1] for a 2D array).
• Slicing: Extract subarrays (e.g., arr[:, 1] for selecting the second
column).
Reshaping and Flattening Arrays
• Reshape: Change the shape of an array without altering its data
(e.g., arr.reshape(3, 2)).
• Flatten: Convert a multidimensional array to a 1D array using
arr.flatten() or arr.ravel().

19
Introduction to NumPy
Some Attributes and Functions
Attributes:
shape: Dimensions of the array.
size: Total number of elements in the array.
dtype: Data type of the array elements.
ndim: Number of dimensions (axes) of the array.
itemsize: Number of bytes per element in the array.
Functions:
reshape(): Change the shape of an array while keeping the same number of elements.
flatten(): Convert a multi-dimensional array into a 1D array.

20
Introduction to NumPy

Advanced Indexing Techniques


• Boolean Indexing: Select elements based on a condition (e.g., arr[arr > 5]).
• Fancy Indexing: Use arrays of indices to access specific elements (e.g.,
arr[[0, 2], [1, 3]]).
Manipulating Arrays
• Concatenation: Combine arrays using np.concatenate([arr1, arr2], axis=0).
• Stacking: Vertical stacking (np.vstack([arr1, arr2])) and horizontal stacking
(np.hstack([arr1, arr2])).
• Splitting: Divide arrays using np.split(arr, sections, axis=0).

21
Introduction to NumPy
Array Operations
Element-wise Operations: Operations applied to each element (e.g., arr1 + arr2).
Matrix Multiplication: Use np.dot(arr1, arr2) or arr1 @ arr2.
Broadcasting: Automatic expansion of array dimensions for operations (e.g., adding a
1D array to a 2D array). Rule: they are equal, or one of them is 1.

import numpy as np
# Add a 1D array to each row of a 2D array
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
print(a + b)
# Output: [[11, 22, 33], [14, 25, 36]]
22
Introduction to NumPy

Aggregation Functions
• np.sum(arr): Sum of all elements.
• np.mean(arr): Mean value.
• np.min(arr), np.max(arr): Minimum and maximum values.
• np.std(arr): Standard deviation.

23
Introduction to NumPy

Handling Missing Data


• Use np.nan to represent missing values.
• Functions like np.isnan(arr) can identify NaN values.
• np.nan_to_num(arr) replaces NaN with zero.

24
25
Pandas “Panel data system” or Python Pandas is a library of
Python which is used for data analysis.
panel data, is a term in econometrics and statistics. It refers to data that
contains observations over multiple time periods for the same individuals
or entities,

26
Pandas extends NumPy by providing functions for exploratory
data analysis, statistics, and data visualization.

It can be thought of as a Python‟s, equivalent to Microsoft Excel


spreadsheets for working with and exploring tabular data.

27
Key Features of Pandas

• Create a structured data set like R‟s data frame and Excel spreadsheet.
• Reading data from various sources such as CSV, TXT, XLSX, JSON, SQL
database etc.
• Writing and exporting data in csv or Excel format ( built in methods)
• It has a fast and efficient DataFrame object with default and customized
indexing.
• It is used for reshaping and pivoting of the data sets.
• It can do group by data aggregations and transformations.
• It integrates with the other libraries like SciPy, and scikit-learn.
• Many more …
28
29
30
Difference between NumPy and Pandas
Before we dive into the details of Pandas, we would like to point out some
important differences between NumPy and Pandas.
• Pandas mainly works with the tabular data and is popularly used for data
analysis and visualization, whereas the NumPy works primarily with the
numerical data.
• Pandas provides some sets of powerful tools like DataFrame and Series
that is mainly used for analyzing the data, whereas NumPy offers a
powerful object called ndArray.
• The performance of NumPy is better than Pandas for 50K row or less.
Whereas Pandas is better than the NumPy for 500K rows or more. Between
50K to 500K rows, performance depends on the kind of operation.
• NumPy consumes less memory as compared to Pandas.
31
• Indexing of the Series objects is quite slow as compared to NumPy arrays.
Pandas Data Objects
Pandas provides two data structures/objects for processing the data, i.e., Series
and DataFrame. Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with
labels rather than simple integer indices.
1. Pandas Series: It is a one-dimensional labeled array to store the data
elements which are generally similar to the columns in Excel. A Series
contains a sequence of values and an associated array of data labels,
called its index.

2. Pandas DataFrame: It is a mutable two-dimensional data structure


with labeled rows and columns which are generally similar to excel or
and SQL sheets. 32
Pandas Data Objects

Note that the individual columns in Pandas are referred to as “Series”. A


collection of series forms a DataFrame with a common index. In the case of
lack of alignment, Pandas introduces missing values as NaN.

33
Pandas Series Object

• Series are a special type of data structure available in the Pandas


library.
• It is defined as a one-dimensional labeled, homogeneously typed
array.
• One can create a series with objects of any data type - be it
integers, floats, strings, any other data type.

• It can be seen as a data structure with two arrays: One functioning


as the index, i.e., the labels, and the other contains the actual data
values.
• There are several different ways to create a series.
34
Pandas Series Object

We can easily convert the list, tuple, and dictionary into series using “Series”
method with the following syntax.
pandas.Series(data, index, dtype, copy)

Series has the following parameters:


• data: It should be a python object like list, dictionary, or a scalar value.
• index: The value of index should be unique and hashable. It must be of the
same length as data. If we do not pass any index, default np.arange(n) will
be used where n is the length of data.
• dtype: It refers to the data type.
• copy: It is used for copying the data. (Default is False) 35
Pandas Series Object

• copy: It is used for copying the data. (Default is False)


Constructing Series from a list with copy=False. Constructing Series from a 1d ndarray with copy=False.
>>> r = [1, 2] >>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False) >>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999 >>> ser.iloc[0] = 999
>>> r >>> r
[1, 2] array([999, 2])
>>> ser >>> ser
0 999 0 999
1 2 1 2
dtype: int64 dtype: int64
Due to input data type the Series has a
Due to input data type the Series has a view on the
copy of the original data even though
original data, so the data is changed as well. 36
copy=False, so the data is unchanged.
Pandas Series Object

import pandas as pd
The syntax that is used for creating an empty Series.
series1 = pd.Series()
print(series1)

Output : Series([], dtype: float64)

The above example creates an empty series type object that has no values
and having default data types, i.e., float64.

37
Pandas Series Object
Creating a Series using inputs:
list1 = [10,20,30]
arr1 = np.array([10,20,30])
dict1 = {'a':10,'b':20,'c':30}

We can create Series by using following kinds of inputs:


• List
• Array
• Dictionary

38
Pandas Series Object
In [315]:
pd.Series(list1) labels=['a', 'b', 'c']
In [320]: series1 =pd.Series(list1, index = labels)
Out[315]: In [321]: series1
0 10 Out[321]:
1 20 a 10
b 20
2 30
c 30
dtype: int64 dtype: int64

39
Pandas Series Object

In [323]: print(series1[0]) # using numerical index


10
In [324]: print(series1['a']) # using label
10

40
Pandas Series Object

In [325]: ser_dict = pd.Series(dict1)


In [326]: print(ser_dict)
a 10
b 20
c 30
dtype: int64

( Don‟t use index)

41
Pandas Series Object
Accessing Data from Series

In [323]: print(series1[0]) # using numerical index


10

In [329]: series1[[0,1,2]]
Out[329]:
a 10
b 20
c 30
dtype: int64
42
Pandas Series Object
Series object attributes
Attributes Description
Series.index Defines the index of the Series
Series.shape Returns a tuple of shape of the data
Series.dtype Returns the data type
Series.size Returns the total elements in the series
Returns True if series is empty, otherwise
Series.empty
returns False
Returns True if there are any missing
Series.hasnans
values, otherwise returns False
Series.nbytes Returns the number of bytes in the data
Series.ndim Returns the dimensions
43
Pandas Series Object
Series object attributes

In [330]: print(series1.index)
Index([„a‟, „b‟, „c‟], dtype=‟object‟)

In [331]: print(series1.values)
[10 20 30]

In [332]: print(series.dtype)
float64

In [334]: print(series1.shape)
(3,)
44
Pandas Series Object
Series object attributes

In [335]: print("Dimension of series:", series1.ndim)


Dimension of series: 1

In [336]: print("Size of a series:", series1.size)


Size of a series: 3

In [337]: print(“Number of bytes:", series1.nbytes, 'bytes')


Number of bytes: 24 bytes

45
Pandas Series Object
Series.to_frame()
Series is defined as a type of lists that can hold an integer, string, double
values, etc. It returns an object in the form of a list that has an index starting
from 0 to n-1 where n represents the length of values in a Series. The main
difference between Series and DataFrame is that Series can only contain a
single list with a particular index, whereas the DataFrame is a combination of
more than one series that can analyze the data. The series.to_frame()
function is used to convert the series object to the DataFrame.
Syntax:
series.to_frame(name = None)
46
Pandas Series Object

series.to_frame(name = None)

Name: Refers to the object. Its defaults value is None. If it has one value, the
passed name will be substituted for the series name. It returns the DataFrame
representation of Series
In [352]: series1.to_frame()
Out[352]:
0
04
17
26
31
43
54 47
Pandas Series Object

Series.value_counts()
To extract the information about the values contained in a series, use
value_count() function. The value_counts() function returns a Series object
that contains counts of unique values. It returns an object that will be in
descending order so that its first element will be the most-frequently-occurred
element. By default, it excludes NA values.

Syntax
Series.value_counts(normalize = False, sort = True, ascending =
False, bins = None, dropna = True)
48
Pandas Series Object

Series.value_counts()
In [353]: x = pd.Series([4,1,7,6,1])
In [354]: x.value_counts()
Out[354]:
12
41
71
61
dtype: int64

49
Pandas DATAFRAME Object

Why Use Pandas DataFrames?


A DataFrame in Pandas is like a table in a database or an Excel
spreadsheet, but with more flexibility and powerful tools for data
manipulation.

50
Pandas DATAFRAME Object
Advantages of DataFrames:

Ease of Use:
Intuitive methods for reading, writing, and manipulating data.
Ability to handle heterogeneous data (e.g., numeric, textual, and categorical
data).
Efficient Data Handling:
Works well with large datasets.
Optimized for performance using built-in vectorized operations.
Flexible Data Manipulation:
Add, modify, and drop columns or rows easily.
Supports reshaping, merging, and filtering data intuitively.

51
Pandas DATAFRAME Object
Advantages of DataFrames:
Data Cleaning:
Handle missing data with tools for filling, interpolating, or dropping NaN values.
Perform data transformations such as renaming columns, changing data types, or applying
functions.
Integration:
Visualize data using libraries like Matplotlib and Seaborn.
Use DataFrames as inputs for machine learning models with Scikit-learn.
Rich Indexing and Slicing:
Access data by labels or positions.
Perform row or column filtering with intuitive syntax.
Support for Data Analysis:
Generate descriptive statistics with .describe().
Group data and compute aggregations using .groupby().

52
Pandas DATAFRAME Object
pip install pandas
Creating DataFrames
import pandas as pd
From dictionaries: data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
Name Age
0 Alice 25
1 Bob 30

data = [['Alice', 25], ['Bob', 30]]


From lists: df = pd.DataFrame(data, columns=['Name', 'Age'])

From CSV files: df = pd.read_csv('data.csv') We will see more … 53


Pandas DATAFRAME Object
Viewing DataFrames

df.head() # First 5 rows


df.tail() # Last 5 rows

Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
54
Pandas DATAFRAME Object
Viewing DataFrames

df.head() # First 5 rows


df.tail() # Last 5 rows

Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
55
Pandas DATAFRAME Object
Data Selection and Indexing
Selecting Data
By columns:
df['Name'] # Single column
df[['Name', 'Age']] # Multiple columns

By rows (using .loc and .iloc):

df.loc[0] # Row by label


df.iloc[0] # Row by index

56
Pandas DATAFRAME Object
Data Selection and Indexing
Filtering Data
Using conditions:
df[df['Age'] > 25]

Setting an Index ( use a column as a label)


Set and reset:
df.set_index('Name', inplace=True)
print(df.loc['Alice']) # Row by label)

Reset index: df.reset_index(inplace=True) 57


Pandas DATAFRAME Object
Data Cleaning
Handling Missing Data
Detect missing values:
df.isnull().sum()
Fill or drop:
df.fillna(0, inplace=True)
df.dropna(inplace=True)

58
Pandas DATAFRAME Object
Renaming Columns
df.rename(columns={'Age': 'Years'}, inplace=True)

Changing Data Types


df['Age'] = df['Age'].astype('float')

59
Pandas DATAFRAME Object
Data Transformation
Adding/Updating Columns

df['NewColumn'] = df['Age'] * 2
Applying Functions
Row-wise or column-wise:

df['Transformed'] = df['Age'].apply(lambda x: x + 5)

60
Pandas DATAFRAME Object
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score
Age Score
0 Alice 25 85
Name
1 Bob 30 90
Alice 26.5 86.5
2 Alice 28 88
Bob 32.5 87.5
3 Bob 35 85
Charlie 40.0 95.0
4 Charlie 40 95

Sorting
df.sort_values('Age', ascending=False, inplace=True) 61
Pandas DATAFRAME Object
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score Using Other Aggregations:
Age Score Replace .mean() with other aggregation functions:
0 Alice 25 85
1 Bob 30 90
Name •.sum(): Total sum of values in each group.
2 Alice 28 88
Alice 26.5 86.5 •.count(): Count of rows in each group.
3 Bob 35 85
Bob 32.5 87.5 •.max(): Maximum value in each group.
4 Charlie 40 95
Charlie 40.0 95.0 •.min(): Minimum value in each group.

Sorting
df.sort_values('Age', ascending=False, inplace=True)
62
Pandas DATAFRAME Object
Saving DataFrames
To CSV
df.to_csv('output.csv', index=False)

To Excel
df.to_excel('output.xlsx', index=False)

63
64
What is Matplotlib?
• Matplotlib is a low level graph plotting
library in python that serves as a
visualization utility.
• Matplotlib is open source and we can use
it freely.
• Matplotlib is mostly written in python, a
few segments are written in C, Objective-
C and Javascript for Platform
compatibility.

65
Installation of Matplotlib: pip install matplotlib
Use it in app: import matplotlib

Checking Matplotlib Version:

import matplotlib
print(matplotlib.__version__)

66
Installation of Matplotlib: pip install matplotlib

Checking Matplotlib Version:

import matplotlib
print(matplotlib.__version__)

67
Pyplot:
Most of the Matplotlib utilities lies under the pyplot submodule,
and are usually imported under the plt alias:

import matplotlib.pyplot as plt


Now the Pyplot package can be referred to as plt.

68
Plotting x and y points
• The plot() function is used to draw points (markers) in a diagram.
• By default, the plot() function draws a line from point to point.
• The function takes parameters for specifying points in the diagram.
• Parameter 1 is an array containing the points on the x-axis.
• Parameter 2 is an array containing the points on the y-axis.
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays
[1, 8] and [3, 10] to the plot function.

69
Plotting x and y points
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([1, 8])


ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints)
plt.show()

70
Plotting Without Line
• To plot only the markers, you can use shortcut string notation parameter 'o',
which means 'rings'.
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([1, 8])


ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints, 'o')


plt.show()

71
Multiple Points
• You can plot as many points as you like, just make sure you have the same
number of points in both axis.

import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([1, 2, 6, 8])


ypoints = np.array([3, 8, 1, 10])

plt.plot(xpoints, ypoints)
plt.show()

72
Default X-Points

If we do not specify the points on the x-axis, they will get the default values 0, 1, 2, 3 etc.,
depending on the length of the y-points.
So, if we take the same example as above, and leave out the x-points, the diagram will look
like this:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10, 5, 7])

plt.plot(ypoints)
plt.show()
73
Markers
You can use the keyword argument marker to emphasize each point
with a specified marker:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o')


plt.show()

74
Markers
You can use the keyword argument marker to emphasize each point with a
specified marker:

...
plt.plot(ypoints, marker = '*')
...

75
Markers
Marker Description 's' Square Triangle
'o' Circle 'D' Diamond '>'
Diamond
Right
'*' Star 'd'
(thin) '1' Tri Down
'.' Point 'p' Pentagon '2' Tri Up
',' Pixel 'H' Hexagon
'x' X 'h' Hexagon '3' Tri Left
'X' X (filled) 'v'
Triangle '4' Tri Right
Down
'+' Plus '^' Triangle Up
'|' Vline
'P' Plus (filled) '<' Triangle Left '_' Hline

76
Format Strings fmt
You can also use the shortcut string notation parameter to specify the marker.
This parameter is also called fmt, and is written with this syntax:
Marker|line|color
Example
o:r
Mark each point with a circle:
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, 'o:r')
plt.show() 77
Marker Size
You can use the keyword argument markersize or the shorter version, ms
to set the size of the markers:
Set the size of the markers to 20:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20)


plt.show()

78
Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to
set the color of the edge of the markers:
Set the EDGE color to red:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20,


mec = 'r')
plt.show()

79
Marker Color
You can use the keyword argument markerfacecolor or the shorter mfc to set the color
inside the edge of the markers:
Set the FACE color to red:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20,


mfc = 'r')
plt.show()

You can use both mec and mfc


plt.plot(ypoints, marker = 'o', ms = 20, mec = 'r', mfc = 'r')
80
Matplotlib Line
Linestyle: You can use the keyword argument linestyle, or
shorter ls, to change the style of the plotted line:
plt.plot(ypoints, linestyle = 'dotted')

plt.plot(ypoints, ls = ':')

Style Or
'solid' (default) '-'
'dotted' ':'
'dashed' '--'
'dashdot' '-.'
'None' ' 'or' ' 81
Matplotlib Line
Line Color: You can use the keyword argument color or the
shorter c to set the color of the line:
plt.plot(ypoints, color = 'r')
plt.plot(ypoints, c = '#4CAF50')

plt.plot(ypoints, c = 'hotpink')

Line Width: You can use the keyword argument


linewidth or the shorter lw to change the width of
the line.
The value is a floating number, in points:
plt.plot(ypoints, linewidth = '20.5')
82
Multiple Lines
You can plot as many lines as you like by simply adding
more plt.plot() functions:
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])

plt.plot(y1)
plt.plot(y2)

x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])

plt.plot(x1, y1, x2, y2)


plt.show()
83
Matplotlib Labels and Title
Create Labels for a Plot: With Pyplot, you can use
the xlabel() and ylabel() functions to set a label for the
x- and y-axis.
plt.plot(x, y)

plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")

Create a Title for a Plot: With Pyplot, you can use


the title() function to set a title for the plot.

plt.title("Sports Watch Data")

84
Matplotlib Labels and Title
Create Labels for a Plot: With Pyplot, you can use
the xlabel() and ylabel() functions to set a label for the
x- and y-axis.
plt.plot(x, y)

plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")

Create a Title for a Plot: With Pyplot, you can use


the title() function to set a title for the plot.

plt.title("Sports Watch Data")

85
Set Font Properties for Title and Labels
You can use the fontdict parameter in xlabel(),
ylabel(), and title() to set font properties for the title
and labels.

font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}

plt.title("Sports Watch Data", fontdict = font1)


plt.xlabel("Average Pulse", fontdict = font2)
plt.ylabel("Calorie Burnage", fontdict = font2)

86
Position the Title

You can use the loc parameter in title() to position the title.
Legal values are: 'left', 'right', and 'center'. Default value is
'center'.

plt.title("Sports Watch Data", loc = 'left')


plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")

87
Matplotlib Adding Grid Lines
Add Grid Lines to a Plot: With Pyplot, you can use the
grid() function to add grid lines to the plot.
plt.plot(x, y)

plt.grid()
plt.show()

Specify Which Grid Lines to Display: You can use the axis
parameter in the grid() function to specify which grid lines to
display.
Legal values are: 'x', 'y', and 'both'. Default value is 'both'.

plt.grid(axis = 'x')

plt.grid(axis = 'y')
88
Matplotlib Adding Grid Lines
Set Line Properties for the Grid: You can also set the line properties of the grid, like this: grid(color = 'color',
linestyle = 'linestyle', linewidth = number).

plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)

89
Matplotlib Subplot
Display Multiple Plots: With the subplot() function you can draw multiple plots in
one figure:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(1, 2, 1)
plt.plot(x,y)

#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])

plt.subplot(1, 2, 2)
plt.plot(x,y)

plt.show() 90
The subplot() Function
The subplot() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by the first and
second argument.
The third argument represents the index of the current plot.

plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.

plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the
second plot.

91
The subplot() Function
Example: Draw 2 plots on top of each other:

#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(2, 1, 1)
plt.plot(x,y)

#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30,
40])

plt.subplot(2, 1, 2)
plt.plot(x,y)
92
plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(2, 3, 1)
plt.plot(x,y)
The subplot() Function x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
You can draw as many plots you like on one figure, just
plt.subplot(2, 3, 2)
descibe the number of rows, columns, and the index of plt.plot(x,y)

the plot. x = np.array([0, 1, 2, 3])


y = np.array([3, 8, 1, 10])

plt.subplot(2, 3, 3)
plt.plot(x,y)

x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])

plt.subplot(2, 3, 4)
plt.plot(x,y)

x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(2, 3, 5)
plt.plot(x,y)

x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])

plt.subplot(2, 3, 6)
plt.plot(x,y) 93
Title and Super Title
You can add a title to each plot with the title() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")

#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])

plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")

plt.show()

94
Title and Super Title
You can add a title to the entire figure with the suptitle() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")

#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])

plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")

plt.suptitle("MY SHOP")
plt.show()
95
Matplotlib Scatter
Creating Scatter Plots: With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same length, one for
the values of the x-axis, and one for values on the y-axis:

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y)
plt.show()

The observation in the example above is the result of 13 cars


passing by.
The X-axis shows how old the car is.
The Y-axis shows the speed of the car when it passes.
Are there any relationships between the observations?
It seems that the newer the car, the faster it drives, but that could
be a coincidence, after all we only registered 13 cars.
96
Matplotlib Scatter
Compare Plots: In the previous example, there seems to be a relationship between
speed and age, but what if we plot the observations from another day as well? Will the
scatter plot tell us something else?
#day one, the age and speed of 13 cars:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)

#day two, the age and speed of 15 cars:


x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y =
np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80
,85])
plt.scatter(x, y)

plt.show()

By comparing the two plots, I think it is safe to say that they both gives us the same conclusion:
the newer the car, the faster it drives. 97
Matplotlib Scatter
Colors: you can set your own color for each scatter plot with the color or the c
argument:

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')

x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y =
np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80
,85])
plt.scatter(x, y, color = '#88c999')

98
Matplotlib Scatter
Color Each Dot:
colors =
np.array(["red","green","blue","yellow","pink","black","orang
e","purple","beige","brown","gray","cyan","magenta"])

plt.scatter(x, y, c=colors)

ColorMap: The Matplotlib module has a number of available colormaps.


A colormap is like a list of colors, where each color has a value that ranges from 0 to 100.
This colormap is called 'viridis' and as you can see it ranges from 0, which is a purple color, up
to 100, which is a yellow color.
colors = np.array([0, 10, 20, 30, 40, 45, 50, 55, 60, 70, 80, 90, 100])

plt.scatter(x, y, c=colors, cmap='viridis')


plt.colorbar() # to show the ColorMap bar 99
Matplotlib Scatter
Size: You can change the size of the dots with the s argument.
Just like colors, make sure the array for sizes has the same length as the arrays for the x-
and y-axis:
import matplotlib.pyplot as plt
import numpy as np

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y =
np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes =
np.array([20,50,100,200,500,1000,60,90,10,300,600,800,
75])

plt.scatter(x, y, s=sizes)

plt.show()

100
Matplotlib Scatter
Alpha: You can adjust the transparency of the dots
with the alpha argument.
Just like colors, make sure the array for sizes has
the same length as the arrays for the x- and y-axis:
plt.scatter(x, y, s=sizes, alpha=0.5)

Combine Color Size and Alpha


x = np.random.randint(100, size=(100))
y = np.random.randint(100, size=(100))
colors = np.random.randint(100, size=(100))
sizes = 10 * np.random.randint(100, size=(100))

plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='nipy_spectral')

plt.colorbar()
101
plt.show()
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Creating Bars: With Pyplot, you can
plt.bar(x,y)
use the bar() function to draw bar plt.show()
graphs:

The bar() function takes arguments that describes the


layout of the bars.
The categories and their values represented by the first
and second argument as arrays.

x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)

102
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Horizontal Bars: If you want the
plt.barh(x,y)
bars to be displayed horizontally plt.show()
instead of vertically, use the barh()
function:

Bar Color: The bar() and barh() take the keyword


argument color to set the color of the bars:
plt.bar(x, y, color = "red")
plt.show()

103
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Bar Width: The bar() takes the
plt.bar(x, y, width = 0.1)
keyword argument width to set the plt.show()
width of the bars
(The default width value is 0.8)

Note: For horizontal bars, use height instead of width.


plt.barh(x, y, height = 0.1)

104
Matplotlib Histograms
Histogram: A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:
You can read from the histogram that there are
approximately:
2 people from 140 to 145cm
5 people from 145 to 150cm
15 people from 151 to 156cm
31 people from 157 to 162cm
46 people from 163 to 168cm
53 people from 168 to 173cm
45 people from 173 to 178cm
28 people from 179 to 184cm
21 people from 185 to 190cm
4 people from 190 to 195cm
105
Matplotlib Histograms
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10

x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

106
Matplotlib Histograms
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10

x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

107
Matplotlib Pie Charts
Creating Pie Charts: With Pyplot, you can use the pie() function to draw pie charts:
import matplotlib.pyplot as plt
import numpy as np

y = np.array([35, 25, 25, 15])

plt.pie(y)
plt.show()

As you can see the pie chart draws one piece


(called a wedge) for each value in the array (in this
case [35, 25, 25, 15]).
By default the plotting of the first wedge starts
from the x-axis and moves counterclockwise:
108
Matplotlib Pie Charts
Labels: Add labels to the pie chart with
the labels parameter.
The labels parameter must be an array
with one label for each wedge:

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas",
"Cherries", "Dates"]

plt.pie(y, labels = mylabels)


plt.show()

109
Matplotlib Pie Charts
Start Angle
As mentioned the default start angle is at the
x-axis, but you can change the start angle by
specifying a startangle parameter.
The startangle parameter is defined with an
angle in degrees, default angle is 0:

110
Matplotlib Pie Charts
Start Angle
Start the first wedge at 90 degrees:

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries",
"Dates"]

plt.pie(y, labels = mylabels, startangle = 90)


plt.show()
111
Matplotlib Pie Charts
Explode
Maybe you want one of the wedges to stand out? The explode
parameter allows you to do that.
The explode parameter, if specified, and not None, must be an
array with one value for each wedge.
Each value represents how far from the center each wedge is
displayed:

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]

plt.pie(y, labels = mylabels, explode = myexplode)


plt.show()
112
Matplotlib Pie Charts
Shadow

Add a shadow to the pie chart by setting the shadows


parameter to True:
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]

plt.pie(y, labels = mylabels, explode = myexplode,


shadow = True)
plt.show()

113
Matplotlib Pie Charts
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
mycolors = ["black", "hotpink", "b", "#4CAF50"]

plt.pie(y, labels = mylabels, colors = mycolors)


plt.show()

114
Matplotlib Pie Charts
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
mycolors = ["black", "hotpink", "b", "#4CAF50"]

plt.pie(y, labels = mylabels, colors = mycolors)


plt.show()

'r' - Red 'm' - Magenta


'g' - Green 'y' - Yellow
'b' - Blue 'k' - Black
115
'c' - Cyan 'w' - White
Matplotlib Pie Charts
Legend
To add a list of explanation for each wedge,
use the legend() function:

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries", "Dates"]

plt.pie(y, labels = mylabels)


plt.legend()
plt.show()

116
Matplotlib Pie Charts
Legend With Header
To add a header to the legend, add the
title parameter to the legend function.

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries", "Dates"]

plt.pie(y, labels = mylabels)


plt.legend(title = "Four Fruits:")
plt.show()

117
118
1. Scikit-learn
Definition:
Scikit-learn is a Python library for classical machine learning tasks,
built on NumPy, SciPy, and Matplotlib. It provides efficient tools for
data mining, preprocessing, and machine learning algorithms.

119
1. Scikit-learn

Uses:

Classification: SVM, k-NN, Random Forest, Logistic Regression, etc.


Regression: Linear Regression, Ridge, Lasso, etc.
Clustering: K-Means, DBSCAN, etc.
Dimensionality Reduction: PCA, t-SNE.
Model Evaluation: Cross-validation, metrics.
Feature Engineering: Encoding, scaling.

120
1. Scikit-learn

Advantages:
Ease of Use: Intuitive and simple API for classical ML tasks.
Rich Library: Includes most classical ML algorithms.
Integration: Works seamlessly with NumPy, Pandas, and Matplotlib.
Good Documentation: Well-documented and beginner-friendly.
Disadvantages:
No Deep Learning: Not suitable for neural networks or large-scale deep learning tasks.
Performance: Slower for very large datasets compared to specialized frameworks.

121
2. TensorFlow

Definition:
TensorFlow is an open-source framework for numerical
computation and large-scale machine learning, developed by Google.
It provides both low-level and high-level APIs for building and
training machine learning models, particularly for deep learning.

122
2. TensorFlow

Uses:

Deep Learning: Build custom architectures, including CNNs, RNNs, GANs (generative
adversarial network) , transformers, etc.
Large-Scale ML: Handles massive datasets and distributed training across multiple
GPUs/TPUs.
Production Deployment: TensorFlow Lite for mobile and IoT devices, TensorFlow Serving
for cloud-based applications.
Numerical Computation: Works well for general-purpose numerical tasks, beyond machine
learning.

123
2. TensorFlow

Advantages:
Scalability: Distributed training and deployment capabilities make it suitable
for production.
Customizability: Low-level APIs allow detailed control of computations.
Ecosystem: TensorFlow Lite, TensorFlow.js, and TensorFlow Serving provide
tools for cross-platform development.
Performance: Optimized for speed and works well on GPUs and TPUs.
Disadvantages:
Complexity: The low-level API requires extensive knowledge and verbose
code.
Debugging: The static computational graph (in earlier versions) could make
debugging harder compared to frameworks like PyTorch.
124
3. Keras

Definition:

Keras is a high-level API for building and training neural networks,


designed to be user-friendly and modular. It became integrated with
TensorFlow as its official high-level API in 2017 but can still be
used with other backends like Theano or CNTK.

125
3. Keras

Uses:

Rapid Prototyping: Build and experiment with neural networks quickly using
minimal code.
Beginner-Friendly Deep Learning: Offers simple APIs for standard tasks.
Model Customization: Supports functional and sequential APIs for defining
complex models.
Transfer Learning: Easy to load and fine-tune pre-trained models.

126
3. Keras

Advantages:
Simplicity: High-level abstractions reduce the need for boilerplate code.
Backend Flexibility: Though now tied to TensorFlow, older versions allowed using Theano
or CNTK as a backend.
Pre-trained Models: Includes many pre-trained architectures (e.g., ResNet, VGG,
MobileNet) for quick use.

Disadvantages:
Dependency on Backends: Keras‟ performance depends on the underlying backend
(TensorFlow in modern versions).
Limited Low-Level Control: For highly customized models, the low-level TensorFlow API
or other frameworks may be better.
127
4. PyTorch

Definition:

PyTorch is a deep learning framework developed by


Facebook, offering a dynamic computational graph that is
Pythonic and flexible, making it popular in research and
development.
128
4. PyTorch

Uses:

Deep Learning: Build CNNs, RNNs, GANs, etc.


Research: Ideal for experimenting with new architectures due
to its flexibility.
Production: TorchServe for model deployment.
129
4. PyTorch

Advantages:
Dynamic Graph: Easier to debug and modify than TensorFlow‟s static graph.
Pythonic: Feels like native Python, reducing complexity.
Community Support: Strong and growing ecosystem, widely used in research.
Disadvantages:
Production Tools: While improving, it historically lagged behind TensorFlow
for production environments.
Fewer High-Level Tools: Keras-like functionality is not as built-in (though
torch.nn helps).

130
Feature Scikit-learn TensorFlow Keras PyTorch
Classical ML (e.g., Deep learning, scalable High-level API for deep Research-focused deep
Primary Use
regression) ML learning learning

Extremely easy, Flexible, moderate


Ease of Use Beginner-friendly Steeper learning curve
beginner-friendly complexity

Simplicity for traditional Production-ready, end- Simplified interface for Dynamic graph
Core Philosophy
ML to-end ML TensorFlow execution for research

Depends on TensorFlow GPU-optimized, fast


Performance CPU-optimized GPU/TPU-optimized
backend and flexible

Static (TF 1.x), Eager Relies on TensorFlow Easy debugging with


Debugging Limited needs, stable API
(TF 2.x) tools dynamic graph

Image/NLP/Time series Rapid neural network Advanced research,


Popular Use Cases Regression, clustering
processing prototyping custom models

Community & Large community, Backed by Google, huge Integrated into Growing, backed by
Support extensive docs ecosystem TensorFlow's ecosystem Meta (Facebook)
131
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.

132
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.

133
134
Introduction to scikit-learn
Scikit-learn is a powerful and versatile open-source library in Python, designed for machine learning and
data science tasks. It is one of the most widely used libraries by data scientists and machine learning
practitioners due to its simplicity, efficiency, and comprehensive coverage of machine learning algorithms.
Built on top of other popular Python libraries such as NumPy, SciPy, and Matplotlib, scikit-learn provides a
seamless and unified interface for various machine learning models, preprocessing techniques, and evaluation
metrics. Its modular and consistent API allows users to focus on solving problems rather than worrying
about compatibility or implementation details.

135
Introduction to scikit-learn
The library supports both supervised and unsupervised learning methods, ranging from
simple algorithms like linear regression and k-nearest neighbors to more advanced
techniques like support vector machines, random forests…etc.

For unsupervised learning, it offers clustering algorithms like k-means and hierarchical
clustering, as well as dimensionality reduction techniques like principal component analysis
(PCA). What sets scikit-learn apart is its focus on user experience; the interface remains
consistent across all algorithms, making it easy to switch between models or compare their
performance with minimal changes to the code.

136
Introduction to scikit-learn
One of the key features of scikit-learn is its preprocessing utilities, which help ensure that the
data is ready for analysis. Real-world datasets often have missing values, categorical
variables, or features with vastly different scales. Scikit-learn provides tools to handle these
issues efficiently, such as scaling, encoding, and imputing missing values. These
preprocessing techniques are essential for achieving optimal performance from machine
learning models. Additionally, scikit-learn integrates seamlessly with Pandas for handling
datasets, Matplotlib for visualization, and NumPy for numerical computations, making it a
perfect fit for end-to-end machine learning workflows.

137
Introduction to scikit-learn
Installing scikit-learn is straightforward and can be done using Python's
package manager, pip.
A single command pip install scikit-learn is all it takes to get started.

Scikit-learn's documentation is highly regarded in the programming


community for its clarity and depth, offering numerous examples and
tutorials to help users understand and implement machine learning
algorithms effectively.

138
Introduction to scikit-learn
The essence of scikit-learn lies in its ability to streamline the process of
building and deploying machine learning models. Instead of writing
complex code from scratch, users can utilize pre-built functions for tasks such
as data splitting, model training, and evaluation.

Scikit-learn also supports cross-validation, a crucial technique for assessing


the generalization ability of a model. By automating these tasks, scikit-learn
not only saves time but also reduces the likelihood of errors, enabling users to
focus on extracting meaningful insights from data.

139
Loading and Exploring Data

Machine learning starts with data. Before building models or performing


analysis, it is essential to understand, prepare, and manipulate the data properly.
Scikit-learn provides a robust set of tools for loading datasets, exploring their
structure, and preparing them for analysis. The library seamlessly integrates
with Python's data-handling ecosystem, making it easy to load data from
various sources like CSV files, databases, or even scikit-learn’s preloaded
datasets. Proper data handling ensures that the subsequent steps in your
machine learning workflow, such as preprocessing and model building, are
efficient and error-free.
140
Loading and Exploring Data
Loading Built-in Datasets
Scikit-learn comes with several built-in datasets designed for experimentation and
learning purposes. These datasets, like the Iris dataset, the Breast Cancer dataset, and
the Digits dataset, are small and formatted for direct use. Loading these datasets is
straightforward using the datasets module. For example, let‟s explore the Iris dataset, which
contains measurements of different iris flower species.
Keys in the dataset: dict_keys(['data', 'target',
'frame', 'target_names', 'DESCR',
from sklearn import datasets 'feature_names', 'filename', 'data_module'])

# Load the Iris dataset Feature names: ['sepal length (cm)', 'sepal
iris = datasets.load_iris() width (cm)', 'petal length (cm)', 'petal width
(cm)']
# Explore the dataset
print("Keys in the dataset:", iris.keys()) Target names (classes): ['setosa' 'versicolor'
print("Feature names:", iris.feature_names) 'virginica']
print("Target names (classes):", iris.target_names)
print("Data shape:", iris.data.shape) Data shape: (150, 4)
141
print("Sample data point:", iris.data[0]) Sample data point: [5.1 3.5 1.4 0.2]
Loading and Exploring Data
Loading Built-in Datasets

In the previous code, we loaded the Iris dataset and explored its structure. Scikit-learn
datasets are usually stored as dictionaries with keys like: data (the features
« values »), target (the labels), and feature_names (names of the features).

The data array contains the measurements of each flower, while the target array maps
each flower to a species (e.g., Setosa, Versicolor, or Virginica). These datasets are
excellent for learning because they are clean, well-structured, and ready for use.

142
Loading and Exploring Data
Loading External Data
Real-world datasets are not prepackaged and require manual loading. Scikit-learn
supports loading data from NumPy arrays, Pandas DataFrames, or even SciPy
sparse matrices. For instance, let‟s load a CSV file using Pandas and convert it into a
format suitable for scikit-learn.
import pandas as pd
from sklearn.model_selection import train_test_split

# Load a CSV file into a Pandas DataFrame


data = pd.read_csv('path/to/dataset.csv')

# Separate features and target


X = data.drop(columns='target_column')
y = data['target_column']

# Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

print("Training data shape:", X_train.shape)


143
print("Testing data shape:", X_test.shape)
Loading and Exploring Data
Loading External Data

In this example, we used Pandas to read a CSV file into a DataFrame. We then
separated the features (X) and the target variable (y), as machine learning models
require these inputs to be in separate variables. Finally, we split the dataset into
training and testing sets using scikit-learn's train_test_split function.
import pandas as pd
from sklearn.model_selection import train_test_split

# Load a CSV file into a Pandas DataFrame


data = pd.read_csv('path/to/dataset.csv')

# Separate features and target


X = data.drop(columns='target_column')
y = data['target_column']

# Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training data shape:", X_train.shape)


print("Testing data shape:", X_test.shape) 144
Exploring Data
Exploration is a vital step in understanding the characteristics of your dataset. For
example, knowing the distribution of values, the presence of missing data, or the
relationships between variables can influence the choice of preprocessing techniques and
algorithms.
Here‟s how to perform basic data exploration:
The describe method provides a quick statistical
#...
# Basic statistics
summary of numeric columns, including mean,
print(X.describe()) standard deviation, and range. Checking for missing
# Check for missing values
values is another critical step, as many machine
print("Missing values in each column:\n", X.isnull().sum()) learning algorithms cannot handle missing data
directly.
# Visualize relationships (for numeric features)
import matplotlib.pyplot as plt
import seaborn as sns The pairplot function from Seaborn helps visualize the
sns.pairplot(data, diag_kind='kde')
relationships between different features, offering
plt.show() insights into potential correlations or clusters.

145
Preliminary Data Cleaning

Real-world datasets are rarely clean and often require preprocessing. For example,
missing values can be handled using scikit-learn's imputation techniques.

from sklearn.impute import SimpleImputer


import numpy as np This example demonstrates how to fill
# Create an imputer object missing values with the mean of each column
imputer = SimpleImputer(strategy='mean') using the SimpleImputer class.
# Assume we have missing values in our dataset
X = np.array([[1, 2, np.nan], [3, np.nan, 6], [7, 8, 9]])
Imputation strategies can also include filling
# Impute missing values
X_imputed = imputer.fit_transform(X)
with median or a constant value, depending
on the context of your data.
print("Original Data:\n", X)
print("Data after imputation:\n", X_imputed)

146
Converting Categorical Data
Categorical data often needs to be converted into a numeric format for machine
learning models. Scikit-learn provides the LabelEncoder and OneHotEncoder
classes for this purpose. Let‟s look at how to encode a column of categorical
values.
from sklearn.preprocessing import LabelEncoder, Label encoding maps each category to a unique
OneHotEncoder integer, while one-hot encoding creates binary
# Label encoding vectors representing each category.
categories = ['dog', 'cat', 'mouse'] The choice between these methods depends on the
le = LabelEncoder()
labels = le.fit_transform(categories) algorithm being used and whether it can handle
ordinal relationships in the data.
print("Label Encoded:", labels)

# One-hot encoding
Label Encoded: [1 0 2]
ohe = OneHotEncoder(sparse_output=False)
one_hot = ohe.fit_transform(labels.reshape(-1, One-Hot Encoded:
1)) [[0. 1. 0.]
[1. 0. 0.]
print("One-Hot Encoded:\n", one_hot) 147
[0. 0. 1.]]
Preprocessing and Feature Engineering

Data preprocessing and feature engineering are crucial steps in any machine learning
pipeline. The quality and structure of the input data directly impact the performance of
machine learning models. In many cases, preprocessing and feature engineering are the most
time-intensive steps because they involve cleaning, transforming, and preparing raw data
into a format suitable for model training. Scikit-learn provides an extensive toolkit to address
these challenges, ranging from scaling and encoding to more advanced techniques like
generating polynomial features or selecting relevant features.

148
Preprocessing and Feature Engineering
Standardization and Normalization
One of the most common preprocessing steps is scaling numerical features.
Standardization and normalization are two popular techniques for this purpose:
Standardization scales data to have a mean of 0 and a standard deviation of 1.
Normalization scales data to a range of [0, 1] or to have unit norm.
In this code, the StandardScaler removes the mean
and scales the data to unit variance, while the
from sklearn.preprocessing import StandardScaler, Normalizer
Normalizer scales each row (sample) to have unit
import numpy as np norm.
# Sample dataset
Standardization is particularly useful for
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) algorithms sensitive to feature scaling, such as
support vector machines and k-nearest neighbors.
# Standardization
scaler = StandardScaler() Standardized Data:
X_standardized = scaler.fit_transform(X) [[-1.22474487 -1.22474487 -1.22474487]
print("Standardized Data:\n", X_standardized) [ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
# Normalization Normalized Data:
normalizer = Normalizer() [[0.26726124 0.53452248 0.80178373]
X_normalized = normalizer.fit_transform(X) [0.45584231 0.56980288 0.68376346]
print("Normalized Data:\n", X_normalized) 149
[0.50257071 0.57436653 0.64616234]]
Preprocessing and Feature Engineering
Generating Polynomial Features
Sometimes, raw features are insufficient for capturing complex
relationships in data.
Polynomial features introduce higher-order interactions between
variables, allowing models to learn more intricate patterns.
from sklearn.preprocessing import PolynomialFeatures
The PolynomialFeatures class creates new features by
import numpy as np computing all possible combinations of input features up to a
specified degree.
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6]])
For example, if you have two features 𝑥 1​ and 𝑥 2​, polynomial
# Generate polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False) features of degree 2 include 𝒙𝟐𝟏 ​, 𝒙𝟐𝟐 and 𝒙1​ × 𝒙2​.
X_poly = poly.fit_transform(X)
Original Data:
print("Original Data:\n", X)
print("Polynomial Features:\n", X_poly)
[[1 2]
[3 4]
[5 6]]
Polynomial Features:
[[ 1. 2. 1. 2. 4.]
[ 3. 4. 9. 12. 16.] 150
[ 5. 6. 25. 30. 36.]]
Preprocessing and Feature Engineering
Binarizing Continuous Features
In some cases, you may want to convert continuous features into binary values
(e.g., 0 or 1). This is useful for specific tasks like threshold-based classification.
The Binarizer class converts values above a given threshold to 1 and below it to 0, which can
be useful in certain scenarios where binary outcomes are needed.

from sklearn.preprocessing import Binarizer Original Data:


import numpy as np [[1.5]
[2.3]
# Sample data [0.8]
X = np.array([[1.5], [2.3], [0.8], [3.7]]) [3.7]]
Binarized Data:
# Binarization
binarizer = Binarizer(threshold=2.0) [[0.]
X_binary = binarizer.fit_transform(X) [1.]
print("Original Data:\n", X) [0.]
print("Binarized Data:\n", X_binary) [1.]]

151
Supervised Learning Models

Supervised learning is the most commonly used machine learning paradigm where
models learn a mapping between input features (X) and target labels (y) using
labeled datasets. In supervised learning, the goal is either to classify data points
into predefined categories (classification) or to predict a continuous value
(regression). Scikit-learn provides a wide variety of algorithms for both tasks,
with consistent interfaces for training, predicting, and evaluating models.

This section delves into the most commonly used supervised learning models,
explaining their working principles and showcasing their implementation with
scikit-learn.

152
import numpy as np
Supervised Learning Models from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Linear Regression
# Generate synthetic data
X = np.random.rand(100, 1) * 10 # Feature: 100 random points
Linear regression is a simple yet powerful algorithm for y = 3 * X.squeeze() + 5 + np.random.randn(100) # Target:
Linear with noise
regression tasks. It assumes a linear relationship
between the independent variables (features) and the # Train-test split
dependent variable (target). X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
y=β0​+β1​x1​+β2​x2​+⋯+βn​xn​
The model tries to fit a straight line (or hyperplane in # Initialize and train the model
model = LinearRegression()
higher dimensions) that minimizes the sum of squared model.fit(X_train, y_train)
differences between the observed and predicted values.
# Make predictions
y_pred = model.predict(X_test)
In this example:
The LinearRegression class is used to create and train a # Evaluate the model
print("Coefficients:", model.coef_)
model.
print("Intercept:", model.intercept_)
The coefficients and intercept define the best-fit line, which print("Mean Squared Error:", mean_squared_error(y_test,
the model learns during training. y_pred))
Evaluation metrics like Mean Squared Error (MSE) and R² print("R² Score:", r2_score(y_test, y_pred))
Score assess the model's performance. A higher R² score
indicates better alignment between predictions and actual
153
values.
from sklearn.datasets import load_iris
Supervised Learning Models from
from
sklearn.linear_model import LogisticRegression
sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Logistic Regression # Load Iris dataset


data = load_iris()
Logistic regression, despite its name, is a classification X, y = data.data, data.target
algorithm.
# Binary classification: Select two classes
X, y = X[y != 2], y[y != 2] # Remove class 2 for binary
It models the probability of a data point belonging to a classification
certain class using the logistic function. # Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
It is especially useful for binary classification tasks and test_size=0.2, random_state=42)

can be extended to multiclass problems using # Initialize and train the model
techniques like one-vs-rest (OvR). model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
This example demonstrates logistic regression for binary y_pred = model.predict(X_test)
classification. The classification_report provides key
# Evaluate the model
metrics like precision, recall, and F1-score, helping assess print(classification_report(y_test, y_pred))
the model's ability to classify data points accurately.

154
from sklearn.svm import SVC
Supervised Learning Models from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
Support Vector Machines (SVM) import matplotlib.pyplot as plt
import numpy as np
Support Vector Machines are versatile algorithms for # Generate synthetic dataset
classification and regression. SVMs work by finding a X, y = make_blobs(n_samples=100, centers=2, random_state=42)
hyperplane that best separates classes in a high- # Train-test split
dimensional space. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

With the help of kernel functions, SVMs can handle # Initialize and train the model
non-linear relationships by transforming data into model = SVC(kernel='linear')
model.fit(X_train, y_train)
higher dimensions.
# Visualize decision boundary
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid
This example uses an SVM with a linear kernel. The decision
xx = np.linspace(xlim[0], xlim[1], 30)
boundary separating the two classes is visualized on a 2D yy = np.linspace(ylim[0], ylim[1], 30)
plane. SVMs are highly effective for datasets where the YY, XX = np.meshgrid(yy, xx)
classes are well-separated. xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = model.decision_function(xy).reshape(XX.shape)

# Plot decision boundary


ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5)155
plt.show()
from sklearn.neighbors import KNeighborsClassifier
Supervised Learning Models from sklearn.metrics import accuracy_score

# Generate synthetic dataset


X, y = make_blobs(n_samples=200, centers=3, random_state=42)
K-Nearest Neighbors (KNN)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
K-Nearest Neighbors is a simple algorithm test_size=0.2, random_state=42)

that classifies data points based on their # Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
proximity to other points. knn.fit(X_train, y_train)

# Make predictions
The model assigns the class most common y_pred = knn.predict(X_test)

among the k-nearest neighbors in feature # Evaluate the model


print("Accuracy:", accuracy_score(y_test, y_pred))
space.

In KNN, the choice of k (number of neighbors) significantly impacts the model‟s performance.

Smaller values of k make the model sensitive to noise, while larger values generalize the
decision boundary.
156
Supervised Learning Models
from sklearn.tree import DecisionTreeClassifier
Decision Trees from sklearn.tree import plot_tree
Decision trees are interpretable models that # Initialize and train the model
split the data into subsets based on feature tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
values. Each internal node represents a
# Visualize the tree
decision rule, and each leaf node represents a plt.figure(figsize=(10, 8))
plot_tree(tree, feature_names=['Feature 1', 'Feature 2'],
class or prediction. class_names=['Class 0', 'Class 1', 'Class 2'], filled=True)
plt.show()

Decision trees are intuitive and easy to visualize. However, they are prone to
overfitting,

157
Unsupervised Learning Models

Unsupervised learning focuses on finding hidden structures or patterns in unlabeled data.


Unlike supervised learning, there are no predefined outputs to predict. Instead, the goal is to
explore the data, group similar instances, or reduce dimensionality for better visualization and
analysis.

Scikit-learn provides a robust suite of algorithms for unsupervised learning, including


clustering techniques, dimensionality reduction methods, and anomaly detection.

This section will cover popular unsupervised learning algorithms, their use cases, and how to
implement them with scikit-learn.

158
Unsupervised Learning Models

Clustering:
Clustering algorithms divide the dataset into groups (clusters) of similar instances based on
their feature values. Each cluster contains data points that are more similar to each other than to
those in other clusters.

K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It partitions data into k clusters,
where k is specified by the user. The algorithm assigns each point to the nearest cluster center
and iteratively adjusts the cluster centers to minimize intra-cluster variance.

159
from sklearn.cluster import KMeans
Unsupervised Learning Models import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It # Generate synthetic data
partitions data into kkk clusters, where kkk is specified by X, _ = make_blobs(n_samples=300, centers=4,
the user. The algorithm assigns each point to the nearest cluster_std=0.6, random_state=42)
cluster center and iteratively adjusts the cluster centers to
minimize intra-cluster variance. # Apply K-Means
In this example: kmeans = KMeans(n_clusters=4, random_state=42)
• The make_blobs function generates synthetic data with 4 kmeans.fit(X)
clusters.
• The KMeans model clusters the data into 4 groups. # Get cluster centers and labels
• The results are visualized, showing cluster centers and centers = kmeans.cluster_centers_
color-coded clusters. labels = kmeans.labels_

# Visualize the clusters


plt.scatter(X[:, 0], X[:, 1], c=labels,
cmap='viridis', s=50)
plt.scatter(centers[:, 0], centers[:, 1],
c='red', marker='X', s=200, label='Centers')
plt.legend()
plt.show()
160
Unsupervised Learning Models
Choosing the Number of Clusters:
The elbow method helps determine the # Elbow Method
optimal number of clusters by plotting the inertia = []
k_values = range(1, 10)
sum of squared distances from points to their for k in k_values:
assigned cluster center for different k values. kmeans = KMeans(n_clusters=k,
The "elbow" point indicates the best k. random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

plt.plot(k_values, inertia, marker='o')


plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

161
Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters (dendrogram), representing nested groupings. There are two
main approaches:
Agglomerative Clustering: (Bottom-Up Approach) Starts with individual data points and merges them
iteratively.
Divisive Clustering: (Top-Down Approach) Starts with all data points in one cluster and splits them
iteratively.
Agglomerative clustering is useful when
from sklearn.cluster import AgglomerativeClustering you want to understand the hierarchy of
from scipy.cluster.hierarchy import dendrogram, linkage data, and the dendrogram provides a
import matplotlib.pyplot as plt detailed visualization of the clustering
process.
# Apply Agglomerative Clustering
model = AgglomerativeClustering(n_clusters=3)
labels = model.fit_predict(X)

# Visualize dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
plt.title("Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show() 162
Unsupervised Learning Models from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
Dimensionality Reduction # Generate synthetic data
Dimensionality reduction techniques reduce the number of features np.random.seed(42)
X = np.random.rand(100, 5) # 5-dimensional
while retaining as much information as possible. These techniques are data
invaluable for visualization, noise reduction, and speeding up
computations. # Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
Principal Component Analysis (PCA)
print("Original Shape:", X.shape)
PCA is a widely used dimensionality reduction technique. It projects print("Reduced Shape:", X_pca.shape)
data into a lower-dimensional space by finding new axes (principal
# Visualize the reduced data
components) that maximize variance. plt.scatter(X_pca[:, 0], X_pca[:, 1],
c='blue')
plt.xlabel('Principal Component 1')
In this example: plt.ylabel('Principal Component 2')
• The original 5-dimensional data is reduced to 2 plt.title('PCA Result')
plt.show()
dimensions.
• PCA identifies the directions of maximum variance,
helping to preserve the most important information.
163
Model Training , Evaluation and Metrics

Model evaluation is a critical part of the machine learning workflow. It helps you understand
how well your model performs and whether it generalizes effectively to unseen data. In
supervised learning, we measure the agreement between the predicted and true labels for
classification, or the closeness of predicted values to true values for regression. For
unsupervised learning, evaluation focuses on the quality of clusters or the reduction in
dimensionality. Scikit-learn offers a robust set of metrics and techniques to evaluate models
comprehensively.

In this section, we will cover the essential evaluation methods for classification, regression,
and clustering tasks. Detailed examples are provided to demonstrate how these metrics can
guide you in choosing and improving your models.

164
Model Training , Evaluation and Metrics
from sklearn.model_selection import train_test_split
Train-Test Split from sklearn.linear_model import LogisticRegression
The simplest way to evaluate a model is from sklearn.metrics import accuracy_score
to divide the dataset into two parts: a # Synthetic dataset
training set to train the model and a X = [[i] for i in range(10)]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
test set to evaluate its performance.
This ensures that the model is tested on # Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
data it has never seen before, providing test_size=0.3, random_state=42)
a realistic estimate of how it will
# Train a logistic regression model
perform in production. model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model


y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

The dataset is split into 70% training and 30% testing.


The accuracy score measures the proportion of correctly classified instances. 165
Model Training , Evaluation and Metrics
K-Fold Cross-Validation
Cross-Validation from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
Cross-validation is a more robust
evaluation method that divides the # Dataset
X = [[i] for i in range(10)]
dataset into multiple folds. Each fold y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
acts as a test set once, while the
# Perform cross-validation
remaining folds are used for training. model = DecisionTreeClassifier()
The final performance metric is the scores = cross_val_score(model, X, y, cv=5)
print("Cross-Validation Scores:", scores)
average across all folds. print("Mean Score:", scores.mean()) #average accuracy

This example splits the data into 5 folds and reports


the scores for each fold, along with the mean score,
which provides a better estimate of model
performance than a single train-test split.

166
Model Training , Evaluation and Metrics
Classification Metrics
For classification tasks, the choice of evaluation metric depends on the problem‟s
requirements.
Scikit-learn provides various metrics such as accuracy, precision, recall, F1-score,
and ROC-AUC.

True Positives (TP), False Positives (FP), True Negatives (TN), and False
Negatives (FN):
In classification problems, confusion matrices are used to evaluate the
performance of a classification model. The matrix shows the counts of true
positives, false positives, true negatives, and false negatives, which are critical in
calculating various evaluation metrics, such as accuracy, precision, recall, F1-
score, and specificity.

167
Model Training , Evaluation and Metrics

Classification Metrics

True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN):
Here‟s a breakdown of these terms:
• True Positives (TP): These are the cases where the model correctly predicted the positive class. In other
words, the true label is positive (1), and the model predicted it as positive (1).
Example: The model correctly predicts that a person has a disease when they actually do.
• False Positives (FP): These are the cases where the model incorrectly predicted the positive class. In other
words, the true label is negative (0), but the model predicted it as positive (1).
Example: The model incorrectly predicts that a person has a disease when they actually do not.
• True Negatives (TN): These are the cases where the model correctly predicted the negative class. In other
words, the true label is negative (0), and the model predicted it as negative (0).
Example: The model correctly predicts that a person does not have a disease when they actually do not.
• False Negatives (FN): These are the cases where the model incorrectly predicted the negative class. In other
words, the true label is positive (1), but the model predicted it as negative (0).
Example: The model incorrectly predicts that a person does not have a disease when they actually do.
168
Model Training , Evaluation and Metrics

Classification Metrics:

Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.

A confusion matrix looks like this:

Predicted Positive (1) Predicted Negative (0)


Actual Positive (1) True Positive (TP) False Negative (FN)
Actual Negative (0) False Positive (FP) True Negative (TN)

169
Model Training , Evaluation and Metrics

Classification Metrics:

Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.
from sklearn.metrics import confusion_matrix

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

[[2 0]
[1 2]] 170
Model Training , Evaluation and Metrics

Classification Metrics

Accuracy:
Accuracy is the proportion of correctly classified instances. While simple, it can be
misleading for imbalanced datasets where one class dominates.

from sklearn.metrics import accuracy_score

y_true = [0, 1, 1, 0, 1] Accuracy =


y_pred = [0, 1, 0, 0, 1]
print("Accuracy:", accuracy_score(y_true, y_pred))

In this example, accuracy is calculated as the ratio of correct predictions to the total
number of predictions.
171
Model Training , Evaluation and Metrics
Classification Metrics:
Precision, Recall, and F1-Score:
• Precision: The proportion of true positives among predicted Precision=
positives.
• Recall (Sensitivity or True Positive Rate ): The proportion
of true positives among actual positives.
• F1-Score: The harmonic mean of precision and recall, Recall =
balancing both metrics.

from sklearn.metrics import classification_report

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

print(classification_report(y_true, y_pred))

The classification_report provides a detailed breakdown of precision, recall, and F1-score 172
for each class.
Model Training , Evaluation and Metrics
Classification Metrics:
ROC and AUC:
The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the
false positive rate. The Area Under the Curve (AUC) measures the model‟s ability to
distinguish between classes.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_true = [0, 1, 1, 0, 1]
y_scores = [0.2, 0.8, 0.6, 0.4, 0.9] # Model probabilities

fpr, tpr, thresholds = roc_curve(y_true, y_scores)


auc_score = roc_auc_score(y_true, y_scores)

plt.plot(fpr, tpr, marker='.')


plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (AUC = {auc_score:.2f})')
plt.show() 173
Model Training , Evaluation and Metrics
Regression Metrics :
For regression tasks, common evaluation metrics include Mean Squared Error (MSE),
Mean Absolute Error (MAE), and R² Score.
Mean Squared Error and Mean Absolute Error: ( lower values are better)
• MSE is the average of the squared differences between the actual values and the predicted values.
• Lower MSE indicates better performance, as it means the predictions are closer to the actual values. A
smaller MSE implies that the model's predictions are more accurate.

• MAE is the average of the absolute differences between the actual values and the predicted values.
• lower MAE indicates better performance because it suggests the model is making more accurate
predictions without large errors.

from sklearn.metrics import mean_squared_error, mean_absolute_error

y_true = [3, -0.5, 2, 7]


y_pred = [2.5, 0.0, 2, 8]

MSE: 0.375 print("MSE:", mean_squared_error(y_true, y_pred)) 174


MAE: 0.5 print("MAE:", mean_absolute_error(y_true, y_pred))
Model Training , Evaluation and Metrics
Regression Metrics :

R² Score ( Higher value is better)


R² score measures the proportion of the variance in the dependent variable that is predictable
from the independent variables. It is a measure of how well the model explains the data.

Higher R² values are better because they indicate that the model explains more of the variance
in the data. An R² score of 1 means the model perfectly predicts the data, and an R² score of 0
means the model does no better than predicting the mean value.

from sklearn.metrics import r2_score


y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

print("R² Score:", r2_score(y_true, y_pred)) 175


R² Score: 0.948
Model Training , Evaluation and Metrics
Clustering Metrics :
Evaluating clustering models is challenging because there are no true labels. Metrics like
Silhouette Score and Adjusted Rand Index (ARI) help assess clustering performance.

Silhouette Score:
Silhouette Score measures how similar a point is to its cluster compared to other clusters.
A higher score indicates well-separated clusters.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Clustering example
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict([[1], [2], [8], [9]])

print("Silhouette Score:", silhouette_score([[1], [2], [8], [9]], labels))

176
Model Training , Evaluation and Metrics
Clustering Metrics :
Adjusted Rand Index :
ARI compares the similarity between predicted and true clusters, accounting for chance.

from sklearn.metrics import adjusted_rand_score

true_labels = [0, 0, 1, 1]
pred_labels = [0, 0, 1, 1]

print("ARI:", adjusted_rand_score(true_labels, pred_labels))

177
Hyperparameter Tuning

178
Any questions ?

179

You might also like