Chapter 04 Advanced Use of Python Libraries for AI and Data Science
Chapter 04 Advanced Use of Python Libraries for AI and Data Science
Introduction
What is data science?
Data science is an interdisciplinary field focused on extracting
knowledge, insights, and value from structured and unstructured
data. It combines techniques from statistics, mathematics,
computer science, and domain-specific expertise to analyze,
interpret, and predict patterns in data.
2
Introduction
Key Components of Data Science
• Data Collection and Storage: Gathering data from multiple sources,
including databases, APIs, web scraping, sensors, and more. Ensuring
data is stored in formats that are accessible and organized for analysis is
critical.
• Data Cleaning and Preparation: Before analysis, data often requires
significant cleaning and transformation, such as handling missing values,
correcting errors, standardizing formats, and ensuring consistency.
• Exploratory Data Analysis (EDA): EDA involves understanding data‟s
structure, distributions, and key features through visualization and
summary statistics to form hypotheses and identify trends.
3
Introduction
Key Components of Data Science
• Statistical Analysis and Hypothesis Testing: Using statistical methods to test
hypotheses, identify correlations, and draw inferences from data.
• Machine Learning and Modeling: Applying algorithms to build predictive
models, classify data, detect patterns, and automate decision-making. Models
range from simple linear regressions to complex deep learning models,
depending on the problem and data.
• Data Visualization: Presenting insights visually, using tools like charts, graphs,
dashboards, and infographics. Visualization helps stakeholders (or Users)
understand complex data and models intuitively.
• Deployment and Optimization: Implementing data-driven solutions into
production environments where they can be used in real-time applications. This
includes ongoing monitoring and optimization to ensure models remain accurate
over time.
4
Introduction
Data categories
In data science, data types are generally categorized based on their
structure and format: structured, semi-structured, and unstructured
data
1. Structured Data
Structured data is highly organized and easily searchable in predefined formats, often found in tabular form with
rows and columns.
Characteristics:
Follows a strict schema (e.g., rows and columns in a database).
Easily searchable with SQL queries.
Data types and formats are defined in advance, making the data predictable and consistent.
Examples:
Relational databases (e.g., MySQL, PostgreSQL).
Spreadsheets (e.g., Excel tables).
Financial records, inventory data, and customer information stored in databases.
5
Introduction
Data categories
2. Semi-Structured Data
Semi-structured data lacks a fixed schema but still contains tags or markers to separate
elements, making it partially organized.
Characteristics:
Data doesn‟t strictly adhere to a fixed schema but still has some structure, like tags or
other markers.
More flexible than structured data but still organized enough for certain querying and
parsing methods.
Often represented in text files with a flexible schema (e.g., XML or JSON format).
Examples:
JSON and XML documents.
Email messages (with structured headers but free-form body content).
NoSQL databases (e.g., MongoDB, which allows documents with variable
structures).
HTML pages, where some parts have a defined structure but contain unstructured data
in free-form text.
6
Introduction
Data categories
3. Unstructured Data
Unstructured data lacks any predefined format or organization, making it harder to
process and analyze without specialized tools.
Characteristics:
No inherent structure, schema, or easily searchable format.
Data may be stored in formats that make analysis challenging without natural
language processing (NLP) or image recognition tools.
Often requires data processing and transformation before analysis can be done.
Examples:
Text documents, PDF files.
Media files (e.g., images, audio, video).
Social media posts, chat messages, and emails.
Sensor data from IoT devices, which may have raw, unprocessed values.
7
Introduction
Data categories
Summary Table
Data Type Characteristics Examples
Rigid schema, organized, easily Relational databases, Excel
Structured
searchable sheets
Flexible schema, contains JSON, XML, NoSQL
Semi-Structured
tags/markers for organization databases, emails
Images, audio, video,
No schema, difficult to
Unstructured social media posts, text
search/organize without processing
files
8
Introduction to Key Libraries for AI and
Data science
9
Introduction to NumPy
10
Introduction to NumPy
Why Use NumPy Over Regular Python Lists
Performance: NumPy arrays are implemented in C, which makes array
operations significantly faster than equivalent operations on regular Python
lists. This is particularly true for large datasets.
Memory Efficiency: NumPy uses less memory than lists, which is
essential when working with big data.
Vectorized Operations: NumPy supports vectorized operations, allowing
you to apply a function to an entire array without needing explicit loops,
which enhances performance.
Mathematical Functions: NumPy includes a wide variety of built-in
mathematical functions and statistical operations, making it easier to
perform complex calculations.
11
Introduction to NumPy
12
Introduction to NumPy
Practical Applications
•Data Analysis: Efficiently compute statistical metrics.
•Image Processing: Manipulate pixel data as 2D or 3D arrays.
•Scientific Computing: Perform simulations and mathematical modeling.
•Machine Learning Pipelines: Integrate data transformation and feature
scaling as part of ML workflows
13
Introduction to NumPy
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Creating Multidimensional Arrays
1D Array: arr=np.array([1, 2, 3])
2D Array: arr=np.array([[1, 2, 3], [4, 5, 6]])
3D Array: arr=np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
ND Array (N-dimensional)
# Using zeros, ones, and arange …etc
zeros_array = np.zeros((3, 3)) # 3x3 array of zeros
ones_array = np.ones((2, 4)) # 2x4 array of ones
range_array = np.arange(0, 10, 2) # Array: [0, 2, 4, 6, 8]
full_array = np.full((3, 3), 7) # 3x3 array filled with the value 7
empty_array = np.empty((2, 3)) # 2x3 uninitialized array
linear_space = np.linspace(0, 10, 5) # 5 values from 0 to 10
random_array = np.random.rand(3, 2) # 3x2 array of random numbers between 0 and 1
random_integers = np.random.randint(1, 10, (2, 3)) # 2x3 array of random integers between 1 and 9
14
Introduction to NumPy
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Specifying Data Types During Creation:
When creating an array, you can specify the data type (dtype) to ensure the array
elements are of a particular type, such as integers, floats, or complex numbers. This can
be done using the dtype parameter.
import numpy as np
Specifying dtype helps in
# Integer array controlling memory usage and
int_array = np.array([1, 2, 3, 4], dtype=int)
ensures consistent data
# Float array handling, especially when
float_array = np.array([1, 2, 3, 4], dtype=float) performing mathematical
# Complex array operations.
complex_array = np.array([1, 2, 3, 4], dtype=complex) 15
Introduction to NumPy
Typecasting Arrays with astype()
You can change the data type of an existing array using the astype()
method. This is especially useful if you need to convert an array's data type
after creation.
import numpy as np
Typecasting can be beneficial
# Converting integer array to float
float_array = int_array.astype(float) for compatibility with specific
functions or for saving memory
# Converting float array to integer by switching to smaller data
int_array = float_array.astype(int)
types when possible.
# Converting integer array to complex
complex_array = int_array.astype(complex)
16
Introduction to NumPy
Working with Complex Data Types: Structured Arrays and Records
Structured arrays allow you to handle complex data by combining multiple data types in a
single array, similar to records or rows in a database table. Structured arrays are created
using compound data types, often useful when working with tabular data.
Creating a Structured Array
You can define structured arrays by specifying field names and data types:
import numpy as np
# Define a structured array with fields 'name', 'age', and 'height'
person_dtype = [('name', 'U10'), ('age', 'i4'), ('height', 'f4')]
people = np.array([('Alice', 25, 5.5), ('Bob', 30, 6.0)], dtype=person_dtype)
17
Introduction to NumPy
Structured arrays are ideal when dealing with datasets that contain multiple data types across fields,
such as database records or CSV data.
18
Introduction to NumPy
Array Indexing and Slicing
• Indexing: Access specific elements (e.g., arr[0, 1] for a 2D array).
• Slicing: Extract subarrays (e.g., arr[:, 1] for selecting the second
column).
Reshaping and Flattening Arrays
• Reshape: Change the shape of an array without altering its data
(e.g., arr.reshape(3, 2)).
• Flatten: Convert a multidimensional array to a 1D array using
arr.flatten() or arr.ravel().
19
Introduction to NumPy
Some Attributes and Functions
Attributes:
shape: Dimensions of the array.
size: Total number of elements in the array.
dtype: Data type of the array elements.
ndim: Number of dimensions (axes) of the array.
itemsize: Number of bytes per element in the array.
Functions:
reshape(): Change the shape of an array while keeping the same number of elements.
flatten(): Convert a multi-dimensional array into a 1D array.
20
Introduction to NumPy
21
Introduction to NumPy
Array Operations
Element-wise Operations: Operations applied to each element (e.g., arr1 + arr2).
Matrix Multiplication: Use np.dot(arr1, arr2) or arr1 @ arr2.
Broadcasting: Automatic expansion of array dimensions for operations (e.g., adding a
1D array to a 2D array). Rule: they are equal, or one of them is 1.
import numpy as np
# Add a 1D array to each row of a 2D array
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
print(a + b)
# Output: [[11, 22, 33], [14, 25, 36]]
22
Introduction to NumPy
Aggregation Functions
• np.sum(arr): Sum of all elements.
• np.mean(arr): Mean value.
• np.min(arr), np.max(arr): Minimum and maximum values.
• np.std(arr): Standard deviation.
23
Introduction to NumPy
24
25
Pandas “Panel data system” or Python Pandas is a library of
Python which is used for data analysis.
panel data, is a term in econometrics and statistics. It refers to data that
contains observations over multiple time periods for the same individuals
or entities,
26
Pandas extends NumPy by providing functions for exploratory
data analysis, statistics, and data visualization.
27
Key Features of Pandas
• Create a structured data set like R‟s data frame and Excel spreadsheet.
• Reading data from various sources such as CSV, TXT, XLSX, JSON, SQL
database etc.
• Writing and exporting data in csv or Excel format ( built in methods)
• It has a fast and efficient DataFrame object with default and customized
indexing.
• It is used for reshaping and pivoting of the data sets.
• It can do group by data aggregations and transformations.
• It integrates with the other libraries like SciPy, and scikit-learn.
• Many more …
28
29
30
Difference between NumPy and Pandas
Before we dive into the details of Pandas, we would like to point out some
important differences between NumPy and Pandas.
• Pandas mainly works with the tabular data and is popularly used for data
analysis and visualization, whereas the NumPy works primarily with the
numerical data.
• Pandas provides some sets of powerful tools like DataFrame and Series
that is mainly used for analyzing the data, whereas NumPy offers a
powerful object called ndArray.
• The performance of NumPy is better than Pandas for 50K row or less.
Whereas Pandas is better than the NumPy for 500K rows or more. Between
50K to 500K rows, performance depends on the kind of operation.
• NumPy consumes less memory as compared to Pandas.
31
• Indexing of the Series objects is quite slow as compared to NumPy arrays.
Pandas Data Objects
Pandas provides two data structures/objects for processing the data, i.e., Series
and DataFrame. Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with
labels rather than simple integer indices.
1. Pandas Series: It is a one-dimensional labeled array to store the data
elements which are generally similar to the columns in Excel. A Series
contains a sequence of values and an associated array of data labels,
called its index.
33
Pandas Series Object
We can easily convert the list, tuple, and dictionary into series using “Series”
method with the following syntax.
pandas.Series(data, index, dtype, copy)
import pandas as pd
The syntax that is used for creating an empty Series.
series1 = pd.Series()
print(series1)
The above example creates an empty series type object that has no values
and having default data types, i.e., float64.
37
Pandas Series Object
Creating a Series using inputs:
list1 = [10,20,30]
arr1 = np.array([10,20,30])
dict1 = {'a':10,'b':20,'c':30}
38
Pandas Series Object
In [315]:
pd.Series(list1) labels=['a', 'b', 'c']
In [320]: series1 =pd.Series(list1, index = labels)
Out[315]: In [321]: series1
0 10 Out[321]:
1 20 a 10
b 20
2 30
c 30
dtype: int64 dtype: int64
39
Pandas Series Object
40
Pandas Series Object
41
Pandas Series Object
Accessing Data from Series
In [329]: series1[[0,1,2]]
Out[329]:
a 10
b 20
c 30
dtype: int64
42
Pandas Series Object
Series object attributes
Attributes Description
Series.index Defines the index of the Series
Series.shape Returns a tuple of shape of the data
Series.dtype Returns the data type
Series.size Returns the total elements in the series
Returns True if series is empty, otherwise
Series.empty
returns False
Returns True if there are any missing
Series.hasnans
values, otherwise returns False
Series.nbytes Returns the number of bytes in the data
Series.ndim Returns the dimensions
43
Pandas Series Object
Series object attributes
In [330]: print(series1.index)
Index([„a‟, „b‟, „c‟], dtype=‟object‟)
In [331]: print(series1.values)
[10 20 30]
In [332]: print(series.dtype)
float64
In [334]: print(series1.shape)
(3,)
44
Pandas Series Object
Series object attributes
45
Pandas Series Object
Series.to_frame()
Series is defined as a type of lists that can hold an integer, string, double
values, etc. It returns an object in the form of a list that has an index starting
from 0 to n-1 where n represents the length of values in a Series. The main
difference between Series and DataFrame is that Series can only contain a
single list with a particular index, whereas the DataFrame is a combination of
more than one series that can analyze the data. The series.to_frame()
function is used to convert the series object to the DataFrame.
Syntax:
series.to_frame(name = None)
46
Pandas Series Object
series.to_frame(name = None)
Name: Refers to the object. Its defaults value is None. If it has one value, the
passed name will be substituted for the series name. It returns the DataFrame
representation of Series
In [352]: series1.to_frame()
Out[352]:
0
04
17
26
31
43
54 47
Pandas Series Object
Series.value_counts()
To extract the information about the values contained in a series, use
value_count() function. The value_counts() function returns a Series object
that contains counts of unique values. It returns an object that will be in
descending order so that its first element will be the most-frequently-occurred
element. By default, it excludes NA values.
Syntax
Series.value_counts(normalize = False, sort = True, ascending =
False, bins = None, dropna = True)
48
Pandas Series Object
Series.value_counts()
In [353]: x = pd.Series([4,1,7,6,1])
In [354]: x.value_counts()
Out[354]:
12
41
71
61
dtype: int64
49
Pandas DATAFRAME Object
50
Pandas DATAFRAME Object
Advantages of DataFrames:
Ease of Use:
Intuitive methods for reading, writing, and manipulating data.
Ability to handle heterogeneous data (e.g., numeric, textual, and categorical
data).
Efficient Data Handling:
Works well with large datasets.
Optimized for performance using built-in vectorized operations.
Flexible Data Manipulation:
Add, modify, and drop columns or rows easily.
Supports reshaping, merging, and filtering data intuitively.
51
Pandas DATAFRAME Object
Advantages of DataFrames:
Data Cleaning:
Handle missing data with tools for filling, interpolating, or dropping NaN values.
Perform data transformations such as renaming columns, changing data types, or applying
functions.
Integration:
Visualize data using libraries like Matplotlib and Seaborn.
Use DataFrames as inputs for machine learning models with Scikit-learn.
Rich Indexing and Slicing:
Access data by labels or positions.
Perform row or column filtering with intuitive syntax.
Support for Data Analysis:
Generate descriptive statistics with .describe().
Group data and compute aggregations using .groupby().
52
Pandas DATAFRAME Object
pip install pandas
Creating DataFrames
import pandas as pd
From dictionaries: data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
Name Age
0 Alice 25
1 Bob 30
Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
54
Pandas DATAFRAME Object
Viewing DataFrames
Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
55
Pandas DATAFRAME Object
Data Selection and Indexing
Selecting Data
By columns:
df['Name'] # Single column
df[['Name', 'Age']] # Multiple columns
56
Pandas DATAFRAME Object
Data Selection and Indexing
Filtering Data
Using conditions:
df[df['Age'] > 25]
58
Pandas DATAFRAME Object
Renaming Columns
df.rename(columns={'Age': 'Years'}, inplace=True)
59
Pandas DATAFRAME Object
Data Transformation
Adding/Updating Columns
df['NewColumn'] = df['Age'] * 2
Applying Functions
Row-wise or column-wise:
df['Transformed'] = df['Age'].apply(lambda x: x + 5)
60
Pandas DATAFRAME Object
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score
Age Score
0 Alice 25 85
Name
1 Bob 30 90
Alice 26.5 86.5
2 Alice 28 88
Bob 32.5 87.5
3 Bob 35 85
Charlie 40.0 95.0
4 Charlie 40 95
Sorting
df.sort_values('Age', ascending=False, inplace=True) 61
Pandas DATAFRAME Object
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score Using Other Aggregations:
Age Score Replace .mean() with other aggregation functions:
0 Alice 25 85
1 Bob 30 90
Name •.sum(): Total sum of values in each group.
2 Alice 28 88
Alice 26.5 86.5 •.count(): Count of rows in each group.
3 Bob 35 85
Bob 32.5 87.5 •.max(): Maximum value in each group.
4 Charlie 40 95
Charlie 40.0 95.0 •.min(): Minimum value in each group.
Sorting
df.sort_values('Age', ascending=False, inplace=True)
62
Pandas DATAFRAME Object
Saving DataFrames
To CSV
df.to_csv('output.csv', index=False)
To Excel
df.to_excel('output.xlsx', index=False)
63
64
What is Matplotlib?
• Matplotlib is a low level graph plotting
library in python that serves as a
visualization utility.
• Matplotlib is open source and we can use
it freely.
• Matplotlib is mostly written in python, a
few segments are written in C, Objective-
C and Javascript for Platform
compatibility.
65
Installation of Matplotlib: pip install matplotlib
Use it in app: import matplotlib
import matplotlib
print(matplotlib.__version__)
66
Installation of Matplotlib: pip install matplotlib
import matplotlib
print(matplotlib.__version__)
67
Pyplot:
Most of the Matplotlib utilities lies under the pyplot submodule,
and are usually imported under the plt alias:
68
Plotting x and y points
• The plot() function is used to draw points (markers) in a diagram.
• By default, the plot() function draws a line from point to point.
• The function takes parameters for specifying points in the diagram.
• Parameter 1 is an array containing the points on the x-axis.
• Parameter 2 is an array containing the points on the y-axis.
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays
[1, 8] and [3, 10] to the plot function.
69
Plotting x and y points
import matplotlib.pyplot as plt
import numpy as np
plt.plot(xpoints, ypoints)
plt.show()
70
Plotting Without Line
• To plot only the markers, you can use shortcut string notation parameter 'o',
which means 'rings'.
import matplotlib.pyplot as plt
import numpy as np
71
Multiple Points
• You can plot as many points as you like, just make sure you have the same
number of points in both axis.
plt.plot(xpoints, ypoints)
plt.show()
72
Default X-Points
If we do not specify the points on the x-axis, they will get the default values 0, 1, 2, 3 etc.,
depending on the length of the y-points.
So, if we take the same example as above, and leave out the x-points, the diagram will look
like this:
plt.plot(ypoints)
plt.show()
73
Markers
You can use the keyword argument marker to emphasize each point
with a specified marker:
74
Markers
You can use the keyword argument marker to emphasize each point with a
specified marker:
...
plt.plot(ypoints, marker = '*')
...
75
Markers
Marker Description 's' Square Triangle
'o' Circle 'D' Diamond '>'
Diamond
Right
'*' Star 'd'
(thin) '1' Tri Down
'.' Point 'p' Pentagon '2' Tri Up
',' Pixel 'H' Hexagon
'x' X 'h' Hexagon '3' Tri Left
'X' X (filled) 'v'
Triangle '4' Tri Right
Down
'+' Plus '^' Triangle Up
'|' Vline
'P' Plus (filled) '<' Triangle Left '_' Hline
76
Format Strings fmt
You can also use the shortcut string notation parameter to specify the marker.
This parameter is also called fmt, and is written with this syntax:
Marker|line|color
Example
o:r
Mark each point with a circle:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(ypoints, 'o:r')
plt.show() 77
Marker Size
You can use the keyword argument markersize or the shorter version, ms
to set the size of the markers:
Set the size of the markers to 20:
78
Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to
set the color of the edge of the markers:
Set the EDGE color to red:
79
Marker Color
You can use the keyword argument markerfacecolor or the shorter mfc to set the color
inside the edge of the markers:
Set the FACE color to red:
plt.plot(ypoints, ls = ':')
Style Or
'solid' (default) '-'
'dotted' ':'
'dashed' '--'
'dashdot' '-.'
'None' ' 'or' ' 81
Matplotlib Line
Line Color: You can use the keyword argument color or the
shorter c to set the color of the line:
plt.plot(ypoints, color = 'r')
plt.plot(ypoints, c = '#4CAF50')
plt.plot(ypoints, c = 'hotpink')
plt.plot(y1)
plt.plot(y2)
x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
84
Matplotlib Labels and Title
Create Labels for a Plot: With Pyplot, you can use
the xlabel() and ylabel() functions to set a label for the
x- and y-axis.
plt.plot(x, y)
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
85
Set Font Properties for Title and Labels
You can use the fontdict parameter in xlabel(),
ylabel(), and title() to set font properties for the title
and labels.
font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}
86
Position the Title
You can use the loc parameter in title() to position the title.
Legal values are: 'left', 'right', and 'center'. Default value is
'center'.
87
Matplotlib Adding Grid Lines
Add Grid Lines to a Plot: With Pyplot, you can use the
grid() function to add grid lines to the plot.
plt.plot(x, y)
plt.grid()
plt.show()
Specify Which Grid Lines to Display: You can use the axis
parameter in the grid() function to specify which grid lines to
display.
Legal values are: 'x', 'y', and 'both'. Default value is 'both'.
plt.grid(axis = 'x')
plt.grid(axis = 'y')
88
Matplotlib Adding Grid Lines
Set Line Properties for the Grid: You can also set the line properties of the grid, like this: grid(color = 'color',
linestyle = 'linestyle', linewidth = number).
89
Matplotlib Subplot
Display Multiple Plots: With the subplot() function you can draw multiple plots in
one figure:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.show() 90
The subplot() Function
The subplot() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by the first and
second argument.
The third argument represents the index of the current plot.
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.
plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the
second plot.
91
The subplot() Function
Example: Draw 2 plots on top of each other:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 1, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30,
40])
plt.subplot(2, 1, 2)
plt.plot(x,y)
92
plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 1)
plt.plot(x,y)
The subplot() Function x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
You can draw as many plots you like on one figure, just
plt.subplot(2, 3, 2)
descibe the number of rows, columns, and the index of plt.plot(x,y)
plt.subplot(2, 3, 3)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 3, 4)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 5)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 3, 6)
plt.plot(x,y) 93
Title and Super Title
You can add a title to each plot with the title() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")
plt.show()
94
Title and Super Title
You can add a title to the entire figure with the suptitle() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")
plt.suptitle("MY SHOP")
plt.show()
95
Matplotlib Scatter
Creating Scatter Plots: With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same length, one for
the values of the x-axis, and one for values on the y-axis:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
plt.show()
By comparing the two plots, I think it is safe to say that they both gives us the same conclusion:
the newer the car, the faster it drives. 97
Matplotlib Scatter
Colors: you can set your own color for each scatter plot with the color or the c
argument:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y =
np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80
,85])
plt.scatter(x, y, color = '#88c999')
98
Matplotlib Scatter
Color Each Dot:
colors =
np.array(["red","green","blue","yellow","pink","black","orang
e","purple","beige","brown","gray","cyan","magenta"])
plt.scatter(x, y, c=colors)
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y =
np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes =
np.array([20,50,100,200,500,1000,60,90,10,300,600,800,
75])
plt.scatter(x, y, s=sizes)
plt.show()
100
Matplotlib Scatter
Alpha: You can adjust the transparency of the dots
with the alpha argument.
Just like colors, make sure the array for sizes has
the same length as the arrays for the x- and y-axis:
plt.scatter(x, y, s=sizes, alpha=0.5)
plt.colorbar()
101
plt.show()
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Creating Bars: With Pyplot, you can
plt.bar(x,y)
use the bar() function to draw bar plt.show()
graphs:
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)
102
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Horizontal Bars: If you want the
plt.barh(x,y)
bars to be displayed horizontally plt.show()
instead of vertically, use the barh()
function:
103
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Bar Width: The bar() takes the
plt.bar(x, y, width = 0.1)
keyword argument width to set the plt.show()
width of the bars
(The default width value is 0.8)
104
Matplotlib Histograms
Histogram: A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:
You can read from the histogram that there are
approximately:
2 people from 140 to 145cm
5 people from 145 to 150cm
15 people from 151 to 156cm
31 people from 157 to 162cm
46 people from 163 to 168cm
53 people from 168 to 173cm
45 people from 173 to 178cm
28 people from 179 to 184cm
21 people from 185 to 190cm
4 people from 190 to 195cm
105
Matplotlib Histograms
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10
plt.hist(x)
plt.show()
106
Matplotlib Histograms
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10
plt.hist(x)
plt.show()
107
Matplotlib Pie Charts
Creating Pie Charts: With Pyplot, you can use the pie() function to draw pie charts:
import matplotlib.pyplot as plt
import numpy as np
plt.pie(y)
plt.show()
109
Matplotlib Pie Charts
Start Angle
As mentioned the default start angle is at the
x-axis, but you can change the start angle by
specifying a startangle parameter.
The startangle parameter is defined with an
angle in degrees, default angle is 0:
110
Matplotlib Pie Charts
Start Angle
Start the first wedge at 90 degrees:
113
Matplotlib Pie Charts
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:
114
Matplotlib Pie Charts
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
mycolors = ["black", "hotpink", "b", "#4CAF50"]
116
Matplotlib Pie Charts
Legend With Header
To add a header to the legend, add the
title parameter to the legend function.
117
118
1. Scikit-learn
Definition:
Scikit-learn is a Python library for classical machine learning tasks,
built on NumPy, SciPy, and Matplotlib. It provides efficient tools for
data mining, preprocessing, and machine learning algorithms.
119
1. Scikit-learn
Uses:
120
1. Scikit-learn
Advantages:
Ease of Use: Intuitive and simple API for classical ML tasks.
Rich Library: Includes most classical ML algorithms.
Integration: Works seamlessly with NumPy, Pandas, and Matplotlib.
Good Documentation: Well-documented and beginner-friendly.
Disadvantages:
No Deep Learning: Not suitable for neural networks or large-scale deep learning tasks.
Performance: Slower for very large datasets compared to specialized frameworks.
121
2. TensorFlow
Definition:
TensorFlow is an open-source framework for numerical
computation and large-scale machine learning, developed by Google.
It provides both low-level and high-level APIs for building and
training machine learning models, particularly for deep learning.
122
2. TensorFlow
Uses:
Deep Learning: Build custom architectures, including CNNs, RNNs, GANs (generative
adversarial network) , transformers, etc.
Large-Scale ML: Handles massive datasets and distributed training across multiple
GPUs/TPUs.
Production Deployment: TensorFlow Lite for mobile and IoT devices, TensorFlow Serving
for cloud-based applications.
Numerical Computation: Works well for general-purpose numerical tasks, beyond machine
learning.
123
2. TensorFlow
Advantages:
Scalability: Distributed training and deployment capabilities make it suitable
for production.
Customizability: Low-level APIs allow detailed control of computations.
Ecosystem: TensorFlow Lite, TensorFlow.js, and TensorFlow Serving provide
tools for cross-platform development.
Performance: Optimized for speed and works well on GPUs and TPUs.
Disadvantages:
Complexity: The low-level API requires extensive knowledge and verbose
code.
Debugging: The static computational graph (in earlier versions) could make
debugging harder compared to frameworks like PyTorch.
124
3. Keras
Definition:
125
3. Keras
Uses:
Rapid Prototyping: Build and experiment with neural networks quickly using
minimal code.
Beginner-Friendly Deep Learning: Offers simple APIs for standard tasks.
Model Customization: Supports functional and sequential APIs for defining
complex models.
Transfer Learning: Easy to load and fine-tune pre-trained models.
126
3. Keras
Advantages:
Simplicity: High-level abstractions reduce the need for boilerplate code.
Backend Flexibility: Though now tied to TensorFlow, older versions allowed using Theano
or CNTK as a backend.
Pre-trained Models: Includes many pre-trained architectures (e.g., ResNet, VGG,
MobileNet) for quick use.
Disadvantages:
Dependency on Backends: Keras‟ performance depends on the underlying backend
(TensorFlow in modern versions).
Limited Low-Level Control: For highly customized models, the low-level TensorFlow API
or other frameworks may be better.
127
4. PyTorch
Definition:
Uses:
Advantages:
Dynamic Graph: Easier to debug and modify than TensorFlow‟s static graph.
Pythonic: Feels like native Python, reducing complexity.
Community Support: Strong and growing ecosystem, widely used in research.
Disadvantages:
Production Tools: While improving, it historically lagged behind TensorFlow
for production environments.
Fewer High-Level Tools: Keras-like functionality is not as built-in (though
torch.nn helps).
130
Feature Scikit-learn TensorFlow Keras PyTorch
Classical ML (e.g., Deep learning, scalable High-level API for deep Research-focused deep
Primary Use
regression) ML learning learning
Simplicity for traditional Production-ready, end- Simplified interface for Dynamic graph
Core Philosophy
ML to-end ML TensorFlow execution for research
Community & Large community, Backed by Google, huge Integrated into Growing, backed by
Support extensive docs ecosystem TensorFlow's ecosystem Meta (Facebook)
131
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.
132
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.
133
134
Introduction to scikit-learn
Scikit-learn is a powerful and versatile open-source library in Python, designed for machine learning and
data science tasks. It is one of the most widely used libraries by data scientists and machine learning
practitioners due to its simplicity, efficiency, and comprehensive coverage of machine learning algorithms.
Built on top of other popular Python libraries such as NumPy, SciPy, and Matplotlib, scikit-learn provides a
seamless and unified interface for various machine learning models, preprocessing techniques, and evaluation
metrics. Its modular and consistent API allows users to focus on solving problems rather than worrying
about compatibility or implementation details.
135
Introduction to scikit-learn
The library supports both supervised and unsupervised learning methods, ranging from
simple algorithms like linear regression and k-nearest neighbors to more advanced
techniques like support vector machines, random forests…etc.
For unsupervised learning, it offers clustering algorithms like k-means and hierarchical
clustering, as well as dimensionality reduction techniques like principal component analysis
(PCA). What sets scikit-learn apart is its focus on user experience; the interface remains
consistent across all algorithms, making it easy to switch between models or compare their
performance with minimal changes to the code.
136
Introduction to scikit-learn
One of the key features of scikit-learn is its preprocessing utilities, which help ensure that the
data is ready for analysis. Real-world datasets often have missing values, categorical
variables, or features with vastly different scales. Scikit-learn provides tools to handle these
issues efficiently, such as scaling, encoding, and imputing missing values. These
preprocessing techniques are essential for achieving optimal performance from machine
learning models. Additionally, scikit-learn integrates seamlessly with Pandas for handling
datasets, Matplotlib for visualization, and NumPy for numerical computations, making it a
perfect fit for end-to-end machine learning workflows.
137
Introduction to scikit-learn
Installing scikit-learn is straightforward and can be done using Python's
package manager, pip.
A single command pip install scikit-learn is all it takes to get started.
138
Introduction to scikit-learn
The essence of scikit-learn lies in its ability to streamline the process of
building and deploying machine learning models. Instead of writing
complex code from scratch, users can utilize pre-built functions for tasks such
as data splitting, model training, and evaluation.
139
Loading and Exploring Data
# Load the Iris dataset Feature names: ['sepal length (cm)', 'sepal
iris = datasets.load_iris() width (cm)', 'petal length (cm)', 'petal width
(cm)']
# Explore the dataset
print("Keys in the dataset:", iris.keys()) Target names (classes): ['setosa' 'versicolor'
print("Feature names:", iris.feature_names) 'virginica']
print("Target names (classes):", iris.target_names)
print("Data shape:", iris.data.shape) Data shape: (150, 4)
141
print("Sample data point:", iris.data[0]) Sample data point: [5.1 3.5 1.4 0.2]
Loading and Exploring Data
Loading Built-in Datasets
In the previous code, we loaded the Iris dataset and explored its structure. Scikit-learn
datasets are usually stored as dictionaries with keys like: data (the features
« values »), target (the labels), and feature_names (names of the features).
The data array contains the measurements of each flower, while the target array maps
each flower to a species (e.g., Setosa, Versicolor, or Virginica). These datasets are
excellent for learning because they are clean, well-structured, and ready for use.
142
Loading and Exploring Data
Loading External Data
Real-world datasets are not prepackaged and require manual loading. Scikit-learn
supports loading data from NumPy arrays, Pandas DataFrames, or even SciPy
sparse matrices. For instance, let‟s load a CSV file using Pandas and convert it into a
format suitable for scikit-learn.
import pandas as pd
from sklearn.model_selection import train_test_split
In this example, we used Pandas to read a CSV file into a DataFrame. We then
separated the features (X) and the target variable (y), as machine learning models
require these inputs to be in separate variables. Finally, we split the dataset into
training and testing sets using scikit-learn's train_test_split function.
import pandas as pd
from sklearn.model_selection import train_test_split
145
Preliminary Data Cleaning
Real-world datasets are rarely clean and often require preprocessing. For example,
missing values can be handled using scikit-learn's imputation techniques.
146
Converting Categorical Data
Categorical data often needs to be converted into a numeric format for machine
learning models. Scikit-learn provides the LabelEncoder and OneHotEncoder
classes for this purpose. Let‟s look at how to encode a column of categorical
values.
from sklearn.preprocessing import LabelEncoder, Label encoding maps each category to a unique
OneHotEncoder integer, while one-hot encoding creates binary
# Label encoding vectors representing each category.
categories = ['dog', 'cat', 'mouse'] The choice between these methods depends on the
le = LabelEncoder()
labels = le.fit_transform(categories) algorithm being used and whether it can handle
ordinal relationships in the data.
print("Label Encoded:", labels)
# One-hot encoding
Label Encoded: [1 0 2]
ohe = OneHotEncoder(sparse_output=False)
one_hot = ohe.fit_transform(labels.reshape(-1, One-Hot Encoded:
1)) [[0. 1. 0.]
[1. 0. 0.]
print("One-Hot Encoded:\n", one_hot) 147
[0. 0. 1.]]
Preprocessing and Feature Engineering
Data preprocessing and feature engineering are crucial steps in any machine learning
pipeline. The quality and structure of the input data directly impact the performance of
machine learning models. In many cases, preprocessing and feature engineering are the most
time-intensive steps because they involve cleaning, transforming, and preparing raw data
into a format suitable for model training. Scikit-learn provides an extensive toolkit to address
these challenges, ranging from scaling and encoding to more advanced techniques like
generating polynomial features or selecting relevant features.
148
Preprocessing and Feature Engineering
Standardization and Normalization
One of the most common preprocessing steps is scaling numerical features.
Standardization and normalization are two popular techniques for this purpose:
Standardization scales data to have a mean of 0 and a standard deviation of 1.
Normalization scales data to a range of [0, 1] or to have unit norm.
In this code, the StandardScaler removes the mean
and scales the data to unit variance, while the
from sklearn.preprocessing import StandardScaler, Normalizer
Normalizer scales each row (sample) to have unit
import numpy as np norm.
# Sample dataset
Standardization is particularly useful for
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) algorithms sensitive to feature scaling, such as
support vector machines and k-nearest neighbors.
# Standardization
scaler = StandardScaler() Standardized Data:
X_standardized = scaler.fit_transform(X) [[-1.22474487 -1.22474487 -1.22474487]
print("Standardized Data:\n", X_standardized) [ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
# Normalization Normalized Data:
normalizer = Normalizer() [[0.26726124 0.53452248 0.80178373]
X_normalized = normalizer.fit_transform(X) [0.45584231 0.56980288 0.68376346]
print("Normalized Data:\n", X_normalized) 149
[0.50257071 0.57436653 0.64616234]]
Preprocessing and Feature Engineering
Generating Polynomial Features
Sometimes, raw features are insufficient for capturing complex
relationships in data.
Polynomial features introduce higher-order interactions between
variables, allowing models to learn more intricate patterns.
from sklearn.preprocessing import PolynomialFeatures
The PolynomialFeatures class creates new features by
import numpy as np computing all possible combinations of input features up to a
specified degree.
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6]])
For example, if you have two features 𝑥 1 and 𝑥 2, polynomial
# Generate polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False) features of degree 2 include 𝒙𝟐𝟏 , 𝒙𝟐𝟐 and 𝒙1 × 𝒙2.
X_poly = poly.fit_transform(X)
Original Data:
print("Original Data:\n", X)
print("Polynomial Features:\n", X_poly)
[[1 2]
[3 4]
[5 6]]
Polynomial Features:
[[ 1. 2. 1. 2. 4.]
[ 3. 4. 9. 12. 16.] 150
[ 5. 6. 25. 30. 36.]]
Preprocessing and Feature Engineering
Binarizing Continuous Features
In some cases, you may want to convert continuous features into binary values
(e.g., 0 or 1). This is useful for specific tasks like threshold-based classification.
The Binarizer class converts values above a given threshold to 1 and below it to 0, which can
be useful in certain scenarios where binary outcomes are needed.
151
Supervised Learning Models
Supervised learning is the most commonly used machine learning paradigm where
models learn a mapping between input features (X) and target labels (y) using
labeled datasets. In supervised learning, the goal is either to classify data points
into predefined categories (classification) or to predict a continuous value
(regression). Scikit-learn provides a wide variety of algorithms for both tasks,
with consistent interfaces for training, predicting, and evaluating models.
This section delves into the most commonly used supervised learning models,
explaining their working principles and showcasing their implementation with
scikit-learn.
152
import numpy as np
Supervised Learning Models from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Linear Regression
# Generate synthetic data
X = np.random.rand(100, 1) * 10 # Feature: 100 random points
Linear regression is a simple yet powerful algorithm for y = 3 * X.squeeze() + 5 + np.random.randn(100) # Target:
Linear with noise
regression tasks. It assumes a linear relationship
between the independent variables (features) and the # Train-test split
dependent variable (target). X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
y=β0+β1x1+β2x2+⋯+βnxn
The model tries to fit a straight line (or hyperplane in # Initialize and train the model
model = LinearRegression()
higher dimensions) that minimizes the sum of squared model.fit(X_train, y_train)
differences between the observed and predicted values.
# Make predictions
y_pred = model.predict(X_test)
In this example:
The LinearRegression class is used to create and train a # Evaluate the model
print("Coefficients:", model.coef_)
model.
print("Intercept:", model.intercept_)
The coefficients and intercept define the best-fit line, which print("Mean Squared Error:", mean_squared_error(y_test,
the model learns during training. y_pred))
Evaluation metrics like Mean Squared Error (MSE) and R² print("R² Score:", r2_score(y_test, y_pred))
Score assess the model's performance. A higher R² score
indicates better alignment between predictions and actual
153
values.
from sklearn.datasets import load_iris
Supervised Learning Models from
from
sklearn.linear_model import LogisticRegression
sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
can be extended to multiclass problems using # Initialize and train the model
techniques like one-vs-rest (OvR). model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
This example demonstrates logistic regression for binary y_pred = model.predict(X_test)
classification. The classification_report provides key
# Evaluate the model
metrics like precision, recall, and F1-score, helping assess print(classification_report(y_test, y_pred))
the model's ability to classify data points accurately.
154
from sklearn.svm import SVC
Supervised Learning Models from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
Support Vector Machines (SVM) import matplotlib.pyplot as plt
import numpy as np
Support Vector Machines are versatile algorithms for # Generate synthetic dataset
classification and regression. SVMs work by finding a X, y = make_blobs(n_samples=100, centers=2, random_state=42)
hyperplane that best separates classes in a high- # Train-test split
dimensional space. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
With the help of kernel functions, SVMs can handle # Initialize and train the model
non-linear relationships by transforming data into model = SVC(kernel='linear')
model.fit(X_train, y_train)
higher dimensions.
# Visualize decision boundary
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Create grid
This example uses an SVM with a linear kernel. The decision
xx = np.linspace(xlim[0], xlim[1], 30)
boundary separating the two classes is visualized on a 2D yy = np.linspace(ylim[0], ylim[1], 30)
plane. SVMs are highly effective for datasets where the YY, XX = np.meshgrid(yy, xx)
classes are well-separated. xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = model.decision_function(xy).reshape(XX.shape)
that classifies data points based on their # Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
proximity to other points. knn.fit(X_train, y_train)
# Make predictions
The model assigns the class most common y_pred = knn.predict(X_test)
In KNN, the choice of k (number of neighbors) significantly impacts the model‟s performance.
Smaller values of k make the model sensitive to noise, while larger values generalize the
decision boundary.
156
Supervised Learning Models
from sklearn.tree import DecisionTreeClassifier
Decision Trees from sklearn.tree import plot_tree
Decision trees are interpretable models that # Initialize and train the model
split the data into subsets based on feature tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
values. Each internal node represents a
# Visualize the tree
decision rule, and each leaf node represents a plt.figure(figsize=(10, 8))
plot_tree(tree, feature_names=['Feature 1', 'Feature 2'],
class or prediction. class_names=['Class 0', 'Class 1', 'Class 2'], filled=True)
plt.show()
Decision trees are intuitive and easy to visualize. However, they are prone to
overfitting,
157
Unsupervised Learning Models
This section will cover popular unsupervised learning algorithms, their use cases, and how to
implement them with scikit-learn.
158
Unsupervised Learning Models
Clustering:
Clustering algorithms divide the dataset into groups (clusters) of similar instances based on
their feature values. Each cluster contains data points that are more similar to each other than to
those in other clusters.
K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It partitions data into k clusters,
where k is specified by the user. The algorithm assigns each point to the nearest cluster center
and iteratively adjusts the cluster centers to minimize intra-cluster variance.
159
from sklearn.cluster import KMeans
Unsupervised Learning Models import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It # Generate synthetic data
partitions data into kkk clusters, where kkk is specified by X, _ = make_blobs(n_samples=300, centers=4,
the user. The algorithm assigns each point to the nearest cluster_std=0.6, random_state=42)
cluster center and iteratively adjusts the cluster centers to
minimize intra-cluster variance. # Apply K-Means
In this example: kmeans = KMeans(n_clusters=4, random_state=42)
• The make_blobs function generates synthetic data with 4 kmeans.fit(X)
clusters.
• The KMeans model clusters the data into 4 groups. # Get cluster centers and labels
• The results are visualized, showing cluster centers and centers = kmeans.cluster_centers_
color-coded clusters. labels = kmeans.labels_
161
Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters (dendrogram), representing nested groupings. There are two
main approaches:
Agglomerative Clustering: (Bottom-Up Approach) Starts with individual data points and merges them
iteratively.
Divisive Clustering: (Top-Down Approach) Starts with all data points in one cluster and splits them
iteratively.
Agglomerative clustering is useful when
from sklearn.cluster import AgglomerativeClustering you want to understand the hierarchy of
from scipy.cluster.hierarchy import dendrogram, linkage data, and the dendrogram provides a
import matplotlib.pyplot as plt detailed visualization of the clustering
process.
# Apply Agglomerative Clustering
model = AgglomerativeClustering(n_clusters=3)
labels = model.fit_predict(X)
# Visualize dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
plt.title("Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show() 162
Unsupervised Learning Models from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
Dimensionality Reduction # Generate synthetic data
Dimensionality reduction techniques reduce the number of features np.random.seed(42)
X = np.random.rand(100, 5) # 5-dimensional
while retaining as much information as possible. These techniques are data
invaluable for visualization, noise reduction, and speeding up
computations. # Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
Principal Component Analysis (PCA)
print("Original Shape:", X.shape)
PCA is a widely used dimensionality reduction technique. It projects print("Reduced Shape:", X_pca.shape)
data into a lower-dimensional space by finding new axes (principal
# Visualize the reduced data
components) that maximize variance. plt.scatter(X_pca[:, 0], X_pca[:, 1],
c='blue')
plt.xlabel('Principal Component 1')
In this example: plt.ylabel('Principal Component 2')
• The original 5-dimensional data is reduced to 2 plt.title('PCA Result')
plt.show()
dimensions.
• PCA identifies the directions of maximum variance,
helping to preserve the most important information.
163
Model Training , Evaluation and Metrics
Model evaluation is a critical part of the machine learning workflow. It helps you understand
how well your model performs and whether it generalizes effectively to unseen data. In
supervised learning, we measure the agreement between the predicted and true labels for
classification, or the closeness of predicted values to true values for regression. For
unsupervised learning, evaluation focuses on the quality of clusters or the reduction in
dimensionality. Scikit-learn offers a robust set of metrics and techniques to evaluate models
comprehensively.
In this section, we will cover the essential evaluation methods for classification, regression,
and clustering tasks. Detailed examples are provided to demonstrate how these metrics can
guide you in choosing and improving your models.
164
Model Training , Evaluation and Metrics
from sklearn.model_selection import train_test_split
Train-Test Split from sklearn.linear_model import LogisticRegression
The simplest way to evaluate a model is from sklearn.metrics import accuracy_score
to divide the dataset into two parts: a # Synthetic dataset
training set to train the model and a X = [[i] for i in range(10)]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
test set to evaluate its performance.
This ensures that the model is tested on # Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,
data it has never seen before, providing test_size=0.3, random_state=42)
a realistic estimate of how it will
# Train a logistic regression model
perform in production. model = LogisticRegression()
model.fit(X_train, y_train)
166
Model Training , Evaluation and Metrics
Classification Metrics
For classification tasks, the choice of evaluation metric depends on the problem‟s
requirements.
Scikit-learn provides various metrics such as accuracy, precision, recall, F1-score,
and ROC-AUC.
True Positives (TP), False Positives (FP), True Negatives (TN), and False
Negatives (FN):
In classification problems, confusion matrices are used to evaluate the
performance of a classification model. The matrix shows the counts of true
positives, false positives, true negatives, and false negatives, which are critical in
calculating various evaluation metrics, such as accuracy, precision, recall, F1-
score, and specificity.
167
Model Training , Evaluation and Metrics
Classification Metrics
True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN):
Here‟s a breakdown of these terms:
• True Positives (TP): These are the cases where the model correctly predicted the positive class. In other
words, the true label is positive (1), and the model predicted it as positive (1).
Example: The model correctly predicts that a person has a disease when they actually do.
• False Positives (FP): These are the cases where the model incorrectly predicted the positive class. In other
words, the true label is negative (0), but the model predicted it as positive (1).
Example: The model incorrectly predicts that a person has a disease when they actually do not.
• True Negatives (TN): These are the cases where the model correctly predicted the negative class. In other
words, the true label is negative (0), and the model predicted it as negative (0).
Example: The model correctly predicts that a person does not have a disease when they actually do not.
• False Negatives (FN): These are the cases where the model incorrectly predicted the negative class. In other
words, the true label is positive (1), but the model predicted it as negative (0).
Example: The model incorrectly predicts that a person does not have a disease when they actually do.
168
Model Training , Evaluation and Metrics
Classification Metrics:
Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.
169
Model Training , Evaluation and Metrics
Classification Metrics:
Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
[[2 0]
[1 2]] 170
Model Training , Evaluation and Metrics
Classification Metrics
Accuracy:
Accuracy is the proportion of correctly classified instances. While simple, it can be
misleading for imbalanced datasets where one class dominates.
In this example, accuracy is calculated as the ratio of correct predictions to the total
number of predictions.
171
Model Training , Evaluation and Metrics
Classification Metrics:
Precision, Recall, and F1-Score:
• Precision: The proportion of true positives among predicted Precision=
positives.
• Recall (Sensitivity or True Positive Rate ): The proportion
of true positives among actual positives.
• F1-Score: The harmonic mean of precision and recall, Recall =
balancing both metrics.
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(classification_report(y_true, y_pred))
The classification_report provides a detailed breakdown of precision, recall, and F1-score 172
for each class.
Model Training , Evaluation and Metrics
Classification Metrics:
ROC and AUC:
The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the
false positive rate. The Area Under the Curve (AUC) measures the model‟s ability to
distinguish between classes.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
y_true = [0, 1, 1, 0, 1]
y_scores = [0.2, 0.8, 0.6, 0.4, 0.9] # Model probabilities
• MAE is the average of the absolute differences between the actual values and the predicted values.
• lower MAE indicates better performance because it suggests the model is making more accurate
predictions without large errors.
Higher R² values are better because they indicate that the model explains more of the variance
in the data. An R² score of 1 means the model perfectly predicts the data, and an R² score of 0
means the model does no better than predicting the mean value.
Silhouette Score:
Silhouette Score measures how similar a point is to its cluster compared to other clusters.
A higher score indicates well-separated clusters.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# Clustering example
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict([[1], [2], [8], [9]])
176
Model Training , Evaluation and Metrics
Clustering Metrics :
Adjusted Rand Index :
ARI compares the similarity between predicted and true clusters, accounting for chance.
true_labels = [0, 0, 1, 1]
pred_labels = [0, 0, 1, 1]
177
Hyperparameter Tuning
178
Any questions ?
179