Chapter 04 Advanced Use of Python Libraries for AI and Data Science

1
Introduction
What is data science?
Data science is an interdisciplinary field focused on extracting
knowledge, insights, and value from structured and unstructured
data. It combines techniques from statistics, mathematics,
computer science, and domain-specific expertise to analyze,
interpret, and predict patterns in data.
Data science leverages tools and methods like machine learning,

artificial intelligence (AI), data mining, and big data processing to
derive meaningful information that supports decision-making,
problem-solving, and innovation in various industries.
2
Introduction
Key Components of Data Science
• Data Collection and Storage: Gathering data from multiple sources,
including databases, APIs, web scraping, sensors, and more. Ensuring
data is stored in formats that are accessible and organized for analysis is
critical.
• Data Cleaning and Preparation: Before analysis, data often requires
significant cleaning and transformation, such as handling missing values,
correcting errors, standardizing formats, and ensuring consistency.
• Exploratory Data Analysis (EDA): EDA involves understanding data‟s
structure, distributions, and key features through visualization and
summary statistics to form hypotheses and identify trends.
3
Introduction
Key Components of Data Science
• Statistical Analysis and Hypothesis Testing: Using statistical methods to test
hypotheses, identify correlations, and draw inferences from data.
• Machine Learning and Modeling: Applying algorithms to build predictive
models, classify data, detect patterns, and automate decision-making. Models
range from simple linear regressions to complex deep learning models,
depending on the problem and data.
• Data Visualization: Presenting insights visually, using tools like charts, graphs,
dashboards, and infographics. Visualization helps stakeholders (or Users)
understand complex data and models intuitively.
• Deployment and Optimization: Implementing data-driven solutions into
production environments where they can be used in real-time applications. This
includes ongoing monitoring and optimization to ensure models remain accurate
over time.
4
Introduction
Data categories
In data science, data types are generally categorized based on their
structure and format: structured, semi-structured, and unstructured
data
1. Structured Data
Structured data is highly organized and easily searchable in predefined formats, often found in tabular form with
rows and columns.
Characteristics:
Follows a strict schema (e.g., rows and columns in a database).
Easily searchable with SQL queries.
Data types and formats are defined in advance, making the data predictable and consistent.
Examples:
Relational databases (e.g., MySQL, PostgreSQL).
Spreadsheets (e.g., Excel tables).
Financial records, inventory data, and customer information stored in databases.
5
Introduction
Data categories
2. Semi-Structured Data
Semi-structured data lacks a fixed schema but still contains tags or markers to separate
elements, making it partially organized.
Characteristics:
Data doesn‟t strictly adhere to a fixed schema but still has some structure, like tags or
other markers.
More flexible than structured data but still organized enough for certain querying and
parsing methods.
Often represented in text files with a flexible schema (e.g., XML or JSON format).
Examples:
JSON and XML documents.
Email messages (with structured headers but free-form body content).
NoSQL databases (e.g., MongoDB, which allows documents with variable
structures).
HTML pages, where some parts have a defined structure but contain unstructured data
in free-form text.
6
Introduction
Data categories
3. Unstructured Data
Unstructured data lacks any predefined format or organization, making it harder to
process and analyze without specialized tools.
Characteristics:
No inherent structure, schema, or easily searchable format.
Data may be stored in formats that make analysis challenging without natural
language processing (NLP) or image recognition tools.
Often requires data processing and transformation before analysis can be done.
Examples:
Text documents, PDF files.
Media files (e.g., images, audio, video).
Social media posts, chat messages, and emails.
Sensor data from IoT devices, which may have raw, unprocessed values.
7
Introduction
Data categories
Summary Table
Data Type Characteristics Examples
Rigid schema, organized, easily Relational databases, Excel
Structured
searchable sheets
Flexible schema, contains JSON, XML, NoSQL
Semi-Structured
tags/markers for organization databases, emails
Images, audio, video,
No schema, difficult to
Unstructured social media posts, text
search/organize without processing
files
8
Introduction to Key Libraries for AI and
Data science
1- NumPy (Numerical Python) :

Multidimensional array manipulation
9
Introduction to NumPy
NumPy (Numerical Python) is a powerful library

in Python used for numerical computations.
One of its most important features is the support for

multidimensional arrays, which are known as
ndarray objects.
These arrays make it easy to perform mathematical

and logical operations on data efficiently.
10
Why Use NumPy Over Regular Python Lists
Performance: NumPy arrays are implemented in C, which makes array
operations significantly faster than equivalent operations on regular Python
lists. This is particularly true for large datasets.
Memory Efficiency: NumPy uses less memory than lists, which is
essential when working with big data.
Vectorized Operations: NumPy supports vectorized operations, allowing
you to apply a function to an entire array without needing explicit loops,
which enhances performance.
Mathematical Functions: NumPy includes a wide variety of built-in
mathematical functions and statistical operations, making it easier to
perform complex calculations.
11
Applications in Data Science and AI

•Data Preprocessing: Clean and transform datasets by handling missing data, normalizing values, and
reshaping data structures for machine learning models.
•Feature Extraction: Use NumPy functions to extract and manipulate features from datasets.
•Statistical Analysis: Calculate statistics to understand data distributions, variances, and correlations that
guide model selection.
•Linear Algebra: Perform operations such as matrix decomposition, linear transformations, which are
essential for building machine learning algorithms.
•Training Data Preparation: Use ndarray operations to efficiently shuffle, split, and prepare training and
testing sets.
•Model Implementation: Implement basic neural networks and custom algorithms using NumPy for
operations like dot products and activation functions.
•Simulation and Testing: Create mock data or simulate scenarios to test and validate machine learning
models.
12
Practical Applications
•Data Analysis: Efficiently compute statistical metrics.
•Image Processing: Manipulate pixel data as 2D or 3D arrays.
•Scientific Computing: Perform simulations and mathematical modeling.
•Machine Learning Pipelines: Integrate data transformation and feature
scaling as part of ML workflows
13
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Creating Multidimensional Arrays
1D Array: arr=np.array([1, 2, 3])
2D Array: arr=np.array([[1, 2, 3], [4, 5, 6]])
3D Array: arr=np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
ND Array (N-dimensional)
# Using zeros, ones, and arange …etc
zeros_array = np.zeros((3, 3)) # 3x3 array of zeros
ones_array = np.ones((2, 4)) # 2x4 array of ones
range_array = np.arange(0, 10, 2) # Array: [0, 2, 4, 6, 8]
full_array = np.full((3, 3), 7) # 3x3 array filled with the value 7
empty_array = np.empty((2, 3)) # 2x3 uninitialized array
linear_space = np.linspace(0, 10, 5) # 5 values from 0 to 10
random_array = np.random.rand(3, 2) # 3x2 array of random numbers between 0 and 1
random_integers = np.random.randint(1, 10, (2, 3)) # 2x3 array of random integers between 1 and 9
14
pip install numpy will install the latest version of NumPy and any dependencies it requires
import numpy as np
Specifying Data Types During Creation:
When creating an array, you can specify the data type (dtype) to ensure the array
elements are of a particular type, such as integers, floats, or complex numbers. This can
be done using the dtype parameter.
import numpy as np
Specifying dtype helps in
# Integer array controlling memory usage and
int_array = np.array([1, 2, 3, 4], dtype=int)
ensures consistent data
# Float array handling, especially when
float_array = np.array([1, 2, 3, 4], dtype=float) performing mathematical
# Complex array operations.
complex_array = np.array([1, 2, 3, 4], dtype=complex) 15
Typecasting Arrays with astype()
You can change the data type of an existing array using the astype()
method. This is especially useful if you need to convert an array's data type
after creation.
import numpy as np
Typecasting can be beneficial
# Converting integer array to float
float_array = int_array.astype(float) for compatibility with specific
functions or for saving memory
# Converting float array to integer by switching to smaller data
int_array = float_array.astype(int)
types when possible.
# Converting integer array to complex
complex_array = int_array.astype(complex)
16
Working with Complex Data Types: Structured Arrays and Records
Structured arrays allow you to handle complex data by combining multiple data types in a
single array, similar to records or rows in a database table. Structured arrays are created
using compound data types, often useful when working with tabular data.
Creating a Structured Array
You can define structured arrays by specifying field names and data types:
import numpy as np
# Define a structured array with fields 'name', 'age', and 'height'
person_dtype = [('name', 'U10'), ('age', 'i4'), ('height', 'f4')]
people = np.array([('Alice', 25, 5.5), ('Bob', 30, 6.0)], dtype=person_dtype)
•U10 specifies a Unicode string of length 10 for name.

•i4 specifies a 4-byte integer for age.
•f4 specifies a 4-byte float for height.
17
Accessing Structured Array Data:

You can access data in structured arrays by field name or index:
# Access all names

names = people['name']
# Access the first person's age

first_age = people['age'][0]
Structured arrays are ideal when dealing with datasets that contain multiple data types across fields,
such as database records or CSV data.
18
Array Indexing and Slicing
• Indexing: Access specific elements (e.g., arr[0, 1] for a 2D array).
• Slicing: Extract subarrays (e.g., arr[:, 1] for selecting the second
column).
Reshaping and Flattening Arrays
• Reshape: Change the shape of an array without altering its data
(e.g., arr.reshape(3, 2)).
• Flatten: Convert a multidimensional array to a 1D array using
arr.flatten() or arr.ravel().
19
Some Attributes and Functions
Attributes:
shape: Dimensions of the array.
size: Total number of elements in the array.
dtype: Data type of the array elements.
ndim: Number of dimensions (axes) of the array.
itemsize: Number of bytes per element in the array.
Functions:
reshape(): Change the shape of an array while keeping the same number of elements.
flatten(): Convert a multi-dimensional array into a 1D array.
20
Advanced Indexing Techniques

• Boolean Indexing: Select elements based on a condition (e.g., arr[arr > 5]).
• Fancy Indexing: Use arrays of indices to access specific elements (e.g.,
arr[[0, 2], [1, 3]]).
Manipulating Arrays
• Concatenation: Combine arrays using np.concatenate([arr1, arr2], axis=0).
• Stacking: Vertical stacking (np.vstack([arr1, arr2])) and horizontal stacking
(np.hstack([arr1, arr2])).
• Splitting: Divide arrays using np.split(arr, sections, axis=0).
21
Array Operations
Element-wise Operations: Operations applied to each element (e.g., arr1 + arr2).
Matrix Multiplication: Use np.dot(arr1, arr2) or arr1 @ arr2.
Broadcasting: Automatic expansion of array dimensions for operations (e.g., adding a
1D array to a 2D array). Rule: they are equal, or one of them is 1.
import numpy as np
# Add a 1D array to each row of a 2D array
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
print(a + b)
# Output: [[11, 22, 33], [14, 25, 36]]
22
Aggregation Functions
• np.sum(arr): Sum of all elements.
• np.mean(arr): Mean value.
• np.min(arr), np.max(arr): Minimum and maximum values.
• np.std(arr): Standard deviation.
23
Handling Missing Data

• Use np.nan to represent missing values.
• Functions like np.isnan(arr) can identify NaN values.
• np.nan_to_num(arr) replaces NaN with zero.
24
25
Pandas “Panel data system” or Python Pandas is a library of
Python which is used for data analysis.
panel data, is a term in econometrics and statistics. It refers to data that
contains observations over multiple time periods for the same individuals
or entities,
26
Pandas extends NumPy by providing functions for exploratory
data analysis, statistics, and data visualization.
It can be thought of as a Python‟s, equivalent to Microsoft Excel

spreadsheets for working with and exploring tabular data.
27
Key Features of Pandas
• Create a structured data set like R‟s data frame and Excel spreadsheet.
• Reading data from various sources such as CSV, TXT, XLSX, JSON, SQL
database etc.
• Writing and exporting data in csv or Excel format ( built in methods)
• It has a fast and efficient DataFrame object with default and customized
indexing.
• It is used for reshaping and pivoting of the data sets.
• It can do group by data aggregations and transformations.
• It integrates with the other libraries like SciPy, and scikit-learn.
• Many more …
28
29
30
Difference between NumPy and Pandas
Before we dive into the details of Pandas, we would like to point out some
important differences between NumPy and Pandas.
• Pandas mainly works with the tabular data and is popularly used for data
analysis and visualization, whereas the NumPy works primarily with the
numerical data.
• Pandas provides some sets of powerful tools like DataFrame and Series
that is mainly used for analyzing the data, whereas NumPy offers a
powerful object called ndArray.
• The performance of NumPy is better than Pandas for 50K row or less.
Whereas Pandas is better than the NumPy for 500K rows or more. Between
50K to 500K rows, performance depends on the kind of operation.
• NumPy consumes less memory as compared to Pandas.
31
• Indexing of the Series objects is quite slow as compared to NumPy arrays.
Pandas Data Objects
Pandas provides two data structures/objects for processing the data, i.e., Series
and DataFrame. Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with
labels rather than simple integer indices.
1. Pandas Series: It is a one-dimensional labeled array to store the data
elements which are generally similar to the columns in Excel. A Series
contains a sequence of values and an associated array of data labels,
called its index.
2. Pandas DataFrame: It is a mutable two-dimensional data structure

with labeled rows and columns which are generally similar to excel or
and SQL sheets. 32
Pandas Data Objects
Note that the individual columns in Pandas are referred to as “Series”. A

collection of series forms a DataFrame with a common index. In the case of
lack of alignment, Pandas introduces missing values as NaN.
33
Pandas Series Object
• Series are a special type of data structure available in the Pandas

library.
• It is defined as a one-dimensional labeled, homogeneously typed
array.
• One can create a series with objects of any data type - be it
integers, floats, strings, any other data type.
• It can be seen as a data structure with two arrays: One functioning

as the index, i.e., the labels, and the other contains the actual data
values.
• There are several different ways to create a series.
34
We can easily convert the list, tuple, and dictionary into series using “Series”
method with the following syntax.
pandas.Series(data, index, dtype, copy)
Series has the following parameters:

• data: It should be a python object like list, dictionary, or a scalar value.
• index: The value of index should be unique and hashable. It must be of the
same length as data. If we do not pass any index, default np.arange(n) will
be used where n is the length of data.
• dtype: It refers to the data type.
• copy: It is used for copying the data. (Default is False) 35
• copy: It is used for copying the data. (Default is False)

Constructing Series from a list with copy=False. Constructing Series from a 1d ndarray with copy=False.
>>> r = [1, 2] >>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False) >>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999 >>> ser.iloc[0] = 999
>>> r >>> r
[1, 2] array([999, 2])
>>> ser >>> ser
0 999 0 999
1 2 1 2
dtype: int64 dtype: int64
Due to input data type the Series has a
Due to input data type the Series has a view on the
copy of the original data even though
original data, so the data is changed as well. 36
copy=False, so the data is unchanged.
import pandas as pd
The syntax that is used for creating an empty Series.
series1 = pd.Series()
print(series1)
Output : Series([], dtype: float64)
The above example creates an empty series type object that has no values
and having default data types, i.e., float64.
37
Creating a Series using inputs:
list1 = [10,20,30]
arr1 = np.array([10,20,30])
dict1 = {'a':10,'b':20,'c':30}
We can create Series by using following kinds of inputs:

• List
• Array
• Dictionary
38
In [315]:
pd.Series(list1) labels=['a', 'b', 'c']
In [320]: series1 =pd.Series(list1, index = labels)
Out[315]: In [321]: series1
0 10 Out[321]:
1 20 a 10
b 20
2 30
c 30
dtype: int64 dtype: int64
39
In [323]: print(series1[0]) # using numerical index

10
In [324]: print(series1['a']) # using label
10
40
In [325]: ser_dict = pd.Series(dict1)

In [326]: print(ser_dict)
a 10
b 20
c 30
dtype: int64
( Don‟t use index)
41
Accessing Data from Series
In [323]: print(series1[0]) # using numerical index

10
In [329]: series1[[0,1,2]]
Out[329]:
a 10
b 20
c 30
dtype: int64
42
Series object attributes
Attributes Description
Series.index Defines the index of the Series
Series.shape Returns a tuple of shape of the data
Series.dtype Returns the data type
Series.size Returns the total elements in the series
Returns True if series is empty, otherwise
Series.empty
returns False
Returns True if there are any missing
Series.hasnans
values, otherwise returns False
Series.nbytes Returns the number of bytes in the data
Series.ndim Returns the dimensions
43
In [330]: print(series1.index)
Index([„a‟, „b‟, „c‟], dtype=‟object‟)
In [331]: print(series1.values)
[10 20 30]
In [332]: print(series.dtype)
float64
In [334]: print(series1.shape)
(3,)
44
In [335]: print("Dimension of series:", series1.ndim)

Dimension of series: 1
In [336]: print("Size of a series:", series1.size)

Size of a series: 3
In [337]: print(“Number of bytes:", series1.nbytes, 'bytes')

Number of bytes: 24 bytes
45
Series.to_frame()
Series is defined as a type of lists that can hold an integer, string, double
values, etc. It returns an object in the form of a list that has an index starting
from 0 to n-1 where n represents the length of values in a Series. The main
difference between Series and DataFrame is that Series can only contain a
single list with a particular index, whereas the DataFrame is a combination of
more than one series that can analyze the data. The series.to_frame()
function is used to convert the series object to the DataFrame.
Syntax:
series.to_frame(name = None)
46
series.to_frame(name = None)
Name: Refers to the object. Its defaults value is None. If it has one value, the
passed name will be substituted for the series name. It returns the DataFrame
representation of Series
In [352]: series1.to_frame()
Out[352]:
0
04
17
26
31
43
54 47
Series.value_counts()
To extract the information about the values contained in a series, use
value_count() function. The value_counts() function returns a Series object
that contains counts of unique values. It returns an object that will be in
descending order so that its first element will be the most-frequently-occurred
element. By default, it excludes NA values.
Syntax
Series.value_counts(normalize = False, sort = True, ascending =
False, bins = None, dropna = True)
48
Series.value_counts()
In [353]: x = pd.Series([4,1,7,6,1])
In [354]: x.value_counts()
Out[354]:
12
41
71
61
dtype: int64
49
Pandas DATAFRAME Object
Why Use Pandas DataFrames?

A DataFrame in Pandas is like a table in a database or an Excel
spreadsheet, but with more flexibility and powerful tools for data
manipulation.
50
Advantages of DataFrames:
Ease of Use:
Intuitive methods for reading, writing, and manipulating data.
Ability to handle heterogeneous data (e.g., numeric, textual, and categorical
data).
Efficient Data Handling:
Works well with large datasets.
Optimized for performance using built-in vectorized operations.
Flexible Data Manipulation:
Add, modify, and drop columns or rows easily.
Supports reshaping, merging, and filtering data intuitively.
51
Advantages of DataFrames:
Data Cleaning:
Handle missing data with tools for filling, interpolating, or dropping NaN values.
Perform data transformations such as renaming columns, changing data types, or applying
functions.
Integration:
Visualize data using libraries like Matplotlib and Seaborn.
Use DataFrames as inputs for machine learning models with Scikit-learn.
Rich Indexing and Slicing:
Access data by labels or positions.
Perform row or column filtering with intuitive syntax.
Support for Data Analysis:
Generate descriptive statistics with .describe().
Group data and compute aggregations using .groupby().
52
pip install pandas
Creating DataFrames
import pandas as pd
From dictionaries: data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
Name Age
0 Alice 25
1 Bob 30
data = [['Alice', 25], ['Bob', 30]]

From lists: df = pd.DataFrame(data, columns=['Name', 'Age'])
From CSV files: df = pd.read_csv('data.csv') We will see more … 53

Viewing DataFrames
df.head() # First 5 rows

df.tail() # Last 5 rows
Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
54
Viewing DataFrames
df.head() # First 5 rows

df.tail() # Last 5 rows
Summary Information:
Age
df.info() count 2.000000
mean 27.500000
df.describe() std 3.535534
min 25.000000
25% 26.250000
50% 27.500000
75% 28.750000
max 30.000000
55
Data Selection and Indexing
Selecting Data
By columns:
df['Name'] # Single column
df[['Name', 'Age']] # Multiple columns
By rows (using .loc and .iloc):
df.loc[0] # Row by label

df.iloc[0] # Row by index
56
Data Selection and Indexing
Filtering Data
Using conditions:
df[df['Age'] > 25]
Setting an Index ( use a column as a label)

Set and reset:
df.set_index('Name', inplace=True)
print(df.loc['Alice']) # Row by label)
Reset index: df.reset_index(inplace=True) 57

Data Cleaning
Handling Missing Data
Detect missing values:
df.isnull().sum()
Fill or drop:
df.fillna(0, inplace=True)
df.dropna(inplace=True)
58
Renaming Columns
df.rename(columns={'Age': 'Years'}, inplace=True)
Changing Data Types

df['Age'] = df['Age'].astype('float')
59
Data Transformation
Adding/Updating Columns
df['NewColumn'] = df['Age'] * 2
Applying Functions
Row-wise or column-wise:
df['Transformed'] = df['Age'].apply(lambda x: x + 5)
60
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score
Age Score
0 Alice 25 85
Name
1 Bob 30 90
Alice 26.5 86.5
2 Alice 28 88
Bob 32.5 87.5
3 Bob 35 85
Charlie 40.0 95.0
4 Charlie 40 95
Sorting
df.sort_values('Age', ascending=False, inplace=True) 61
Data Aggregation
Group By
grouped = df.groupby('Name').mean()
Name Age Score Using Other Aggregations:
Age Score Replace .mean() with other aggregation functions:
0 Alice 25 85
1 Bob 30 90
Name •.sum(): Total sum of values in each group.
2 Alice 28 88
Alice 26.5 86.5 •.count(): Count of rows in each group.
3 Bob 35 85
Bob 32.5 87.5 •.max(): Maximum value in each group.
4 Charlie 40 95
Charlie 40.0 95.0 •.min(): Minimum value in each group.
Sorting
df.sort_values('Age', ascending=False, inplace=True)
62
Saving DataFrames
To CSV
df.to_csv('output.csv', index=False)
To Excel
df.to_excel('output.xlsx', index=False)
63
64
What is Matplotlib?
• Matplotlib is a low level graph plotting
library in python that serves as a
visualization utility.
• Matplotlib is open source and we can use
it freely.
• Matplotlib is mostly written in python, a
few segments are written in C, Objective-
C and Javascript for Platform
compatibility.
65
Installation of Matplotlib: pip install matplotlib
Use it in app: import matplotlib
Checking Matplotlib Version:
import matplotlib
print(matplotlib.__version__)
66
Installation of Matplotlib: pip install matplotlib
Checking Matplotlib Version:
import matplotlib
print(matplotlib.__version__)
67
Pyplot:
Most of the Matplotlib utilities lies under the pyplot submodule,
and are usually imported under the plt alias:
import matplotlib.pyplot as plt

Now the Pyplot package can be referred to as plt.
68
Plotting x and y points
• The plot() function is used to draw points (markers) in a diagram.
• By default, the plot() function draws a line from point to point.
• The function takes parameters for specifying points in the diagram.
• Parameter 1 is an array containing the points on the x-axis.
• Parameter 2 is an array containing the points on the y-axis.
• If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays
[1, 8] and [3, 10] to the plot function.
69
Plotting x and y points
import numpy as np
xpoints = np.array([1, 8])

ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
plt.show()
70
Plotting Without Line
• To plot only the markers, you can use shortcut string notation parameter 'o',
which means 'rings'.
import numpy as np
xpoints = np.array([1, 8])

ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints, 'o')

plt.show()
71
Multiple Points
• You can plot as many points as you like, just make sure you have the same
number of points in both axis.

import numpy as np
xpoints = np.array([1, 2, 6, 8])

ypoints = np.array([3, 8, 1, 10])
plt.plot(xpoints, ypoints)
plt.show()
72
Default X-Points
If we do not specify the points on the x-axis, they will get the default values 0, 1, 2, 3 etc.,
depending on the length of the y-points.
So, if we take the same example as above, and leave out the x-points, the diagram will look
like this:

import numpy as np
ypoints = np.array([3, 8, 1, 10, 5, 7])
plt.plot(ypoints)
plt.show()
73
Markers
You can use the keyword argument marker to emphasize each point
with a specified marker:

import numpy as np
plt.plot(ypoints, marker = 'o')

plt.show()
74
Markers
You can use the keyword argument marker to emphasize each point with a
specified marker:
...
plt.plot(ypoints, marker = '*')
...
75
Markers
Marker Description 's' Square Triangle
'o' Circle 'D' Diamond '>'
Diamond
Right
'*' Star 'd'
(thin) '1' Tri Down
'.' Point 'p' Pentagon '2' Tri Up
',' Pixel 'H' Hexagon
'x' X 'h' Hexagon '3' Tri Left
'X' X (filled) 'v'
Triangle '4' Tri Right
Down
'+' Plus '^' Triangle Up
'|' Vline
'P' Plus (filled) '<' Triangle Left '_' Hline
76
Format Strings fmt
You can also use the shortcut string notation parameter to specify the marker.
This parameter is also called fmt, and is written with this syntax:
Marker|line|color
Example
o:r
Mark each point with a circle:
import numpy as np
plt.plot(ypoints, 'o:r')
plt.show() 77
Marker Size
You can use the keyword argument markersize or the shorter version, ms
to set the size of the markers:
Set the size of the markers to 20:

import numpy as np
plt.plot(ypoints, marker = 'o', ms = 20)

plt.show()
78
Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to
set the color of the edge of the markers:
Set the EDGE color to red:

import numpy as np
plt.plot(ypoints, marker = 'o', ms = 20,

mec = 'r')
plt.show()
79
Marker Color
You can use the keyword argument markerfacecolor or the shorter mfc to set the color
inside the edge of the markers:
Set the FACE color to red:

import numpy as np
plt.plot(ypoints, marker = 'o', ms = 20,

mfc = 'r')
plt.show()
You can use both mec and mfc

plt.plot(ypoints, marker = 'o', ms = 20, mec = 'r', mfc = 'r')
80
Matplotlib Line
Linestyle: You can use the keyword argument linestyle, or
shorter ls, to change the style of the plotted line:
plt.plot(ypoints, linestyle = 'dotted')
plt.plot(ypoints, ls = ':')
Style Or
'solid' (default) '-'
'dotted' ':'
'dashed' '--'
'dashdot' '-.'
'None' ' 'or' ' 81
Matplotlib Line
Line Color: You can use the keyword argument color or the
shorter c to set the color of the line:
plt.plot(ypoints, color = 'r')
plt.plot(ypoints, c = '#4CAF50')
plt.plot(ypoints, c = 'hotpink')
Line Width: You can use the keyword argument

linewidth or the shorter lw to change the width of
the line.
The value is a floating number, in points:
plt.plot(ypoints, linewidth = '20.5')
82
Multiple Lines
You can plot as many lines as you like by simply adding
more plt.plot() functions:
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])
plt.plot(y1)
plt.plot(y2)
x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])
plt.plot(x1, y1, x2, y2)

plt.show()
83
Matplotlib Labels and Title
Create Labels for a Plot: With Pyplot, you can use
the xlabel() and ylabel() functions to set a label for the
x- and y-axis.
plt.plot(x, y)
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
Create a Title for a Plot: With Pyplot, you can use

the title() function to set a title for the plot.
plt.title("Sports Watch Data")
84
Matplotlib Labels and Title
Create Labels for a Plot: With Pyplot, you can use
the xlabel() and ylabel() functions to set a label for the
x- and y-axis.
plt.plot(x, y)
Create a Title for a Plot: With Pyplot, you can use

the title() function to set a title for the plot.
plt.title("Sports Watch Data")
85
Set Font Properties for Title and Labels
You can use the fontdict parameter in xlabel(),
ylabel(), and title() to set font properties for the title
and labels.
font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}
plt.title("Sports Watch Data", fontdict = font1)

plt.xlabel("Average Pulse", fontdict = font2)
plt.ylabel("Calorie Burnage", fontdict = font2)
86
Position the Title
You can use the loc parameter in title() to position the title.
Legal values are: 'left', 'right', and 'center'. Default value is
'center'.
plt.title("Sports Watch Data", loc = 'left')

87
Matplotlib Adding Grid Lines
Add Grid Lines to a Plot: With Pyplot, you can use the
grid() function to add grid lines to the plot.
plt.plot(x, y)
plt.grid()
plt.show()
Specify Which Grid Lines to Display: You can use the axis
parameter in the grid() function to specify which grid lines to
display.
Legal values are: 'x', 'y', and 'both'. Default value is 'both'.
plt.grid(axis = 'x')
plt.grid(axis = 'y')
88
Matplotlib Adding Grid Lines
Set Line Properties for the Grid: You can also set the line properties of the grid, like this: grid(color = 'color',
linestyle = 'linestyle', linewidth = number).
plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)
89
Matplotlib Subplot
Display Multiple Plots: With the subplot() function you can draw multiple plots in
one figure:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y)
plt.show() 90
The subplot() Function
The subplot() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by the first and
second argument.
The third argument represents the index of the current plot.
#the figure has 1 row, 2 columns, and this plot is the first plot.
#the figure has 1 row, 2 columns, and this plot is the
second plot.
91
The subplot() Function
Example: Draw 2 plots on top of each other:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30,
40])
plt.plot(x,y)
92
plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
The subplot() Function x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
You can draw as many plots you like on one figure, just
descibe the number of rows, columns, and the index of plt.plot(x,y)
the plot. x = np.array([0, 1, 2, 3])

y = np.array([3, 8, 1, 10])
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y) 93
Title and Super Title
You can add a title to each plot with the title() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y)
plt.title("INCOME")
plt.show()
94
Title and Super Title
You can add a title to the entire figure with the suptitle() function:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y)
plt.title("INCOME")
plt.suptitle("MY SHOP")
plt.show()
95
Matplotlib Scatter
Creating Scatter Plots: With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same length, one for
the values of the x-axis, and one for values on the y-axis:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
The observation in the example above is the result of 13 cars

passing by.
The X-axis shows how old the car is.
The Y-axis shows the speed of the car when it passes.
Are there any relationships between the observations?
It seems that the newer the car, the faster it drives, but that could
be a coincidence, after all we only registered 13 cars.
96
Matplotlib Scatter
Compare Plots: In the previous example, there seems to be a relationship between
speed and age, but what if we plot the observations from another day as well? Will the
scatter plot tell us something else?
#day one, the age and speed of 13 cars:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
#day two, the age and speed of 15 cars:

x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y =
np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80
,85])
plt.scatter(x, y)
plt.show()
By comparing the two plots, I think it is safe to say that they both gives us the same conclusion:
the newer the car, the faster it drives. 97
Matplotlib Scatter
Colors: you can set your own color for each scatter plot with the color or the c
argument:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y =
np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80
,85])
plt.scatter(x, y, color = '#88c999')
98
Matplotlib Scatter
Color Each Dot:
colors =
np.array(["red","green","blue","yellow","pink","black","orang
e","purple","beige","brown","gray","cyan","magenta"])
plt.scatter(x, y, c=colors)
ColorMap: The Matplotlib module has a number of available colormaps.

A colormap is like a list of colors, where each color has a value that ranges from 0 to 100.
This colormap is called 'viridis' and as you can see it ranges from 0, which is a purple color, up
to 100, which is a yellow color.
colors = np.array([0, 10, 20, 30, 40, 45, 50, 55, 60, 70, 80, 90, 100])
plt.scatter(x, y, c=colors, cmap='viridis')

plt.colorbar() # to show the ColorMap bar 99
Matplotlib Scatter
Size: You can change the size of the dots with the s argument.
Just like colors, make sure the array for sizes has the same length as the arrays for the x-
and y-axis:
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y =
np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes =
np.array([20,50,100,200,500,1000,60,90,10,300,600,800,
75])
plt.scatter(x, y, s=sizes)
plt.show()
100
Matplotlib Scatter
Alpha: You can adjust the transparency of the dots
with the alpha argument.
Just like colors, make sure the array for sizes has
the same length as the arrays for the x- and y-axis:
plt.scatter(x, y, s=sizes, alpha=0.5)
Combine Color Size and Alpha

x = np.random.randint(100, size=(100))
y = np.random.randint(100, size=(100))
colors = np.random.randint(100, size=(100))
sizes = 10 * np.random.randint(100, size=(100))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='nipy_spectral')
plt.colorbar()
101
plt.show()
Matplotlib Bars x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
Creating Bars: With Pyplot, you can
plt.bar(x,y)
use the bar() function to draw bar plt.show()
graphs:
The bar() function takes arguments that describes the

layout of the bars.
The categories and their values represented by the first
and second argument as arrays.
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)
102
y = np.array([3, 8, 1, 10])
Horizontal Bars: If you want the
plt.barh(x,y)
bars to be displayed horizontally plt.show()
instead of vertically, use the barh()
function:
Bar Color: The bar() and barh() take the keyword

argument color to set the color of the bars:
plt.bar(x, y, color = "red")
plt.show()
103
y = np.array([3, 8, 1, 10])
Bar Width: The bar() takes the
plt.bar(x, y, width = 0.1)
keyword argument width to set the plt.show()
width of the bars
(The default width value is 0.8)
Note: For horizontal bars, use height instead of width.

plt.barh(x, y, height = 0.1)
104
Matplotlib Histograms
Histogram: A histogram is a graph showing frequency distributions.
It is a graph showing the number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:
You can read from the histogram that there are
approximately:
2 people from 140 to 145cm
105
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
106
Create Histogram: In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent
into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the
values will concentrate around 170, and the standard deviation is 10
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
107
Matplotlib Pie Charts
Creating Pie Charts: With Pyplot, you can use the pie() function to draw pie charts:
import numpy as np
y = np.array([35, 25, 25, 15])
plt.pie(y)
plt.show()
As you can see the pie chart draws one piece

(called a wedge) for each value in the array (in this
case [35, 25, 25, 15]).
By default the plotting of the first wedge starts
from the x-axis and moves counterclockwise:
108
Labels: Add labels to the pie chart with
the labels parameter.
The labels parameter must be an array
with one label for each wedge:
y = np.array([35, 25, 25, 15])

mylabels = ["Apples", "Bananas",
"Cherries", "Dates"]
plt.pie(y, labels = mylabels)

plt.show()
109
Start Angle
As mentioned the default start angle is at the
x-axis, but you can change the start angle by
specifying a startangle parameter.
The startangle parameter is defined with an
angle in degrees, default angle is 0:
110
Start Angle
Start the first wedge at 90 degrees:
y = np.array([35, 25, 25, 15])

mylabels = ["Apples", "Bananas", "Cherries",
"Dates"]
plt.pie(y, labels = mylabels, startangle = 90)

plt.show()
111
Explode
Maybe you want one of the wedges to stand out? The explode
parameter allows you to do that.
The explode parameter, if specified, and not None, must be an
array with one value for each wedge.
Each value represents how far from the center each wedge is
displayed:
y = np.array([35, 25, 25, 15])

mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]
plt.pie(y, labels = mylabels, explode = myexplode)

plt.show()
112
Shadow
Add a shadow to the pie chart by setting the shadows

parameter to True:
y = np.array([35, 25, 25, 15])
myexplode = [0.2, 0, 0, 0]
plt.pie(y, labels = mylabels, explode = myexplode,

shadow = True)
plt.show()
113
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:
y = np.array([35, 25, 25, 15])

mycolors = ["black", "hotpink", "b", "#4CAF50"]
plt.pie(y, labels = mylabels, colors = mycolors)

plt.show()
114
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one
value for each wedge:
y = np.array([35, 25, 25, 15])
mycolors = ["black", "hotpink", "b", "#4CAF50"]
plt.pie(y, labels = mylabels, colors = mycolors)

plt.show()
'r' - Red 'm' - Magenta

'g' - Green 'y' - Yellow
'b' - Blue 'k' - Black
115
'c' - Cyan 'w' - White
Legend
To add a list of explanation for each wedge,
use the legend() function:
y = np.array([35, 25, 25, 15])


plt.legend()
plt.show()
116
Legend With Header
To add a header to the legend, add the
title parameter to the legend function.
y = np.array([35, 25, 25, 15])


plt.legend(title = "Four Fruits:")
plt.show()
117
118
1. Scikit-learn
Definition:
Scikit-learn is a Python library for classical machine learning tasks,
built on NumPy, SciPy, and Matplotlib. It provides efficient tools for
data mining, preprocessing, and machine learning algorithms.
119
1. Scikit-learn
Uses:
Classification: SVM, k-NN, Random Forest, Logistic Regression, etc.

Regression: Linear Regression, Ridge, Lasso, etc.
Clustering: K-Means, DBSCAN, etc.
Dimensionality Reduction: PCA, t-SNE.
Model Evaluation: Cross-validation, metrics.
Feature Engineering: Encoding, scaling.
120
1. Scikit-learn
Advantages:
Ease of Use: Intuitive and simple API for classical ML tasks.
Rich Library: Includes most classical ML algorithms.
Integration: Works seamlessly with NumPy, Pandas, and Matplotlib.
Good Documentation: Well-documented and beginner-friendly.
Disadvantages:
No Deep Learning: Not suitable for neural networks or large-scale deep learning tasks.
Performance: Slower for very large datasets compared to specialized frameworks.
121
2. TensorFlow
Definition:
TensorFlow is an open-source framework for numerical
computation and large-scale machine learning, developed by Google.
It provides both low-level and high-level APIs for building and
training machine learning models, particularly for deep learning.
122
2. TensorFlow
Uses:
Deep Learning: Build custom architectures, including CNNs, RNNs, GANs (generative
adversarial network) , transformers, etc.
Large-Scale ML: Handles massive datasets and distributed training across multiple
GPUs/TPUs.
Production Deployment: TensorFlow Lite for mobile and IoT devices, TensorFlow Serving
for cloud-based applications.
Numerical Computation: Works well for general-purpose numerical tasks, beyond machine
learning.
123
2. TensorFlow
Advantages:
Scalability: Distributed training and deployment capabilities make it suitable
for production.
Customizability: Low-level APIs allow detailed control of computations.
Ecosystem: TensorFlow Lite, TensorFlow.js, and TensorFlow Serving provide
tools for cross-platform development.
Performance: Optimized for speed and works well on GPUs and TPUs.
Disadvantages:
Complexity: The low-level API requires extensive knowledge and verbose
code.
Debugging: The static computational graph (in earlier versions) could make
debugging harder compared to frameworks like PyTorch.
124
3. Keras
Definition:
Keras is a high-level API for building and training neural networks,

designed to be user-friendly and modular. It became integrated with
TensorFlow as its official high-level API in 2017 but can still be
used with other backends like Theano or CNTK.
125
3. Keras
Uses:
Rapid Prototyping: Build and experiment with neural networks quickly using
minimal code.
Beginner-Friendly Deep Learning: Offers simple APIs for standard tasks.
Model Customization: Supports functional and sequential APIs for defining
complex models.
Transfer Learning: Easy to load and fine-tune pre-trained models.
126
3. Keras
Advantages:
Simplicity: High-level abstractions reduce the need for boilerplate code.
Backend Flexibility: Though now tied to TensorFlow, older versions allowed using Theano
or CNTK as a backend.
Pre-trained Models: Includes many pre-trained architectures (e.g., ResNet, VGG,
MobileNet) for quick use.
Disadvantages:
Dependency on Backends: Keras‟ performance depends on the underlying backend
(TensorFlow in modern versions).
Limited Low-Level Control: For highly customized models, the low-level TensorFlow API
or other frameworks may be better.
127
4. PyTorch
Definition:
PyTorch is a deep learning framework developed by

Facebook, offering a dynamic computational graph that is
Pythonic and flexible, making it popular in research and
development.
128
4. PyTorch
Uses:
Deep Learning: Build CNNs, RNNs, GANs, etc.

Research: Ideal for experimenting with new architectures due
to its flexibility.
Production: TorchServe for model deployment.
129
4. PyTorch
Advantages:
Dynamic Graph: Easier to debug and modify than TensorFlow‟s static graph.
Pythonic: Feels like native Python, reducing complexity.
Community Support: Strong and growing ecosystem, widely used in research.
Disadvantages:
Production Tools: While improving, it historically lagged behind TensorFlow
for production environments.
Fewer High-Level Tools: Keras-like functionality is not as built-in (though
torch.nn helps).
130
Feature Scikit-learn TensorFlow Keras PyTorch
Classical ML (e.g., Deep learning, scalable High-level API for deep Research-focused deep
Primary Use
regression) ML learning learning
Extremely easy, Flexible, moderate

Ease of Use Beginner-friendly Steeper learning curve
beginner-friendly complexity
Simplicity for traditional Production-ready, end- Simplified interface for Dynamic graph
Core Philosophy
ML to-end ML TensorFlow execution for research
Depends on TensorFlow GPU-optimized, fast

Performance CPU-optimized GPU/TPU-optimized
backend and flexible
Static (TF 1.x), Eager Relies on TensorFlow Easy debugging with

Debugging Limited needs, stable API
(TF 2.x) tools dynamic graph
Image/NLP/Time series Rapid neural network Advanced research,

Popular Use Cases Regression, clustering
processing prototyping custom models
Community & Large community, Backed by Google, huge Integrated into Growing, backed by
Support extensive docs ecosystem TensorFlow's ecosystem Meta (Facebook)
131
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.
132
Scikit-learn: Ideal for small-scale projects focused on classical ML techniques.
TensorFlow: Suitable for production-ready deep learning and scalable
applications.
Keras: Best for beginners or rapid prototyping of deep learning models.
PyTorch: Great for researchers and experimental deep learning tasks requiring
dynamic graph execution.
133
134
Introduction to scikit-learn
Scikit-learn is a powerful and versatile open-source library in Python, designed for machine learning and
data science tasks. It is one of the most widely used libraries by data scientists and machine learning
practitioners due to its simplicity, efficiency, and comprehensive coverage of machine learning algorithms.
Built on top of other popular Python libraries such as NumPy, SciPy, and Matplotlib, scikit-learn provides a
seamless and unified interface for various machine learning models, preprocessing techniques, and evaluation
metrics. Its modular and consistent API allows users to focus on solving problems rather than worrying
about compatibility or implementation details.
135
The library supports both supervised and unsupervised learning methods, ranging from
simple algorithms like linear regression and k-nearest neighbors to more advanced
techniques like support vector machines, random forests…etc.
For unsupervised learning, it offers clustering algorithms like k-means and hierarchical
clustering, as well as dimensionality reduction techniques like principal component analysis
(PCA). What sets scikit-learn apart is its focus on user experience; the interface remains
consistent across all algorithms, making it easy to switch between models or compare their
performance with minimal changes to the code.
136
One of the key features of scikit-learn is its preprocessing utilities, which help ensure that the
data is ready for analysis. Real-world datasets often have missing values, categorical
variables, or features with vastly different scales. Scikit-learn provides tools to handle these
issues efficiently, such as scaling, encoding, and imputing missing values. These
preprocessing techniques are essential for achieving optimal performance from machine
learning models. Additionally, scikit-learn integrates seamlessly with Pandas for handling
datasets, Matplotlib for visualization, and NumPy for numerical computations, making it a
perfect fit for end-to-end machine learning workflows.
137
Installing scikit-learn is straightforward and can be done using Python's
package manager, pip.
A single command pip install scikit-learn is all it takes to get started.
Scikit-learn's documentation is highly regarded in the programming

community for its clarity and depth, offering numerous examples and
tutorials to help users understand and implement machine learning
algorithms effectively.
138
The essence of scikit-learn lies in its ability to streamline the process of
building and deploying machine learning models. Instead of writing
complex code from scratch, users can utilize pre-built functions for tasks such
as data splitting, model training, and evaluation.
Scikit-learn also supports cross-validation, a crucial technique for assessing

the generalization ability of a model. By automating these tasks, scikit-learn
not only saves time but also reduces the likelihood of errors, enabling users to
focus on extracting meaningful insights from data.
139
Loading and Exploring Data
Machine learning starts with data. Before building models or performing

analysis, it is essential to understand, prepare, and manipulate the data properly.
Scikit-learn provides a robust set of tools for loading datasets, exploring their
structure, and preparing them for analysis. The library seamlessly integrates
with Python's data-handling ecosystem, making it easy to load data from
various sources like CSV files, databases, or even scikit-learn’s preloaded
datasets. Proper data handling ensures that the subsequent steps in your
machine learning workflow, such as preprocessing and model building, are
efficient and error-free.
140
Loading Built-in Datasets
Scikit-learn comes with several built-in datasets designed for experimentation and
learning purposes. These datasets, like the Iris dataset, the Breast Cancer dataset, and
the Digits dataset, are small and formatted for direct use. Loading these datasets is
straightforward using the datasets module. For example, let‟s explore the Iris dataset, which
contains measurements of different iris flower species.
Keys in the dataset: dict_keys(['data', 'target',
'frame', 'target_names', 'DESCR',
from sklearn import datasets 'feature_names', 'filename', 'data_module'])
# Load the Iris dataset Feature names: ['sepal length (cm)', 'sepal
iris = datasets.load_iris() width (cm)', 'petal length (cm)', 'petal width
(cm)']
# Explore the dataset
print("Keys in the dataset:", iris.keys()) Target names (classes): ['setosa' 'versicolor'
print("Feature names:", iris.feature_names) 'virginica']
print("Target names (classes):", iris.target_names)
print("Data shape:", iris.data.shape) Data shape: (150, 4)
141
print("Sample data point:", iris.data[0]) Sample data point: [5.1 3.5 1.4 0.2]
Loading Built-in Datasets
In the previous code, we loaded the Iris dataset and explored its structure. Scikit-learn
datasets are usually stored as dictionaries with keys like: data (the features
« values »), target (the labels), and feature_names (names of the features).
The data array contains the measurements of each flower, while the target array maps
each flower to a species (e.g., Setosa, Versicolor, or Virginica). These datasets are
excellent for learning because they are clean, well-structured, and ready for use.
142
Loading External Data
Real-world datasets are not prepackaged and require manual loading. Scikit-learn
supports loading data from NumPy arrays, Pandas DataFrames, or even SciPy
sparse matrices. For instance, let‟s load a CSV file using Pandas and convert it into a
format suitable for scikit-learn.
import pandas as pd
from sklearn.model_selection import train_test_split
# Load a CSV file into a Pandas DataFrame

data = pd.read_csv('path/to/dataset.csv')
# Separate features and target

X = data.drop(columns='target_column')
y = data['target_column']
# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
print("Training data shape:", X_train.shape)

143
print("Testing data shape:", X_test.shape)
Loading External Data
In this example, we used Pandas to read a CSV file into a DataFrame. We then
separated the features (X) and the target variable (y), as machine learning models
require these inputs to be in separate variables. Finally, we split the dataset into
training and testing sets using scikit-learn's train_test_split function.
import pandas as pd
# Load a CSV file into a Pandas DataFrame

data = pd.read_csv('path/to/dataset.csv')
# Separate features and target

X = data.drop(columns='target_column')
y = data['target_column']
# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training data shape:", X_train.shape)

print("Testing data shape:", X_test.shape) 144
Exploring Data
Exploration is a vital step in understanding the characteristics of your dataset. For
example, knowing the distribution of values, the presence of missing data, or the
relationships between variables can influence the choice of preprocessing techniques and
algorithms.
Here‟s how to perform basic data exploration:
The describe method provides a quick statistical
#...
# Basic statistics
summary of numeric columns, including mean,
print(X.describe()) standard deviation, and range. Checking for missing
# Check for missing values
values is another critical step, as many machine
print("Missing values in each column:\n", X.isnull().sum()) learning algorithms cannot handle missing data
directly.
# Visualize relationships (for numeric features)
import seaborn as sns The pairplot function from Seaborn helps visualize the
sns.pairplot(data, diag_kind='kde')
relationships between different features, offering
plt.show() insights into potential correlations or clusters.
145
Preliminary Data Cleaning
Real-world datasets are rarely clean and often require preprocessing. For example,
missing values can be handled using scikit-learn's imputation techniques.
from sklearn.impute import SimpleImputer

import numpy as np This example demonstrates how to fill
# Create an imputer object missing values with the mean of each column
imputer = SimpleImputer(strategy='mean') using the SimpleImputer class.
# Assume we have missing values in our dataset
X = np.array([[1, 2, np.nan], [3, np.nan, 6], [7, 8, 9]])
Imputation strategies can also include filling
# Impute missing values
X_imputed = imputer.fit_transform(X)
with median or a constant value, depending
on the context of your data.
print("Original Data:\n", X)
print("Data after imputation:\n", X_imputed)
146
Converting Categorical Data
Categorical data often needs to be converted into a numeric format for machine
learning models. Scikit-learn provides the LabelEncoder and OneHotEncoder
classes for this purpose. Let‟s look at how to encode a column of categorical
values.
from sklearn.preprocessing import LabelEncoder, Label encoding maps each category to a unique
OneHotEncoder integer, while one-hot encoding creates binary
# Label encoding vectors representing each category.
categories = ['dog', 'cat', 'mouse'] The choice between these methods depends on the
le = LabelEncoder()
labels = le.fit_transform(categories) algorithm being used and whether it can handle
ordinal relationships in the data.
print("Label Encoded:", labels)
# One-hot encoding
Label Encoded: [1 0 2]
ohe = OneHotEncoder(sparse_output=False)
one_hot = ohe.fit_transform(labels.reshape(-1, One-Hot Encoded:
1)) [[0. 1. 0.]
[1. 0. 0.]
print("One-Hot Encoded:\n", one_hot) 147
[0. 0. 1.]]
Preprocessing and Feature Engineering
Data preprocessing and feature engineering are crucial steps in any machine learning
pipeline. The quality and structure of the input data directly impact the performance of
machine learning models. In many cases, preprocessing and feature engineering are the most
time-intensive steps because they involve cleaning, transforming, and preparing raw data
into a format suitable for model training. Scikit-learn provides an extensive toolkit to address
these challenges, ranging from scaling and encoding to more advanced techniques like
generating polynomial features or selecting relevant features.
148
Standardization and Normalization
One of the most common preprocessing steps is scaling numerical features.
Standardization and normalization are two popular techniques for this purpose:
Standardization scales data to have a mean of 0 and a standard deviation of 1.
Normalization scales data to a range of [0, 1] or to have unit norm.
In this code, the StandardScaler removes the mean
and scales the data to unit variance, while the
from sklearn.preprocessing import StandardScaler, Normalizer
Normalizer scales each row (sample) to have unit
import numpy as np norm.
# Sample dataset
Standardization is particularly useful for
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) algorithms sensitive to feature scaling, such as
support vector machines and k-nearest neighbors.
# Standardization
scaler = StandardScaler() Standardized Data:
X_standardized = scaler.fit_transform(X) [[-1.22474487 -1.22474487 -1.22474487]
print("Standardized Data:\n", X_standardized) [ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
# Normalization Normalized Data:
normalizer = Normalizer() [[0.26726124 0.53452248 0.80178373]
X_normalized = normalizer.fit_transform(X) [0.45584231 0.56980288 0.68376346]
print("Normalized Data:\n", X_normalized) 149
[0.50257071 0.57436653 0.64616234]]
Generating Polynomial Features
Sometimes, raw features are insufficient for capturing complex
relationships in data.
Polynomial features introduce higher-order interactions between
variables, allowing models to learn more intricate patterns.
from sklearn.preprocessing import PolynomialFeatures
The PolynomialFeatures class creates new features by
import numpy as np computing all possible combinations of input features up to a
specified degree.
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6]])
For example, if you have two features 𝑥 1 and 𝑥 2, polynomial
# Generate polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False) features of degree 2 include 𝒙𝟐𝟏 , 𝒙𝟐𝟐 and 𝒙1 × 𝒙2.
X_poly = poly.fit_transform(X)
Original Data:
print("Original Data:\n", X)
print("Polynomial Features:\n", X_poly)
[[1 2]
[3 4]
[5 6]]
Polynomial Features:
[[ 1. 2. 1. 2. 4.]
[ 3. 4. 9. 12. 16.] 150
[ 5. 6. 25. 30. 36.]]
Binarizing Continuous Features
In some cases, you may want to convert continuous features into binary values
(e.g., 0 or 1). This is useful for specific tasks like threshold-based classification.
The Binarizer class converts values above a given threshold to 1 and below it to 0, which can
be useful in certain scenarios where binary outcomes are needed.
from sklearn.preprocessing import Binarizer Original Data:

import numpy as np [[1.5]
[2.3]
# Sample data [0.8]
X = np.array([[1.5], [2.3], [0.8], [3.7]]) [3.7]]
Binarized Data:
# Binarization
binarizer = Binarizer(threshold=2.0) [[0.]
X_binary = binarizer.fit_transform(X) [1.]
print("Original Data:\n", X) [0.]
print("Binarized Data:\n", X_binary) [1.]]
151
Supervised Learning Models
Supervised learning is the most commonly used machine learning paradigm where
models learn a mapping between input features (X) and target labels (y) using
labeled datasets. In supervised learning, the goal is either to classify data points
into predefined categories (classification) or to predict a continuous value
(regression). Scikit-learn provides a wide variety of algorithms for both tasks,
with consistent interfaces for training, predicting, and evaluating models.
This section delves into the most commonly used supervised learning models,
explaining their working principles and showcasing their implementation with
scikit-learn.
152
import numpy as np
Supervised Learning Models from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Linear Regression
# Generate synthetic data
X = np.random.rand(100, 1) * 10 # Feature: 100 random points
Linear regression is a simple yet powerful algorithm for y = 3 * X.squeeze() + 5 + np.random.randn(100) # Target:
Linear with noise
regression tasks. It assumes a linear relationship
between the independent variables (features) and the # Train-test split
dependent variable (target). X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
y=β0+β1x1+β2x2+⋯+βnxn
The model tries to fit a straight line (or hyperplane in # Initialize and train the model
model = LinearRegression()
higher dimensions) that minimizes the sum of squared model.fit(X_train, y_train)
differences between the observed and predicted values.
# Make predictions
y_pred = model.predict(X_test)
In this example:
The LinearRegression class is used to create and train a # Evaluate the model
print("Coefficients:", model.coef_)
model.
print("Intercept:", model.intercept_)
The coefficients and intercept define the best-fit line, which print("Mean Squared Error:", mean_squared_error(y_test,
the model learns during training. y_pred))
Evaluation metrics like Mean Squared Error (MSE) and R² print("R² Score:", r2_score(y_test, y_pred))
Score assess the model's performance. A higher R² score
indicates better alignment between predictions and actual
153
values.
from sklearn.datasets import load_iris
Supervised Learning Models from
from
sklearn.linear_model import LogisticRegression
sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
Logistic Regression # Load Iris dataset

data = load_iris()
Logistic regression, despite its name, is a classification X, y = data.data, data.target
algorithm.
# Binary classification: Select two classes
X, y = X[y != 2], y[y != 2] # Remove class 2 for binary
It models the probability of a data point belonging to a classification
certain class using the logistic function. # Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
It is especially useful for binary classification tasks and test_size=0.2, random_state=42)
can be extended to multiclass problems using # Initialize and train the model
techniques like one-vs-rest (OvR). model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
This example demonstrates logistic regression for binary y_pred = model.predict(X_test)
classification. The classification_report provides key
# Evaluate the model
metrics like precision, recall, and F1-score, helping assess print(classification_report(y_test, y_pred))
the model's ability to classify data points accurately.
154
from sklearn.svm import SVC
Supervised Learning Models from sklearn.datasets import make_blobs
Support Vector Machines (SVM) import matplotlib.pyplot as plt
import numpy as np
Support Vector Machines are versatile algorithms for # Generate synthetic dataset
classification and regression. SVMs work by finding a X, y = make_blobs(n_samples=100, centers=2, random_state=42)
hyperplane that best separates classes in a high- # Train-test split
dimensional space. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
With the help of kernel functions, SVMs can handle # Initialize and train the model
non-linear relationships by transforming data into model = SVC(kernel='linear')
higher dimensions.
# Visualize decision boundary
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='autumn')
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# Create grid
This example uses an SVM with a linear kernel. The decision
xx = np.linspace(xlim[0], xlim[1], 30)
boundary separating the two classes is visualized on a 2D yy = np.linspace(ylim[0], ylim[1], 30)
plane. SVMs are highly effective for datasets where the YY, XX = np.meshgrid(yy, xx)
classes are well-separated. xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = model.decision_function(xy).reshape(XX.shape)
# Plot decision boundary

ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5)155
plt.show()
from sklearn.neighbors import KNeighborsClassifier
Supervised Learning Models from sklearn.metrics import accuracy_score
# Generate synthetic dataset

X, y = make_blobs(n_samples=200, centers=3, random_state=42)
K-Nearest Neighbors (KNN)
# Train-test split
K-Nearest Neighbors is a simple algorithm test_size=0.2, random_state=42)
that classifies data points based on their # Initialize and train the model
knn = KNeighborsClassifier(n_neighbors=5)
proximity to other points. knn.fit(X_train, y_train)
# Make predictions
The model assigns the class most common y_pred = knn.predict(X_test)
among the k-nearest neighbors in feature # Evaluate the model

print("Accuracy:", accuracy_score(y_test, y_pred))
space.
In KNN, the choice of k (number of neighbors) significantly impacts the model‟s performance.
Smaller values of k make the model sensitive to noise, while larger values generalize the
decision boundary.
156
Supervised Learning Models
from sklearn.tree import DecisionTreeClassifier
Decision Trees from sklearn.tree import plot_tree
Decision trees are interpretable models that # Initialize and train the model
split the data into subsets based on feature tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
values. Each internal node represents a
# Visualize the tree
decision rule, and each leaf node represents a plt.figure(figsize=(10, 8))
plot_tree(tree, feature_names=['Feature 1', 'Feature 2'],
class or prediction. class_names=['Class 0', 'Class 1', 'Class 2'], filled=True)
plt.show()
Decision trees are intuitive and easy to visualize. However, they are prone to
overfitting,
157
Unsupervised Learning Models
Unsupervised learning focuses on finding hidden structures or patterns in unlabeled data.

Unlike supervised learning, there are no predefined outputs to predict. Instead, the goal is to
explore the data, group similar instances, or reduce dimensionality for better visualization and
analysis.
Scikit-learn provides a robust suite of algorithms for unsupervised learning, including

clustering techniques, dimensionality reduction methods, and anomaly detection.
This section will cover popular unsupervised learning algorithms, their use cases, and how to
implement them with scikit-learn.
158
Clustering:
Clustering algorithms divide the dataset into groups (clusters) of similar instances based on
their feature values. Each cluster contains data points that are more similar to each other than to
those in other clusters.
K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It partitions data into k clusters,
where k is specified by the user. The algorithm assigns each point to the nearest cluster center
and iteratively adjusts the cluster centers to minimize intra-cluster variance.
159
from sklearn.cluster import KMeans
Unsupervised Learning Models import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
K-Means Clustering:
K-Means is one of the most popular clustering algorithms. It # Generate synthetic data
partitions data into kkk clusters, where kkk is specified by X, _ = make_blobs(n_samples=300, centers=4,
the user. The algorithm assigns each point to the nearest cluster_std=0.6, random_state=42)
cluster center and iteratively adjusts the cluster centers to
minimize intra-cluster variance. # Apply K-Means
In this example: kmeans = KMeans(n_clusters=4, random_state=42)
• The make_blobs function generates synthetic data with 4 kmeans.fit(X)
clusters.
• The KMeans model clusters the data into 4 groups. # Get cluster centers and labels
• The results are visualized, showing cluster centers and centers = kmeans.cluster_centers_
color-coded clusters. labels = kmeans.labels_
# Visualize the clusters

plt.scatter(X[:, 0], X[:, 1], c=labels,
cmap='viridis', s=50)
plt.scatter(centers[:, 0], centers[:, 1],
c='red', marker='X', s=200, label='Centers')
plt.legend()
plt.show()
160
Choosing the Number of Clusters:
The elbow method helps determine the # Elbow Method
optimal number of clusters by plotting the inertia = []
k_values = range(1, 10)
sum of squared distances from points to their for k in k_values:
assigned cluster center for different k values. kmeans = KMeans(n_clusters=k,
The "elbow" point indicates the best k. random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
plt.plot(k_values, inertia, marker='o')

plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
161
Hierarchical Clustering:
Hierarchical clustering creates a tree of clusters (dendrogram), representing nested groupings. There are two
main approaches:
Agglomerative Clustering: (Bottom-Up Approach) Starts with individual data points and merges them
iteratively.
Divisive Clustering: (Top-Down Approach) Starts with all data points in one cluster and splits them
iteratively.
Agglomerative clustering is useful when
from sklearn.cluster import AgglomerativeClustering you want to understand the hierarchy of
from scipy.cluster.hierarchy import dendrogram, linkage data, and the dendrogram provides a
import matplotlib.pyplot as plt detailed visualization of the clustering
process.
# Apply Agglomerative Clustering
model = AgglomerativeClustering(n_clusters=3)
labels = model.fit_predict(X)
# Visualize dendrogram
linkage_matrix = linkage(X, method='ward')
dendrogram(linkage_matrix)
plt.title("Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show() 162
Unsupervised Learning Models from sklearn.decomposition import PCA
import numpy as np
Dimensionality Reduction # Generate synthetic data
Dimensionality reduction techniques reduce the number of features np.random.seed(42)
X = np.random.rand(100, 5) # 5-dimensional
while retaining as much information as possible. These techniques are data
invaluable for visualization, noise reduction, and speeding up
computations. # Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
Principal Component Analysis (PCA)
print("Original Shape:", X.shape)
PCA is a widely used dimensionality reduction technique. It projects print("Reduced Shape:", X_pca.shape)
data into a lower-dimensional space by finding new axes (principal
# Visualize the reduced data
components) that maximize variance. plt.scatter(X_pca[:, 0], X_pca[:, 1],
c='blue')
plt.xlabel('Principal Component 1')
In this example: plt.ylabel('Principal Component 2')
• The original 5-dimensional data is reduced to 2 plt.title('PCA Result')
plt.show()
dimensions.
• PCA identifies the directions of maximum variance,
helping to preserve the most important information.
163
Model Training , Evaluation and Metrics
Model evaluation is a critical part of the machine learning workflow. It helps you understand
how well your model performs and whether it generalizes effectively to unseen data. In
supervised learning, we measure the agreement between the predicted and true labels for
classification, or the closeness of predicted values to true values for regression. For
unsupervised learning, evaluation focuses on the quality of clusters or the reduction in
dimensionality. Scikit-learn offers a robust set of metrics and techniques to evaluate models
comprehensively.
In this section, we will cover the essential evaluation methods for classification, regression,
and clustering tasks. Detailed examples are provided to demonstrate how these metrics can
guide you in choosing and improving your models.
164
Train-Test Split from sklearn.linear_model import LogisticRegression
The simplest way to evaluate a model is from sklearn.metrics import accuracy_score
to divide the dataset into two parts: a # Synthetic dataset
training set to train the model and a X = [[i] for i in range(10)]
y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
test set to evaluate its performance.
This ensures that the model is tested on # Split the dataset
data it has never seen before, providing test_size=0.3, random_state=42)
a realistic estimate of how it will
# Train a logistic regression model
perform in production. model = LogisticRegression()
# Evaluate the model

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
The dataset is split into 70% training and 30% testing.

The accuracy score measures the proportion of correctly classified instances. 165
K-Fold Cross-Validation
Cross-Validation from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
Cross-validation is a more robust
evaluation method that divides the # Dataset
X = [[i] for i in range(10)]
dataset into multiple folds. Each fold y = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
acts as a test set once, while the
# Perform cross-validation
remaining folds are used for training. model = DecisionTreeClassifier()
The final performance metric is the scores = cross_val_score(model, X, y, cv=5)
print("Cross-Validation Scores:", scores)
average across all folds. print("Mean Score:", scores.mean()) #average accuracy
This example splits the data into 5 folds and reports

the scores for each fold, along with the mean score,
which provides a better estimate of model
performance than a single train-test split.
166
Classification Metrics
For classification tasks, the choice of evaluation metric depends on the problem‟s
requirements.
Scikit-learn provides various metrics such as accuracy, precision, recall, F1-score,
and ROC-AUC.
True Positives (TP), False Positives (FP), True Negatives (TN), and False
Negatives (FN):
In classification problems, confusion matrices are used to evaluate the
performance of a classification model. The matrix shows the counts of true
positives, false positives, true negatives, and false negatives, which are critical in
calculating various evaluation metrics, such as accuracy, precision, recall, F1-
score, and specificity.
167
True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN):
Here‟s a breakdown of these terms:
• True Positives (TP): These are the cases where the model correctly predicted the positive class. In other
words, the true label is positive (1), and the model predicted it as positive (1).
Example: The model correctly predicts that a person has a disease when they actually do.
• False Positives (FP): These are the cases where the model incorrectly predicted the positive class. In other
words, the true label is negative (0), but the model predicted it as positive (1).
Example: The model incorrectly predicts that a person has a disease when they actually do not.
• True Negatives (TN): These are the cases where the model correctly predicted the negative class. In other
words, the true label is negative (0), and the model predicted it as negative (0).
Example: The model correctly predicts that a person does not have a disease when they actually do not.
• False Negatives (FN): These are the cases where the model incorrectly predicted the negative class. In other
words, the true label is positive (1), but the model predicted it as negative (0).
Example: The model incorrectly predicts that a person does not have a disease when they actually do.
168
Classification Metrics:
Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.
A confusion matrix looks like this:
Predicted Positive (1) Predicted Negative (0)

Actual Positive (1) True Positive (TP) False Negative (FN)
Actual Negative (0) False Positive (FP) True Negative (TN)
169
Confusion Matrix
• The confusion matrix summarizes prediction results, showing counts of true
positives, false positives, true negatives, and false negatives.
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
[[2 0]
[1 2]] 170
Accuracy:
Accuracy is the proportion of correctly classified instances. While simple, it can be
misleading for imbalanced datasets where one class dominates.
from sklearn.metrics import accuracy_score
y_true = [0, 1, 1, 0, 1] Accuracy =

y_pred = [0, 1, 0, 0, 1]
print("Accuracy:", accuracy_score(y_true, y_pred))
In this example, accuracy is calculated as the ratio of correct predictions to the total
number of predictions.
171
Precision, Recall, and F1-Score:
• Precision: The proportion of true positives among predicted Precision=
positives.
• Recall (Sensitivity or True Positive Rate ): The proportion
of true positives among actual positives.
• F1-Score: The harmonic mean of precision and recall, Recall =
balancing both metrics.
from sklearn.metrics import classification_report
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(classification_report(y_true, y_pred))
The classification_report provides a detailed breakdown of precision, recall, and F1-score 172
for each class.
ROC and AUC:
The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the
false positive rate. The Area Under the Curve (AUC) measures the model‟s ability to
distinguish between classes.
from sklearn.metrics import roc_curve, roc_auc_score
y_true = [0, 1, 1, 0, 1]
y_scores = [0.2, 0.8, 0.6, 0.4, 0.9] # Model probabilities
fpr, tpr, thresholds = roc_curve(y_true, y_scores)

auc_score = roc_auc_score(y_true, y_scores)
plt.plot(fpr, tpr, marker='.')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (AUC = {auc_score:.2f})')
plt.show() 173
Regression Metrics :
For regression tasks, common evaluation metrics include Mean Squared Error (MSE),
Mean Absolute Error (MAE), and R² Score.
Mean Squared Error and Mean Absolute Error: ( lower values are better)
• MSE is the average of the squared differences between the actual values and the predicted values.
• Lower MSE indicates better performance, as it means the predictions are closer to the actual values. A
smaller MSE implies that the model's predictions are more accurate.
• MAE is the average of the absolute differences between the actual values and the predicted values.
• lower MAE indicates better performance because it suggests the model is making more accurate
predictions without large errors.
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_true = [3, -0.5, 2, 7]

y_pred = [2.5, 0.0, 2, 8]
MSE: 0.375 print("MSE:", mean_squared_error(y_true, y_pred)) 174

MAE: 0.5 print("MAE:", mean_absolute_error(y_true, y_pred))
Regression Metrics :
R² Score ( Higher value is better)

R² score measures the proportion of the variance in the dependent variable that is predictable
from the independent variables. It is a measure of how well the model explains the data.
Higher R² values are better because they indicate that the model explains more of the variance
in the data. An R² score of 1 means the model perfectly predicts the data, and an R² score of 0
means the model does no better than predicting the mean value.
from sklearn.metrics import r2_score

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print("R² Score:", r2_score(y_true, y_pred)) 175

R² Score: 0.948
Clustering Metrics :
Evaluating clustering models is challenging because there are no true labels. Metrics like
Silhouette Score and Adjusted Rand Index (ARI) help assess clustering performance.
Silhouette Score:
Silhouette Score measures how similar a point is to its cluster compared to other clusters.
A higher score indicates well-separated clusters.
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# Clustering example
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict([[1], [2], [8], [9]])
print("Silhouette Score:", silhouette_score([[1], [2], [8], [9]], labels))
176
Clustering Metrics :
Adjusted Rand Index :
ARI compares the similarity between predicted and true clusters, accounting for chance.
from sklearn.metrics import adjusted_rand_score
true_labels = [0, 0, 1, 1]
pred_labels = [0, 0, 1, 1]
print("ARI:", adjusted_rand_score(true_labels, pred_labels))
177
Hyperparameter Tuning
178
Any questions ?
179

Chapter 04 Advanced Use of Python Libraries for AI and Data Science

Uploaded by

Chapter 04 Advanced Use of Python Libraries for AI and Data Science

Uploaded by

1

Data science leverages tools and methods like machine learning,

1- NumPy (Numerical Python) :

NumPy (Numerical Python) is a powerful library

One of its most important features is the support for

These arrays make it easy to perform mathematical

Applications in Data Science and AI

•U10 specifies a Unicode string of length 10 for name.

Accessing Structured Array Data:

# Access all names

# Access the first person's age

Advanced Indexing Techniques

Handling Missing Data

It can be thought of as a Python‟s, equivalent to Microsoft Excel

2. Pandas DataFrame: It is a mutable two-dimensional data structure

Note that the individual columns in Pandas are referred to as “Series”. A

• Series are a special type of data structure available in the Pandas

• It can be seen as a data structure with two arrays: One functioning

Series has the following parameters:

• copy: It is used for copying the data. (Default is False)

Output : Series([], dtype: float64)

We can create Series by using following kinds of inputs:

In [323]: print(series1[0]) # using numerical index

In [325]: ser_dict = pd.Series(dict1)

( Don‟t use index)

In [323]: print(series1[0]) # using numerical index

In [335]: print("Dimension of series:", series1.ndim)

In [336]: print("Size of a series:", series1.size)

In [337]: print(“Number of bytes:", series1.nbytes, 'bytes')

Why Use Pandas DataFrames?

data = [['Alice', 25], ['Bob', 30]]

From CSV files: df = pd.read_csv('data.csv') We will see more … 53

df.head() # First 5 rows

df.head() # First 5 rows

By rows (using .loc and .iloc):

df.loc[0] # Row by label

Setting an Index ( use a column as a label)

Reset index: df.reset_index(inplace=True) 57

Changing Data Types

Checking Matplotlib Version:

Checking Matplotlib Version:

import matplotlib.pyplot as plt

xpoints = np.array([1, 8])

xpoints = np.array([1, 8])

plt.plot(xpoints, ypoints, 'o')

import matplotlib.pyplot as plt

xpoints = np.array([1, 2, 6, 8])

import matplotlib.pyplot as plt

ypoints = np.array([3, 8, 1, 10, 5, 7])

import matplotlib.pyplot as plt

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o')

ypoints = np.array([3, 8, 1, 10])

import matplotlib.pyplot as plt

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20)

import matplotlib.pyplot as plt

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20,

import matplotlib.pyplot as plt

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o', ms = 20,

You can use both mec and mfc

Line Width: You can use the keyword argument

plt.plot(x1, y1, x2, y2)

Create a Title for a Plot: With Pyplot, you can use

plt.title("Sports Watch Data")

Create a Title for a Plot: With Pyplot, you can use

plt.title("Sports Watch Data")

plt.title("Sports Watch Data", fontdict = font1)

plt.title("Sports Watch Data", loc = 'left')

plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)

the plot. x = np.array([0, 1, 2, 3])

The observation in the example above is the result of 13 cars

#day two, the age and speed of 15 cars:

ColorMap: The Matplotlib module has a number of available colormaps.