INTRODUCTION TO PYTHON
(PART III)
Presenter: Prof. Amit Kumar Das
Assistant Professor,
Dept. of Computer Science and Engg.,
Institute of Engineering & Management.
IMPORTANT LIBRARIES IN PYTHON
scikit-learn
Numpy – Fundamental package for scientific
computing
SciPy – Package providing mathematical functions
and statistical distributions
matplotlib – Primary library supporting scientific
plotting e.g. line diagrams, histograms, scatter plots
pandas – Primary library providing data manipulation
functionalities
BASIC PYTHON LIBRARIES - NUMPY
NumPy package contains functionality for multidimensional arrays, high-level mathematical
functions e.g. linear algebra and Fourier transform operations, random number generators, etc.
In scikit-learn, NumPy array is the primary data structure, used to input data. Any data used needs
to be converted to a NumPy array.
[Link](object, dtype, copy, order, subok, ndmin)
dtype means data-type i.e. the desired data-type for the array. If not given, then the type will be
determined as the minimum type required to hold the objects in the sequence.
empty - Return a new uninitialized array
full - Return a new array of given shape filled with value
ones - Return a new array setting values to one
zeros - Return a new array setting values to zero
BASIC PYTHON LIBRARIES - NUMPY
# Defining an array variable with data ...
import numpy as np
arr1 = [Link]((2,3))
arr2 = [Link]([[10,2,3], [23,45,67]])
print(var1)
[[10 2 3]
[23 45 67]]
# Create an array of 1s ...
Arr3 = [Link]((2,3))
[[ 1., 1., 1.],
[ 1., 1., 1.]]
# Create an array of 0s ...
Arr4 = [Link]((2,3),dtype=[Link])
[[0, 0, 0],
[0, 0, 0]]
# Create an array with random numbers ...
[Link]((2,2))
[[ 0.47448072, 0.49876875],
[ 0.29531478, 0.48425055]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
# Defining 1-D array variable with data ...
var2 = [Link](4)
var2[0] = 5.67
var2[1] = 2
var2[2] = 56
var2[3] = 304
print(var2)
[ 5.67 2. 56. 304. ]
print([Link]) # Returns the dimension of the array ...
(4,)
print([Link]) # Returns the size of the array ...
4
# Defining 2-D array variable with data ...
var3 = [Link]((2,3))
var3[0][0] = 5.67
var3[0][1] = 2
var3[0][2] = 56
var3[1][0] = .09
var3[1][1] = 132
var3[1][2] = 1056
print(var3)
[[ 5.67000000e+00 2.00000000e+00 5.60000000e+01]
[ 9.00000000e-02 1.32000000e+02 1.05600000e+03]]
[Note: Same result will be obtained with dtype=[Link]]
print([Link])
(2, 3)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
# Same declaration with dtype mentioned ...
var3 = [Link]((2,3), dtype=[Link])
[[ 5, 2, 56],
[ 0, 132, 1056]]
print(var3[1]) # Returns a row of an array ...
[ 0 132 1056]
print(var3[[0, 1]]) # Returns multiple rows of an array ...
[[ 5 2 56]
[ 0 132 1056]]
print(var3[:, 2]) # Returns a column of an array ...
[ 56 1056]
print(var3[:, [1, 2]]) # Returns multiple column of an array ...
[[ 2 56]
[ 132 1056]]
print(var3[1][2]) # Returns a cell value of an array ...
1056
print(var3[1, 2]) # Returns a cell value of an array ...
1056
print([Link](var3)) # Returns transpose of an array ...
[[ 5 0]
[ 2 132]
[ 56 1056]]
print([Link](3,2)) # Returns a re-shaped array ...
[[ 5 2]
[ 56 0]
[ 132 1056]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
Create and concatenate arrays:
import numpy as np
arr1= [Link]((2,3), dtype=[Link])
arr1[0][0] = 5.67
arr1[0][1] = 2
arr1[0][2] = 56
arr1[1][0] = .09
arr1[1][1] = 132
arr1[1][2] = 1056
[[ 5, 2, 56],
[ 0, 132, 1056]]
arr2 = [Link]((1,3), dtype=[Link])
arr2[0][0] = 37
arr2[0][1] = 2.193
arr2[0][2] = 5609
[[ 37, 2, 5609]]
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
arr_concat = [Link]((arr1, arr2), axis = 0)
print(arr_concat)
[[ 5 2 56]
[ 0 132 1056]
[ 37 2 5609]]
[Link]() # Returns minimum value stored in an array ...
2.0
[Link]() # Returns maximum value stored in an array ...
304.0
[Link]() # Returns cumulative sum of the values stored in an array
...
array([ 5.67, 7.67, 63.67, 367.67])
[Link]() # Returns mean or average value stored in an array ...
91.917500000000004
[Link]() # Returns standard deviation of values stored in an array ...
124.2908299865682
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – NUMPY (CONTD.)
BASIC PYTHON LIBRARIES – PANDAS
pandas is a Python package providing fast and flexible functionalities designed to work with
“relational” or “labeled” data.
import pandas as pd # “pd” is just an alias for pandas
data = pd.read_csv("[Link]") # Uploads data from a .csv file
type(data) # To find the type of the data set object loaded
[Link]
[Link] # To find the dimensions i.e. number of rows and columns of
the data set loaded
(398, 9)
nrow_count = [Link][0] # To find just the number of rows
print(nrow_count)
398
ncol_count = [Link][1] # To find just the number of columns
print(ncol_count)
9
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
[Link] # To get the columns of a dataframe
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model year', 'origin', 'car name'],
dtype='object')
# To change the column names of a dataframe e.g. ‘mpg’ in this case …
[Link] = ['miles_per_gallon', 'cylinders', 'displacement',
'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car
name']
[Link] # To get the revised column names of the dataframe ...
Index(['miles_per_gallon', 'cylinders', ...], dtype='object')
[Link](columns={'displacement': 'disp'}, inplace=True)
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
[Link]() # By default displays top 5 rows
[Link](3) # To display the top 3 rows
[Link] () # By default displays bottom 5 rows
[Link] (3) # To display the bottom 3 rows
[Link][200,'cylinders'] # Will return cell value of the 200th row
and column ‘cylinders’ of the data frame
6
Alternatively, we can use the following code:
data.get_value(200,'cylinders')
data_cyl = [Link][: , "car name"]
data_cyl.head()
0 chevrolet chevelle malibu
1 buick skylark 320
2 plymouth satellite
3 amc rebel sst
4 ford torino
Name: car name, dtype: object
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
Find missing values in a data set:
import numpy as np
import pandas as pd
# Creation of a data set with missing values ...
var1 = [[Link], [Link], [Link], 10.1, 12, 123.14, 0.121]
var2 = [40.2, 11.78, 7801, 0.25, 34.2, [Link], [Link]]
var3 = [1234, [Link], 34.5, [Link], 78.25, 14.5, [Link]]
df = [Link]({'Attr_1': var1, 'Attr_2': var2, 'Attr_3': var3})
print(df)
Attr_1 Attr_2 Attr_3
0 NaN 40.20 1234.00
1 NaN 11.78 NaN
2 NaN 7801.00 34.50
3 10.100 0.25 NaN
4 12.000 34.20 78.25
5 123.140 NaN 14.50
6 0.121 NaN NaN
# Find missing values in a data set
miss_val = df[df['Attr_1'].isnull()]
print(miss_val)
Attr_1 Attr_2 Attr_3
0 NaN 40.20 1234.0
1 NaN 11.78 NaN
2 NaN 7801.00 34.5
BASIC PYTHON LIBRARIES – PANDAS (CONTD.)
>>> [Link](data[["mpg"]])
23.514573
>>> [Link](data[["mpg"]])
23.0
>>> [Link](data[["mpg"]])
60.936119
>>> [Link](data[["mpg"]])
7.806159
BASIC PYTHON LIBRARIES – MATPLOTLIB
Constructing Box plot for Iris data set
Popular data set in the machine learning
Consists of 3 different types of iris flower - Setosa, Versicolour, and
Virginica
4 columns - Sepal Length, Sepal Width, Petal Length and Petal Width
First have to import the Python library datasets
>>> from sklearn import datasets
# import some data to play with
>>> iris = datasets.load_iris()
>>> import [Link] as plt
>>> X = [Link][:, :4]
>>> [Link](X)
>>> [Link]()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
Box plot for Iris data set (all features):
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
>>> [Link](X[:, 1])
>>> [Link]()
Box plot for Iris data set (single feature)
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
>>> import [Link] as plt
>>> X = [Link][:, :1]
Histogram
>>> [Link](X)
>>> [Link]('Sepal length')
>>> [Link]()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
Scatterplot of Iris data set : Sepal length vs. Petal length
>>> X = [Link][:, :4] # We take the first 4 features
>>> y = [Link]
>>> [Link](X[:, 2], X[:, 0], c=y, cmap=[Link].Set1,
edgecolor='k')
>>> [Link]('Petal length')
>>> [Link]('Sepal length')
>>> [Link]()
BASIC PYTHON LIBRARIES – MATPLOTLIB (CONTD.)
Scatterplot of Iris data set : Sepal length vs. Petal length
DATA PRE-PROCESSING
Mainly deals with two things –
Handling outliers
Remediating missing values
Primary measures for remediating outliers and missing values are:
Removing specific rows containing outliers / missing values
Imputing the value (i.e. outlier / missing value) with a standard
statistical measure e.g. mean or median or mode for that attribute
Estimate the value (i.e. outlier / missing value) based on value of the
attribute in similar records and replace with the estimated value.
Cap the values within 1.5 X IQR limits
DATA PRE-PROCESSING (CONTD.)
>>> df = pd.read_csv("[Link]")
Finding missing values in a data set:
>>> miss_val = df[df['horsepower'].isnull()]
>>> print(miss_val)
DATA PRE-PROCESSING (CONTD.)
Finding Outliers (Option 1) :
>>> import [Link] as plt
>>> X = data["mpg"]
>>> [Link](X)
>>> [Link]()
>>> outliers = [Link](X[:, ])["fliers"][0].get_data()[1]
>>> outliers
array([ 46.6])
DATA PRE-PROCESSING (CONTD.)
Finding Outliers (Option 2) :
def find_outlier(ds, col):
quart1 = ds[col].quantile(0.25)
quart3 = ds[col].quantile(0.75)
IQR = quart3 - quart1 #Inter-quartile range
low_val = quart1 - 1.5*IQR
high_val = quart3 + 1.5*IQR
ds = [Link][(ds[col] < low_val) | (ds[col] > high_val)]
return ds
>>> outliers = find_outlier(data, "mpg")
>>> outliers
mpg cylinders displacement horsepower weight acceleration \
322 46.6 4 86.0 65.0 2110 17.9
model year origin car name
322 80 3 mazda glc
DATA PRE-PROCESSING (CONTD.)
Removing records with missing values / outliers:
We can drop the rows / columns with missing values using the code below.
>>> [Link](axis=0, how=‘any')
In a similar way, outlier values can be removed.
def remove_outlier(ds, col):
quart1 = ds[col].quantile(0.25)
quart3 = ds[col].quantile(0.75)
IQR = quart3 - quart1 #Interquartile range
low_val = quart1 - 1.5*IQR
high_val = quart3 + 1.5*IQR
df_out = [Link][(ds[col] > low_val) & (ds[col] <
high_val)]
return df_out
>>> data = remove_outlier(data, "mpg")
DATA PRE-PROCESSING (CONTD.)
Imputing standard values:
Only the affected rows are identified and the value of the attribute is transformed to the mean
value of the attribute.
>>> hp_mean = [Link](data['horsepower'])
>>> imputedrows = data[data['horsepower'].isnull()]
>>> imputedrows = [Link]([Link], hp_mean)
Then the portion of the data set not having any missing row is kept apart.
>>> missval_removed_rows = [Link](subset=['horsepower'])
Then join back the imputed rows and the remaining part of the data set.
>>> data_mod = missval_removed_rows.append(imputedrows,
ignore_index=True)
In a similar way, outlier values can be imputed.
THANK YOU &
STAY TUNED!