0% found this document useful (0 votes)
77 views14 pages

Kabir Data Preprocessing Python

This document discusses data preprocessing techniques in Python, including standardization, missing value replacement, resampling, discretization, feature selection, dimensionality reduction, and relevant Python packages. It provides code examples for standardization, imputing missing values, and PCA. Key packages mentioned are Scikit-learn, Pandas, NumPy, and SciPy.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
0% found this document useful (0 votes)
77 views14 pages

Kabir Data Preprocessing Python

This document discusses data preprocessing techniques in Python, including standardization, missing value replacement, resampling, discretization, feature selection, dimensionality reduction, and relevant Python packages. It provides code examples for standardization, imputing missing values, and PCA. Key packages mentioned are Scikit-learn, Pandas, NumPy, and SciPy.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 14

Data Preprocessing in

1
Python
Ahmedul Kabir
TA, CS 548, Spring 2015
2 Preprocessing Techniques Covered

Standardization and Normalization


Missing value replacement
Resampling
Discretization
Feature Selection
Dimensionality Reduction: PCA
3 Python Packages/Tools for Data Mining

Scikit-learn
Orange
Pandas
MLPy
MDP
PyBrain … and many more
4 Some Other Basic Packages

 NumPy and SciPy


 Fundamental Packages for scientific computing with Python
 Contains powerful n-dimensional array objects
 Useful linear algebra, random number and other capabilities
 Pandas
 Contains useful data structures and algorithms
 Matplotlib
 Contains functions for plotting/visualizing data.
5 Standardization and Normalization

 Standardization: To transform data so that it has zero mean and unit variance.
Also called scaling
 Use function sklearn.preprocessing.scale()
 Parameters:
 X: Data to be scaled
 with_mean: Boolean. Whether to center the data (make zero mean)
 with_std: Boolean (whether to make unit standard deviation

 Normalization: to transform data so that it is scaled to the [0,1] range.


 Use function sklearn.preprocessing.normalize()
 Parameters:
 X: Data to be normalized
 norm: which norm to use: l1 or l2
 axis: whether to normalize by row or column
6 Example code of
Standardization/Scaling
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
7 Missing Value Replacement

 In scikit-learn, this is referred to as “Imputation”


 Class be used sklearn.preprocessing.Imputer
 Important parameters:
 strategy: What to replace the missing value with: mean / median / most_frequent
 axis: Boolean. Whether to replace along rows or columns

 Attribute:
 statistics_ : The imputer-filled values for each feature

 Important methods
 fit(X[, y]) Fit the model with X.
 transform(X) Replace all the missing values in X.
8 Example code for Replacing Missing
Values
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]
9 Resampling

 Using class sklearn.utils.resample


 Important parameters:
 n_sample: No. of samples to keep
 replace: Boolean. Whether to resample with or without replacement
 Returns sequence of resampled views of the collections. The
original arrays are not impacted.

 Another useful class is sklearn.utils.shuffle


10 Discretization

 Scikit-learn doesn’t have a direct class that performs


discretization.

 Can be performed with cut and qcut functions available


in pandas.

 Orange has discretization functions in


Orange.feature.discretization
11 Feature Selection

 The sklearn.feature_selection module implements feature selection


algorithms.
 Some classes in this module are:
 GenericUnivariateSelect: Univariate feature selector based on statistical tests.
 SelectKBest: Select features according to the k highest scores.
 RFE: Feature ranking with recursive feature elimination.
 VarianceThreshold: Feature selector that removes all low-variance features.

 Scikit-learn does not have a CFS implementation, but RFE works in


somewhat similar fashion.
12 Dimensionality Reduction: PCA
 The sklearn.decomposition module includes matrix decomposition
algorithms, including PCA
 sklearn.decomposition.PCA class
 Important parameters:
 n_components: No. of components to keep

 Important attributes:
 components_ : Components with maximum variance
 explained_variance_ratio_ : Percentage of variance explained by each of the selected
components

 Important methods
 fit(X[, y]) Fit the model with X.
 score_samples(X) Return the log-likelihood of each sample
 transform(X) Apply the dimensionality reduction on X.
13 Other Useful Information

 Generate a random permutation of numbers 1.… n:


numpy.random.permutation(n)
 You can randomly generate some toy datasets using Sample generators in
sklearn.datasets
 Scikit-learn doesn’t directly handle categorical/nominal attributes well. In
order to use them in the dataset, some sort of encoding needs to be
performed.
 One good way to encode categorical attributes: if there are n categories,
create n dummy binary variables representing each category.
 Can be done easily using the sklearn.preprocessing.oneHotEncoder class.
14 References

 Preprocessing Modules: http://scikit-learn.org/stable/modules/preprocessing.html


 Video Tutorial: http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107
 Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html
 User Guide http://scikit-learn.org/stable/user_guide.html
 API Reference http://scikit-learn.org/stable/modules/classes.html
 Example Gallery http://scikit-learn.org/stable/auto_examples/index.html

You might also like