Kabir Data Preprocessing Python
Kabir Data Preprocessing Python
1
Python
Ahmedul Kabir
TA, CS 548, Spring 2015
2 Preprocessing Techniques Covered
Scikit-learn
Orange
Pandas
MLPy
MDP
PyBrain … and many more
4 Some Other Basic Packages
Standardization: To transform data so that it has zero mean and unit variance.
Also called scaling
Use function sklearn.preprocessing.scale()
Parameters:
X: Data to be scaled
with_mean: Boolean. Whether to center the data (make zero mean)
with_std: Boolean (whether to make unit standard deviation
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
7 Missing Value Replacement
Attribute:
statistics_ : The imputer-filled values for each feature
Important methods
fit(X[, y]) Fit the model with X.
transform(X) Replace all the missing values in X.
8 Example code for Replacing Missing
Values
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]
9 Resampling
Important attributes:
components_ : Components with maximum variance
explained_variance_ratio_ : Percentage of variance explained by each of the selected
components
Important methods
fit(X[, y]) Fit the model with X.
score_samples(X) Return the log-likelihood of each sample
transform(X) Apply the dimensionality reduction on X.
13 Other Useful Information