05 Preprocessing and Sklearn - Slides
05 Preprocessing and Sklearn - Slides
Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat451-fs2020/
6. Scikit-learn Pipelines
Labels
Training Dataset
Learning
Final Model New Data
Labels Algorithm
Model Selection
Cross-Validation
Performance Metrics
Hyperparameter Optimization
Dataset paper: Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936);
also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
https://pandas.pydata.org
Source: https://github.com/rasbt/python_reference/blob/master/useful_scripts/large_csv_to_sqlite.py
Modin: https://github.com/modin-project/modin
6. Scikit-learn Pipelines
MLXTEND http://rasbt.github.io/mlxtend/
Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and
extensions to Python’s scientific computing stack."
The Journal of Open Source Software 3.24 (2018).
6. Scikit-learn Pipelines
Code notebook: https://github.com/rasbt/stat451-machine-learning-fs20/blob/master/
L05/code/05-preprocessing-and-sklearn__notes.ipynb
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 28
Python Classes
6. Scikit-learn Pipelines
Sebastian Raschka STAT 451: Intro to ML Lecture 5: Scikit-learn 38
The "Main" Machine Learning Library for
Python
http://scikit-learn.org
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-learn: Machine learning in Python. the Journal of
Machine Learning Research, 12, pp.2825-2830.
Training Training
Data Labels
① est.fit(X_train, y_train)
② est.predict(X_test)
Predicted
labels
6. Scikit-learn Pipelines
Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.
Sebastian Raschka
1
STAT 451: Intro to ML
https://archive.ics.uci.edu/ml/datasets/iris Lecture 5: Scikit-learn 44
Stratified Splits
[i]
[i]
x − xmin
xnorm =
xmax − xmin
[i]
[i]
x − xmin
xnorm =
xmax − xmin
[i]
[i]
x − μx
xstd =
σx
1 i=1 [i] 2
n−1∑
sx = (x − x̄)
n
i=1
1
(x [i] − μx)2
n∑
σx =
n
i=1
1
(x [i] − x̄)2
n−1∑
sx =
n
1 i=1 [i]
(x − μx)2
n∑
σx =
n
Estimate:
mean: 20 cm
Estimate:
mean: 20 cm
Standardize:
Estimate:
mean: 20 cm
mean: 20 cm
- example6: 7 cm -> class ?
<latexit sha1_base64="HAx2k5DyXbF25+3p1q+TLdp/cAo=">AAACaHicbVHBTttAEF27tIVAW1NAqOIyImpFL5GdosKlEoILRyo1gBRH0XozTlas19buGBqsiH/sjQ/g0q/oxliIAk9a6enNvJndt0mhpKUwvPX8Vwuv37xdXGotr7x7/yFY/Xhq89II7Ilc5eY84RaV1NgjSQrPC4M8SxSeJRdH8/rZJRorc/2LpgUOMj7WMpWCk5OGwc1k5/or/IA4wbHUlXCj7AxaUCMm/E2VUNxamEEX4EsjgUydcA2xQgg73yGOX3RED4acJmiupMXZvDFGPWpWDYN22AlrwHMSNaTNGpwMgz/xKBdlhprqNf0oLGhQcUNSKDc+Li0WXFzwMfYd1TxDO6jqoGbw2SkjSHPjjiao1ceOimfWTrPEdWacJvZpbS6+VOuXlO4PKqmLklCL+0VpqYBymKcOI2lQkJo6woWR7q4gJtxwQe5vWi6E6OmTn5PTbif61un+3G0fHDZxLLItts12WMT22AE7ZiesxwS785a9dW/D++sH/qb/6b7V9xrPGvsP/vY/S5OzNw==</latexit>
Estimate:
Estimate "new" mean and std.:
mean: 20 cm
Training Test
Data Data
①
est.fit(X_train)
Transformed Transformed
Training Data Test Data
6. Scikit-learn Pipelines
(Step 1) (Step 2)
Class labels
Training set Test set
pipeline.fit(…) pipeline.predict(…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)
(Step 1) (Step 2)
Class labels
Training set Test set
pipeline.fit(…) pipeline.predict(…)
Pipeline
.fit(…) &
Scaling
.transform(…)
.transform(…)
Dimensionality
Reduction
.fit(…) &
.transform(…)
Learning Algorithm .transform(…)
.fit(…)
Predictive Model
Class labels
.predict(…)
Change hyperparameters
and repeat
Machine learning
algorithm
Evaluate
Fit
Predictive model
Final performance estimate
https://github.com/scikit-learn/scikit-learn/pull/13900
https://github.com/rasbt/stat451-machine-learning-fs20/blob/
master/L05/code/05-preprocessing-and-sklearn__notes.ipynb