Credits: Forked from PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn;
from sklearn.linear_model import LinearRegression
from scipy import stats
import pylab as pl
seaborn.set()
from IPython.display import Image
Image("http://scikit-learn.org/dev/_static/ml_map.png", width=800)
Given a scikit-learn estimator object named model
, the following methods are available:
model.fit()
: fit training data. For supervised learning applications,
this accepts two arguments: the data X
and the labels y
(e.g. model.fit(X, y)
).
For unsupervised learning applications, this accepts only a single argument,
the data X
(e.g. model.fit(X)
).model.predict()
: given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new
(e.g. model.predict(X_new)
),
and returns the learned label for each object in the array.model.predict_proba()
: For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict()
.model.score()
: for classification or regression problems, most (all?) estimators implement
a score method. Scores are between 0 and 1, with a larger score indicating a better fit.model.predict()
: predict labels in clustering algorithms.model.transform()
: given an unsupervised model, transform new data into the new basis.
This also accepts one argument X_new
, and returns the new representation of the data based
on the unsupervised model.model.fit_transform()
: some estimators implement this method,
which more efficiently performs a fit and a transform on the same input data.from sklearn.datasets import load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape
print(iris.keys())
print((n_samples, n_features))
print(iris.data.shape)
print(iris.target.shape)
print(iris.target_names)
print(iris.feature_names)
['target_names', 'data', 'target', 'DESCR', 'feature_names'] (150, 4) (150, 4) (150,) ['setosa' 'versicolor' 'virginica'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
import numpy as np
import matplotlib.pyplot as plt
# 'sepal width (cm)'
x_index = 1
# 'petal length (cm)'
y_index = 2
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
The K-Nearest Neighbors (KNN) algorithm is a method used for algorithm used for classification or for regression. In both cases, the input consists of the k closest training examples in the feature space. Given a new, unknown observation, look up which points have the closest features and assign the predominant class.
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
# fit the model
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
X_pred = [3, 5, 4, 2]
result = knn.predict([X_pred, ])
print(iris.target_names[result])
print(iris.target_names)
print(knn.predict_proba([X_pred, ]))
from fig_code import plot_iris_knn
plot_iris_knn()
['versicolor'] ['setosa' 'versicolor' 'virginica'] [[ 0. 0.8 0.2]]
Note we see overfitting in the K-Nearest Neighbors model above. We'll be addressing overfitting and model validation in a later notebook.