Scikit-Learn-Exercises - Jupyter Notebook
Scikit-Learn-Exercises - Jupyter Notebook
Notes:
There may be more than one different way to answer a question or complete an exercise.
Some skeleton code has been implemented for you.
Exercises are based off (and directly taken from) the quick introduction to Scikit-Learn notebook
(https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-
tools/introduction-to-scikit-learn.ipynb).
Different tasks will be detailed by comments or text. Places to put your own code are defined by ###
(don't remove anything other than ### ).
For further reference and resources, it's advised to check out the Scikit-Learn documnetation (https://scikit-
learn.org/stable/user_guide.html).
And if you get stuck, try searching for a question in the following format: "how to do XYZ with Scikit-Learn",
where XYZ is the function you want to leverage from Scikit-Learn.
Since we'll be working with data, we'll import Scikit-Learn's counterparts, Matplotlib, NumPy and pandas.
In [3]:
Note: When viewing a .csv on GitHub, make sure it's in the raw format. For example, the URL should look
like: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv
(https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv)
In [4]:
Out[4]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
Our goal here is to build a machine learning model on all of the columns except target to predict target .
In essence, the target column is our target variable (also called y or labels ) and the rest of the other
columns are our independent variables (also called data or X ).
And since our target variable is one thing or another (heart disease or not), we know our problem is a
classification problem (classifying whether something is one thing or another).
In [5]:
Now we've split our data into X and y , we'll use Scikit-Learn to split it into training and test sets.
In [6]:
In [7]:
Out[7]:
Since our data is now in training and test sets, we'll build a machine learning model to fit patterns in the training
data and then make predictions on the test data.
To figure out which machine learning model we should use, you can refer to Scikit-Learn's machine learning
map (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).
After following the map, you decide to use the RandomForestClassifier (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
In [8]:
Now you've got a RandomForestClassifier instance, let's fit it to the training data.
In [9]:
Out[9]:
RandomForestClassifier()
In [10]:
# Use the fitted model to make predictions on the test data and
# save the predictions to a variable called y_preds
y_preds = clf.predict(X_test)
In [11]:
# Evaluate the fitted model on the training set using the score() function
clf.score(X_train, y_train)
Out[11]:
1.0
In [12]:
# Evaluate the fitted model on the test set using the score() function
clf.score(X_test, y_test)
Out[12]:
0.819672131147541
For this exercise, the models we're going to try and compare are:
LinearSVC (https://scikit-learn.org/stable/modules/svm.html#classification)
KNeighborsClassifier (https://scikit-learn.org/stable/modules/neighbors.html) (also known as K-Nearest
Neighbors or KNN)
SVC (https://scikit-learn.org/stable/modules/svm.html#classification) (also known as support vector
classifier, a form of support vector machine (https://en.wikipedia.org/wiki/Support-vector_machine))
LogisticRegression (https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (despite the name, this
is actually a classifier)
RandomForestClassifier (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) (an ensemble method
and what we used above)
We'll follow the same workflow we used above (except this time for multiple models):
Note: Since we've already got the data ready, we can reuse it in this section.
In [13]:
Thanks to the consistency of Scikit-Learn's API design, we can use virtually the same code to fit, score and
make predictions with each of our models.
If you're wondering what it means to instantiate each model in a dictionary, see the example below.
In [14]:
# Create a dictionary called models which contains all of the classification models we've i
# Make sure the dictionary is in the same format as example_dict
# The models dictionary should contain 5 models
models = {"LinearSVC": LinearSVC(),
"KNN": KNeighborsClassifier(),
"SVC": SVC(),
"LogisticRegression": LogisticRegression(),
"RandomForestClassifier": RandomForestClassifier()}
Since each model we're using has the same fit() and score() functions, we can loop through our models
dictionary and, call fit() on the training data and then call score() with the test data.
In [15]:
Out[15]:
{'RandomForestClassifier': 0.8360655737704918}
In [16]:
# Loop through the models dictionary items, fitting the model on the training data
# and appending the model name and model score on the test data to the results dictionary
for model_name, model in models.items():
model.fit(X_train, y_train)
results[model_name] = model.score(X_test, y_test)
C:\Users\User\anaconda3\lib\site-packages\sklearn\svm\_base.py:985: Converge
nceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:
763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scik
it-learn.org/stable/modules/preprocessing.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regre
ssion (https://scikit-learn.org/stable/modules/linear_model.html#logistic-re
gression)
n_iter_i = _check_optimize_result(
Out[16]:
{'LinearSVC': 0.7868852459016393,
'KNN': 0.6885245901639344,
'SVC': 0.6557377049180327,
'LogisticRegression': 0.8360655737704918,
'RandomForestClassifier': 0.819672131147541}
Due to the randomness of how each model finds patterns in the data, you might notice different results each
time.
Without manually setting the random state using the random_state parameter of some models or using a
NumPy random seed, every time you run the cell, you'll get slightly different results.
Let's see this in effect by running the same code as the cell above, except this time setting a NumPy random
seed equal to 42 (https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.seed.html).
In [17]:
# Run the same code as the cell above, except this time set a NumPy random seed
# equal to 42
np.random.seed(42)
results
C:\Users\User\anaconda3\lib\site-packages\sklearn\svm\_base.py:985: Converge
nceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:
763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://scik
it-learn.org/stable/modules/preprocessing.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regre
ssion (https://scikit-learn.org/stable/modules/linear_model.html#logistic-re
gression)
n_iter_i = _check_optimize_result(
Out[17]:
{'LinearSVC': 0.7868852459016393,
'KNN': 0.6885245901639344,
'SVC': 0.6557377049180327,
'LogisticRegression': 0.8360655737704918,
'RandomForestClassifier': 0.8360655737704918}
Run the cell above a few times, what do you notice about the results?
Which model performs the best this time?
What happens if you add a NumPy random seed to the cell where you called train_test_split()
(towards the top of the notebook) and then rerun the cell above?
In [18]:
# Create a pandas dataframe with the data as the values of the results dictionary,
# the index as the keys of the results dictionary and a single column called accuracy.
# Be sure to save the dataframe to a variable.
results_df = pd.DataFrame(results.values(),
results.keys(),
columns=['accuracy'])
Using np.random.seed(42) results in the LogisticRegression model perfoming the best (at least on my
computer).
Hyperparameter Tuning
Remember, if you're ever trying to tune a machine learning models hyperparameters and you're not sure where
to start, you can always search something like "MODEL_NAME hyperparameter tuning".
In the case of LogisticRegression, you might come across articles, such as Hyperparameter Tuning Using Grid
Search by Chris Albon
(https://chrisalbon.com/machine_learning/model_selection/hyperparameter_tuning_using_grid_search/).
The different hyperparameters to search over have been setup for you in log_reg_grid but feel free to
change them.
In [19]:
Since we've got a set of hyperparameters we can import RandomizedSearchCV , pass it our dictionary of
hyperparameters and let it search for the best combination.
In [20]:
Out[20]:
Once RandomizedSearchCV has finished, we can find the best hyperparmeters it found using the
best_params_ attributes.
In [21]:
# Find the best parameters of the RandomizedSearchCV instance using the best_params_ attrib
rs_log_reg.best_params_
Out[21]:
In [22]:
Out[22]:
0.8360655737704918
After hyperparameter tuning, did the models score improve? What else could you try to improve it? Are there
any other methods of hyperparameter tuning you can find for LogisticRegression ?
But when it comes to classification, you'll likely want to use a few more evaluation metrics, including:
Before we get to these, we'll instantiate a new instance of our model using the best hyerparameters found by
RandomizedSearchCV .
In [23]:
# Fit the new instance of LogisticRegression with the best hyperparameters on the training
clf.fit(X_train, y_train)
Out[23]:
LogisticRegression(C=0.23357214690901212, solver='liblinear')
Now it's to import the relative Scikit-Learn methods for each of the classification evaluation metrics we're after.
In [24]:
Evaluation metrics are very often comparing a model's predictions to some ground truth labels.
Let's make some predictions on the test data using our latest model and save them to y_preds .
In [25]:
Time to use the predictions our model has made to evaluate it beyond accuracy.
In [26]:
Out[26]:
array([[21, 7],
[ 3, 30]], dtype=int64)
Challenge: The in-built confusion_matrix function in Scikit-Learn produces something not too visual, how
could you make your confusion matrix more visual?
You might want to search something like "how to plot a confusion matrix". Note: There may be more than one
way to do this.
In [27]:
def plot_cof_mat(cof_mat):
"""
Plot a confusion matrix using Seaborn heatmap().
"""
fig, ax=plt.subplots(figsize=(3,3))
ax=sns.heatmap(cof_mat,
annot=True,
cbar=False)
plt.xlabel('True label')
plt.ylabel('Predicted label')
plot_cof_mat(cof_mat)
In [28]:
accuracy 0.84 61
macro avg 0.84 0.83 0.83 61
weighted avg 0.84 0.84 0.83 61
Challenge: Write down what each of the columns in this classification report are.
Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually
correct. A model which produces no false positives has a precision of 1.0.
Recall - Indicates the proportion of actual positives which were correctly classified. A model which
produces no false negatives has a recall of 1.0.
F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
Support - The number of samples each metric was calculated on.
Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
Macro avg - Short for macro average, the average precision, recall and F1 score between classes. Macro
avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric.
Weighted avg - Short for weighted average, the weighted average precision, recall and F1 score between
classes. Weighted means each metric is calculated with respect to how many samples there are in each
class. This metric will favour the majority class (e.g. will give a high value when one class out performs
another due to having more samples).
The classification report gives us a range of values for precision, recall and F1 score, time to find these metrics
using Scikit-Learn functions.
In [29]:
Out[29]:
0.8108108108108109
In [30]:
Out[30]:
0.9090909090909091
In [31]:
Out[31]:
0.8571428571428571
Confusion matrix: done. Classification report: done. ROC (receiver operator characteristic) curve & AUC (area
under curve) score: not done.
If you're unfamiliar with what a ROC curve, that's your first challenge, to read up on what one is.
And the AUC score is the area behind the ROC curve.
Scikit-Learn provides a handy function for creating both of these called plot_roc_curve() (https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html).
In [32]:
# Plot a ROC curve using our current machine learning model using plot_roc_curve
###
plot_roc_curve(clf, X_test, y_test)
Out[32]:
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x20477ed7970>
Beautiful! We've gone far beyond accuracy with a plethora extra classification evaluation metrics.
If you're not sure about any of these, don't worry, they can take a while to understand. That could be an optional
extension, reading up on a classification metric you're not sure of.
The thing to note here is all of these metrics have been calculated using a single training set and a single test
set. Whilst this is okay, a more robust way is to calculate them using cross-validation (https://scikit-
learn.org/stable/modules/cross_validation.html).
We can calculate various evaluation metrics using cross-validation using Scikit-Learn's cross_val_score()
(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function along
with the scoring parameter.
In [33]:
In [34]:
Out[34]:
In [35]:
cross_val_acc
Out[35]:
0.8479781420765027
In the examples, the cross-validated accuracy is found by taking the mean of the array returned by
cross_val_score() .
Now it's time to find the same for precision, recall and F1 score.
In [36]:
Out[36]:
0.8215873015873015
In [37]:
Out[37]:
0.9272727272727274
In [38]:
Out[38]:
0.8705403543192143
One method of exporting and importing models is using the joblib library.
In Scikit-Learn, exporting and importing a trained model is known as model persistence (https://scikit-
learn.org/stable/modules/model_persistence.html).
In [39]:
# Import the dump and load functions from the joblib library
###
from joblib import dump, load
In [40]:
Out[40]:
['trained-classifier.joblib']
In [41]:
# Use the load function to import the trained model you just exported
# Save it to a different variable name to the origial trained model
###
loaded_clf=load('trained-classifier.joblib')
# Evaluate the loaded trained model on the test data
###
loaded_clf.score(X_test, y_test)
Out[41]:
0.8360655737704918
What do you notice about the loaded trained model results versus the original (pre-exported) model results?
We'll use Scikit-Learn's built-in regression machine learning models to try and learn the patterns in the car
characteristics and their prices on a certain group of the dataset before trying to predict the sale price of a group
of cars the model has never seen before.
In [42]:
Out[42]:
In [43]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Make 951 non-null object
1 Colour 950 non-null object
2 Odometer (KM) 950 non-null float64
3 Doors 950 non-null float64
4 Price 950 non-null float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB
In [44]:
Out[44]:
Make 49
Colour 50
Odometer (KM) 50
Doors 50
Price 50
dtype: int64
In [45]:
Out[45]:
Make object
Colour object
Odometer (KM) float64
Doors float64
Price float64
dtype: object
Knowing this information, what would happen if we tried to model our data as it is?
Let's see.
localhost:8888/notebooks/Data Science/09 Scikit-learn Creating Machine Learning Models/scikit-learn-exercises.ipynb# 18/28
7/30/22, 1:41 AM scikit-learn-exercises - Jupyter Notebook
In [46]:
# EXAMPLE: This doesn't work because our car_sales data isn't all numerical
from sklearn.ensemble import RandomForestRegressor
car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sales.Price
rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-476d8071e1b5> in <module>
2 from sklearn.ensemble import RandomForestRegressor
3 car_sales_X, car_sales_y = car_sales.drop("Price", axis=1), car_sale
s.Price
----> 4 rf_regressor = RandomForestRegressor().fit(car_sales_X, car_sales_y)
~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in fit(self, X, y,
sample_weight)
302 "sparse multilabel-indicator for y is not supporte
d."
303 )
--> 304 X, y = self._validate_data(X, y, multi_output=True,
305 accept_sparse="csc", dtype=DTYPE)
306 if sample_weight is not None:
~\anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y,
reset, validate_separately, **check_params)
431 y = check_array(y, **check_y_params)
432 else:
--> 433 X, y = check_X_y(X, y, **check_params)
434 out = X, y
435
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args,
**kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y,
accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, en
sure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_
numeric, estimator)
812 raise ValueError("y cannot be None")
813
--> 814 X = check_array(X, accept_sparse=accept_sparse,
815 accept_large_sparse=accept_large_sparse,
816 dtype=dtype, order=order, copy=copy,
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args,
**kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(arr
ay, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finit
e, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
localhost:8888/notebooks/Data Science/09 Scikit-learn Creating Machine Learning Models/scikit-learn-exercises.ipynb# 20/28
7/30/22, 1:41 AM scikit-learn-exercises - Jupyter Notebook
As we see, the cell above breaks because our data contains non-numerical values as well as missing data.
To take care of some of the missing data, we'll remove the rows which have no labels (all the rows with missing
values in the Price column).
In [47]:
Building a pipeline
Since our car_sales data has missing numerical values as well as the data isn't all numerical, we'll have to
fix these things before we can fit a machine learning model on it.
There are ways we could do this with pandas but since we're practicing Scikit-Learn, we'll see how we might do
it with the Pipeline (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.
Because we're modifying columns in our dataframe (filling missing values, converting non-numerical data to
numbers) we'll need the ColumnTransformer (https://scikit-
learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html), SimpleImputer
(https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) and OneHotEncoder
(https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) classes as well.
Finally, because we'll need to split our data into training and test sets, we'll import train_test_split as well.
In [48]:
Now we've got the necessary tools we need to create our preprocessing Pipeline which fills missing values
along with turning all non-numerical data into numbers.
In [49]:
It would be safe to treat Doors as a categorical feature as well, however since we know the vast majority of
cars have 4 doors, we'll impute the missing Doors values as 4.
In [50]:
Now onto the numeric features. In this case, the only numeric feature is the Odometer (KM) column. Let's fill
In [51]:
Time to put all of our individual transformer Pipeline 's into a single ColumnTransformer instance.
In [52]:
Boom! Now our preprocessor is ready, time to import some regression models to try out.
RidgeRegression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
SVR(kernel="linear") (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) - short for
Support Vector Regressor, a form form of support vector machine.
SVR(kernel="rbf") (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) - short for
Support Vector Regressor, a form of support vector machine.
RandomForestRegressor (https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) - the regression
version of RandomForestClassifier.
In [53]:
Again, thanks to the design of the Scikit-Learn library, we're able to use very similar code for each of these
models.
To test them all, we'll create a dictionary of regression models and an empty dictionary for regression model
results.
In [54]:
# Create dictionary of model instances, there should be 4 total key, value pairs
# in the form {"model_name": model_instance}.
# Don't forget there's two versions of SVR, one with a "linear" kernel and the
# other with kernel set to "rbf".
regression_models = {"Ridge": Ridge(),
"SVR_linear": SVR(kernel='linear'),
"SVR_rbf": SVR(kernel='rbf'),
"RandomForestRegressor": RandomForestRegressor()}
Our regression model dictionary is prepared as well as an empty dictionary to append results to, time to get the
data split into X (feature variables) and y (target variable) as well as training and test sets.
In our car sales problem, we're trying to use the different characteristics of a car ( X ) to predict its sale price
( y ).
In [55]:
In [56]:
Out[56]:
Alright, our data is split into training and test sets, time to build a small loop which is going to:
4. Evaluates the target model on the car sales test data and appends the results to our
regression_results dictionary
In [57]:
# Score the model pipeline on the test data appending the model_name to the
# results dictionary
print(f"Scoring {model_name}...")
regression_results[model_name] = model_pipeline.score(car_X_test, car_y_test)
Fitting Ridge...
Scoring Ridge...
Fitting SVR_linear...
Scoring SVR_linear...
Fitting SVR_rbf...
Scoring SVR_rbf...
Fitting RandomForestRegressor...
Scoring RandomForestRegressor...
Our regression models have been fit, let's see how they did!
In [58]:
Out[58]:
{'Ridge': 0.254026110579439,
'SVR_linear': -0.489452821008145,
'SVR_rbf': 0.0018546241516633755,
'RandomForestRegressor': 0.2291358152962253}
Since we've fitted some models but only compared them via the default metric contained in the score()
method (R^2 score or coefficient of determination), let's take the RidgeRegression model and evaluate it with
a few other regression metrics (https://scikit-learn.org/stable/modules/model_evaluation.html#regression-
metrics).
1. R^2 (pronounced r-squared) or coefficient of determination - Compares your models predictions to the
mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all
your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly
predicts a range of numbers it's R^2 value would be 1.
2. Mean absolute error (MAE) - The average of the absolute differences between predictions and actual
values. It gives you an idea of how wrong your predictions were.
3. Mean squared error (MSE) - The average squared differences between predictions and actual values.
Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).
Scikit-Learn has a few classes built-in which are going to help us with these, namely, mean_absolute_error
(https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html),
mean_squared_error (https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) and r2_score (https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.r2_score.html).
In [59]:
All the evaluation metrics we're concerned with compare a model's predictions with the ground truth labels.
Knowing this, we'll have to make some predictions.
Let's create a Pipeline with the preprocessor and a Ridge() model, fit it on the car sales training data
and then make predictions on the car sales test data.
In [60]:
# Make predictions on the car sales test data using the RidgeRegression Pipeline
car_y_preds = ridge_pipeline.predict(car_X_test)
Out[60]:
Nice! Now we've got some predictions, time to evaluate them. We'll find the mean squared error (MSE), mean
absolute error (MAE) and R^2 score (coefficient of determination) of our model.
In [61]:
# EXAMPLE: Find the MSE by comparing the car sales test labels to the car sales predictions
mse = mean_squared_error(car_y_test, car_y_preds)
# Return the MSE
mse
Out[61]:
49950182.63337458
In [62]:
# Find the MAE by comparing the car sales test labels to the car sales predictions
mae = mean_absolute_error(car_y_test, car_y_preds)
# Return the MAE
mae
Out[62]:
5713.8215208551555
In [63]:
# Find the R^2 score by comparing the car sales test labels to the car sales predictions
r2 = r2_score(car_y_test, car_y_preds)
# Return the R^2 score
r2
Out[63]:
0.254026110579439
Boom! Our model could potentially do with some hyperparameter tuning (this would be a great extension). And
we could probably do with finding some more data on our problem, 1000 rows doesn't seem to be sufficient.
Extensions
You should be proud. Getting this far means you've worked through a classification problem and regression
problem using pure (mostly) Scikit-Learn (no easy feat!).
For more exercises, check out the Scikit-Learn getting started documentation (https://scikit-
learn.org/stable/getting_started.html). A good practice would be to read through it and for the parts you find
interesting, add them into the end of this notebook.
Finally, as always, remember, the best way to learn something new is to try it. And try it relentlessly. If you're
unsure of how to do something, never be afraid to ask a question or search for something such as, "how to tune
the hyperparmaters of a scikit-learn ridge regression model".