Linear Regression
Linear Regression
ipynb - Colaboratory
In this notebook we'll use experiment with linear regression in the prediction of used car prices.
We'll try different models, use polynomial features, and implement forward feature selection.
Instructions:
problems for you to insert code are indicated with lines that begin with #@ followed by a
problem number
always use a 75/25 split when splitting the data into training/test sets
always use 'random_state = 0' when using test_train_split so that you get the same answers
as the model output
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
from scipy.stats import zscore
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code toggle()"><input type="submit" value="Click here to display/hid
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 1/16
2/23/2021 linear-regression.ipynb - Colaboratory
<form action= javascript:code_toggle() ><input type= submit value= Click here to display/hid
df = pd.read_csv("https://raw.githubusercontent.com/grbruns/cst383/master/kuiper-2008-cars.cs
df.drop(['Model', 'Trim'], inplace=True, axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Price 804 non-null float64
1 Mileage 804 non-null int64
2 Make 804 non-null object
3 Type 804 non-null object
4 Cylinder 804 non-null int64
5 Liter 804 non-null float64
6 Doors 804 non-null int64
7 Cruise 804 non-null int64
8 Sound 804 non-null int64
9 Leather 804 non-null int64
dtypes: float64(2), int64(6), object(2)
memory usage: 62.9+ KB
There appears to be no missing data; at least none in the form of NA values. Let's look at the
relationships between some of the features.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 2/16
2/23/2021 linear-regression.ipynb - Colaboratory
What kinds of cars are in the data set and what are their proportions?
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 3/16
2/23/2021 linear-regression.ipynb - Colaboratory
Let's build a model to predict a used car's price from its mileage.
With one predictor, our linear model can be shown as a line. Let's look at the model compared to the
data.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 4/16
2/23/2021 linear-regression.ipynb - Colaboratory
intercept: 24764.56
coefficient for Mileage: -0.17
r-squared value: 0.02
The model says that for every extra mile of milage, our prediction for price will go down by 17 cents!
It makes sense that more milage means a lower price. Let's build another model to predict price, but
this time using cruise control and leather interior as additional features.
reg2 = LinearRegression()
reg2.fit(X,y)
Intercept: 14297.18
coefficients:
Mileage: -0.19
Cruise: 10256.12
Leather: 4175.58
Amazingly, the coe cient for cruise control indicates that -- in this model -- the presence of cruise
control will increase the predicted price by over $10,000. That seems crazy, but maybe cruise
control is a good differentiator between a cheap car and an expensive car.
What would this model predict for the price of a car with 20,000 miles and cruise control but not a
leather interior?
A good way to see how well a regression model is working is to plot predicted values against actual
values. In our case we'll plot predicted price vs. actual price. Ideally, all points will be close to the
line showing where actual=predicted.
The plot shows that many predictions are bad -- they are far off the line. The points on the far right
are cases in which our predicted price is much less than the actual price. For example, in one of
these points the actual price is about 60,000 dollars but our predicted price is about 24,000 dollars.
In the other direction, there's a car with an actual price of about 10,000 in which our model predicts
about 25,000. You wouldn't want to use this model in deciding how much to pay for a used car.
We've been using one data set both to train our model and to make predictions. It's much better to
split the data into separate training and set sets.
#@ 10 Create another linear model (again building a model to predict Price from
# Mileage, Cruise, and Leather). Call your new model reg3.
# However, this time fit the model using the training data.
Let's plot predicted vs. actual for predictions made on the training set. This should be similar to our
last predicted vs. actual plot.
Now let's do a predicted vs. actual plot for predictions made on the test data.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 8/16
2/23/2021 linear-regression.ipynb - Colaboratory
The two plots look pretty similar. Again we see some really bad predictions, like the predicted price
of about 28,000 for a car having an actual price of 70,000.
A good way to measure the goodness of a regression model is by using the root mean squared
error.
#@ 13 Print the root mean squared error on the test data. This is the
# square root of the average squared error. Write your own code
# to compute the RMSE; don't use a library function.
RMSE: 8517.23
We can make better predictions if we use more features as predictors. Let's use a feature that
represents the number of cylinders in a car's engine.
#@ 14 Create a new model reg4 that is like reg3, but adds 'Cylinder'
# as a new predictor. Do a train/test split (with random_state = 0), and fit your model to
# the training data.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
reg4 = LinearRegression()
reg4.fit(X_train,y_train)
Let's look at the RMSE of the new model, as well as the R-squared statistic.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 9/16
2/23/2021 linear-regression.ipynb - Colaboratory
The new cylinder feature reduces the RMSE by over 10 percent. Let's look at predicted vs. actual for
the new model.
Some of the predictions are still really bad. Would scaling the data help us make better predictions?
Let's check the R-squared score of our new model with scaled data. How does it compare to the R-
squared value above for the unscaled data?
regs = LinearRegression()
regs.fit(X_train, y_train)
With Scikit-Learn, all features must be numeric. Let's now transform all categorical variables into
numeric variables using the dummy variables approach.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Price 804 non-null float64
1 Mileage 804 non-null int64
2 Cylinder 804 non-null int64
3 Liter 804 non-null float64
4 Doors 804 non-null int64
5 Cruise 804 non-null int64
6 Sound 804 non-null int64
7 Leather 804 non-null int64
8 Make_Cadillac 804 non-null uint8
9 Make_Chevrolet 804 non-null uint8
10 Make_Pontiac 804 non-null uint8
11 Make_SAAB 804 non-null uint8
12 Make_Saturn 804 non-null uint8
13 Type_Coupe 804 non-null uint8
14 Type_Hatchback 804 non-null uint8
15 Type_Sedan 804 non-null uint8
16 Type_Wagon 804 non-null uint8
dtypes: float64(2), int64(6), uint8(9)
memory usage: 57.4 KB
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 11/16
2/23/2021 linear-regression.ipynb - Colaboratory
Let's build a linear model in which we use all of the features as predictors, and look at the values of
all the coe cients.
#@ 18 Make a model with all features.
# First, create X and y where y contains 'Price' values
# and X contains the other columns of df.
# Next, perform a test train split to get X_train, X_test, etc.
# Then create a linear model using LinearRegression. Call your model reg5.
# Finally, print the coefficients of your model (use a loop
# in printing all coefficients except the intercept).
reg5 = LinearRegression()
reg5.fit(X_train,y_train)
print("intercept: {:.2f}".format(reg5.intercept_))
print("coefficient:")
name = 0
for i in reg5.coef_:
print(" ", df.columns[name] + ":","{:.2f}".format(i))
name+=1
intercept: 34393.88
coefficient:
Price: -0.19
Mileage: -1217.73
Cylinder: 5674.20
Liter: -5338.01
Doors: 149.11
Cruise: 150.82
Sound: 112.58
Leather: 15638.25
Make_Cadillac: -1616.20
Make_Chevrolet: -1831.79
Make_Pontiac: 10623.51
Make_SAAB: -1345.82
Make_Saturn: -12563.40
Type_Coupe: -2209.53
Type_Hatchback: -2221.69
Type_Sedan: 1762.21
It is interesting to look at the coe cient values. We see that Saab commands a premium price, and
that coupes are associated with a low price compared to sedans. None of the features like cruise
control, sound system, or leather seats make a big different in the predictions from this model.
#@ 19 Print the r-squared value for your model based on the training data
# and also print the RMSE based on the test data.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 12/16
2/23/2021 linear-regression.ipynb - Colaboratory
print("R-squared: {:.2f}".format(reg5.score(X_train,y_train)))
print("RMSE: {:.2f}".format(rmse))
R-squared: 0.94
RMSE: 2685.22
We can add even more features to the model by adding derived features. Let's try the
PolynomialFeatures class.
#@ 20 From your NumPy array X create an extended data set X_poly using
# PolynomialFeatures with degree=2. Assign your PolynomialFeatures
# object to variable pf.
For a sanity check, how many rows and columns in the new data set?
X_poly.shape
(804, 153)
Let's go for broke and build a model using all features. What is the RMSE on the test data for such a
model?
reg6 = LinearRegression()
reg6.fit(X_train,y_train)
rmse = np.sqrt(((reg6.predict(X_test)-y_test)**2).mean())
print("RMSE: {:.2f}".format(rmse))
RMSE: 1446.13
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 13/16
2/23/2021 linear-regression.ipynb - Colaboratory
This RMSE value is much lower than before, but our model has 153 features! It would be good to
know which feature is the most important. As an experiment, let's look at the RMSE if we use only
the rst feature.
# compute negated mean square error scores using 5-fold cross validation
scores = cross_val_score(LinearRegression(), X_0, y_train, scoring='neg_mean_squared_error',
Now we'll compute the RMSE for each individual feature, and see which is lowest.
#@ 22 Using the ideas in the last cell, compute the RMSE for each individual
# feature using 5-fold cross validation. Save the index and RMSE associated
# with the best feature as the two variables i_min and rmse_min.
rmse_min = errors.min()
i_min = np.where(errors==errors.min())[0][0]
Which is the best set of 10 features? The number of possible sets of 10 features is equal to 153 *
152 * 151 * ... * 144, which is a gigantic number (bigger than a billion billion). Instead we can use
forward feature selection, which is a kind of greedy method.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 14/16
2/23/2021 linear-regression.ipynb - Colaboratory
remaining = list(range(X_train.shape[1]))
selected = []
n = 10
while len(selected) < n:
# find the single features that works best in conjunction
# with the already selected features
rmse_min = 1e7
for i in remaining:
# YOUR CODE HERE
selected.append(i)
scores = cross_val_score(LinearRegression(), X_train[:,selected], y_train, scoring="n
rmse = np.sqrt(-scores.mean())
selected.pop()
if(rmse < rmse_min):
rmse_min = rmse
i_min = i
remaining.remove(i_min)
selected.append(i_min)
print('num features: {}; rmse: {:.2f}'.format(len(selected), rmse_min))
How does the test RMSE of the model created using the 10 features found with forward feature
selection compare with the test RMSE of the model that uses all the features?
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 15/16
2/23/2021 linear-regression.ipynb - Colaboratory
reg6 = LinearRegression()
reg6.fit(X_train,y_train)
rmse = np.sqrt(((reg6.predict(X_test)-y_test)**2).mean())
print("test RMSE with 10 features: {:.2f}".format(rmse))
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 16/16