100% found this document useful (1 vote)

98 views16 pages

Linear Regression

The document discusses using linear regression to predict used car prices. It introduces linear regression models using mileage, cruise control, and leather interior as predictors of car price. The best-fitting model includes mileage, cruise control, and leather interior as factors. This model predicts a price of around $20,800 for a car with 20,000 miles and cruise control but no leather interior. The document also provides code to evaluate the linear regression models, including plotting actual vs. predicted prices to assess model fit.

Uploaded by

api-446917114

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

100% found this document useful (1 vote)

98 views16 pages

Linear Regression

Uploaded by

api-446917114

Available Formats

Download as PDF, TXT or read online on Scribd

Download as pdf or txt

You are on page 1/ 16

2/23/2021 linear-regression.

ipynb - Colaboratory

Predicting used car prices with linear regression

Dr. Bruns

In this notebook we'll use experiment with linear regression in the prediction of used car prices.
We'll try different models, use polynomial features, and implement forward feature selection.

Instructions:
problems for you to insert code are indicated with lines that begin with #@ followed by a
problem number
always use a 75/25 split when splitting the data into training/test sets
always use 'random_state = 0' when using test_train_split so that you get the same answers
as the model output

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
from scipy.stats import zscore
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

# code in this cell from:

# https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-noteboo
from IPython.display import HTML

HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code toggle()"><input type="submit" value="Click here to display/hid
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 1/16
2/23/2021 linear-regression.ipynb - Colaboratory
<form action= javascript:code_toggle() ><input type= submit value= Click here to display/hid

Click here to display/hide the code.

# switch to seaborn default stylistic parameters

sns.set()
sns.set_context('notebook')

Read the Kuiper's 2008 car data

We'll drop the Model and Trim features, as we are interested in making predictions without
identifying the exact kind of car.

df = pd.read_csv("https://raw.githubusercontent.com/grbruns/cst383/master/kuiper-2008-cars.cs
df.drop(['Model', 'Trim'], inplace=True, axis=1)

Are there any NA values in the data set?

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Price 804 non-null float64
1 Mileage 804 non-null int64
2 Make 804 non-null object
3 Type 804 non-null object
4 Cylinder 804 non-null int64
5 Liter 804 non-null float64
6 Doors 804 non-null int64
7 Cruise 804 non-null int64
8 Sound 804 non-null int64
9 Leather 804 non-null int64
dtypes: float64(2), int64(6), object(2)
memory usage: 62.9+ KB

There appears to be no missing data; at least none in the form of NA values. Let's look at the
relationships between some of the features.

# 1 Produce a grid of scatterplots using only Price, Mileage, and Liter

# Note: use a semicolon after your last plot statement to supress
# the non-graphical output.

# YOUR CODE HERE

sns.pairplot(df,vars=["Price","Mileage","Liter"]);

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 2/16
2/23/2021 linear-regression.ipynb - Colaboratory

What kinds of cars are in the data set and what are their proportions?

#@ 2 Create a bar plot of the car types, showing the

# number of cars of each type.

# YOUR CODE HERE

bar_plot = df["Type"].value_counts().plot.bar()
plt.title("Number of cars of each type")
plt.xlabel("Car type")
plt.ylabel("# of cars")
plt.xticks(rotation=-45)
plt.show(bar_plot)

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 3/16
2/23/2021 linear-regression.ipynb - Colaboratory

Let's build a model to predict a used car's price from its mileage.

#@ 3 Build a linear model using Mileage as the single predictor variable,

# and Price as the target variable. Fit the model using the entire data set.
# Use the LinearRegression class and use variable 'reg' for your LinearRegression object.

# YOUR CODE HERE

X = df[["Mileage"]].values
y = df["Price"].values
reg = LinearRegression()
reg.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

With one predictor, our linear model can be shown as a line. Let's look at the model compared to the
data.

#@ 4 Create a scatterplot of Price by Mileage (Price on y axis),

# then superimpose your linear model on it (as a line). Do not use
# Seaborn's regplot for this -- use two plotting statements.

# YOUR CODE HERE

scatter_plot = sns.scatterplot(x="Mileage",y="Price",data=df)
predictions = reg.predict(X)
plt.title("Linear model of Price by Mileage")
plt.plot(X, predictions, color="black", linestyle="dashed", linewidth="2")
plt.show(scatter_plot)

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 4/16
2/23/2021 linear-regression.ipynb - Colaboratory

What are the coe cients of the model?

#@ 5 Print the coefficients and R-squared value of the model.

# Hint: to get the R-squared value you can use the score()
# method of LinearRegression.

# YOUR CODE HERE

print("intercept: {:.2f}".format(reg.intercept_))
print("coefficient for Mileage: {:.2f}".format(reg.coef_[0]))
print("r-squared value: {:.2f}".format(reg.score(X,y)))

intercept: 24764.56
coefficient for Mileage: -0.17
r-squared value: 0.02

The model says that for every extra mile of milage, our prediction for price will go down by 17 cents!
It makes sense that more milage means a lower price. Let's build another model to predict price, but
this time using cruise control and leather interior as additional features.

#@ 6 Create a new linear model for Price using predictors Mileage,

# cruise, and Leather. Assign your model to variable reg2.

# YOUR CODE HERE

X = df[["Mileage","Cruise","Leather"]].values
y = df["Price"].values

reg2 = LinearRegression()
reg2.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Here are the coe cients of our new model.

#@ 7 Print the coefficients (including intercept) for the new model.

# Hint: (Use print()'{:.2f}'.format(x)) to print a value
# x with 2 digits after the decimal point.)

# YOUR CODE HERE

print("Intercept: {:.2f}".format(reg2.intercept_))
print("coefficients: ")
print("\tMileage: {:.2f}".format(reg2.coef [0]))
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 5/16
2/23/2021 linear-regression.ipynb - Colaboratory
print( \tMileage: {:.2f} .format(reg2.coef_[0]))
print("\tCruise: {:.2f}".format(reg2.coef_[1]))
print("\tLeather: {:.2f}".format(reg2.coef_[2]))

Intercept: 14297.18
coefficients:
Mileage: -0.19
Cruise: 10256.12
Leather: 4175.58

Amazingly, the coe cient for cruise control indicates that -- in this model -- the presence of cruise
control will increase the predicted price by over $10,000. That seems crazy, but maybe cruise
control is a good differentiator between a cheap car and an expensive car.

What would this model predict for the price of a car with 20,000 miles and cruise control but not a
leather interior?

#@ 8 Using your model, compute the predicted price for

# a car with a mileage of 20,000 and cruise but no leather.
# Hint: create a matrix to predict() that contains only one row.
# Print the predicted price.

# YOUR CODE HERE

car = [20000,1,0]
estimate = reg2.predict([car])
print("Predicted price: {:.2f}".format(estimate[0]))

Predicted price: 20827.73

A good way to see how well a regression model is working is to plot predicted values against actual
values. In our case we'll plot predicted price vs. actual price. Ideally, all points will be close to the
line showing where actual=predicted.

#@ 9 Define a function named 'plot_actual_predicted' to plot predicted vs actual values.

# It should take as input a NumPy array of actual values, a NumPy array of
# predicted values, and a plot title. The two arrays can be assumed to be the same length.
# I used linewidth=2 and linestyle='dashed' for the actual=predicted line.
#
# Hint: when plotting, first plot the scatter plot and then plot the line that
# shows when predicted=actual. You need only two points to plot the line where
# actual = predicted, and the points should have the form (a,a), (b,b).
# To determine what a and b should be, you may want to compute 1) the minimum
# value of actual and predicted, and 2) the maximum value of actual and predicted.

def plot_actual_predicted(actual, predicted, title):

# YOUR CODE HERE
scatter_plot = sns.scatterplot(x=actual, y=predicted)
lt l b l("A t l")
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 6/16
2/23/2021 linear-regression.ipynb - Colaboratory
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title(title)
p1 = [predicted.min(), predicted.max()]
p2 = [predicted.min(), predicted.max()]
plt.plot(p1, p2, color="black", linestyle="dashed", linewidth=2)
plt.show(scatter_plot)

# plot predicted vs. actual for your linear model

plot_actual_predicted(y, reg2.predict(X), 'Predicted by actual (car price model 2)')

The plot shows that many predictions are bad -- they are far off the line. The points on the far right
are cases in which our predicted price is much less than the actual price. For example, in one of
these points the actual price is about 60,000 dollars but our predicted price is about 24,000 dollars.
In the other direction, there's a car with an actual price of about 10,000 in which our model predicts
about 25,000. You wouldn't want to use this model in deciding how much to pay for a used car.

We've been using one data set both to train our model and to make predictions. It's much better to
split the data into separate training and set sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Now let's t our same model using only training data.

#@ 10 Create another linear model (again building a model to predict Price from
# Mileage, Cruise, and Leather). Call your new model reg3.
# However, this time fit the model using the training data.

# YOUR CODE HERE

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 7/16
2/23/2021 linear-regression.ipynb - Colaboratory
OU CO
reg3 = LinearRegression()
reg3.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's plot predicted vs. actual for predictions made on the training set. This should be similar to our
last predicted vs. actual plot.

#@ 11 Plot the actual and predicted values using your function

# plot_actual_predicted().
# Use the training data.

# YOUR CODE HERE

plot_actual_predicted(y_train, reg3.predict(X_train), "Predicted by actual (training data)")

Now let's do a predicted vs. actual plot for predictions made on the test data.

#@ 12 Plot the actual and predicted values using plot_actual_predicted().

# This time use the test data.

# YOUR CODE HERE

plot_actual_predicted(y_test, reg3.predict(X_test), "Predicted by actual (test data)")

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 8/16
2/23/2021 linear-regression.ipynb - Colaboratory

The two plots look pretty similar. Again we see some really bad predictions, like the predicted price
of about 28,000 for a car having an actual price of 70,000.

A good way to measure the goodness of a regression model is by using the root mean squared
error.

#@ 13 Print the root mean squared error on the test data. This is the
# square root of the average squared error. Write your own code
# to compute the RMSE; don't use a library function.

# YOUR CODE HERE

rmse = np.sqrt(((reg3.predict(X_test) - y_test)**2).mean())
print("RMSE: {:.2F}".format(rmse))

RMSE: 8517.23

We can make better predictions if we use more features as predictors. Let's use a feature that
represents the number of cylinders in a car's engine.

#@ 14 Create a new model reg4 that is like reg3, but adds 'Cylinder'
# as a new predictor. Do a train/test split (with random_state = 0), and fit your model to
# the training data.

# YOUR CODE HERE

X = df[["Mileage","Cruise","Leather","Cylinder"]].values
y = df["Price"].values

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

reg4 = LinearRegression()
reg4.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's look at the RMSE of the new model, as well as the R-squared statistic.

# Print the RMSE of your new model on the test data,

# and R-squared value of your new model (on the training data)

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 9/16
2/23/2021 linear-regression.ipynb - Colaboratory

print('RMSE of reg4: {:.2f}'.format(np.sqrt(mean_squared_error(y_test, reg4.predict(X_test)))

print('r-squared value of reg4: {:.4f}'.format(reg4.score(X_train,y_train)))

RMSE of reg4: 7810.77

r-squared value of reg4: 0.4489

The new cylinder feature reduces the RMSE by over 10 percent. Let's look at predicted vs. actual for
the new model.

#@ 15 Plot the actual and predicted values using plot_actual_predicted().

# Use test data for your predictions.

# YOUR CODE HERE

plot_actual_predicted(y_test, reg4.predict(X_test), "Predcited by actual (Model 4, test data)

Some of the predictions are still really bad. Would scaling the data help us make better predictions?

#@ 16 Using predictors Mileage, Cruise, Leather, and Cylinder, make

# NumPy arrays X and y, where y contains the values in column Price.
# then scale all columns of array X using scipy.stats.zscore.
# Use X_s as the name of the scaled version of X.

# YOUR CODE HERE

X_s = df[["Mileage","Cruise","Leather","Cylinder"]].apply(zscore).values
y = df["Price"].values

Let's check the R-squared score of our new model with scaled data. How does it compare to the R-
squared value above for the unscaled data?

X_train, X_test, y_train, y_test = train_test_split(X_s, y, test_size=0.25, random_state=0)

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 10/16
2/23/2021 linear-regression.ipynb - Colaboratory

regs = LinearRegression()
regs.fit(X_train, y_train)

print('r-squared value of regs: {:.4f}'.format(regs.score(X_train,y_train)))

r-squared value of regs: 0.4489

With Scikit-Learn, all features must be numeric. Let's now transform all categorical variables into
numeric variables using the dummy variables approach.

#@ 17 Using pandas.get_dummies, convert all the categorical

# columns of df into numerical columns. Use option drop_first = True
# so that, if the categorical variable has n different unique
# values, then only n-1 dummy variables will be used.

# YOUR CODE HERE

df = pd.get_dummies(df,columns=["Make","Type"], drop_first=True)

A check to ensure that we now have only numeric variables.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 804 entries, 0 to 803
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Price 804 non-null float64
1 Mileage 804 non-null int64
2 Cylinder 804 non-null int64
3 Liter 804 non-null float64
4 Doors 804 non-null int64
5 Cruise 804 non-null int64
6 Sound 804 non-null int64
7 Leather 804 non-null int64
8 Make_Cadillac 804 non-null uint8
9 Make_Chevrolet 804 non-null uint8
10 Make_Pontiac 804 non-null uint8
11 Make_SAAB 804 non-null uint8
12 Make_Saturn 804 non-null uint8
13 Type_Coupe 804 non-null uint8
14 Type_Hatchback 804 non-null uint8
15 Type_Sedan 804 non-null uint8
16 Type_Wagon 804 non-null uint8
dtypes: float64(2), int64(6), uint8(9)
memory usage: 57.4 KB

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 11/16
2/23/2021 linear-regression.ipynb - Colaboratory

Let's build a linear model in which we use all of the features as predictors, and look at the values of
all the coe cients.
#@ 18 Make a model with all features.
# First, create X and y where y contains 'Price' values
# and X contains the other columns of df.
# Next, perform a test train split to get X_train, X_test, etc.
# Then create a linear model using LinearRegression. Call your model reg5.
# Finally, print the coefficients of your model (use a loop
# in printing all coefficients except the intercept).

# YOUR CODE HERE

y = df["Price"].values
X = df.drop(columns=["Price"]).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=0)

reg5 = LinearRegression()
reg5.fit(X_train,y_train)

print("intercept: {:.2f}".format(reg5.intercept_))
print("coefficient:")
name = 0
for i in reg5.coef_:
print(" ", df.columns[name] + ":","{:.2f}".format(i))
name+=1

intercept: 34393.88
coefficient:
Price: -0.19
Mileage: -1217.73
Cylinder: 5674.20
Liter: -5338.01
Doors: 149.11
Cruise: 150.82
Sound: 112.58
Leather: 15638.25
Make_Cadillac: -1616.20
Make_Chevrolet: -1831.79
Make_Pontiac: 10623.51
Make_SAAB: -1345.82
Make_Saturn: -12563.40
Type_Coupe: -2209.53
Type_Hatchback: -2221.69
Type_Sedan: 1762.21

It is interesting to look at the coe cient values. We see that Saab commands a premium price, and
that coupes are associated with a low price compared to sedans. None of the features like cruise
control, sound system, or leather seats make a big different in the predictions from this model.

#@ 19 Print the r-squared value for your model based on the training data
# and also print the RMSE based on the test data.
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 12/16
2/23/2021 linear-regression.ipynb - Colaboratory

# YOUR CODE HERE

rmse = np.sqrt(((reg5.predict(X_test)-y_test)**2).mean())

print("R-squared: {:.2f}".format(reg5.score(X_train,y_train)))
print("RMSE: {:.2f}".format(rmse))

R-squared: 0.94
RMSE: 2685.22

We can add even more features to the model by adding derived features. Let's try the
PolynomialFeatures class.

#@ 20 From your NumPy array X create an extended data set X_poly using
# PolynomialFeatures with degree=2. Assign your PolynomialFeatures
# object to variable pf.

# YOUR CODE HERE

pf = PolynomialFeatures(degree=2)
pf.fit(X)
X_poly = pf.transform(X)

For a sanity check, how many rows and columns in the new data set?

X_poly.shape

(804, 153)

Let's go for broke and build a model using all features. What is the RMSE on the test data for such a
model?

#@ 21 Create a model reg6 using all of these features. First

# create training and test sets (using X_poly and y), then
# train a linear model on the training data, and then compute
# the RMSE on the test data.

# YOUR CODE HERE

X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.25, random_state=0

reg6 = LinearRegression()
reg6.fit(X_train,y_train)

rmse = np.sqrt(((reg6.predict(X_test)-y_test)**2).mean())
print("RMSE: {:.2f}".format(rmse))

RMSE: 1446.13

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 13/16
2/23/2021 linear-regression.ipynb - Colaboratory

This RMSE value is much lower than before, but our model has 153 features! It would be good to
know which feature is the most important. As an experiment, let's look at the RMSE if we use only
the rst feature.

# make a version of the training data with just feature 0

X_0 = X_train[:,[0]]

# compute negated mean square error scores using 5-fold cross validation
scores = cross_val_score(LinearRegression(), X_0, y_train, scoring='neg_mean_squared_error',

# work out the average root mean squared error. We need to

# first negate the scores, because they are negative MSE, not MSE.
rmse = np.sqrt(-scores.mean())

print('RMSE for feature 0 only: {:.2f}'. format(rmse))

RMSE for feature 0 only: 9983.69

Now we'll compute the RMSE for each individual feature, and see which is lowest.

#@ 22 Using the ideas in the last cell, compute the RMSE for each individual
# feature using 5-fold cross validation. Save the index and RMSE associated
# with the best feature as the two variables i_min and rmse_min.

# YOUR CODE HERE

def rmseA(x):
scores = cross_val_score(LinearRegression(), x.reshape(-1,1), y_train, scoring="neg_mean_
rmse = np.sqrt(-scores.mean())
return rmse

errors = np.apply_along_axis(rmseA, 0, X_train)

rmse_min = errors.min()
i_min = np.where(errors==errors.min())[0][0]

Which single feature has the lowest RMSE?

print('best feature: {}, best RMSE: {:.2f}'.format(pf.get_feature_names()[i_min], rmse_min))

best feature: x2 x7, best RMSE: 7206.27

Which is the best set of 10 features? The number of possible sets of 10 features is equal to 153 *
152 * 151 * ... * 144, which is a gigantic number (bigger than a billion billion). Instead we can use
forward feature selection, which is a kind of greedy method.

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 14/16
2/23/2021 linear-regression.ipynb - Colaboratory

#@ 23 Find the 10 features that give the lowest RMSE by using

# forward search. Important: find the single best feature, then
# find one more feature that is the best *when combined with
# the single best feature*, and continue, always looking for
# the single best feature combined with all previously-selected
# features.
#
# I've filled in some of the code for you. 'remaining' is a list
# of the features to be considered when finding the next best
# feature. It is initialized to all the features. 'selected' is
# a list of the features that have been chosen in a round of finding the
# next best feature. It is initialized to the empty list.
#
# Don't forget to have include all selected features when looking
# for the next best feature.

remaining = list(range(X_train.shape[1]))
selected = []
n = 10
while len(selected) < n:
# find the single features that works best in conjunction
# with the already selected features
rmse_min = 1e7
for i in remaining:
# YOUR CODE HERE
selected.append(i)
scores = cross_val_score(LinearRegression(), X_train[:,selected], y_train, scoring="n
rmse = np.sqrt(-scores.mean())
selected.pop()
if(rmse < rmse_min):
rmse_min = rmse
i_min = i

remaining.remove(i_min)
selected.append(i_min)
print('num features: {}; rmse: {:.2f}'.format(len(selected), rmse_min))

num features: 1; rmse: 7206.27

num features: 2; rmse: 5896.86
num features: 3; rmse: 4003.42
num features: 4; rmse: 3568.84
num features: 5; rmse: 3322.06
num features: 6; rmse: 2462.93
num features: 7; rmse: 2293.72
num features: 8; rmse: 2167.47
num features: 9; rmse: 2067.19
num features: 10; rmse: 2009.69

How does the test RMSE of the model created using the 10 features found with forward feature
selection compare with the test RMSE of the model that uses all the features?
https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 15/16
2/23/2021 linear-regression.ipynb - Colaboratory

#@ 24 Print the RMSE of the best 10 features on the test set

# YOUR CODE HERE

X_train, X_test, y_train, y_test = train_test_split(X_poly[:,selected], y, test_size=0.25, ra

reg6 = LinearRegression()
reg6.fit(X_train,y_train)

rmse = np.sqrt(((reg6.predict(X_test)-y_test)**2).mean())
print("test RMSE with 10 features: {:.2f}".format(rmse))

test RMSE with 10 features: 2075.86

https://colab.research.google.com/drive/1N7butvfpFS87a7rFGIrlGIqpglKpceyn#scrollTo=rHGqR7DNyuLs&printMode=true 16/16

Data Mining
No ratings yet
Data Mining
10 pages
Azure Machine Learning Studio - Automobile Price Prediction
No ratings yet
Azure Machine Learning Studio - Automobile Price Prediction
11 pages
Problem: # Partition
No ratings yet
Problem: # Partition
5 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
FIAT 9110718en Stand 2012-006
0% (1)
FIAT 9110718en Stand 2012-006
17 pages
Rotring Catalogue 2013 en
No ratings yet
Rotring Catalogue 2013 en
60 pages
Problem Statement Is To Predict Price Column Based On Data With 24 Columns With Over 200 Data Entries Using Linear Regression
No ratings yet
Problem Statement Is To Predict Price Column Based On Data With 24 Columns With Over 200 Data Entries Using Linear Regression
5 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
INSY446 - 02 - Linear Model Part 1
No ratings yet
INSY446 - 02 - Linear Model Part 1
27 pages
Exp_6-Model Development_sdk_ok
No ratings yet
Exp_6-Model Development_sdk_ok
11 pages
Unit 5
No ratings yet
Unit 5
171 pages
Simple Linear Regression With Jupyter Notebook: Dr. Alvin Ang
No ratings yet
Simple Linear Regression With Jupyter Notebook: Dr. Alvin Ang
16 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
10 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
Car Price Detection Based On The Travelling Distance
No ratings yet
Car Price Detection Based On The Travelling Distance
15 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Ml Cyber Lab
No ratings yet
Ml Cyber Lab
16 pages
ML_recordjp
No ratings yet
ML_recordjp
35 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
INDUSTRY 2 Jaimin
No ratings yet
INDUSTRY 2 Jaimin
14 pages
Learning/"
No ratings yet
Learning/"
39 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
Lab Experiments Vi Sem-1
No ratings yet
Lab Experiments Vi Sem-1
10 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
INDUSTRY 2 Akshat
No ratings yet
INDUSTRY 2 Akshat
12 pages
Regression Practice - MLR
No ratings yet
Regression Practice - MLR
9 pages
Engo 645
No ratings yet
Engo 645
9 pages
Session7 LinearRegression
No ratings yet
Session7 LinearRegression
52 pages
Python practice questions (1)
No ratings yet
Python practice questions (1)
5 pages
Note
No ratings yet
Note
9 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Final Lab Manual
No ratings yet
Final Lab Manual
34 pages
Mlaifile1 3
No ratings yet
Mlaifile1 3
27 pages
CSE1703 - Fundamental of Data Science
No ratings yet
CSE1703 - Fundamental of Data Science
6 pages
Sales Car Price Predictions
No ratings yet
Sales Car Price Predictions
6 pages
1_Lab Manual (ML)
No ratings yet
1_Lab Manual (ML)
42 pages
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
No ratings yet
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
13 pages
PML Ex3 Final
No ratings yet
PML Ex3 Final
20 pages
Car Price Prediction Using Machine Learning
33% (3)
Car Price Prediction Using Machine Learning
15 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
C1 W1 Lab03 Model Representation Soln-Copy1
No ratings yet
C1 W1 Lab03 Model Representation Soln-Copy1
7 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Important Questions
No ratings yet
Important Questions
4 pages
Lab 10 Ai Mussab(Fa22 Bce 073)
No ratings yet
Lab 10 Ai Mussab(Fa22 Bce 073)
7 pages
PRJ Car Price Prediction For Data Science
No ratings yet
PRJ Car Price Prediction For Data Science
10 pages
C1 W1 Lab02 Model Representation Soln
No ratings yet
C1 W1 Lab02 Model Representation Soln
5 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
No ratings yet
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
7 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
LinearRegression HandsOn
No ratings yet
LinearRegression HandsOn
3 pages
Car Price Prediction Report
No ratings yet
Car Price Prediction Report
8 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
CB0494 Notes
No ratings yet
CB0494 Notes
6 pages
External
No ratings yet
External
11 pages
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Cruz Jordan Proposal
No ratings yet
Cruz Jordan Proposal
23 pages
cst363 - Project Report
No ratings yet
cst363 - Project Report
12 pages
Jordan Cruz Finalprojectreport
No ratings yet
Jordan Cruz Finalprojectreport
4 pages
Service Learning Itu - Sheet1
No ratings yet
Service Learning Itu - Sheet1
1 page
Cruz Paper2
No ratings yet
Cruz Paper2
9 pages
Classification of Internet Products and Evaluation of Application Utilization Based On The Product Fulfillment Process
No ratings yet
Classification of Internet Products and Evaluation of Application Utilization Based On The Product Fulfillment Process
5 pages
Facing Tomorrow's Quantum Hackers Today
100% (1)
Facing Tomorrow's Quantum Hackers Today
23 pages
Brandingdlp PDF
No ratings yet
Brandingdlp PDF
9 pages
Basidiolichen Summary
No ratings yet
Basidiolichen Summary
2 pages
Directory 2005 (Session 2004-2005)
No ratings yet
Directory 2005 (Session 2004-2005)
87 pages
Mohamed Salah cv4 PDF
No ratings yet
Mohamed Salah cv4 PDF
3 pages
A Study On Employee'S Perception Towards Training Program
No ratings yet
A Study On Employee'S Perception Towards Training Program
1 page
Lecture4 Chapter1 - Binary - Gray, and ASCII Codes
No ratings yet
Lecture4 Chapter1 - Binary - Gray, and ASCII Codes
36 pages
Document Transfer
No ratings yet
Document Transfer
1 page
X - CH - 8 - Heredity
No ratings yet
X - CH - 8 - Heredity
24 pages
A Newton Method For Systems of M Equations in N Variables
No ratings yet
A Newton Method For Systems of M Equations in N Variables
13 pages
Pharmaco in For Matics
No ratings yet
Pharmaco in For Matics
7 pages
Exalted 3Rd Edition Charm Cascades Legend: Created by Madletter Daelkyr@Gmx - de
No ratings yet
Exalted 3Rd Edition Charm Cascades Legend: Created by Madletter Daelkyr@Gmx - de
25 pages
Estimate 2
No ratings yet
Estimate 2
63 pages
CSE 185 Introduction To Computer Vision: Fitting and Alignment
No ratings yet
CSE 185 Introduction To Computer Vision: Fitting and Alignment
42 pages
O-Levels Mathematics 2006
No ratings yet
O-Levels Mathematics 2006
40 pages
Shielded Metal Arc Welding
100% (4)
Shielded Metal Arc Welding
33 pages
UG 45 Calculation
No ratings yet
UG 45 Calculation
7 pages
I_Q Modulation Considerations
No ratings yet
I_Q Modulation Considerations
20 pages
Laptop Dealer Price List-53
No ratings yet
Laptop Dealer Price List-53
5 pages
Trading Indicators by Bill Williams Ebook PDF
100% (2)
Trading Indicators by Bill Williams Ebook PDF
8 pages
Name: - Class: - Answer All Questions
No ratings yet
Name: - Class: - Answer All Questions
7 pages
Mémoire M1EIE Handahamé
No ratings yet
Mémoire M1EIE Handahamé
17 pages
Tuan Van Pham 83-94
No ratings yet
Tuan Van Pham 83-94
12 pages
ASTM A179 2005 LTCS Tubes Seamless Cold-Drawn For Heat-Exchanger and Condenser
No ratings yet
ASTM A179 2005 LTCS Tubes Seamless Cold-Drawn For Heat-Exchanger and Condenser
2 pages
Astm Loq
No ratings yet
Astm Loq
2 pages
感測器量測介紹 1 2024 241107 130118
No ratings yet
感測器量測介紹 1 2024 241107 130118
27 pages