House Price Prediction Using Machine Learning in Python
House Price Prediction Using Machine Learning in Python
Learning in Python
We all have experienced a time when we have to look up for a new house to buy. But
then the journey begins with a lot of frauds, negotiating deals, researching the local areas
and so on.
1
0 Exterior1st Exterior covering on house
1
1 BsmtFinSF2 Type 2 finished square feet.
1
2 TotalBsmtSF Total square feet of basement area
1
3 SalePrice To be predicted
import pandas as pd
dataset = pd.read_excel("HousePricePrediction.xlsx")
print(dataset.head(5))
Output:
As we have imported the data. So shape method will show us the dimension of the
dataset.
Python3
dataset.shape
Output:
(2919,13)
Data Preprocessing
Now, we categorize the features depending on their datatype (int, float, object) and then
calculate the number of them.
Python3
object_cols = list(obj[obj].index)
print("Categorical variables:",len(object_cols))
int_ = (dataset.dtypes == 'int')
num_cols = list(int_[int_].index)
print("Integer variables:",len(num_cols))
fl = (dataset.dtypes == 'float')
fl_cols = list(fl[fl].index)
print("Float variables:",len(fl_cols))
Output:
Categorical variables : 4
Integer variables : 6
Float variables : 3
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.corr(),
cmap = 'BrBG',
fmt = '.2f',
linewidths = 2,
annot = True)
Output:
unique_values = []
unique_values.append(dataset[col].unique().size)
plt.figure(figsize=(10,6))
plt.xticks(rotation=90)
sns.barplot(x=object_cols,y=unique_values)
Output:
The plot shows that Exterior1st has around 16 unique categories and other features have
around 6 unique categories. To findout the actual count of each category we can plot the
bargraph of each four features separately.
Python3
plt.figure(figsize=(18, 36))
plt.xticks(rotation=90)
index = 1
y = dataset[col].value_counts()
plt.subplot(11, 4, index)
plt.xticks(rotation=90)
sns.barplot(x=list(y.index), y=y)
index += 1
Output:
Data Cleaning
Data Cleaning is the way to improvise the data or remove incorrect, corrupted or
irrelevant data.
As in our dataset, there are some columns that are not important and irrelevant for the
model training. So, we can drop that column before training. There are 2 approaches to
dealing with empty/null values
We can easily delete the column/row (if the feature or record is not much important).
Filling the empty slots with mean/mode/0/NA/etc. (depending on the dataset
requirement).
As Id Column will not be participating in any prediction. So we can Drop it.
Python3
dataset.drop(['Id'],
axis=1,
inplace=True)
Replacing SalePrice empty values with their mean values to make the data distribution
symmetric.
Python3
dataset['SalePrice'] = dataset['SalePrice'].fillna(
dataset['SalePrice'].mean())
Drop records with null values (as the empty records are very less).
Python3
new_dataset = dataset.dropna()
Checking features which have null values in the new dataframe (if there are still any).
Python3
new_dataset.isnull().sum()
Output:
OneHotEncoder – For Label categorical features
One hot Encoding is the best way to convert categorical data into binary vectors. This
maps the values to integer values. By using OneHotEncoder, we can easily convert object
data into int. So for that, firstly we have to collect all the features which have the object
datatype. To do so, we will make a loop.
Python3
s = (new_dataset.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
len(object_cols))
Output:
Then once we have a list of all the features. We can apply OneHotEncoding to the whole
list.
Python3
OH_encoder = OneHotEncoder(sparse=False)
OH_cols =
pd.DataFrame(OH_encoder.fit_transform(new_dataset[object_cols]))
OH_cols.index = new_dataset.index
OH_cols.columns = OH_encoder.get_feature_names()
Python3
X = df_final.drop(['SalePrice'], axis=1)
Y = df_final['SalePrice']
model_SVR = svm.SVR()
model_SVR.fit(X_train,Y_train)
Y_pred = model_SVR.predict(X_valid)
print(mean_absolute_percentage_error(Y_valid, Y_pred))
Output :
0.18705129
Random Forest Regression
Random Forest is an ensemble technique that uses multiple of decision trees and can be
used for both regression and classification tasks. To read more about random forests refer
this.
Python3
model_RFR = RandomForestRegressor(n_estimators=10)
model_RFR.fit(X_train, Y_train)
Y_pred = model_RFR.predict(X_valid)
mean_absolute_percentage_error(Y_valid, Y_pred)
Output :
0.1929469
Linear Regression
Linear Regression predicts the final output-dependent value based on the given
independent features. Like, here we have to predict SalePrice depending on features like
MSSubClass, YearBuilt, BldgType, Exterior1st etc. To read more about Linear
Regression refer this.
Python3
from sklearn.linear_model import LinearRegression
model_LR = LinearRegression()
model_LR.fit(X_train, Y_train)
Y_pred = model_LR.predict(X_valid)
print(mean_absolute_percentage_error(Y_valid, Y_pred))
Output :
0.187416838
Conclusion
Clearly, SVM model is giving better accuracy as the mean absolute error is the least
among all the other regressor models i.e. 0.18 approx. To get much better results
ensemble learning techniques like Bagging and Boosting can also be used.