See discussions, stats, and author profiles for this publication at: [Link]
net/publication/349477129
House Price Prediction
Experiment Findings · February 2021
DOI: 10.13140/RG.2.2.27657.98408
CITATIONS READS
3 36,247
1 author:
Udit Deo
Maulana Azad National Institute of Technology, Bhopal
4 PUBLICATIONS 3 CITATIONS
SEE PROFILE
All content following this page was uploaded by Udit Deo on 21 February 2021.
The user has requested enhancement of the downloaded file.
HOUSE PRICE PREDICTION USING REGRESSION
TECHNIQUES
A MINI PROJECT
REPORT
Submitted by
Udit Deo
17bcs057
Uday Deo
17bcs056
B. TECH.
IN
Computer Science and Engineering
SHRI MATA VAISHNO DEVI UNIVERSITY
November 29, 2019
SHRI MATA VAISHNO DEVI UNIVERSITY
CERTIFICATE
Certified that this project report HOUSE PRICE PREDICTION USING
REGRESSION TECHNIQUES” is the work of Udit Deo and Uday Deo who carried out the mini
project work under my supervision.
Pooja Sharma
Assistant Professor
CSE, SMVDU
Submitted to the Viva voce Examination held on _____________________
INTERNAL EXAMINER EXTERNAL EXAMINER
TABLE OF CONTENTS
Chapter Number Contents Page Number
Abstract iii
List of Table iv
List of Figures v
List of symbols and Abbreviations vi
1 INTRODUCTION vii
1.1 AIM and IMPORTANCE viii
1.1.1 Aim ix
1.1.2 Need and Motivation x
2. DATASET xi
2.1 Steps in Preparing Data for Model xii
2.1.1 Data Exploration xiii
2.1.2 Data Visualization xiv
2.1.3 Data Selection xv-xvi
2.1.4 Data Transformation xvii-xx
3 LANGUAGE AND MODELS USED xxi
3.1 Python xxii
3.1.1 Jupyter Notebook xxii
3.1.2 NumPy xxii
3.1.3 Pandas xxii
3.1.4 Seaborn xxii
3.1.5 Matplotlib xxii
3.2 Models Used xxiii
3.2.1 Multiple Linear Regression xxiv
3.2.2 Random Forest Regressor xxv
3.2.3 XG Boost Regressor xxvi
4. RESULTS AND DISCUSSIONS xxvii
4.1 BEST SUITED MODEL xxvii
4.2 DEPLOYMENT APP xxviii
5. CONCLUSION xxix
Abstract
House price forecasting is an important topic of real estate. The literature attempts to
derive useful knowledge from historical data of property markets. Machine learning
techniques are applied to analyze historical property transactions in India to discover
useful models for house buyers and sellers. Revealed is the high discrepancy between
house prices in the most expensive and most affordable suburbs in the city of Mumbai.
Moreover, experiments demonstrate that the Multiple Linear Regression that is based on
mean squared error measurement is a competitive approach.
INTRODUCTION
AIM and IMPORTANCE
Aim
These are the Parameters on which we will evaluate ourselves-
• Create an effective price prediction model
• Validate the model’s prediction accuracy
• Identify the important home price attributes which feed the model’s predictive power.
Need and Motivation
Having lived in India for so many years if there is one thing that I had been taking for granted,
it’s that housing and rental prices continue to rise. Since the housing crisis of 2008, housing
prices have recovered remarkably well, especially in major housing markets. However, in the
4th quarter of 2016, I was surprised to read that Bombay housing prices had fallen the most in
the last 4 years. In fact, median resale prices for condos and coops fell 6.3%, marking the first
time there was a decline since Q1 of 2017. The decline has been partly attributed to political
uncertainty domestically and abroad and the 2014 election. So, to maintain the transparency
among customers and also the comparison can be made easy through this model. If customer
finds the price of house at some given website higher than the price predicted by the model, so
he can reject that house.
DATASET
Here we have web scrapped the Data from [Link] website which is one of the leading
real estate websites operating in INDIA.
Our Data contains Bombay Houses only.
Dataset looks as follows-
Data Exploration
Data exploration is the first step in data analysis and typically involves summarizing the main
characteristics of a data set, including its size, accuracy, initial patterns in the data and other
attributes. It is commonly conducted by data analysts using visual analytics tools, but it can
also be done in more advanced statistical software, Python. Before it can conduct analysis on
data collected by multiple data sources and stored in data warehouses, an organization must
know how many cases are in a data set, what variables are included, how many missing
values there are and what general hypotheses the data is likely to support. An initial
exploration of the data set can help answer these questions by familiarizing analysts with the
data with which they are working.
We divided the data 9:1 for Training and Testing purpose respectively.
Data Visualization
Data visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data. In the
world of Big Data, data visualization tools and technologies are essential to analyse
massive amounts of information and make data-driven decisions.
Data Selection
Data selection is defined as the process of determining the appropriate data
type and source, as well as suitable instruments to collect data. Data selection
precedes the actual practice of data collection. This definition distinguishes data
selection from selective data reporting (selectively excluding data that is not
supportive of a research hypothesis) and interactive/active data selection (using
collected data for monitoring activities/events, or conducting secondary data
analyses). The process of selecting suitable data for a research project can impact data
integrity.
The primary objective of data selection is the determination of appropriate data type,
source, and instrument(s) that allow investigators to adequately answer research
questions. This determination is often discipline-specific and is primarily driven by
the nature of the investigation, existing literature, and accessibility to necessary data
sources.
Correlation Heatmap
Data Transformation
The log transformation can be used to make highly skewed distributions less skewed. This
can be valuable both for making patterns in the data more interpretable and for helping to
meet the assumptions of inferential statistics.
It is hard to discern a pattern in the upper panel whereas the strong relationship is shown
clearly in the lower panel. The comparison of the means of log-transformed data is actually a
comparison of geometric means. This occurs because, as shown below, the anti-log of
the arithmetic mean of log-transformed values is the geometric mean.
Skewed Price
Normal Price
Skewed Area
Normal Area
Skewed Price/Sq.
Normal Price/Sq.
LANGUAGE AND MODELS USED
Python
Python is widely used in scientific and numeric computing:
SciPy is a collection of packages for mathematics, science, and engineering.
Pandas is a data analysis and modelling library.
IPython is a powerful interactive shell that features easy editing and recording of a work
session, and supports visualizations and parallel computing.
The Software Carpentry Course teaches basic skills for scientific computing, running
bootcamps and providing open-access teaching materials.
Libraries Used for this Project include –
Pandas
NumPy
Matplotlib
Seaborn
Scikit Learn
XG Boost
MODELS USED
Regression Model
• Linear Regression is a machine learning algorithm based on supervised learning.
• It performs a regression task. Regression models a target prediction value based on
independent variables.
• It is mostly used for finding out the relationship between variables and forecasting.
Real Vs Predicted
Random Forest Regression Model
• A Random Forest is an ensemble technique capable of performing both regression and
classification tasks with the use of multiple decision trees and a technique
called Bootstrap Aggregation, commonly known as bagging.
• Bagging, in the Random Forest method, involves training each decision tree on a
different data sample where sampling is done with replacement.
• The basic idea behind this is to combine multiple decision trees in determining the
final output rather than relying on individual decision trees.
Real Vs Predicted
XG Boost Regressor Model
• XG Boost stands for eXtreme Gradient Boosting.
• The XG Boost library implements the gradient boosting decision tree algorithm.
• Boosting is an ensemble technique where new models are added to correct the errors
made by existing models.
• Models are added sequentially until no further improvements can be made.
Real Vs Predicted
RESULTS AND DISCUSSIONS
Best Suited Model
So, our study showed that……..
Linear Regression displayed the best performance for this Dataset and can be used for
deploying purposes.
Random Forest Regressor and XGBoost Regressor are far behind, so can’t be
recommended for further deployment purposes.
RMSE Bar Graph
Deployment App
The Model is deployed through Python Web App Flask in collaboration with HTML and
CSS.
Conclusion
So, our Aim is achieved as we have successfully ticked all our parameters as mentioned in
our Aim Column. It is seen that circle rate is the most effective attribute in predicting the
house price and that the Linear Regression is the most effective model for our Dataset with
RMSE score of 0.5025658262899986.
APPENDIX – 4 (A typical Sample of List of Tables)
LIST OF TABLES
TABLE TITLE PAGE
NUMBER
4.1 Raw Bombay House Dataset vi
4.2 Transformed Dataset House ix
APPENDIX – 5 (A typical Sample of List of Figures)
LIST OF FIGURES
FIGURE TITLE PAGE
NUMBER
4.1 Bar Plots of various Parameters viii
4.2 Correlation Heatmap x
4.3 Skewed Plot xi-xiv
4.4 Real Vs Predicted Plots xvii-xix
4.5 Comparison Bar Plot xxiii
4.6 Website View xxv
View publication stats
APPENDIX – 6: (A typical Sample of List of Symbols and Abbreviations)
ABBREVIATIONS
SqM Square meter
SqFt Square Feet
LR Linear Regression
XGB Xtreme Gradient Boost
rmse Root Mean Square Error
Log Logarithmic