0% found this document useful (0 votes)

30 views

Assignment II

This document discusses applying preprocessing and regression analysis to property data. It summarizes preprocessing steps like handling missing data, converting categorical variables to factors, and splitting data into training and test sets. Regression analysis is then performed to understand relationships between factors like area, price per square foot, and number of bathrooms on property price. The linear regression model finds several variables have a statistically significant relationship with price. Model performance is evaluated by comparing actual and predicted prices in the test set.

Uploaded by

21324jesika

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Assignment II

Uploaded by

21324jesika

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Kathmandu University School of Management

Big Data Analytics

Assignment-II

Submitted by:

Aayush Kumar Chaudhary (21305)

MBA (Spring 2021)
August, 2021

Submitted to:

Asst. Prof Baljeet Kaur

Pre Processing and Regression Analysis

Qsn. Apply Pre Processing and Regression Analysis on given Property Data.

Regression analysis is used to understand the relationship among different factors so that we can
determine which factors are most important, which ones can be disregarded, and how these
factors interact. So to analyze the data contained in MagicBricks.csv file and how the different
attributes are related to each other, I have used linear regression approach since the target
variable is continuous in nature.

Details of Data

Dataset consists of 11 columns and 1259 rows. While Furnishing, Locality, Status, Transaction,
and Type are of character type, the columns Area, BHK, Bathroom, Parking, Price, and Per_Sqft
are of numeric type. When we carefully examine the value of the variables, we can see that Area,
Price, and Per_Sqft are continuous variables whereas the rest are categorical. Although BHK,
Parking, and Bathroom have numerical data but they have a set of repetitive fixed data. As a
result, it may also be regarded as categorical data. Here, we need to examine the effects of
several factors on the price of property which is our target variable.

Explanation of Code

The dataset contains numerous missing values that must be preprocessed and replaced with the
values that will produce desired result. In order to preprocess the data and convert it to the
desired format, the first step is preprocessing, which involves installing and loading the
necessary library and using its functions to do so.

dplyr and stringr library is used for manipulating the original dataset like deleting, editing or
adding some new values. caTools is used to split the data into training and testing data.

The data in MagicBricks.csv file is read to propertyData variable using read.csv() function.

2|Page
Pre Processing and Regression Analysis

Preprocessing

I have created the mode function that will be used to calculate the mode value of categorical
variables.

 nrow(propertyData): Used to calculate the number of rows

 length(propertyData): Used to calculate the number of columns

Manually searching every column for missing or NA data is not possible. To determine the count
of missing values or cells with NA values for each column, I used the sapply() method. Character
variables' missing values are represented by empty values, while numeric variables' missing
values are represented by NA. To determine the exact number of missing data for each column, I
have thus used if condition to check both cases of NA and empty value.

The output for the sapply() function is:

We can see that the variables Bathroom has 2 missing values, Furnishing has 5, Parking has 33,
Type has 5 and Per_Sqft has 241. Since we have only 1259 values in the given excel sheet which
is quiet low, so instead of removing the entire row for missing values, I have replaced the
missing values with the mode and mean value whichever fits best for the variables.

This code will count the length of unique values in locality and the subset function I have used to
remove the Locality column from the propertyData dataset. When we look the values of Locality

3|Page
Pre Processing and Regression Analysis

variable, it has around 365 distinct values. Most of the Locality variable's values are unique. As a
result, the model won't be able to gain any insight from the values of locality. Therefore, I
eliminated the Locality column. Additionally, if we use locality when developing the model,
there is a possibility that the test data will have different locality values for which the model may
not have been trained, which could result in an error.

The above code calculates mode value for the variables using user defined Mode function that I
have created at the beginning of code.

The variables are accessed using the code shown above, and the mode value is used to fill in any
missing values. In order to ensure that replacement only occurs if missing values are discovered
and that the other values remain unchanged, I have used an if condition. The mode value is used
to fill in the gaps for all categorical variables. If we try to replace the missing values in the
Parking and Bathroom columns with the mean value while treating these columns as continuous
scale values, the data may not be relevant. For instance, if the estimated mean value for a BHK is
2.75, we are unable to interpret this value. It should be discrete, meaning that either it has to be 2
or 3. Therefore, treating the Parking and Bathroom variables as categorical variables and
replacing them with the mode value produces relevant output.

The missing values in Per_Sqft column is replaced by the mean value of that column since the
data is continuous.

4|Page
Pre Processing and Regression Analysis

Now when we execute the same sapply() function to see the missing values in column we can
see that , there is not any missing values.

Since linear regression works only with numeric and categorical values so we need to convert all
the columns with character values into factor type. The variables BHK, Bathroom, and Parking
are also of the category type as they have fixed set of numerical data which can be executed by
R. It is not necessary to convert these numerical categorical data to factors, but even if we do,
there won't be any problems. All we will be doing is adding further, pointless lines of code.
However, the model will produce errors if character type categorical variables are not converted
to factors. Therefore, variables with character type categories only needs to be converted to
factor.

If we store categorical data as factors then all statistical modelling techniques will handle such
data correctly. For data science and machine learning, it’s important for the variables to be in the
right data type. The code below can be used to convert variables into factors:

The pre-processing of the data is finished up to this point. This cleaned data is now used for
additional analysis. A linear regression model is created by first training it on the majority of the
processed data and then testing it on the remaining small set of data. The model’s performance
improves with better training.

Analysis

5|Page
Pre Processing and Regression Analysis

Data is divided using the sample.split() function so that 70% of the data goes into the training set
and the remaining 30% goes into the test dataset. More data sets must be used to train the model.
Every time it runs on any system, the set.seed() procedure is used to create the same set of train
and test data.

We can use the plot(propertyData) method to look at how the variables correlate with one
another. The box immediately below the box labeled "Area" displays the connection between
Area and BHK. Different variables are represented by plot in each row. The stronger the
association between the variables, the larger the area that is shaded in black. The link between
Area and BHK can be understood as the higher the Area, the more BHK we are looking for.

6|Page
Pre Processing and Regression Analysis

The training dataset is used to generate the linear regression model, ppModel. The price of test
data is now predicted using this model. In order to make comparisons easier, I have added the
predicted price column at the end of the test dataset and moved the Price column right before the
Predicted price column.

We can now have Price and Predicted price comparison side by side and see that few predictions
have huge difference from actual price whereas few are very close to actual price. Those data
with large differences can be considered outliers which might have resulted due to improper
collection of data.

The linear regression model's ppModel summary function displays the strength of the
relationships between various variables. A significant level of association is shown by the three
7|Page
Pre Processing and Regression Analysis

stars to the right of the above table. According to the above table, there is a considerable relation
between the price of property and the per square foot cost. Similarly, the number of bathrooms
and area has a significant relationship with price and can be used to estimate the price of
property. Other factors with which the Price has a strong relationship are also present, as
indicated in the above table represented by the three star and two star showing significant
relationship and one star and dot (.) symbol showing some low level of relationship.

When we plot the actual price and predicted price, we can see that the majority of low-priced
properties are successfully predicted. Plots that are packed along a line show that actual and
predicted prices are quite close to one another. We can also see many outliers in the graph which
is far from the line. Therefore, more data sets should be used to train the model for better
predictions.

8|Page
Pre Processing and Regression Analysis

When we overlay the line graph of test data's predicted price in blue on the plot of actual price in
red, we can see that most of the red line is covered by the blue line, indicating that the bulk of the
data is accurately anticipated. However, there are many outliers when the actual and predicted
prices differ significantly. The outliers are shown by the huge gap between red and blue line.

Finally, when we use root mean square error to calculate the model's accuracy, the result comes
out to be 832710. A poor model fit for the prediction is shown by the high RMSE value. The
model can't fit a dataset perfectly. However when we compare RMSE value with prices in test
data, it appears to be lower which indicates that it properly predicts most of test data. Therefore,
in order for the model to perform better and predict the price of property accurately, it must be
trained with additional massive data sets. In doing so, the model will learn most of the data about
how the price changes as a function of predictor variables like Parking, BHK, etc. and will make
accurate predictions for future datasets.

9|Page
Pre Processing and Regression Analysis

R Code for preprocessing and regression analysis:

#Install and load required libraries
install.packages("stringr")
library(dplyr)
library(stringr)
library(caTools)

#Function to calculate mode

Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

#Reading CSV File

propertyData <- read.csv("D:/Study Materials/MBA/SEM III/Big Data Analytics/R
Programs/Assignment II/MagicBricks.csv")

#Find no of empty cells and cells with NA value

sapply(propertyData, function(x) ifelse(typeof(x)
!="character",sum(is.na(x)),sum(str_trim(x)=="")))
str(propertyData)

#Removing Locality column as it has so many unique values

#which is not going to help model to learn much
length(unique(propertyData$Locality)) #output 365
propertyData <- subset(propertyData,select = -(Locality))

View(propertyData)
#calculating Mode values for categorical data
modevalueFurnishing <- Mode(propertyData$Furnishing)

10 | P a g e
Pre Processing and Regression Analysis

modevalueType <- Mode(propertyData$Type)

modevalueBathroom <- Mode(propertyData$Bathroom)
modevalueParking <- Mode(propertyData$Parking)

#Replacing missing values with mode values

propertyData$Furnishing <- ifelse(str_trim(propertyData$Furnishing)
=="",modevalueFurnishing,propertyData$Furnishing)
propertyData$Type <-
ifelse(str_trim(propertyData$Type)=="",modevalueType,propertyData$Type)
propertyData$Bathroom <-
ifelse(is.na(propertyData$Bathroom),modevalueBathroom,propertyData$Bathroom)
propertyData$Parking <-
ifelse(is.na(propertyData$Parking),modevalueParking,propertyData$Parking)

#Replacing empty values in persqfeet with Mean value

propertyData$Per_Sqft[is.na(propertyData$Per_Sqft)] <-
mean(propertyData$Per_Sqft,na.rm = TRUE)
View(propertyData)
#Linear regression doesn't always work for string data so
#converting to factor values
propertyData$Type <- as.factor(propertyData$Type)
propertyData$Transaction <- as.factor(propertyData$Transaction)
propertyData$Status <- as.factor(propertyData$Status)
propertyData$Furnishing <- as.factor(propertyData$Furnishing)

#Display and Describe property data

str(propertyData)
View(propertyData)

#To genarate same data everytime program runs

11 | P a g e
Pre Processing and Regression Analysis

set.seed(10)

#Splitting data to training and testing data

#70% of data goes to training data
split<- sample.split(propertyData, SplitRatio = .7)
train<- subset(propertyData, split=="TRUE")
test<- subset(propertyData, split=="FALSE")

#Shows correlation among variables

plot(propertyData)

#Creating multiple regression model

ppModel <- lm(Price~., data = train)

#Predicting the test result

predictedPrice<- predict(ppModel, newdata = test)
test["Predicted Price"] = predictedPrice
test <- test %>% relocate(Price,.before = `Predicted Price`)
View(test)

#Shows significant relation in tabular form

summary(ppModel)

#Scatter plot to show Actual and predicted price

plot(test$Price,test$`Predicted Price`, xlab = "Actual Price",ylab = "Predicted price")
#Plot to show relationship between Actual and predicted
abline(lm(test$`Predicted Price`~test$Price))

#Line graph to compare actual and predicted values of test data

12 | P a g e
Pre Processing and Regression Analysis

plot(test$Price,ylab = "Price " , type = 'l', lty = 1, col = "red")

#Adding predicted value to existing plot of actual values
lines(predictedPrice, type = "l",lty = 1, col = "blue")
legend("top", legend=c("Actual", "Predicted"),
col=c("Red", "Blue"), lty = 1,
bty='n',horiz = TRUE, cex=0.4)

#Calculating accuracy (RMSE)

#Tells us the average distance between the actual and predicted values
rmse <- sqrt(mean(test$Price- test$`Predicted Price`)^2)
rmse

13 | P a g e

Terex 5022 Service Manual Global EN
88% (17)
Terex 5022 Service Manual Global EN
1,040 pages
Linear Regression Assignment
0% (1)
Linear Regression Assignment
8 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Power Tips For Toyota Avanza Xenia Users
100% (1)
Power Tips For Toyota Avanza Xenia Users
7 pages
Sakhil Assignment 02
No ratings yet
Sakhil Assignment 02
8 pages
Critical Thinking Exercise-Real Estate
No ratings yet
Critical Thinking Exercise-Real Estate
11 pages
KhanhNgo_1677046_Critical Thinking Exercise-Real Estate
No ratings yet
KhanhNgo_1677046_Critical Thinking Exercise-Real Estate
10 pages
Linear_Regression_datascience_basit.pdf
No ratings yet
Linear_Regression_datascience_basit.pdf
19 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
KhanhNguyen-Critical Thinking Exercise-Real Estate
No ratings yet
KhanhNguyen-Critical Thinking Exercise-Real Estate
6 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
INDUSTRY 2 Jaimin
No ratings yet
INDUSTRY 2 Jaimin
14 pages
lab mannual of ML
No ratings yet
lab mannual of ML
43 pages
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
No ratings yet
Sample Exam For ML YSZ Sample For Machine Lerning - CMNKNVMNCS."NMD, MN, MVN, MDNV, MNDV MC, MDN, MDCNVM, NDV, M Ccwdmnbnbew, Mwbe
4 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Week2 Excel Problem Statement Real Estate-1
No ratings yet
Week2 Excel Problem Statement Real Estate-1
2 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
TBA RECORD FINAL
No ratings yet
TBA RECORD FINAL
140 pages
Sample Exam For ML YSZ: Question 1 (Linear Regression)
No ratings yet
Sample Exam For ML YSZ: Question 1 (Linear Regression)
4 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
BZAN_6310-project_instructions
No ratings yet
BZAN_6310-project_instructions
4 pages
Working With Data
No ratings yet
Working With Data
38 pages
Answer to Critical Thinking Exercise.docx
No ratings yet
Answer to Critical Thinking Exercise.docx
10 pages
Regrassion Analysis Lab Question and Answer
No ratings yet
Regrassion Analysis Lab Question and Answer
13 pages
Predictive_Modelling_Alternate_Project_Business_Case.docx
No ratings yet
Predictive_Modelling_Alternate_Project_Business_Case.docx
47 pages
Master of Business Administration Arpit
No ratings yet
Master of Business Administration Arpit
75 pages
Greek Property Prices
No ratings yet
Greek Property Prices
16 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
1 Final-Exam
No ratings yet
1 Final-Exam
6 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
INDUSTRY 2 Akshat
No ratings yet
INDUSTRY 2 Akshat
12 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Devidutta_Predictive_Modeling.pdf
No ratings yet
Devidutta_Predictive_Modeling.pdf
25 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
FDP Day 1 Regression V 1
No ratings yet
FDP Day 1 Regression V 1
29 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
73 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
FinalProject STAT4444
No ratings yet
FinalProject STAT4444
11 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Problem Statement - Excel Project - Treo's Real Estate
No ratings yet
Problem Statement - Excel Project - Treo's Real Estate
3 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
Making predictions
No ratings yet
Making predictions
13 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
saurabh
No ratings yet
saurabh
22 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
15 Building Regression Models Part2
No ratings yet
15 Building Regression Models Part2
17 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
CIA Understanding
No ratings yet
CIA Understanding
5 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Arun_27072021_Predictive_Modeling.pdf
No ratings yet
Arun_27072021_Predictive_Modeling.pdf
33 pages
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
From Everand
Direct Linear Transformation: Practical Applications and Techniques in Computer Vision
Fouad Sabry
No ratings yet
Sales PPT - GRP 3
No ratings yet
Sales PPT - GRP 3
14 pages
Capstone Assessment
No ratings yet
Capstone Assessment
18 pages
Carpet Response (Responses)
No ratings yet
Carpet Response (Responses)
13 pages
Final Sales-Report
No ratings yet
Final Sales-Report
51 pages
D-Truck Workshop Manual
No ratings yet
D-Truck Workshop Manual
83 pages
With the Photographer MCQ
No ratings yet
With the Photographer MCQ
8 pages
Mis
No ratings yet
Mis
258 pages
LRFD Pre Standard - Revised FINAL - Nov 9 2010
No ratings yet
LRFD Pre Standard - Revised FINAL - Nov 9 2010
215 pages
10 - 1 - Republic V Orfinada
No ratings yet
10 - 1 - Republic V Orfinada
8 pages
Insta Hardware Upvc Catalog 2023
No ratings yet
Insta Hardware Upvc Catalog 2023
32 pages
Penjelasan Project SIK - 2022-2023 Gasal
No ratings yet
Penjelasan Project SIK - 2022-2023 Gasal
5 pages
NTDCL Junior Engineer (Electrical) Test Paper 2014
No ratings yet
NTDCL Junior Engineer (Electrical) Test Paper 2014
4 pages
Mumbai University Result
No ratings yet
Mumbai University Result
96 pages
The Mode of Transportation Utilized by Medical Technology Students of Emilio Aguinaldo College - Manila
No ratings yet
The Mode of Transportation Utilized by Medical Technology Students of Emilio Aguinaldo College - Manila
12 pages
Oratorical Piece For Teachers' Day
100% (7)
Oratorical Piece For Teachers' Day
2 pages
Polynomials
No ratings yet
Polynomials
3 pages
Blavatsky's Diagram of Meditation and The Process of Spiritual Transformation
No ratings yet
Blavatsky's Diagram of Meditation and The Process of Spiritual Transformation
7 pages
Group-6 a338 Accresm
No ratings yet
Group-6 a338 Accresm
9 pages
4.6 Ground Reflection (Two-Ray) Model: Mobile Communications Lecture 5
No ratings yet
4.6 Ground Reflection (Two-Ray) Model: Mobile Communications Lecture 5
4 pages
Assignment Question PE & PLC
No ratings yet
Assignment Question PE & PLC
5 pages
Excel 2007 Tutorial
No ratings yet
Excel 2007 Tutorial
5 pages
Teaching Modern Macro (Taylor)
No ratings yet
Teaching Modern Macro (Taylor)
5 pages
5039 - Assignment 1 - Nguyen Phan Thao My - GBD1101
No ratings yet
5039 - Assignment 1 - Nguyen Phan Thao My - GBD1101
36 pages
Antenna
100% (2)
Antenna
68 pages
CC1011 Midterm
No ratings yet
CC1011 Midterm
3 pages
Lecture Powerpoint: Physics: Principles With Applications, 6 Edition
No ratings yet
Lecture Powerpoint: Physics: Principles With Applications, 6 Edition
21 pages
PINN Gentle Introduction
No ratings yet
PINN Gentle Introduction
26 pages
How To Add, Subtract, Multiply, Divide in Excel
No ratings yet
How To Add, Subtract, Multiply, Divide in Excel
6 pages
Algebra1Unit1Lesson4 - Notes - Multiplying Polynomials
No ratings yet
Algebra1Unit1Lesson4 - Notes - Multiplying Polynomials
8 pages
LitCharts After
No ratings yet
LitCharts After
8 pages
Сеть глобального корпоративного контроля
No ratings yet
Сеть глобального корпоративного контроля
74 pages
Unit 22 L3 Assignment B Lessson
No ratings yet
Unit 22 L3 Assignment B Lessson
34 pages