0% found this document useful (0 votes)
30 views

Assignment II

This document discusses applying preprocessing and regression analysis to property data. It summarizes preprocessing steps like handling missing data, converting categorical variables to factors, and splitting data into training and test sets. Regression analysis is then performed to understand relationships between factors like area, price per square foot, and number of bathrooms on property price. The linear regression model finds several variables have a statistically significant relationship with price. Model performance is evaluated by comparing actual and predicted prices in the test set.

Uploaded by

21324jesika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Assignment II

This document discusses applying preprocessing and regression analysis to property data. It summarizes preprocessing steps like handling missing data, converting categorical variables to factors, and splitting data into training and test sets. Regression analysis is then performed to understand relationships between factors like area, price per square foot, and number of bathrooms on property price. The linear regression model finds several variables have a statistically significant relationship with price. Model performance is evaluated by comparing actual and predicted prices in the test set.

Uploaded by

21324jesika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Kathmandu University School of Management

Big Data Analytics

Assignment-II

Submitted by:

Aayush Kumar Chaudhary (21305)


MBA (Spring 2021)
August, 2021

Submitted to:

Asst. Prof Baljeet Kaur


Pre Processing and Regression Analysis

Qsn. Apply Pre Processing and Regression Analysis on given Property Data.

Regression analysis is used to understand the relationship among different factors so that we can
determine which factors are most important, which ones can be disregarded, and how these
factors interact. So to analyze the data contained in MagicBricks.csv file and how the different
attributes are related to each other, I have used linear regression approach since the target
variable is continuous in nature.

Details of Data

Dataset consists of 11 columns and 1259 rows. While Furnishing, Locality, Status, Transaction,
and Type are of character type, the columns Area, BHK, Bathroom, Parking, Price, and Per_Sqft
are of numeric type. When we carefully examine the value of the variables, we can see that Area,
Price, and Per_Sqft are continuous variables whereas the rest are categorical. Although BHK,
Parking, and Bathroom have numerical data but they have a set of repetitive fixed data. As a
result, it may also be regarded as categorical data. Here, we need to examine the effects of
several factors on the price of property which is our target variable.

Explanation of Code

The dataset contains numerous missing values that must be preprocessed and replaced with the
values that will produce desired result. In order to preprocess the data and convert it to the
desired format, the first step is preprocessing, which involves installing and loading the
necessary library and using its functions to do so.

dplyr and stringr library is used for manipulating the original dataset like deleting, editing or
adding some new values. caTools is used to split the data into training and testing data.

The data in MagicBricks.csv file is read to propertyData variable using read.csv() function.

2|Page
Pre Processing and Regression Analysis

Preprocessing

I have created the mode function that will be used to calculate the mode value of categorical
variables.

 nrow(propertyData): Used to calculate the number of rows


 length(propertyData): Used to calculate the number of columns

Manually searching every column for missing or NA data is not possible. To determine the count
of missing values or cells with NA values for each column, I used the sapply() method. Character
variables' missing values are represented by empty values, while numeric variables' missing
values are represented by NA. To determine the exact number of missing data for each column, I
have thus used if condition to check both cases of NA and empty value.

The output for the sapply() function is:

We can see that the variables Bathroom has 2 missing values, Furnishing has 5, Parking has 33,
Type has 5 and Per_Sqft has 241. Since we have only 1259 values in the given excel sheet which
is quiet low, so instead of removing the entire row for missing values, I have replaced the
missing values with the mode and mean value whichever fits best for the variables.

This code will count the length of unique values in locality and the subset function I have used to
remove the Locality column from the propertyData dataset. When we look the values of Locality

3|Page
Pre Processing and Regression Analysis

variable, it has around 365 distinct values. Most of the Locality variable's values are unique. As a
result, the model won't be able to gain any insight from the values of locality. Therefore, I
eliminated the Locality column. Additionally, if we use locality when developing the model,
there is a possibility that the test data will have different locality values for which the model may
not have been trained, which could result in an error.

The above code calculates mode value for the variables using user defined Mode function that I
have created at the beginning of code.

The variables are accessed using the code shown above, and the mode value is used to fill in any
missing values. In order to ensure that replacement only occurs if missing values are discovered
and that the other values remain unchanged, I have used an if condition. The mode value is used
to fill in the gaps for all categorical variables. If we try to replace the missing values in the
Parking and Bathroom columns with the mean value while treating these columns as continuous
scale values, the data may not be relevant. For instance, if the estimated mean value for a BHK is
2.75, we are unable to interpret this value. It should be discrete, meaning that either it has to be 2
or 3. Therefore, treating the Parking and Bathroom variables as categorical variables and
replacing them with the mode value produces relevant output.

The missing values in Per_Sqft column is replaced by the mean value of that column since the
data is continuous.

4|Page
Pre Processing and Regression Analysis

Now when we execute the same sapply() function to see the missing values in column we can
see that , there is not any missing values.

Since linear regression works only with numeric and categorical values so we need to convert all
the columns with character values into factor type. The variables BHK, Bathroom, and Parking
are also of the category type as they have fixed set of numerical data which can be executed by
R. It is not necessary to convert these numerical categorical data to factors, but even if we do,
there won't be any problems. All we will be doing is adding further, pointless lines of code.
However, the model will produce errors if character type categorical variables are not converted
to factors. Therefore, variables with character type categories only needs to be converted to
factor.

If we store categorical data as factors then all statistical modelling techniques will handle such
data correctly. For data science and machine learning, it’s important for the variables to be in the
right data type. The code below can be used to convert variables into factors:

The pre-processing of the data is finished up to this point. This cleaned data is now used for
additional analysis. A linear regression model is created by first training it on the majority of the
processed data and then testing it on the remaining small set of data. The model’s performance
improves with better training.

Analysis

5|Page
Pre Processing and Regression Analysis

Data is divided using the sample.split() function so that 70% of the data goes into the training set
and the remaining 30% goes into the test dataset. More data sets must be used to train the model.
Every time it runs on any system, the set.seed() procedure is used to create the same set of train
and test data.

We can use the plot(propertyData) method to look at how the variables correlate with one
another. The box immediately below the box labeled "Area" displays the connection between
Area and BHK. Different variables are represented by plot in each row. The stronger the
association between the variables, the larger the area that is shaded in black. The link between
Area and BHK can be understood as the higher the Area, the more BHK we are looking for.

6|Page
Pre Processing and Regression Analysis

The training dataset is used to generate the linear regression model, ppModel. The price of test
data is now predicted using this model. In order to make comparisons easier, I have added the
predicted price column at the end of the test dataset and moved the Price column right before the
Predicted price column.

We can now have Price and Predicted price comparison side by side and see that few predictions
have huge difference from actual price whereas few are very close to actual price. Those data
with large differences can be considered outliers which might have resulted due to improper
collection of data.

The linear regression model's ppModel summary function displays the strength of the
relationships between various variables. A significant level of association is shown by the three
7|Page
Pre Processing and Regression Analysis

stars to the right of the above table. According to the above table, there is a considerable relation
between the price of property and the per square foot cost. Similarly, the number of bathrooms
and area has a significant relationship with price and can be used to estimate the price of
property. Other factors with which the Price has a strong relationship are also present, as
indicated in the above table represented by the three star and two star showing significant
relationship and one star and dot (.) symbol showing some low level of relationship.

When we plot the actual price and predicted price, we can see that the majority of low-priced
properties are successfully predicted. Plots that are packed along a line show that actual and
predicted prices are quite close to one another. We can also see many outliers in the graph which
is far from the line. Therefore, more data sets should be used to train the model for better
predictions.

8|Page
Pre Processing and Regression Analysis

When we overlay the line graph of test data's predicted price in blue on the plot of actual price in
red, we can see that most of the red line is covered by the blue line, indicating that the bulk of the
data is accurately anticipated. However, there are many outliers when the actual and predicted
prices differ significantly. The outliers are shown by the huge gap between red and blue line.

Finally, when we use root mean square error to calculate the model's accuracy, the result comes
out to be 832710. A poor model fit for the prediction is shown by the high RMSE value. The
model can't fit a dataset perfectly. However when we compare RMSE value with prices in test
data, it appears to be lower which indicates that it properly predicts most of test data. Therefore,
in order for the model to perform better and predict the price of property accurately, it must be
trained with additional massive data sets. In doing so, the model will learn most of the data about
how the price changes as a function of predictor variables like Parking, BHK, etc. and will make
accurate predictions for future datasets.

9|Page
Pre Processing and Regression Analysis

R Code for preprocessing and regression analysis:


#Install and load required libraries
install.packages("stringr")
library(dplyr)
library(stringr)
library(caTools)

#Function to calculate mode


Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}

#Reading CSV File


propertyData <- read.csv("D:/Study Materials/MBA/SEM III/Big Data Analytics/R
Programs/Assignment II/MagicBricks.csv")

#Find no of empty cells and cells with NA value


sapply(propertyData, function(x) ifelse(typeof(x)
!="character",sum(is.na(x)),sum(str_trim(x)=="")))
str(propertyData)

#Removing Locality column as it has so many unique values


#which is not going to help model to learn much
length(unique(propertyData$Locality)) #output 365
propertyData <- subset(propertyData,select = -(Locality))

View(propertyData)
#calculating Mode values for categorical data
modevalueFurnishing <- Mode(propertyData$Furnishing)

10 | P a g e
Pre Processing and Regression Analysis

modevalueType <- Mode(propertyData$Type)


modevalueBathroom <- Mode(propertyData$Bathroom)
modevalueParking <- Mode(propertyData$Parking)

#Replacing missing values with mode values


propertyData$Furnishing <- ifelse(str_trim(propertyData$Furnishing)
=="",modevalueFurnishing,propertyData$Furnishing)
propertyData$Type <-
ifelse(str_trim(propertyData$Type)=="",modevalueType,propertyData$Type)
propertyData$Bathroom <-
ifelse(is.na(propertyData$Bathroom),modevalueBathroom,propertyData$Bathroom)
propertyData$Parking <-
ifelse(is.na(propertyData$Parking),modevalueParking,propertyData$Parking)

#Replacing empty values in persqfeet with Mean value


propertyData$Per_Sqft[is.na(propertyData$Per_Sqft)] <-
mean(propertyData$Per_Sqft,na.rm = TRUE)
View(propertyData)
#Linear regression doesn't always work for string data so
#converting to factor values
propertyData$Type <- as.factor(propertyData$Type)
propertyData$Transaction <- as.factor(propertyData$Transaction)
propertyData$Status <- as.factor(propertyData$Status)
propertyData$Furnishing <- as.factor(propertyData$Furnishing)

#Display and Describe property data


str(propertyData)
View(propertyData)

#To genarate same data everytime program runs

11 | P a g e
Pre Processing and Regression Analysis

set.seed(10)

#Splitting data to training and testing data


#70% of data goes to training data
split<- sample.split(propertyData, SplitRatio = .7)
train<- subset(propertyData, split=="TRUE")
test<- subset(propertyData, split=="FALSE")

#Shows correlation among variables


plot(propertyData)

#Creating multiple regression model


ppModel <- lm(Price~., data = train)

#Predicting the test result


predictedPrice<- predict(ppModel, newdata = test)
test["Predicted Price"] = predictedPrice
test <- test %>% relocate(Price,.before = `Predicted Price`)
View(test)

#Shows significant relation in tabular form


summary(ppModel)

#Scatter plot to show Actual and predicted price


plot(test$Price,test$`Predicted Price`, xlab = "Actual Price",ylab = "Predicted price")
#Plot to show relationship between Actual and predicted
abline(lm(test$`Predicted Price`~test$Price))

#Line graph to compare actual and predicted values of test data

12 | P a g e
Pre Processing and Regression Analysis

plot(test$Price,ylab = "Price " , type = 'l', lty = 1, col = "red")


#Adding predicted value to existing plot of actual values
lines(predictedPrice, type = "l",lty = 1, col = "blue")
legend("top", legend=c("Actual", "Predicted"),
col=c("Red", "Blue"), lty = 1,
bty='n',horiz = TRUE, cex=0.4)

#Calculating accuracy (RMSE)


#Tells us the average distance between the actual and predicted values
rmse <- sqrt(mean(test$Price- test$`Predicted Price`)^2)
rmse

13 | P a g e

You might also like