100% found this document useful (1 vote)
180 views

Assignment 2

The document discusses a real estate agency that employs auditors to study various geographic and property features of houses to estimate their pricing. The agency has provided a dataset of 506 houses in Boston with details on crime rates, pollution levels, education facilities, distance from highways, and other variables. The data will be analyzed to build regression models to predict average home prices based on the other variables.

Uploaded by

Gnaneshwar Rao
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
180 views

Assignment 2

The document discusses a real estate agency that employs auditors to study various geographic and property features of houses to estimate their pricing. The agency has provided a dataset of 506 houses in Boston with details on crime rates, pollution levels, education facilities, distance from highways, and other variables. The data will be analyzed to build regression models to predict average home prices based on the other variables.

Uploaded by

Gnaneshwar Rao
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Terro’s real estate agency

Terro’s real-estate is an agency that estimates the pricing of houses in a certain


locality. The pricing is concluded based on different features / factors of a
property. This also helps them in identifying the business value of a property.
To do this activity the company employs an “Auditor”, who studies various
geographic features of a property like pollution level (NOX), crime rate,
education facilities (pupil to teacher ratio), connectivity (distance from
highway), etc. This helps in determining the price of a property. The agency has
provided a dataset of 506 houses in Boston. Following are the details of the
dataset:

Data Dictionary: The data consists of the following variables.


CRIME RATE Per capita crime rate by town
INDUSTRY Proportion of non-retail business acres per town (in
percentage terms)
NOX Nitric oxides concentration (parts per 10 million)
AVG_ROOM Average number of rooms per house
AGE Proportion of houses built prior to 1940 (in percentage terms)
DISTANCE Distance from highway (in miles)
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil-teacher ratio by town
LSTAT % Lower status of the population
AVG_PRICE Average value of houses in $1000's
EXCEL WEEK 2 ASSESSMENT:
GNANESHWAR RAO R

1) Generate the summary statistics for each variable in the table. (Use Data analysis
toolpak). Write down your observations. (5 marks)

Based on Measures of Symmetry, we can say that ‘AVG_ROOM’ has the


sharpest peak as it has the highest kurtosis, while ‘AVG_PRICE’ is the most
positively skewed variable.
Based on Measures of variability, it can be inferred that the Standard deviation
for ‘TAX’ variable is the highest, indicating that its data is more spread out.
Based on minimum and maximum values, we can say that a lot of outliers are
present in ‘TAX’ and ‘AGE’ variables.

2) Plot a histogram of the Avg_Price variable. What do you infer? (5 marks)


Based on the shape of distribution of data, we can say that the AVG_PRICE
variable has a positive skew meaning most of the values occur before the
mean.
Since, most of data most of the data points falls on the left side of the mean
then it’s called Right Skewed Data or Positive Skewed.
The general relationship among the central tendency measures in a positively
skewed distribution may be expressed using the following inequality:
Mean > Median > Mode.

3) Compute the covariance matrix. Share your observations. (5 marks)


From the above covariance matrix, we can infer that the variables TAX and AGE
have the highest covariance, TAX and DISTANCE have the second highest
covariance.
Meanwhile, TAX and AVG_PRICE have the least covariance.

4) Create a correlation matrix of all the variables (Use Data analysis tool pack).
a) Which are the top 3 positively correlated pairs.
b) Which are the top 3 negatively correlated pairs. (5 marks)

Top 3 positively correlated pairs are - TAX & DISTANCE, NOX & INDUS, NOX & AGE.
Top 3 negatively correlated pairs are - AVG_PRICE& LSTAT, AVG_ROOM& LSTAT,
AVG_PRICE& PTRATIO.

5) Build an initial regression model with AVG_PRICE as ‘y’ (Dependent variable) and LSTAT
variable as Independent Variable. Generate the residual plot. (8 marks)
a) What do you infer from the Regression Summary output in terms of variance explained,
coefficient value, Intercept, and the Residual plot?
b) Is the LSTAT variable significant for the analysis based on your model?
a) Regression Summary output provides information on how well the model
fits the data and the relationships between the independent and dependent
variables. Since the R square value is low, the model does not explain the
variation in price very well. A negative value for the coefficient of LSTAT
variable represents that the price goes down as LSTAT goes up. The residual
plot has no patterns, representing no issues with the regression model.
b) P-value for LSTAT variable is less than 0.05, so it is considered as a significant
variable.

6) Build a new Regression model including LSTAT and AVG_ROOM together as independent
variables and AVG_PRICE as dependent variable. (6 marks)
a) Write the Regression equation. If a new house in this locality has 7 rooms (on an average)
and has a value of 20 for L-STAT, then what will be the value of AVG_PRICE? How does it
compare to the company quoting a value of 30000 USD for this locality? Is the company
Overcharging/Undercharging?
b) Is the performance of this model better than the previous model you built in Question 5?
Compare in terms of adjusted R-square and explain.
a) Regression Equation: -1.3582 + AVG_ROOM*5.0947 - LSTAT*0.6423.
Predicted price is 21.4K USD. Since the quoted price is 30K USD, they are
Overcharging.
b) Since the adjusted R square value is higher than the previous model, this
model is better at explaining the dependent variable than the previous model
(5th question).

7) Build another Regression model with all variables where AVG_PRICE alone be the
Dependent Variable and all the other variables are independent. Interpret the output in
terms of adjusted R-square, coefficient and Intercept values. Explain the significance of each
independent variable with respect to AVG_PRICE. (8 marks)
R squared value is 0.69 or 69% which indicates a proper fit. Except for NOX,
TAX, PTRATIO, LSTAT which have negative coefficients, indicating that increase
in those variables results in a decrease in the average price. All other variables
have positive coefficients.
Crime rate is the only variable whose p-value is not less than 0.05. Therefore,
all variables except for ‘crime rate’ are significant for the prediction of average.

8) Pick out only the significant variables from the previous question. Make another instance
of the Regression model using only the significant variables you just picked and answer the
questions below: (8 marks)
a) Interpret the output of this model.
b) Compare the adjusted R-square value of this model with the model in the previous
question, which model performs better according to the value of adjusted R-square?
c) Sort the values of the Coefficients in ascending order. What will happen to the average
price if the value of NOX is more in a locality in this town?
d) Write the regression equation from this model.
a) This model has an R squared value very similar to the previous model but an
adjusted R square value that is slightly higher. All the p values are also less
than 0.05 making all the variables significant.
b) Since this model has a slightly higher value of adjusted R, it explains the y
variable better.
c) Since NOX variable has a negative coefficient, higher value of NOX leads to a
decrease in price.
d) Equation: 29.42 - 10.27*NOX - 1.07*PTRATIO - 0.60*LSTAT - 0.01*TAX
+0.03*AGE + 0.13*INDUS + 0.26*DISTANCE + 4.12*AVG_ROOM.

You might also like