0% found this document useful (0 votes)
2 views24 pages

Computer Class 1_multiple regression

The document outlines a computer class module focused on multiple regression analysis using Stata, emphasizing the importance of practical exercises alongside theoretical lectures. It covers the process of estimating house prices using a dataset, including data preparation, running regressions, interpreting results, and assessing model fit. Key concepts such as the significance of regression coefficients, R-squared values, and the ranking of variables are also discussed to enhance understanding of econometric analysis.

Uploaded by

aurarolee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
2 views24 pages

Computer Class 1_multiple regression

The document outlines a computer class module focused on multiple regression analysis using Stata, emphasizing the importance of practical exercises alongside theoretical lectures. It covers the process of estimating house prices using a dataset, including data preparation, running regressions, interpreting results, and assessing model fit. Key concepts such as the significance of regression coefficients, R-squared values, and the ranking of variables are also discussed to enhance understanding of econometric analysis.

Uploaded by

aurarolee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 24

Computer Class 1: Multiple Regression Analysis

The aim of the computer classes in this module is to introduce you to the use of
various estimation commands in Stata. It is by no means comprehensive, but it will
allow you to learn and understand some basic regression commands. The computer
classes are organised into four sessions. If you do not manage to work through a
session during a computer class, make sure that you do so in your own time. Although
the work done in these sessions do not count towards your final grade in this module,
you will find out that it will greatly facilitate your understanding of the materials
taught in the lectures. Studying the lecture notes alongside the practical examples
demonstrated in the computer classes will put you in a better position to tackle the
exam paper, which will include questions around estimated regressions in Stata. In
today’s workshop, we focus on a practical example where you get an opportunity to
see how the various concepts taught in the first lecture are implemented in practice. It
is assumed that you have worked through the “Basics of STATA” hand-out and you
are familiar with some basic commands. It is particularly important that you practice
and retain these skills since the remaining computer classes will proceed on the
assumption that you have become familiar with the basic commands in Stata and you
are able to follow the class.

Multiple regression analysis extends the simple regression analysis case by covering
situations where the dependent variable (Y) is affected by more than one explanatory
variable (the Xs):

Note: In a simple regression model there would be only one X variable on the right-
hand side. Let us look at some of the issues covered in lecture 1 through a real life
data set on house prices. The purpose of this practical exercise today is to estimate the
determinants of house prices using multiple regression analysis.

 Open Stata

 The data set we are going to use for this workshop is called hprice2.dta. It is
available from Moodle under the computer classes sections. Save the data onto
your personal university drive (or memory stick) and open it in Stata (click
File>Open>hprice2.dta)
 This dataset contains observations on house prices and related variables taken
in the year 1990. Since the observations are for individual houses in one
particular year, our data set is cross sectional.

 Let us first view the contents of the data set. In the main command window
(this is where you will type all subsequent instructions), type describe to get
the following:

(Your data set will load from the drive you saved it to and your screen will show a
different date and time)

 There are 6 variables in this data set. None of them is labelled, so it is hard to
know what each represents (note: when you collect data, it is always good
practice to properly define and label your variables). We label variables by
typing the following command lines:

label variable price “house price $”


label variable assess “assessed value $”
label variable bdrms “number of bdrms”
label variable lotsize “size of lot in square feet”
label variable sqrft “size of house in square feet”
label variable colonial “=1 if home is colonial style, otherwise 0”
If you type describe again, the labels will appear in the far right column, providing a
more detailed description of the variables. For example, we can see that variable price
is defined as “house price $”. The variable colonial takes value 1 if the house is of
colonial style and 0 otherwise. Such a variable is binary in nature (i.e. it takes only
two values) and in econometrics it is more commonly known as a dummy variable.

 Now let us see the content of some of the variables in the data set, say the first 20
observations of the variables price, bdrms, lotsize and colonial. Type list price
bdrms lotsize colonial in 1/20

 To list the last 20 observations:


 Note: (i) The forward slash operator / in the above commands helps to identify a
range of values (for instance, 1/20 means observations in the range 1 to 20). It is
also used in other contexts, e.g. splitting a long line of command in a do-file into
several lines

(ii) -20 means count 20 observations starting from the last observation

(iii) l means the last observation in the dataset (note it is lower case letter l and
not number 1)

Summary Statistics

 It is good practice to do some summary statistics on your data to get useful


information on such things as the mean value of your variables, the standard
deviation, maximum and minimum values, etc. This can also help spot
obvious outliers or extreme values in your dataset.
 Suppose we want to look at the summary statistics for price and lotsize. For
this we type:

 We can see that there is a total of 88 houses in the data set. The cheapest house
is priced at $111000 and the most expensive house is priced at $725 000. The
average house price is $293 546 and there is a large variation of house prices
around this mean. The smallest plot on which a house is built is 1000 square
feet and the largest plot is 92861 square feet.
 How many of these are colonial style houses? To find out, tabulate the dummy
variable colonial:
 Thus 61 houses in the data set are built in colonial style.

Scatter Plots
It is useful to visualise some variables on a scatter plot to get a feel of the data so that
you know what sort of relationship to expect between two variables. (Note: all graphs
in Stata are two-dimensional and therefore for certain graphs, such as a scatter plot,
you can only plot two variables at a time). Suppose we wish to look at the scatter plot
of price and lotsize, or still price and sqrft. To generate a scatter plot between price
and lotsize, type scatter price lotsize. This will bring up a window with the following
graph:

TIP: A new feature introduced in Stata since version 10 is that graphs can be edited
by clicking on the Start Graphics Editor icon. This will load up some tools which
allow you to edit the graph. If you wish to copy the graph, right click on it, select
‘copy graph’ and paste into MS-Word or any other windows application.

Looking at the above graph, it appears that there is a positive association between the
price of a house and its size. This will be apparent if you attempt to draw a line of best
fit through the observations. This line will have a positive slope, implying that the
bigger the house, the higher the price it commands. This positive association is
intuitive, since we expect a bigger house to attract a higher price. In many cases, the
association between variables may not be as intuitive and straightforward, and this
needs to be tested with real life data. Much of econometrics deals with testing and
quantifying the expected relationship between variables using real life data and
ascertaining whether such relationships are significant based on a statistical test.

Running a Regression

The regress command estimates a linear regression through OLS and its syntax is:

regress Y X1 X2 X3 X4…

where Y is the dependent variable and X1, X2, X3, X4, etc… are the explanatory
variables (you need to type the exact names you have given to the variables in your
data set). In this example, we want to explain the determinants of house prices using a
multiple regression model:

 The formulation of the above model is rather intuitive. It is expected that each
of these variables has a positive impact on the price of a house-the bigger the
plot of land on which the house is built, the higher we expect its price to be; a
bigger house will tend to be more expensive; a house with more bedrooms will
tend to command a higher price.
 Although the above variables seem obvious determinants of house prices, we
need to test whether the expected relationship between them is borne out by
real life data, whether such relationships are significant and the extent to
which variations in these determinants can explain variations in house prices.
 Let’s start by estimating the above model. Type regress price lotsize sqrft
bdrms
 Stata executes the command and returns the following output.
TIP: If you want to regress the model without the intercept/constant term (this is
called regression through the origin), type regress price lotsize sqrft bdrms,
noconstant

 The first column gives the variable names, the second column reports the OLS
estimates (also called slope coefficients, excluding the constant) and in the
third column the standard error of the OLS estimates are given.
 The column named t gives t-ratios for the coefficients on lot size, house size,
number of bedrooms and the intercept. The intercept, or constant term is
denoted by _cons in Stata. Note: if you divide the coefficient of each variable
(Coef.) by its standard error (Std. Err.), you will obtain its t-ratio (t) - try this
for yourself.

Interpreting the Regression Results

When interpreting the results, careful attention has to be paid to the different units of
measurement for the variables under study. When interpreting the model, we first
focus on the coefficients by looking at their signs (partial relationships), the
magnitude of their impact on the dependent variable and whether they are statistically
significant or not. We then proceed to see whether overall the model we have
estimated has a good fit. Before accepting the model based on the above, we also need
to check for the presence of multicollinearity, heteroscedasticity and serial correlation
(although serial correlation is more of a problem in time series data sets). Remember
that the properties of the OLS estimators hold if certain assumptions hold, including
the absence of these three latter conditions. The presence of these conditions and
failing to satisfactorily address them may bias our results in many ways (see your
lecture notes on the consequences of violating these OLS assumptions)
Sign and Magnitude of Regression Coefficients

 _cons: The coefficient on the intercept is always interpreted as the predicted


value of the dependent variable if all the explanatory variables take value zero.
In our example, it suggests that if all the variables take value zero, then the
average house price will be $ -21770. Quite often, the constant term has no
meaningful interpretation and is ignored. In this case, for instance, if all the
right hand side variables take value zero, there will be no house and therefore
we should not be observing any price! There are other cases when setting the
explanatory variables to zero has interesting and meaningful implications.
 lotsize: The larger the plot size on which the size is built, the higher the price
you would expect to pay for the house. On average, an increase in plot size by
one square feet increases the price of a house by $2.07, all other things
remaining constant. The sign on this coefficient is in line with a-priori
expectations.
 sqrft: The bigger the house, the higher the price that it commands. On average,
an increase in the size of the house by one square feet increases its price by
$122.78, all other things remaining constant. The sign on this coefficient is in
line with a-priori expectations.
 bdrms: The more bedrooms in a house, the higher the price you pay for it. On
average, an additional bedroom increases the price of the house by $13852.52,
ceteris paribus. The sign on this coefficient is in line with a-priori expectations.

Note the use of the expression all other things remaining constant (or ceteris paribus
for short) in the interpretation. This is extremely important because when we are
assessing the impact of changes in one particular explanatory variable on the
dependent variable, we have to assume that the other explanatory variables do not
change. It is important in interpretation to comment about whether the sign of the
variable is in line with prior beliefs. The reason you include a particular variable in
the model is because you believe that variable has an impact on the dependent
variable. You can hypothesise this impact to be positive, negative or in either
direction. When including the variable, you have a strong belief that at least this
impact is non-zero and test whether that belief is correct or not using real life data.
However, just finding the right sign on the coefficient of interest is not sufficient. It is
very important that the variable is also statistically significant to demonstrate impact,
as we discuss next.

Statistical significance of regression coefficients

Remember the p-value rules we use for determining whether a coefficient is


statistically significant or not:

P<0.01 (coefficient very significant)


P<0.05 (coefficient significant)
P<0.10 (coefficient fairly significant)
P>0.10 (coefficient not significant)

On the basis of the above rules, lotsize and sqrft are very significant and bdrms is not
significant.

Model Fit

The R-squared value tells us how much of the variation in the dependent variable is
explained by variations in the independent variables. In our case, the R-squared is
0.6724, suggesting that about 67% of the variations in house prices can be explained
by variations in the independent variables. The remaining 33% are explained by the
error terms. However, in the presence of more than one explanatory variable, and
especially when comparing between models, it is better to focus on the adjusted R-
squared, which adjusts for the number of explanatory variables and the sample size.
You will recall that the adjusted R-squared will at least never be greater than the R-
squared. Thus, in our example, about 66% of the variations in house prices can be
explained by variations in the independent variables, after adjusting for the number of
explanatory variables and the sample size.

There is no set rule on using R-squared to determine how good the model fits. Ideally,
we want the R-squared to be as close to 1 as possible, although you need to
understand that this means the model is very close to being deterministic, i.e. the error
term (other unobserved/unmeasured factors) plays no role in the regression model.
Also note that a high R-squared does not always mean a very good fit (e.g. in time
series data, we will see that there are cases where R-squared can be artificially high).
When estimating a model and interpreting results from it, do bear in mind that most
often the objective of the exercise is to obtain dependable estimates of the true
population regression coefficients and use them for statistical inference. One should
worry more about how logically or theoretically relevant the explanatory variables are
to the dependent variables and how statistically significant they are.

Ranking Variables (Standardised/Beta Coefficients)

Sometimes, researchers are interested in knowing which variable has the most
important impact on the dependent variable, and often which has the largest impact,
thus calling for a ranking of the variables. Variables can be ranked by order of
importance (or significance) using the p-value. The variable with the lowest p-value
has the highest rank. However, very often p-values are the same. Explanatory
variables cannot be directly compared on the basis of their magnitude, unless they are
all measured in the same unit. Comparison between variables with different units of
measurement is done through the standardised coefficients, which standardise all
coefficients and neutralise differences in units of measurement. In Stata, standardised
coefficients can be obtained as follows:

The standardised coefficients are given in the last column labelled Beta. It can be seen
that sqrft is the most important variable and bdrms is the least important variable.
Note that ranking is on the basis of absolute size (i.e. make all negative coefficients
positive and compare them). In our case, the same ranking can be obtained using the
p-value (i.e. the variable with the lowest p-value has the biggest impact).

TIP: Many of the commands can be written in shorthand form. For instance, the
shorthand for generate is ge, regress is reg, describe is de, summarize is su and so
one. If you type help command_name, the underlined letters of the command name
represent at least the letters you need to type when using this command for it to be
recognised and executed. You can type any letter contained in the command name,
with at least the minimum underlined letters all the way up to the full command name.
Example: any of reg/regr/regre/regres will be interpreted as regress and will run a
regression.

Re-scaling variables and their impact on the OLS estimates

Remember that changing the scale of the Y variable will lead to a corresponding
change in the magnitude of the coefficients and standard errors and therefore this does
not change the significance or interpretation. The same is true of any rescaling done to
any of the X variables, except that in the latter case, if you multiply the X variable by
say 5, then the coefficient on that rescaled X variable is one fifth the value of the
original coefficient. The p-values will be the same in both cases. We can illustrate
both these cases with our data set by creating a new variable Z which is 10 times the
value of our dependent variable price:

gen Z=10*price

Now run the same regression as before but with Z instead of price. We note from the
output table below that all the coefficients have a magnitude of 10 times their values
in the original model when we did not do any rescaling to the price variable. Now
create a new variable B which is 5 times lotsize and run the regression with B
amongst the explanatory variables instead of lotsize. Did you notice that the new
coefficient on variable B is one fifth of its value in the non-rescaled model? Notice
that in both cases, the statistical significance of the variables remains the same.

Post-estimation

Once the model is estimated, you can ask Stata to generate predicted values, residuals,
etc from the model. For more information, type help regress postestimation.
Suppose we wish to retrieve the predicted values for the dependent variable. We first
run the regression as above, then type predict pricehat, xb. Similarly, if you wish to
predict the residual terms from the regression, type predict error, resid. You can list
these two new variables pricehat and error (you could have named them differently)
to see how they behave. If you take the difference between price and pricehat, you
will obtain values equal (although with some very minor decimal place differences) to
the variable error.

Testing the Assumptions of the Model

So far, our interpretation of the regression results rested on the assumptions of the
classical linear regression model being satisfied. In empirical work, it is necessary to
verify whether these assumptions hold. Particularly the assumptions of no
multicollinearity, no heteroscedasticity and no serial correlation have to be checked as
the presence of any of these can impact quite severely on the interpretation of the
results. In the next sections, we cover the normality, no multicollinearity and no
heteroskedasticity assumptions.

Normality of Error Terms

A further assumption about the classical linear regression model is that the error term
is normally distributed. Although idea, this is not a key assumption of OLS, and quite
often it is not always satisfied in practice, due to the way in which some models are
specified. A quick way to check whether the error terms are normally distributed is by
drawing a histogram on them. Since we have already computed the variable error
earlier, type histogram error to obtain:
Unfortunately, the distribution of the error terms does not look very much like a
normal distribution, although a slight bell-shaped form can be identified. Alternatively,
you can look at the normal probability plot, which just simply plots the error terms
against what their expected values would have been if the residuals were normally
distributed. In Stata, this is implemented as pnorm error. There are more formal
statistical tests to check for the normality of a variable, such as the skewness/kurtosis
test (sktest error), the Shapiro-Wilk test (swilk error) and the Shapiro-Francia test
(sfrancia error). You can consult these in your own time.

Multicollinearity

Multicollinearity is a case where there is a strong relationship between 2 or more


explanatory variables (X’s). That is, the correlations between the variables are very
high (usually more than 0.75, either positive or negative). In this case, it becomes
difficult to distinguish the contribution of individual correlated variables since they
may be measuring the same phenomena. There are two ways of checking for
multicollinearity: using the correlation matrix and using the VIF/Tolerance coefficient.

VIF/Tolerance

To obtain the VIF/Tolerance value type vif only after running a regression (Note:
Tolerance = 1/VIF):

From the above values, none of the Tolerance is lower than 0.1 and none of the VIF is
more than 10. It appears that there is no problem of muticollinearity. We use the
correlation matrix to confirm this result:
In this model, there are no problems of multicollinearity since none of the correlation
coefficients is more than 0.75. To illustrate the potential problem of multicollinearity,
run the following model:

Where Assess is the assessed value of the house in $ (usually a house is assessed by
an expert to get an idea of its net worth before putting it on the market for sale). A
priori, we would expect that the higher the assessed value of the house, the higher its
price on the market and the inclusion of this variable is therefore justified in the
model. Let’s run the above model:

We immediately notice that one of the coefficients (sqrft) has the wrong sign and
many of them are now statistically insignificant. Notice how the R-squared is now so
much higher than in our previous model without the variable assess. An assessment of
the model would immediately lead us to suspect multicollinearity. We test if that is
the case:

On the basis of the VIF test, it does not appear that multicollinearity is present. Let us
check the correlation between the variables:
The problem of multicollinearity becomes more apparent now when we look at the
correlation between assess and sqrft, which is quite high at 0.8656. It is not surprising
that this is the case, as one would generally expect a bigger house to be assessed at a
higher value, all other things constant. The inclusion of two highly collinear variables
therefore challenges the assumption of no multicollinearity. The consequences are
obvious: (i) large variance of the coefficients, making them statistically insignificant

even though they are important in the model; (ii) one of the coefficients has the wrong
sign; (iii) R-square is artificially inflated. Therefore assess has to be dropped from the
model, and we therefore re-run the earlier model.

Heteroscedasticity

As we are working with cross sectional data, it is advisable to check for the presence
of heteroskedasticity. Remember that heteroscedasticity is the case where the variance
of the error term is not constant, but rather changes across different segments of the
observations. In particular, the variance can change according to different values of
the explanatory variables. We need to check for the presence of heteroscedasticity in
our model. Let us plot the residuals against the fitted values in a graph to see how they
behave. To do this, type rvfplot
From the above graph it can be seen that a pattern can be identified in that the
residuals become larger with the fitted values and hence there may be a problem of
heteroscedasticity. There are a few formal tests for heteroscedasticity. All these tests
share the same hypotheses given as:

H0: Heteroscedasticity is NOT present (ie, Homoscedasticity)

H1: Heteroscedasticity is present

One of the tests is the Cook-Weisberg Test also known as the Breusch-Pagan Test
which is applied in Stata as follows:

Since the p-value is less than 0.05, we reject H0, and hence confirm that there is a
problem of heteroscedasticity. An alternative test in Stata is the White’s test which is
available as part of an information matrix test for the regression model which reports
a test for heteroscedasticity, skewness and kurtosis on the error term. The information
matrix of these tests is produced by typing:

You only need to focus on the part of the output which displays the White test (this is
similar to the heteroscedasticity test in the first line of the table). Recall that the White
test is based on a chi-square test and the p-value is well below the 0.01 level, rejecting
the null of homoscedasticity (constant variance of error terms). Hence, the above test
also confirms the presence of heteroscedasticity (non-constant variance of error terms).

In the presence of heteroscedasticity, the R-square or adjusted R-square are not


affected, nor are the coefficients biased or inconsistent. However, because the
variance (and therefore standard errors) of the coefficients are biased, they are no
longer valid to construct t-statistics and make inference. In particular, the t-statistics
generated in the presence of heteroscedasticity do not have a t-distribution and
therefore inference becomes difficult. Similarly, the F-statistic no longer follows a F-
distribution and is not valid any more to assess model fit. A quick solution in the
presence of heteroscedasticity is to use robust standard errors for the coefficients,
which can be used for inference irrespective of whether the model suffers from
heteroscedasticity or not. Robust standard errors in Stata can be obtained by
typing ,robust as an option in the regress command. The resulting output produces
heteroscedasticity-robust standard errors and with these, it is possible to make
inference on the coefficients using the t-statistic. The F-statistic can also now be used
for testing model reliability.
Correcting for Heteroscedasticity

Generating robust standard errors is one way to deal with the problem of
heteroscedasticity, but it is an ideal solution for large samples only. With small
sample sizes, the robust t statistics can still have distributions that are not close to t
distributions and again this makes statistical inference problematic. There are several
ways to deal with heteroscedasticity. A very common method is to apply logarithmic
transformations on the variables and re-run the regression as usual. Logarithmic
transformations are a convenient solution to deal with heteroscedasticity, especially
when the source of the problem is model misspecification (e.g. the model should be a
log-linear model but you run a linear model. In a later subsection, we show how to run
a log-linear model and how to interpret the coefficients)

Another method for dealing with heteroscedasticity is through the use of Weighted
Least Squares if you know which particular X variable is responsible for
heteroscedasticity. If this is not known, then you can alternatively use Feasible
Generalised Least Squares (FGLS). FGLS is based on certain steps. A problem with
implementing FGLS is that the regression in the second step may not necessarily be
one where the Y are linearly related to the X variables. X variables could be
introduced in polynomial form (squares, cubic, etc). Heteroscedasticity could simply
be a case of model misspecification, in which case attempting to correct it through
GLS may not be correct.

We will not cover the use of GLS in this tutorial but you can refer to the examples
provided in Wooldridge and attempt to replicate them in Stata in your own time.

Estimating Log-linear models


Logarithmic transformations are a convenient way for correcting heteroscedasticity,
especially when the source of heteroscedasticity is model misspecification (e.g. the
model should be a log-linear model but you run a linear model). Logarithmic
transformations are also appropriate when there are outliers in the model, as a log
transformation scales down the variables by squeezing together the larger values in
the data set and stretching out the smaller values. The interpretation of log-linear
models is different from models that use all variables in level form. While the
interpretation is the same when discussing the sign and statistical significance of
coefficients or R-square, the interpretation of the magnitude of the coefficients is
different. Estimated coefficients in log-linear models involve a mix of elasticities and
semi-elasticities depending on how the log-linear model is formulated.

In particular, assuming Y is the dependent variable and X is the independent variable,


the following table summarises how to interpret the coefficients. Level in this context
means there is no logarithmic transformation applied on the variable. When reading
the table below, assume β has a positive sign.

Perform a logarithm transformation on some of the variables in our dataset as follows


(it is good practice to label your newly created variables):

generate lprice=ln(price)

generate llotsize=ln(lotsize)

generate lsqrft=ln(sqrft)

Note that we do not apply logarithmic transformation on bdrms as this is a variable


with small values. We now estimate the following model:
In the above example, a 1% increase in the size of the lot will increase the price of a
house by 0.17% on average, ceteris paribus. A 1% increase in house size will increase
its price by 0.70%, ceteris paribus. An additional bedroom in a house increases its
price by approximately 3.70% (0.037*100), all other things remaining constant. Does
this model suffer from heteroscedasticity? We use our previous tests:

Now try with estat hettest. As you can notice, under both tests we cannot reject the
null hypothesis of homoscedasticity (although under estat hettest it appears that the
p-value is weakly significant. Since the first test is not significant, we base our
decision on that one. Note that it is not uncommon in applied work to come across
cases where alternative tests for the same problem might give contradicting
conclusions. This is due to alternative tests using different criteria for testing
particular problems. In these cases, you will need to use reasonable judgement to
make a decision.
Dummy Variables in Regression

There are several ways to introduce dummy variables in a regression model,


depending on whether you believe intercept, slopes or interactive dummy variables
are appropriate to capture the relationship of interest in the model.

Intercept Dummy Variables

Our dataset contains a dummy variable colonial which takes value one if the house is
of colonial style and zero otherwise. We may want to know whether colonial style
houses attract a higher price than noncolonial style houses. To check this, we include
the variable colonial in our model:

Running the above regression, we find that on average, the price of colonial houses is
higher than noncolonial houses by 5.38%, ceteris paribus (because our dependent
variable is in logs). However, this variable is not statistically significant, so we do not
reject the null hypothesis that colonial style and noncolonial styles houses do not
differ in price.

Slope Dummy Variables

Slope dummy variables are created by multiplying a dummy variable with one of the
explanatory variables and using it in estimation. For instance, if you hypothesize that
colonial style houses that are built on bigger plots are on average more expensive than
non-colonial style houses that are built on bigger plots, you would create a new
variable called lot_col which is simply lotsize multiplied by colonial:
generate lot_col=llotsize*colonial

When running the regression model, you will then include lot_col as one of the
explanatory variables if you want to test whether this is statistically significant.

Interactive Dummy Variables

Interactive dummy variables are created by multiplying two dummy variables. For
instance, suppose that in our dataset, we had two other dummy variables called
incinerator and neighbourhood. The first variable would indicate whether the house is
located close to an incinerator site (therefore potentially exposing the residents to air
pollution) or not and the second variable indicates whether the house is located in a
good neighbourhood or not. You might wish to assume that these two qualitative
factors are independent of each other and treat these as two separate dummy variables
and look at their impact individually on house prices. Alternatively, you might wish to
assume that these two qualitative factors are not independent. Here is one way of
thinking about how these two factors could not be interrelated (you might disagree).
Assuming that bad neighbourhoods tend to be areas where there is a lot of
unemployment, poverty and crime. If there is an incinerator in an area with houses
close by, then it is likely that the neighbourhood is bad. Incinerators would not be
built close to good neighbourhoods (pressure from the residents against it, or quite
simply, residents who can afford will chose not to buy a property in that area). Hence,
in this case, neighbourhood and incinerator would not be independent. Now suppose
you want to test the hypothesis that a house located close to an incinerator site and in
a bad neighbourhood will on average attract a much lower price. In this case, you
would need to multiply the two dummy variables, creating another variable (called an
interactive dummy) which you will add in the regression model and estimate. If your
hypothesis holds true, you would expect a negative and statistically significant
coefficient on this interactive dummy variable.

Predictions

Often, researchers are interested in predicting what value a dependent variable will
take based on assumed values for the independent variables in the model. Suppose in
our previous model, we wish to predict house prices at given values of the
explanatory variables. If you wish to code these calculations in Stata, this would be
done as follows (note that the values of the coefficients are to 3 decimal places):
gen hp1=5.558+(0.168*llotsize)+(0.707*lsqrft)+(0.027*bdrms)+(0.054*colonial)

Note that this has to be coded accurately (i.e. you need to type in the values of the
coefficients from the estimated Stata regression output correctly). You can also ask
Stata to automatically retrieve the values of the coefficients from the regression output
and compute these predictions as follows:

gen hp2=_b[_cons]+(_b[llotsize]*llotsize)+(_b[lsqrft]*lsqrft)+(_b[bdrms]*bdrms)+(_b[colonial]*colonial)

Note that the value of an estimated coefficient from a Stata output is called by
_b[variablename] where variable name is exactly as it appears in the Stata regression
command.

A shortcut for all this in Stata is to use the predict command after you run a regression:

predict hp3

If you compare these three variables (edit hp1 hp2 hp3), you will note minor
differences in values to decimal places. This is because of the varying use of decimal
places on the coefficients that Stata has used in the above calculations. Note that we
are not done yet, because any of these three variables gives us the predicted values of
lprice. Recall that lprice is the log of price. We are ultimately interested in the
predicted values of price, not log of price. The exact formula for doing this which is:

Here is how we will proceed with coding the above formula into Stata:

(i) First create the error terms – predict error, residuals

(ii) Create the exponential of the predicted error terms – gen K=exp(error)

(iii) Find the average of the exponentiated predicted errors - egen L=mean(K)

(iv) Create the exponential of the predicted values of lprice (let’s use hp3) – gen
M=exp(hp3)

(v) Finally – gen price_predicted=L*M

What we have done above is called in-sample prediction. We have used actual values
on the X variable from the dataset itself. What if we wanted to predict house prices
based on values of X that do not appear in this dataset (by this we mean that we may
not find a house in our data set which has all the characteristics/X values which we
specify)? This is what is known as an out-of-sample prediction. As an example,
suppose we wish to know on average what would be the price of a colonial style
house which has a size of 3000 square feet, is built on a plot size of 7000 square feet
and has 4 bedrooms. In Stata, this will be coded as follows:

gen myprice=_b[_cons]+(_b[llotsize]*ln(7000))+(_b[lsqrft]*ln(3000))+(_b[bdrms]*4)+(_b[colonial]*1)

gen final=L*exp(myprice)

Of course, if asked a question like this in the exam, you will need to calculate these
manually (and the value of K or L will be given to you) as follows on your calculator
(I use 3 decimal places here):

gen myprice= 5.558 + (0.168*ln(7000)) + (0.707*ln(3000)) + (0.027*4) + (0.054*1)

gen final=L*exp(myprice)

To help you become more familiar with the topics covered, here are links to websites
to access the data sets used in “Principles of Econometrics” and “Introductory
Econometrics: A Modern Approach”. You can use these data sets in Stata to replicate
the many excellent examples provided in these textbooks.

http://principlesofeconometrics.com/poe4/poe4.htm

http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html

(This file is a modified version of Dr. Dev Vencappa’s hand-out)

You might also like