Regression Model and Its Applications
Regression Model and Its Applications
Author: Instructor:
Shuyu Jia Dr. Justin Corvino
2 Linear Regression 3
2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Estimating parameters: The method of least squares . . . . . . . . 5
3 Logistic Regression 8
3.1 General Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Multinomial Logistic Regression . . . . . . . . . . . . . . . . . . . . 11
3.3 Ordinal Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Estimating Parameters: Maximum Likelihood Estimation . . . . . . 13
1
Acknowledgments
First of all, I would like to express my sincere gratitude to my instructor
Prof. Justin Corvino for the continuous support of my mathematics study, for
his patience, motivation, and immense knowledge. His guidance helped me in all
the time of this semester’s research study and writing of this paper. Besides my
instructor, I would like to thank Prof. Jeffrey Liebner for his insightful advices,
encouraging comments, as well as all the material he suggested me for self-studying.
Last but not least, I want to thank my fellow classmates and friends John Jamieson,
Christina Sirico, and Daniel Turchiano for their precious feedback and ideas on
my paper.
1 Introduction
Regression models, in general, have two main objectives. The first is to establish
if there exists a relationship between two or more variables. Positive relationships
are observed when an increase of one variable results in an increase of the other.
Conversely, in a negative relationship, we find that if one variable increases, the
other variable tends to decrease. More specifically, regression models are used to
establish if there is a statistically significant relationship between the variables. For
instance, on average, we expect that people in families that earn higher income will
generally spend more. In this case, a positive relationship exists between income
and spending. Another example could be the relation between a student’s height
and his/her exam score. Of course we expect no relationship in this case and we
can use regression models to test that.
There are two different roles that the variables play in regression models. The
first one is the dependent variable. This is the variable whose values we want to
explain or predict. We call it dependent variable because its value depends on
something else. We usually denote it as Y . The other variable is the independent
2
variable, and this is the variable that affects the other one. We usually denote it
as X.
There are two main parts of this paper. In the first part I will introduce different
types of linear regression models and logistic regression models, give examples
about them, and talk about how to estimate model parameters for each of the
regression models. In the second part I will present a case study to display the
capabilities and applications of regression models. The case study is the prediction
of NBA playoff results using regular season’s data. I will use one of the regression
models that I introduced above, which is ordinal logistic regression. I will use R
Studio to simulate the data and explain the result in the end.
2 Linear Regression
2.1 Simple Linear Regression
Simple linear regression uses a linear equation to model the relationship be-
tween two variables, an independent variable and a dependent variable. Data, in
general, consists of a series of x and y observations, which on average may follow
a linear pattern. We use a linear regression model to represent such trends. Here
is an example of a simple linear regression.
3
The line generally does not intersect any of these observations. Rather, the
distance between observations and the linear regression line measures the error
that exists. The purpose of a linear regression is to find a line that minimizes
these errors. Therefore, the linear regression model must include these errors. A
simple linear regression model is shown below.[2][Ch.12]
y = β0 + β1 x + . (1)
y is the dependent variable whose value depends on all other parameters in the
equation. x is the independent variable that affects the dependent variable y. β0
is the constant term or intercept. β1 is the slope coefficient for x. is the error
term which we are trying to minimize.
Now let us have a look at an example applying actual data. Suppose there
are 40 observations of 40 different families, and their weekly income and weekly
consumption of a given product are known. Our simple linear regression model is
as follows:
Consumption = β0 + β1 · Income + . (2)
What we want to test is if income is good enough to explain the consumption.
Suppose we simulate those data into R Studio and get the values of parameters as
follows:
Consumption = 49.13 + 0.85 · Income + . (3)
Let us interpret the coefficients. 49.13 could be interpreted as the consumption
level of a family with 0 income. This makes little sense unless that family receives
financial assistance from the government or whatsoever. Most generally, the inter-
cept may not have an intuitive interpretation, so we usually ignore it. 0.85 is the
marginal effect of one unit of income on consumption. In other words, for every
additional unit of income a family has, we estimate its consumption grows by 0.85
units. Note that the slope always has an intuitive interpretation. It represents the
sensitivity of the dependent variable on changes in the independent variable.
4
dependent variables. Most of the same concepts from simple linear regression ap-
ply to multiple linear regression. In fact, multiple linear regression is still creating
a best fit line through the data, but the data is multi-dimensional, so we usually
use matrices in the calculation instead. However, the basic concepts are the same.
Let us take a look at the equation for a multiple linear regression model:[2][Ch.12]
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + . (4)
Suppose we have 100 data for this model and R studio gives the parameters as
follows:
Based on the result, if a team increases turnovers by 1, we would expect the number
of points scored by the team to decrease by 0.980. For each 1% a team’s shooting
percentage increases, we would expect that team to score 2.522 more points.
5
The method of least squares is trying to minimize the sum of squared vertical de-
viations S. To achieve that, we must take the partial derivatives of S with respect
to both b0 and b1 , then set both of them to zero and solve the equations.[2][Ch.12]
n
∂S X
= 2(yi − b0 − b1 xi )(−1) = 0 (9)
∂b0 i=1
n
∂S X
= 2(yi − b0 − b1 xi )(−xi ) = 0. (10)
∂b1 i=1
There are two things we need to check. First, we need to make sure the critical
point we get is not a saddle by using the second derivative test:
2 2 n n
!2
∂ 2S ∂ 2S ∂ S X X
− = 2n · 2x2i − 2xi (13)
∂b20 ∂b21 ∂b0 ∂b1 i=1 i=1
!2
Xn X n
= 4 · n · x2i − xi . (14)
i=1 i=1
6
Since both sides are positive numbers, when squaring both sides, the inequality
still holds:
(u · v)2 ≤ |u|2 |v|2 . (18)
Based on their definitions, row vector u and column vector v are linearly indepen-
dent, so the equality never holds. Therefore we can confirm that the critical point
we get is not a saddle point from Eq.(16):
2 2
∂ 2S ∂ 2S ∂ S 2 2 2
− = 4 · |u| |v| − (u · v) > 0. (19)
∂b20 ∂b21 ∂b0 ∂b1
Next, we need to take the second partial derivatives with respect to both b0 , b1
and to make sure the results are both positive. This ensures that the critical point
actually minimizes S.
n
∂ 2S X
= 2 = 2n > 0 (20)
∂b20 i=1
n
∂ 2S X 2
= 2xi > 0. (21)
∂b21 i=1
The solutions are denoted by βb0 and βb1 . They are called the least squares estimates
which minimize S, the sum of squared vertical deviations. The estimated regression
line or least squared line is then given by y = βb0 + βb1 x.
7
The X matrix has a row for each observation, consisting of 1 and then the values
of the three predictors.[2][Ch.12]
1 x11 x12 x13
X = 1 x21 x22 x23 . (24)
3 Logistic Regression
3.1 General Logistic Regression
Linear regression models in general, have some problems dealing with binary
outcomes. However, a technique called logistic regression is good at answering
yes/no questions. Let me give you some examples of binary outcomes: (1) Should
a bank give a person a loan or not? (2) Is a particular student is admitted into
a school? (3) Is a person voting against a new law? For those questions there
are only two outcomes: Yes and No, and we usually define a dummy variable to
indicate if an observation is a Yes or a No in the following way.
1 if Yes
y= . (27)
0 if No
Note that if we define our dummy variable the other way around, the coefficients
will have the same magnitudes but opposite signs. For example, whatever helped
me buy a product will have the exact opposite effect on helping me not buy that
product, and thus we are going to have the opposite sign for that attribute.
8
Consider an example of customers’ subscription to a magazine. There are only
two outcomes in this example: subscribed to the magazine or not. Suppose we
have data on 1000 random customers from a given city, and we want to know what
determines their decision to subscribe to a magazine. In this case our dependent
variable is going to be an indicator variable that tells us if a customer has sub-
scribed to the magazine or not. The dependent variable is going to be a 1 if the
subscription took place and it is going to be a 0 otherwise. We will be examining
how age influences the likelihood of subscription in this example.
If we think about the problem, we do not see too many reasons why we could
not use a linear model, because aside from being binary, there is really nothing
special about our dependent variable y. In fact, if we want to change this binary
variable from a zero to a one, we are changing its value to a higher value and thus
anything that increases the value of y should favor the likelihood of a customer’s
subscribing to the magazine. Thus, we could run a simple linear regression model
and suppose we get the following result:
As we interpret the result, we are left wondering what makes it change from a 0
to 1. This can also be interpreted as what increase the likelihood of subscription,
or P (subscribe = 1), which we can also simply denote as p. Therefore the result
can be also read as
Now we can conclude that every additional year of age increases the probability of
subscription by 6.4%. However, problems arise when we try to use this to forecast
the probabilities of customers with given ages. We know that the probabilities
are bounded whereby 0 ≤ p ≤ 1, and suppose the range of age in the dataset is
20 ≤ age ≤ 55. We can surely predict the probability that a 35 year-old person
subscribes is
p = −1.700 + 0.064 · 35 = 0.54. (30)
However once we try to predict the probabilities for 25 and 45 years of age, we get
9
invalid values for both probabilities.
There are two main attributes that must be satisfied. First, the probability
must always be positive, since p ≥ 0. An exponential form would satisfy this.
Even though we have this more complex expression, the linear thinking is not
completely gone. If we do some algebra, the previous expression can be rewritten
as follows.
p
ln = β0 + β1 · age. (35)
1−p
Therefore, even if the probability of a customer subscribing (p) is not a linear
function of age, we can perform a simple transformation on it such that it is now
a linear function of age. This approach is also called log-linear modeling. [3][Ch.8]
The above equation is the one used in logistic regressions. The result for our
example is as follows:
p
ln = −26.52 + 0.78 · age. (36)
1−p
10
Written in terms of the probability p, instead, we have
e−26.52+0.78·age
p = −26.52+0.78·age . (37)
e +1
As we try to predict probabilities using the logistic regression model mentioned
above, the predicted probabilities for all ages are no longer below 0 or above 1. In
this model, as customers grow older, the probability asymptotically gets closer to 1;
and when the age approaches to the other direction, the function is asymptotically
closer to 0; they never go below 0 or above 1. This is why we want to use logistic
regression models.
11
What I am going to introduce next is the concept of log odds. In the last section
we defined the binary dependent variable as follows:
1 if subscribe
subscribe = . (38)
0 if not subscribe
p
We saw that the logit function ln actually follows a linear pattern because
1−p
our model is:
p
ln = β0 + β1 · age. (39)
1−p
We also defined that p is the probability of subscription, or P (subscribe = 1),
and let’s simply denote it as P (1). Therefore 1 − p should be the probability of
not subscribe, we denote it as P (subscribe = 0), or P (0). Thus the logit function
becomes
P (1)
ln = β0 + β1 · age. (40)
P (0)
The equation above is what we called a log odd in binary logistic regression.
This also explains the problem we left in Section 3.1. Let us recall the model we
introduced before:[2][Ch.12]
12
From that we can build our multinomial logistic model.
P (Maths)
ln = a0 + a1 · gender + a2 · age (42)
P (Art)
P (Engineer)
ln = b0 + b1 · gender + b2 · age (43)
P (Art)
P (History)
ln = c0 + c1 · gender + c2 · age. (44)
P (Art)
where a, b, c are parameters we try to estimate.
13
statistical analyses. I will use an example of the U.S. population to explain how
the MLE works. In this example, we would like to know the probability that a
given U.S. individual is male. Of course, since there are approximately 326 million
individuals in the U.S., it is impossible to collect sex information from all of them.
But let us suppose we have a sample of the U.S. population, and in the sample
we have the sex information for each individual. Suppose we have N individuals
in this sample and we are interested in evaluating what the probability is that an
individual from the U.S is male, given that we only have a sample from the U.S.
population. The idea is that there may be some probability distribution func-
tion which determines the probability that given individual was either going to be
male, or female. We can call the probability density function f (xi |p). Here p is the
probability that an individual is male, a population fraction which we are trying
to estimate, but we just assume that we know p at this point. The probability
density function (PDF) is as follows:
When we try to substitute xi = 0 and xi = 1 in the PDF, we can find out that p
is the probability that a given individual is male and 1 − p is the probability that
a given individual is female.
We can see that our function f is telling us the exact same probabilities as we
defined before. That is the case for one observation. Now we want to add some
supplemental observations. In that circumstance, we define our pdf as follows.
14
Note that we are considering observations as random samples. In other words, they
are all independent of one another. Therefore the pdf is just going to be individual
pdfs multiplied together. Now we define some random variables X1 , X2 , ..., XN ,
and each of them represents the sex of a corresponding individual. When we
multiply each of the individual density functions, we obtain a joint probability in
the discrete case as follows:
N
Y
P (X1 = x1 , X2 = x2 , ..., XN = xN ) = pxi (1 − p)1−xi = L. (53)
i=1
15
Since the likelihood function L is a product, taking the log of it turns it into a
sum. Simplify it further as follows:
N
!
Y
l = ln pxi (1 − p)1−xi (58)
i=1
N
X
ln pxi (1 − p)1−xi
= (59)
i=1
N
X
= [xi ln (p) + (1 − xi ) ln (1 − p)] (60)
i=1
N
X N
X
= ln (p) · xi + ln (1 − p) · (1 − xi ). (61)
i=1 i=1
N
P
We may simplify this further if we recognize that xi = N · x where x is the
i=1
mean of all observations xi .
l = ln (p) · N · x + ln (1 − p) · N · (1 − x). (62)
Now the log likelihood function l is ready to be differentiated with respect to
parameter p, and then we set it to zero in order to find the critical point.
dl N · x N · (1 − x)
= − = 0. (63)
dp p 1−p
Using algebra, we get our critical point p = x. Then we need to take the second
derivative of l with respect to p and confirm that the result is negative.
d2 l N · x N · (1 − x)
= − − < 0. (64)
dp2 p2 (1 − p)2
Thus we know that this critical point p actually maximizes the log likelihood l.
We denote the maximum likelihood estimator as pbM L and we get pbM L = x. Since
the log function is monotonic, we know that pbM L also maximizes our original
likelihood function L. Finally we get our parameter p in the original probability
density function (PDF) we defined before. The result shows that the maximum
likelihood estimator for the population parameter p is just going to be the fraction
of individuals who are male in the sample. That makes intuitive sense because p in
the population is just the fraction of individuals who are male in the population,
so the maximum likelihood estimator for the population parameter p is just going
to be the fraction of individuals who are male in the sample.
16
4 Case Study: NBA playoff results estimation
4.1 Introduction of the problem
Every year, people go crazy about NBA playoffs. Plenty of people try to predict
the winner before the series. Often they will predict that the team with a higher
win rate in regular season will also win the series. However sometimes that is not
the case. There are obviously more factors affecting the series, such as offense,
defense, or even the play style of the team. Therefore, I use the following data of
regular season taken from basketball-reference.com to estimate playoff results: (1)
regular season total win rate; (2) points scored per game; (3) points allowed per
game; (4) regular season matching result between the two teams.
4.2 Dataset
According to the NBA rules history, the 3-point line extension in 1997-98 season
was one of the biggest rule changes, which could make a statistical difference in my
predictors.[4] This is because 3-point line extension will likely affect the shooting
percentage of all teams. In this way, both the points scored per game and the
points allowed per game are going to be different. Aside from this, a lot of teams
will change their play style according to the distance from which they shoot the
3-point basket. As a result, the match-ups between the teams are going to have
different results, so I am only using playoff data since 1997-98 season. The first
rounds of NBA playoffs were not best of 7 until 2002-03 season, so I did not use
the first round NBA playoffs data from 1997-98 season to 2002-03 season. Finally
I acquired a total of 260 playoff series within the 20 seasons in my dataset. On
the next page there is a segment of my original dataset; there are supposed to be
a total of 260 entries.
17
Figure 2: A segment of original dataset[5]
I use an ordinal logistic regression to model the relationship between the playoff
result and my four predictors, making several modifications to my original dataset.
First, all independent variables (predictors) should be numbers, so I formally define
the four predictors of my model as follows: (1) the difference of the win/loss
percentage between the two teams; (2) the difference of points scored per game
between the two teams; (3) the difference of points allowed per game between
the two teams; (4) the net-win between the two teams in regular season. Take
one of the series in 2015-16 season, the series of the Cleveland Cavaliers versus
the Golden State Warriors, as an example. In my original dataset the win/loss
fractions of the two teams are 0.695 and 0.890. Then I calculated the difference
of the win/loss percentage between the two teams,−0.195. Similarly, calculations
were performed on the difference of points scored per game between the two teams
18
and the difference of points allowed per game between the two teams. For the
regular season matching result between the two teams, I simply calculated the
difference and put the result in the net-win column. For example, the Cleveland
Cavaliers went 0-2 versus the Golden State Warriors in the regular season of 2015-
16, so the net-win is -2.
Second, the original dataset was ordered by the winners first, so only four
different results exist in the dataset: 4-0, 4-1, 4-2 and 4-3. However, in reality we
do not know which team is going to win before they play. Therefore I flipped half
of the 4-0 series into 0-4, negating the 4 independent variables (predictors) at the
same time. Then I performed the same operation to half of the 4-1, 4-2, and 4-3
series as well. Finally I got eight different results in my dataset.
Last but not least, since we are using ordinal logistic regression, we must define
an ordered factor for all of our possible response. I changed the result column of
my dataset to its corresponding factor as shown below:
19
Figure 4: An excerpt of final dataset
20
Figure 5: The model built from first 19 seasons
Here is the R code for two tests. There are another 13 tests, not shown here..
In order to show the accuracy of this model, I performed two predictions. First,
to predict which team is going to win the series, we add the 0, 1, 2, and 3 entries
and we get the probability that team A will win the series. Similarly, by adding
21
the 4, 5, 6, and 7 entries we get the probability that team B will win. I compared
the two probabilities, and then made the prediction. In Test 247, we already know
the actual result is 1, which means that team A won by 4-1 last season. In the
simulation, the sum of 0,1,2, and 3 entries is about 89.1%, then the sum of 4,5,6,
and 7 entries has to be 1 − 89.1% = 10.9% (See Figure 6). Since 89.1% is greater
than 10.9%, we predict that team A will win the series. We can see that this is a
correct prediction based on the actual result.
Second, by performing each test, R Studio calculates the probability for each
possible result. When we check the actual result of Test 248 which happened last
season, it is 6, which means team A lost 1-4. Then, looking at the simulation
results, the model tells us the probability of the factor being 6 is about 14.4% (See
Figure 6), which is not small given the fact that we have eight possible results.
Therefore, I am interested to see how frequently the actual result lands on a
simulated probability greater than 10%. 10% is an arbitrary value, just to test
how the model performs, and is not based on any metric. Below are the details of
the predictions:
Out of 15 tests (the last season’s data), 14 were correct win/loss predictions,
and 14 actual results are predicted by the model with a probability over 10% (See
Figure 7). The accuracy is over 93% for both predictions. It seems that the ordinal
logistic regression model for NBA playoffs prediction performs very well.
22
The NBA regular season this year has just passed, now is the perfect time
to predict this year’s playoff results. To begin, I acquired this season’s data on
nba-reference.com:
23
This time I will be using data from all 20 seasons to build the ordinal logistic
regression model.
24
Here are the probabilities provided by R Studio:
25
Figure 12: Western conference
26
We perform the exact same operations we did to get the two predictions ear-
lier, and acquire the following predictions of this season’s NBA first round playoff
results:
Up until the time this analysis was performed, four series in the first round
have ended. We have three out of four correct win/loss predictions and three out
of four actual results are predicted by the model with a probability over 20% [5]
(See Figure 11 & Figure 12). The only failed prediction is the series between the
Portland Trail Blazers and the New Orleans Pelicans. However, the actual result
surprised every NBA data analyst. Every analyst thought that the Blazers would
win the series but the result was the other way around, so we can treat this series
as an exception.
For all regression models, we do not want the coefficient of any predictors to
be zero. This is because the predictor would have no contributions to the response
27
if its coefficient is zero. Therefore, if zero lies in the confidence interval of a
coefficient, then the corresponding predictor of that coefficient is not significant.
The results of the 95% two-tailed confidence intervals are as follows:
From Figure 14, we can see that zero is in the confidence interval of the coef-
ficient of win/loss difference, therefore the predictor of win/lose difference is not
significant. The other three confidence intervals of coefficients do not contain zero,
so the points scored per game difference, the points allowed per game difference,
and the net-wins are all significant predictors. OR stands for odds ratio, and the
larger the odds ratio is, the more significant the predictor can be. From this we
can see that points allowed per game are more important than points scored per
game, which indicates that in NBA playoffs defense plays a more important role
than offense.
Seeing as the accuracy of our prediction is over 93%, the ordinal logistic re-
gression model is decent at predicting the NBA playoff results. However, there are
still some other factors worth noticing that may affect the accuracy of the results.
On one hand, from the perspective of my original data collection, using the points
differential in the regular season match-ups instead of net-wins would be more con-
vincing. In 2013-14 season, the Miami Heat went 0-4 versus the Brooklyn Nets in
the regular season, which seems like a big level gap between the two teams. How-
ever, when you have a look at the point differential of these four games, the Miami
Heat lost only one single point for each game.[5] This evidences that net-wins in
the regular season can sometimes be a flawed predictor. On the other hand, injury
is a very common situation in NBA. Sometimes a superstar on a team might be
injured right before the playoff series. We again use the 2013-14 season to illustrate
28
the point. The San Antonio Spurs went 0-4 versus the Oklahoma City Thunder in
the regular season, [5] however, their top defensive player Serge Ibaka was injured
right before the series began. To sum it up, using points differential in the dataset
and deleting exception points are two ways to make the model more accurate.
References
[1] wikipedia.org: Simple Linear Regression, Wikimedia Foundation, 10 Apr.
2018.
[2] Devore, Jay L., and Kenneth N. Berk.: Modern Mathematical Statistics with
Applications, Springer-Verlag, New York, 2012
29