LinearRegressionUsing R
LinearRegressionUsing R
AN INTRODUCTION TO DATA
BY DAVID J. LILJA
Linear Regression
Using R
A N I NTRODUCTION TO DATA M ODELING
DAVID J. L ILJA
University of Minnesota, Minneapolis
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
You are free to:
Share – copy and redistribute the material in any medium or format
Adapt – remix, transform, and build upon the material
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You
may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial – You may not use the material for commercial purposes.
Although every precaution has been taken to verify the accuracy of the information
contained herein, the author and publisher assume no responsibility for any errors or
omissions. No liability is assumed for damages that may result from the use of information
contained within.
ISBN-10: 1-946135-00-3
ISBN-13: 978-1-946135-00-1
https:/doi.org/10.24926/8668/1301
Goals
Interest in what has become popularly known as data mining has expanded
significantly in the past few years, as the amount of data generated contin-
ues to explode. Furthermore, computing systems’ ever-increasing capabil-
ities make it feasible to deeply analyze data in ways that were previously
available only to individuals with access to expensive, high-performance
computing systems.
Learning about the broad field of data mining really means learning
a range of statistical tools and techniques. Regression modeling is one
of those fundamental techniques, while the R programming language is
widely used by statisticians, scientists, and engineers for a broad range of
statistical analyses. A working knowledge of R is an important skill for
anyone who is interested in performing most types of data analysis.
The primary goal of this tutorial is to explain, in step-by-step detail, how
to develop linear regression models. It uses a large, publicly available data
set as a running example throughout the text and employs the R program-
ming language environment as the computational engine for developing
the models.
This tutorial will not make you an expert in regression modeling, nor
a complete programmer in R. However, anyone who wants to understand
how to extract information from data needs a working knowledge of the
basic concepts used to develop reliable regression models, and should also
know how to use R. The specific focus, casual presentation, and detailed
examples will help you understand the modeling process, using R as your
computational tool.
iii
All of the resources you will need to work through the examples in the
book are readily available on the book web site (see p. ii). Furthermore, a
fully functional R programming environment is available as a free, open-
source download [13].
Audience
Students taking university-level courses on data science, statistical model-
ing, and related topics, plus professional engineers and scientists who want
to learn how to perform linear regression modeling, are the primary audi-
ence for this tutorial. This tutorial assumes that you have at least some ex-
perience with programming, such as what you would typically learn while
studying for any science or engineering degree. However, you do not need
to be an expert programmer. In fact, one of the key advantages of R as a
programming language for developing regression models is that it is easy to
perform remarkably complex computations with only a few lines of code.
Acknowledgments
Writing a book requires a lot of time by yourself, concentrating on trying
to say what you want to say as clearly as possible. But developing and
publishing a book is rarely the result of just one person’s effort. This book
is no exception.
At the risk of omitting some of those who provided both direct and in-
direct assistance in preparing this book, I thank the following individuals
for their help: Professor Phil Bones of the University of Canterbury in
Christchurch, New Zealand, for providing me with a quiet place to work
on this text in one of the most beautiful countries in the world, and for our
many interesting conversations; Shane Nackerud and Kristi Jensen of the
University of Minnesota Libraries for their logistical and financial support
through the Libraries’ Partnership for Affordable Content grant program;
and Brian Conn, also of the University of Minnesota Libraries, for his in-
sights into the numerous publishing options available for this type of text,
and for steering me towards the Partnership for Affordable Content pro-
gram. I also want to thank my copy editor, Ingrid Case, for gently and tact-
fully pointing out my errors and inconsistencies. Any errors that remain are
1 Introduction 1
1.1 What is a Linear Regression Model? . . . . . . . . . . . . 2
1.2 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . 6
3 One-Factor Regression 17
3.1 Visualize the Data . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The Linear Model Function . . . . . . . . . . . . . . . . . 19
3.3 Evaluating the Quality of the Model . . . . . . . . . . . . 20
3.4 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . 24
4 Multi-factor Regression 27
4.1 Visualizing the Relationships in the Data . . . . . . . . . . 27
4.2 Identifying Potential Predictors . . . . . . . . . . . . . . . 29
4.3 The Backward Elimination Process . . . . . . . . . . . . . 32
4.4 An Example of the Backward Elimination Process . . . . . 33
4.5 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . 40
4.6 When Things Go Wrong . . . . . . . . . . . . . . . . . . 41
vii
Contents
5 Predicting Responses 51
5.1 Data Splitting for Training and Testing . . . . . . . . . . . 51
5.2 Training and Testing . . . . . . . . . . . . . . . . . . . . 53
5.3 Predicting Across Data Sets . . . . . . . . . . . . . . . . . 56
7 Summary 67
Bibliography 75
Index 77
Update History 81
ATA mining is a phrase that has been popularly used to suggest the
D process of finding useful information from within a large collection
of data. I like to think of data mining as encompassing a broad range of
statistical techniques and tools that can be used to extract different types
of information from your data. Which particular technique or tool to use
depends on your specific goals.
One of the most fundamental of the broad range of data mining tech-
niques that have been developed is regression modeling. Regression mod-
eling is simply generating a mathematical model from measured data. This
model is said to explain an output value given a new set of input values.
Linear regression modeling is a specific form of regression modeling that
assumes that the output can be explained using a linear combination of the
input values.
A common goal for developing a regression model is to predict what the
output value of a system should be for a new set of input values, given that
you have a collection of data about similar systems. For example, as you
gain experience driving a car, you begun to develop an intuitive sense of
how long it might take you to drive somewhere if you know the type of
car, the weather, an estimate of the traffic, the distance, the condition of
the roads, and so on. What you really have done to make this estimate of
driving time is constructed a multi-factor regression model in your mind.
The inputs to your model are the type of car, the weather, etc. The output
is how long it will take you to drive from one point to another. When
you change any of the inputs, such as a sudden increase in traffic, you
automatically re-estimate how long it will take you to reach the destination.
This type of model building and estimating is precisely what we are go-
1
CHAPTER 1. INTRODUCTION
The first column in this table is the index number (or name) from 1 to n
that we have arbitrarily assigned to each of the different systems measured.
Columns 2-4 are the input parameters. These are called the independent
variables for the system we will be modeling. The specific values of the
input parameters were set by the experimenter when the system was mea-
sured, or they were determined by the system configuration. In either case,
we know what the values are and we want to measure the performance
obtained for these input values. For example, in the first system, the pro-
cessor’s clock was 1500 MHz, the cache size was 64 kbytes, and the pro-
cessor contained 2 million transistors. The last column is the performance
that was measured for this system when it executed a standard benchmark
program. We refer to this value as the output of the system. More tech-
nically, this is known as the system’s dependent variable or the system’s
response.
The goal of regression modeling is to use these n independent mea-
surements to determine a mathematical function, f (), that describes the
relationship between the input parameters and the output, such as:
As a final point, note that, since the regression model is a linear com-
bination of the input values, the values of the model parameters will auto-
matically be scaled as we develop the model. As a result, the units used for
the inputs and the output are arbitrary. In fact, we can rescale the values
of the inputs and the output before we begin the modeling process and still
produce a valid model.
1.2 || What is R?
R is a computer language developed specifically for statistical computing.
It is actually more than that, though. R provides a complete environment
for interacting with your data. You can directly use the functions that are
provided in the environment to process your data without writing a com-
plete program. You also can write your own programs to perform opera-
tions that do not have built-in functions, or to repeat the same task multiple
times, for instance.
R is an object-oriented language that uses vectors and matrices as its ba-
sic operands. This feature makes it quite useful for working on large sets of
data using only a few lines of code. The R environment also provides ex-
cellent graphical tools for producing complex plots relatively easily. And,
perhaps best of all, it is free. It is an open source project developed by
many volunteers. You can learn more about the history of R, and down-
load a copy to your own computer, from the R Project web site [13].
In this listing, the “>” character indicates that R is waiting for input. The
line x <- c(2, 4, 6, 8, 10, 12, 14, 16) concatenates all of the values in
the argument into a vector and assigns that vector to the variable x. Simply
typing x by itself causes R to print the contents of the vector. Note that R
treats vectors as a matrix with a single row. Thus, the “[1]” preceding the
values is R’s notation to show that this is the first row of the matrix x. The
next line, mean(x), calls a function in R that computes the arithmetic mean
of the input vector, x. The function var(x) computes the corresponding
variance.
This book will not make you an expert in programming using the R
computer language. Developing good regression models is an interactive
process that requires you to dig in and play around with your data and your
models. Thus, I am more interested in using R as a computing environment
for doing statistical analysis than as a programming language. Instead of
teaching you the language’s syntax and semantics directly, this tutorial will
introduce what you need to know about R as you need it to perform the spe-
cific steps to develop a regression model. You should already have some
programming expertise so that you can follow the examples in the remain-
der of the book. However, you do not need to be an expert programmer.
OOD data is the basis of any sort of regression model, because we use
G this data to actually construct the model. If the data is flawed, the
model will be flawed. It is the old maxim of garbage in, garbage out.
Thus, the first step in regression modeling is to ensure that your data is
reliable. There is no universal approach to verifying the quality of your
data, unfortunately. If you collect it yourself, you at least have the advan-
tage of knowing its provenance. If you obtain your data from somewhere
else, though, you depend on the source to ensure data quality. Your job
then becomes verifying your source’s reliability and correctness as much
as possible.
7
CHAPTER 2. UNDERSTAND YOUR DATA
ignoring the NA values, you must explicitly tell the function to remove the
NA values using mean(x, na.rm=TRUE).
plexity of the processor’s logic. The feature size, channel length, and FO4
(fanout-of-four) delay are related to gate delays in the processor’s logic.
Because these parameters both have a direct effect on how much process-
ing can be done per clock cycle and effect the critical path delays, at least
some of these parameters could be important in a regression model that
describes performance.
Finally, the memory-related parameters recorded in the database are the
separate L1 instruction and data cache sizes, and the unified L2 and L3
cache sizes. Because memory delays are critical to a processor’s perfor-
mance, all of these memory-related parameters have the potential for being
important in the regression models.
The reported performance metric is the score obtained from the SPEC
CPU integer and floating-point benchmark programs from 1992, 1995,
2000, and 2006 [6–8]. This performance result will be the regression
model’s output. Note that performance results are not available for every
processor running every benchmark. Most of the processors have perfor-
mance results for only those benchmark sets that were current when the
processor was introduced into the market. Thus, although there are more
than 1,500 lines in the database representing more than 1,500 unique pro-
cessor configurations, a much smaller number of results are reported for
each individual benchmark.
ing on how the data is organized in the file. We will defer the specifics of
reading the CPU DB file into R until Chapter 6. For now, we will use a
function called extract_data(), which was specifically written for reading
the CPU DB file.
To use this function, copy both the all-data.csv and read-data.R files
into a directory on your computer (you can download both of these files
from this book’s web site shown on p. ii). Then start the R environment
and set the local directory in R to be this directory using the File -> Change
dir pull-down menu. Then use the File -> Source R code pull-down menu
to read the read-data.R file into R. When the R code in this file completes,
you should have six new data frames in your R environment workspace:
int92.dat, fp92.dat, int95.dat, fp95.dat, int00.dat, fp00.dat, int06.dat,
and fp06.dat.
The data frame int92.dat contains the data from the CPU DB database
for all of the processors for which performance results were available for
the SPEC Integer 1992 (Int1992) benchmark program. Similarly, fp92.dat
contains the data for the processors that executed the Floating-Point 1992
(Fp1992) benchmarks, and so on. I use the .dat suffix to show that the
corresponding variable name is a data frame.
Simply typing the name of the data frame will cause R to print the en-
tire table. For example, here are the first few lines printed after I type
int92.dat, truncated to fit within the page:
The first row is the header, which shows the name of each column. Each
subsequent row contains the data corresponding to an individual processor.
The first column is the index number assigned to the processor whose data
is in that row. The next columns are the specific values recorded for that
parameter for each processor. The function head(int92.dat) prints out just
the header and the first few rows of the corresponding data frame. It gives
you a quick glance at the data frame when you interact with your data.
Table 2.1shows the complete list of column names available in these
data frames. Note that the column names are listed vertically in this table,
simply to make them fit on the page.
Table 2.1: The names and definitions of the columns in the data frames
containing the data from CPU DB.
Column Column
Definition
number name
1 (blank) Processor index number
2 nperf Normalized performance
3 perf SPEC performance
4 clock Clock frequency (MHz)
Number of hardware
5 threads
threads available
Number of hardware
6 cores
cores available
7 TDP Thermal design power
Number of transistors on
8 transistors
the chip (M)
9 dieSize The size of the chip
10 voltage Nominal operating voltage
11 featureSize Fabrication feature size
12 channel Fabrication channel size
13 FO4delay Fan-out-four delay
14 L1icache Level 1 instruction cache size
15 L1dcache Level 1 data cache size
16 L2cache Level 2 cache size
17 L3cache Level 3 cache size
> int92.dat[15,12]
[1] 180
We can also access cells by name by putting quotes around the name:
> int92.dat["71","perf"]
[1] 105.1
This expression returns the data in the row labeled 71 and the column
labeled perf. Note that this is not row 71, but rather the row that contains
the data for the processor whose name is 71.
We can access an entire column by leaving the first parameter in the
square brackets empty. For instance, the following prints the value in every
row for the column labeled clock:
> int92.dat[,"clock"]
[1] 100 125 166 175 190 ...
Similarly, this expression prints the values in all of the columns for row
36:
> int92.dat[36,]
nperf perf clock threads cores ...
36 13.07378 79.86399 80 1 1 ...
The functions nrow() and ncol() return the number of rows and columns,
respectively, in the data frame:
> nrow(int92.dat)
[1] 78
> ncol(int92.dat)
[1] 16
This notation says to use the data in the column named perf from the data
frame named int92.dat. We can make yet a further simplification using the
attach function. This function makes the corresponding data frame local to
the current workspace, thereby eliminating the need to use the potentially
awkward $ or square-bracket indexing notation. The following example
shows how this works:
> attach(int92.dat)
> min(perf)
[1] 36.7
> max(perf)
[1] 366.857
> mean(perf)
[1] 124.2859
> sd(perf)
[1] 78.0974
generates the plot shown in Figure 3.1. The first parameter in this func-
tion call is the value we will plot on the x-axis. In this case, we will plot
the clock values from the int00.dat data frame as the independent variable
17
CHAPTER 3. ONE-FACTOR REGRESSION
Int2000
3000
2500
2000
Performance
1500
1000
500
0
Clock
Figure 3.1: A scatter plot of the performance of the processors that were
tested using the Int2000 benchmark versus the clock frequency.
on the x-axis. The dependent variable is the perf column from int00.dat,
which we plot on the y-axis. The function argument main="Int2000" pro-
vides a title for the plot, while xlab="Clock" and ylab="Performance" pro-
vide labels for the x- and y-axes, respectively.
This figure shows that the performance tends to increase as the clock fre-
quency increases, as we expected. If we superimpose a straight line on this
scatter plot, we see that the relationship between the predictor (the clock
frequency) and the output (the performance) is roughly linear. It is not per-
fectly linear, however. As the clock frequency increases, we see a larger
spread in performance values. Our next step is to develop a regression
model that will help us quantify the degree of linearity in the relationship
between the output and the predictor.
y = a0 + a1 x1 (3.1)
> attach(int00.dat)
> int00.lm <- lm(perf ~ clock)
The first line in this example attaches the int00.dat data frame to the
current workspace. The next line calls the lm() function and assigns the
resulting linear model object to the variable int00.lm. We use the suffix
.lm to emphasize that this variable contains a linear model. The argument
in the lm() function, (perf ~ clock), says that we want to find a model
where the predictor clock explains the output perf.
Typing the variable’s name, int00.lm, by itself causes R to print the ar-
gument with which the function lm() was called, along with the computed
coefficients for the regression model.
> int00.lm
Call:
lm(formula = perf ~ clock)
Coefficients:
(Intercept) clock
51.7871 0.5863
3000
2500
2000
perf
1500
1000
500
0
clock
quality. Many of the techniques can be rather technical, and the details of
them are beyond the scope of this tutorial. However, the function summary()
extracts some additional information that we can use to determine how
well the data fit the resulting model. When called with the model object
int00.lm as the argument, summary() produces the following information:
> summary(int00.lm)
Call:
lm(formula = perf ~ clock)
Residuals:
Min 1Q Median 3Q Max
-634.61 -276.17 -30.83 75.38 1299.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.78709 53.31513 0.971 0.332
clock 0.58635 0.02697 21.741 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = perf ~ clock)
These first few lines simply repeat how the lm() function was called. It
is useful to look at this information to verify that you actually called the
function as you intended.
Residuals:
Min 1Q Median 3Q Max
-634.61 -276.17 -30.83 75.38 1299.52
The residuals are the differences between the actual measured values and
the corresponding values on the fitted regression line. In Figure 3.2, each
data point’s residual is the distance that the individual data point is above
(positive residual) or below (negative residual) the regression line. Min is
the minimum residual value, which is the distance from the regression line
to the point furthest below the line. Similarly, Max is the distance from the
regression line of the point furthest above the line. Median is the median
value of all of the residuals. The 1Q and 3Q values are the points that mark
the first and third quartiles of all the sorted residual values.
How should we interpret these values? If the line is a good fit with the
data, we would expect residual values that are normally distributed around
a mean of zero. (Recall that a normal distribution is also called a Gaussian
distribution.) This distribution implies that there is a decreasing probability
of finding residual values as we move further away from the mean. That
is, a good model’s residuals should be roughly balanced around and not
too far away from the mean of zero. Consequently, when we look at the
residual values reported by summary(), a good model would tend to have
a median value near zero, minimum and maximum values of roughly the
same magnitude, and first and third quartile values of roughly the same
magnitude. For this model, the residual values are not too far off what we
would expect for Gaussian-distributed numbers. In Section 3.4, we present
a simple visual test to determine whether the residuals appear to follow a
normal distribution.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.78709 53.31513 0.971 0.332
clock 0.58635 0.02697 21.741 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This portion of the output shows the estimated coefficient values. These
values are simply the fitted regression model values from Equation 3.2.
The Std. Error column shows the statistical standard error for each of the
coefficients. For a good model, we typically would like to see a standard
error that is at least five to ten times smaller than the corresponding coeffi-
cient. For example, the standard error for clock is 21.7 times smaller than
the coefficient value (0.58635/0.02697 = 21.7). This large ratio means that
there is relatively little variability in the slope estimate, a1 . The standard
error for the intercept, a0 , is 53.31513, which is roughly the same as the es-
timated value of 51.78709 for this coefficient. These similar values suggest
that the estimate of this coefficient for this model can vary significantly.
The last column, labeled Pr(>|t|), shows the probability that the corre-
sponding coefficient is not relevant in the model. This value is also known
These final few lines in the output provide some statistical information
about the quality of the regression model’s fit to the data. The Residual
standard error is a measure of the total variation in the residual values.
If the residuals are distributed normally, the first and third quantiles of the
previous residuals should be about 1.5 times this standard error.
The number of degrees of freedom is the total number of measurements
or observations used to generate the model, minus the number of coeffi-
cients in the model. This example had 256 unique rows in the data frame,
corresponding to 256 independent measurements. We used this data to pro-
duce a regression model with two coefficients: the slope and the intercept.
Thus, we are left with (256 - 2 = 254) degrees of freedom.
The Multiple R-squared value is a number between 0 and 1. It is a statis-
tical measure of how well the model describes the measured data. We com-
pute it by dividing the total variation that the model explains by the data’s
total variation. Multiplying this value by 100 gives a value that we can
interpret as a percentage between 0 and 100. The reported R2 of 0.6505
for this model means that the model explains 65.05 percent of the data’s
variation. Random chance and measurement errors creep in, so the model
will never explain all data variation. Consequently, you should not ever
expect an R2 value of exactly one. In general, values of R2 that are closer
to one indicate a better-fitting model. However, a good model does not
necessarily require a large R2 value. It may still accurately predict future
observations, even with a small R2 value.
The Adjusted R-squared value is the R2 value modified to take into ac-
count the number of predictors used in the model. The adjusted R2 is
always smaller than the R2 value. We will discuss the meaining of the ad-
justed R2 in Chapter 4, when we present regression models that use more
than one predictor.
The final line shows the F-statistic. This value compares the current
model to a model that has one fewer parameters. Because the one-factor
model already has only a single parameter, this test is not particularly use-
ful in this case. It is an interesting statistic for the multi-factor models,
however, as we will discuss later.
1000
500
resid(int00.lm)
0
-500
fitted(int00.lm)
Figure 3.3: The residual values versus the input values for the one-factor
model developed using the Int2000 data.
500
0
-500
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Figure 3.4: The Q-Q plot for the one-factor model developed using the
Int2000 data.
The pairs() function produces the plot shown in Figure 4.1. This plot
provides a pairwise comparison of all the data in the int00.dat data frame.
The gap parameter in the function call controls the spacing between the
individual plots. Set it to zero to eliminate any space between plots.
As an example of how to read this plot, locate the box near the upper left
corner labeled perf. This is the value of the performance measured for the
int00.dat data set. The box immediately to the right of this one is a scatter
27
CHAPTER 4. MULTI-FACTOR REGRESSION
0 3000 1.0 2.0 50 100 0.05 20 0 1500 5000
80
nperf
0
3000
perf
0
clock
500
2.0
1.0 threads
4.0
cores
1.0
TDP
50
0 1200
transistors
dieSize
100
voltage
1.0
featureSize
0.05
channel
0.05
FO4delay
20
0 600
L1icache
1500
L1dcache
0
L2cache
0
5000
L3cache
Figure 4.1: All of the pairwise comparisons for the Int2000 data frame.
plot, with perf data on the vertical axis and clock data on the horizontal
axis. This is the same information we previously plotted in Figure 3.1. By
scanning through these plots, we can see any obviously significant relation-
ships between the variables. For example, we quickly observe that there is
a somewhat proportional relationship between perf and clock. Scanning
down the perf column, we also see that there might be a weakly inverse
relationship between perf and featureSize.
Notice that there is a perfect linear correlation between perf and nperf.
This relationship occurs because nperf is a simple rescaling of perf. The
reported benchmark performance values in the database - that is, the perf
values - use different scales for different benchmarks. To directly compare
the values that our models will predict, it is useful to rescale perf to the
range [0,100]. Do this quite easily, using this R code:
max_perf = max(perf)
min_perf = min(perf)
range = max_perf - min_perf
nperf = 100 * (perf - min_perf) / range
Note that this rescaling has no effect on the models we will develop, be-
cause it is a linear transformation of perf. For convenience and consistency,
we use nperf in the remainder of this tutorial.
2 n−1
Radjusted =1− (1 − R2 ) (4.2)
n−m
Table 4.1: The list of potential predictors to be used in the model develop-
ment process.
clock threads cores transistors
dieSize voltage f√ eatureSize channel
F O4delay √ L1icache L1icache
√ L1dcache
L1dcache L2cache L2cache
This function call assigns the resulting linear model object to the variable
int00.lm. As before, we use the suffix .lm to remind us that this variable
is a linear model developed from the data in the corresponding data frame,
int00.dat. The arguments in the function call tell lm() to compute a linear
model that explains the output nperf as a function of the predictors sepa-
rated by the “+” signs. The argument data=int00.dat explicitly passes to
the lm() function the name of the data frame that should be used when de-
veloping this model. This data= argument is not necessary if we attach()
the data frame int00.dat to the current workspace. However, it is useful to
explicitly specify the data frame that lm() should use, to avoid confusion
when you manipulate multiple models simultaneously.
The summary() function gives us a great deal of information about the
linear model we just created:
> summary(int00.lm)
Call:
lm(formula = nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel + FO4delay +
L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache), data = int00.dat)
Residuals:
Min 1Q Median 3Q Max
-10.804 -2.702 0.000 2.285 9.809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.108e+01 7.852e+01 -0.268 0.78927
clock 2.605e-02 1.671e-03 15.594 < 2e-16 ***
threads -2.346e+00 2.089e+00 -1.123 0.26596
cores 2.246e+00 1.782e+00 1.260 0.21235
transistors -5.580e-03 1.388e-02 -0.402 0.68897
dieSize 1.021e-02 1.746e-02 0.585 0.56084
voltage -2.623e+01 7.698e+00 -3.408 0.00117 **
featureSize 3.101e+01 1.122e+02 0.276 0.78324
channel 9.496e+01 5.945e+02 0.160 0.87361
FO4delay -1.765e-02 1.600e+00 -0.011 0.99123
L1icache 1.102e+02 4.206e+01 2.619 0.01111 *
sqrt(L1icache) -7.390e+02 2.980e+02 -2.480 0.01593 *
L1dcache -1.114e+02 4.019e+01 -2.771 0.00739 **
sqrt(L1dcache) 7.492e+02 2.739e+02 2.735 0.00815 **
L2cache -9.684e-03 1.745e-03 -5.550 6.57e-07 ***
sqrt(L2cache) 1.221e+00 2.425e-01 5.034 4.54e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice a few things in this summary: First, a quick glance at the residu-
als shows that they are roughly balanced around a median of zero, which is
what we like to see in our models. Also, notice the line, (179 observations
deleted due to missingness). This tells us that in 179 of the rows in the
data frame - that is, in 179 of the processors for which performance re-
sults were reported for the Int2000 benchmark - some of the values in the
columns that we would like to use as potential predictors were missing.
These NA values caused R to automatically remove these data rows when
computing the linear model.
The total number of observations used in the model equals the number
of degrees of freedom remaining - 61 in this case - plus the total number of
predictors in the model. Finally, notice that the R2 and adjusted R2 values
are relatively close to one, indicating that the model explains the nperf
values well. Recall, however, that these large R2 values may simply show
us that the model is good at modeling the noise in the measurements. We
must still determine whether we should retain all these potential predictors
in the model.
To continue developing the model, we apply the backward elimination
procedure by identifying the predictor with the largest p-value that exceeds
our predetermined threshold of p = 0.05. This predictor is FO4delay, which
has a p-value of 0.99123. We can use the update() function to eliminate a
given predictor and recompute the model in one step. The notation “.~.”
means that update() should keep the left- and right-hand sides of the model
the same. By including “- FO4delay,” we also tell it to remove that predic-
tor from the model, as shown in the following:
> int00.lm <- update(int00.lm, .~. - FO4delay, data = int00.dat)
> summary(int00.lm)
Call:
lm(formula = nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel + L1icache +
sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache +
sqrt(L2cache), data = int00.dat)
Residuals:
Min 1Q Median 3Q Max
-10.795 -2.714 0.000 2.283 9.809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.088e+01 7.584e+01 -0.275 0.783983
clock 2.604e-02 1.563e-03 16.662 < 2e-16 ***
threads -2.345e+00 2.070e+00 -1.133 0.261641
cores 2.248e+00 1.759e+00 1.278 0.206080
transistors -5.556e-03 1.359e-02 -0.409 0.684020
dieSize 1.013e-02 1.571e-02 0.645 0.521488
voltage -2.626e+01 7.302e+00 -3.596 0.000642 ***
featureSize 3.104e+01 1.113e+02 0.279 0.781232
Call:
lm(formula = nperf ~ clock + threads + cores + transistors +
dieSize + voltage + channel + L1icache + sqrt(L1icache) +
L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache), data =
int00.dat)
Residuals:
Min 1Q Median 3Q Max
-10.5548 -2.6442 0.0937 2.2010 10.0264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.129e+01 6.554e+01 -0.477 0.634666
clock 2.591e-02 1.471e-03 17.609 < 2e-16 ***
threads -2.447e+00 2.022e+00 -1.210 0.230755
cores 1.901e+00 1.233e+00 1.541 0.128305
transistors -5.366e-03 1.347e-02 -0.398 0.691700
dieSize 1.325e-02 1.097e-02 1.208 0.231608
voltage -2.519e+01 6.182e+00 -4.075 0.000131 ***
channel 1.188e+02 5.504e+01 2.158 0.034735 *
L1icache 1.037e+02 3.255e+01 3.186 0.002246 **
sqrt(L1icache) -6.930e+02 2.307e+02 -3.004 0.003818 **
L1dcache -1.052e+02 3.106e+01 -3.387 0.001223 **
sqrt(L1dcache) 7.069e+02 2.116e+02 3.341 0.001406 **
L2cache -9.548e-03 1.390e-03 -6.870 3.37e-09 ***
sqrt(L2cache) 1.202e+00 1.821e-01 6.598 9.96e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remove transistors:
> int00.lm <- update(int00.lm, .~. - transistors, data=int00.dat)
> summary(int00.lm)
Call:
lm(formula = nperf ~ clock + threads + cores + dieSize + voltage +
channel + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache) + L2cache + sqrt(L2cache), data = int00.dat)
Residuals:
Min 1Q Median 3Q Max
-9.8861 -3.0801 -0.1871 2.4534 10.4863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.789e+01 4.318e+01 -1.804 0.075745 .
clock 2.566e-02 1.422e-03 18.040 < 2e-16 ***
threads -1.801e+00 1.995e+00 -0.903 0.369794
cores 1.805e+00 1.132e+00 1.595 0.115496
dieSize 1.111e-02 8.807e-03 1.262 0.211407
voltage -2.379e+01 5.734e+00 -4.148 9.64e-05 ***
channel 1.512e+02 3.918e+01 3.861 0.000257 ***
L1icache 8.159e+01 2.006e+01 4.067 0.000128 ***
sqrt(L1icache) -5.386e+02 1.418e+02 -3.798 0.000317 ***
L1dcache -8.422e+01 1.914e+01 -4.401 3.96e-05 ***
sqrt(L1dcache) 5.671e+02 1.299e+02 4.365 4.51e-05 ***
L2cache -8.700e-03 1.262e-03 -6.893 2.35e-09 ***
sqrt(L2cache) 1.069e+00 1.654e-01 6.465 1.36e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remove threads:
> int00.lm <- update(int00.lm, .~. - threads, data=int00.dat)
> summary(int00.lm)
Call:
Residuals:
Min 1Q Median 3Q Max
-9.7388 -3.2326 0.1496 2.6633 10.6255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.022e+01 4.304e+01 -1.864 0.066675 .
clock 2.552e-02 1.412e-03 18.074 < 2e-16 ***
cores 2.271e+00 1.006e+00 2.257 0.027226 *
dieSize 1.281e-02 8.592e-03 1.491 0.140520
voltage -2.299e+01 5.657e+00 -4.063 0.000128 ***
channel 1.491e+02 3.905e+01 3.818 0.000293 ***
L1icache 8.131e+01 2.003e+01 4.059 0.000130 ***
sqrt(L1icache) -5.356e+02 1.416e+02 -3.783 0.000329 ***
L1dcache -8.388e+01 1.911e+01 -4.390 4.05e-05 ***
sqrt(L1dcache) 5.637e+02 1.297e+02 4.346 4.74e-05 ***
L2cache -8.567e-03 1.252e-03 -6.844 2.71e-09 ***
sqrt(L2cache) 1.040e+00 1.619e-01 6.422 1.54e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remove dieSize:
> int00.lm <- update(int00.lm, .~. - dieSize, data=int00.dat)
> summary(int00.lm)
Call:
lm(formula = nperf ~ clock + cores + voltage + channel + L1icache
+ sqrt(L1icache) + L1dcache + sqrt(L1dcache) + L2cache +
sqrt(L2cache), data = int00.dat)
Residuals:
Min 1Q Median 3Q Max
-10.0240 -3.5195 0.3577 2.5486 12.0545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.822e+01 3.840e+01 -1.516 0.133913
clock 2.482e-02 1.246e-03 19.922 < 2e-16 ***
cores 2.397e+00 1.004e+00 2.389 0.019561 *
voltage -2.358e+01 5.495e+00 -4.291 5.52e-05 ***
channel 1.399e+02 3.960e+01 3.533 0.000726 ***
L1icache 8.703e+01 1.972e+01 4.412 3.57e-05 ***
At this point, the p-values for all of the predictors are less than 0.02,
which is less than our predetermined threshold of 0.05. This tells us to
stop the backward elimination process. Intuition and experience tell us
that ten predictors are a rather large number to use in this type of model.
Nevertheless, all of these predictors have p-values below our significance
threshold, so we have no reason to exclude any specific predictor. We
decide to include all ten predictors in the final model:
Also notice that, as predictors drop from the model, the R2 values stay
very close to 0.965. However, the adjusted R2 value tends to increase very
slightly with each dropped predictor. This increase indicates that the model
with fewer predictors and more degrees of freedom tends to explain the
data slightly better than the previous model, which had one more predictor.
These changes in R2 values are very small, though, so we should not read
too much into them. It is possible that these changes are simply due to
random data fluctuations. Nevertheless, it is nice to see them behaving as
we expect.
Roughly speaking, the F-test compares the current model to a model
with one fewer predictor. If the current model is better than the reduced
model, the p-value will be small. In all of our models, we see that the
p-value for the F-test is quite small and consistent from model to model.
As a result, this F-test does not particularly help us discriminate between
potential models.
produces the plot shown in Figure 4.2. We see that the residuals appear
to be somewhat uniformly scattered about zero. At least, we do not see
any obvious patterns that lead us to think that the residuals are not well
behaved. Consequently, this plot gives us no reason to believe that we have
produced a poor model.
The Q-Q plot in Figure 4.3 is generated using these commands:
> qqnorm(resid(int00.lm))
> qqline(resid(int00.lm))
We see the that residuals roughly follow the indicated line. In this plot,
we can see a bit more of a pattern and some obvious nonlinearities, lead-
ing us to be slightly more cautious about concluding that the residuals are
10
5
resid(int00.lm)
0
-5
-10
0 20 40 60 80
fitted(int00.lm)
Figure 4.2: The fitted versus residual values for the multi-factor model de-
veloped from the Int2000 data.
10
5
Sample Quantiles
0
-5
-10
-2 -1 0 1 2
Theoretical Quantiles
Figure 4.3: The Q-Q plot for the multi-factor model developed from the
Int2000 data.
Call:
lm(formula = nperf ~ clock + threads + cores + transistors +
dieSize + voltage + featureSize + channel + FO4delay +
L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache))
Residuals:
14 15 16 17 18 19
0.4096 1.3957 -2.3612 0.1498 -1.5513 1.9575
dieSize NA NA NA NA
voltage NA NA NA NA
featureSize NA NA NA NA
channel NA NA NA NA
FO4delay NA NA NA NA
L1icache NA NA NA NA
sqrt(L1icache) NA NA NA NA
L1dcache NA NA NA NA
sqrt(L1dcache) NA NA NA NA
L2cache NA NA NA NA
sqrt(L2cache) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Notice that every predictor but clock has NA for every entry. Furthermore,
we see a line that says that fourteen coefficients were “not defined because
of singularities.” This statement means that R could not compute a value
for those coefficients because of some anomalies in the data. (More techni-
cally, it could not invert the matrix used in the least-squares minimization
process.)
The first step toward resolving this problem is to notice that 72 obser-
vations were deleted due to “missingness,” leaving only four degrees of
freedom. We use the function nrow(int92.dat) to determine that there are
78 total rows in this data frame. These 78 separate observations sum up to
the two predictors used in the model, plus four degrees of freedom, plus
72 deleted rows. When we tried to develop the model using lm(), however,
some of our data remained unused.
To determine why these rows were excluded, we must do a bit of sanity
checking to see what data anomalies may be causing the problem. The
function table() provides a quick way to summarize a data vector, to see
if anything looks obviously out of place. Executing this function on the
clock column, we obtain the following:
> table(clock)
clock
48 50 60 64 66 70 75 77 80 85 90 96 99 100 101 110
118 120 125 133 150 166 175 180 190 200 225 231 233 250 266
275 291 300 333 350
1 3 4 1 5 1 4 1 2 1 2 1 2 10 1 1
1 3 4 4 3 2 2 1 1 4 1 1 2 2 2 1
1 1 1 1
The top line shows the unique values that appear in the column. The
list of numbers directly below that line is the count of how many times
that particular value appeared in the column. For example, 48 appeared
once, while 50 appeared three times and 60 appeared four times. We see a
reasonable range of values with minimum (48) and maximum (350) values
that are not unexpected. Some of the values occur only once; the most
frequent value occurs ten times, which again does not seem unreasonable.
In short, we do not see anything obviously amiss with these results. We
conclude that the problem likely is with a different data column.
Executing the table() function on the next column in the data frame
threads produces this output:
> table(threads)
threads
1
78
Aha! Now we are getting somewhere. This result shows that all of the
78 entries in this column contain the same value: 1. An input factor in
which all of the elements are the same value has no predictive power in
a regression model. If every row has the same value, we have no way to
distinguish one row from another. Thus, we conclude that threads is not a
useful predictor for our model and we eliminate it as a potential predictor
as we continue to develop our Int1992 regression model.
We continue by executing table() on the column labeled cores. This
operation shows that this column also consists of only a single value, 1. Us-
ing the update() function to eliminate these two predictors from the model
gives the following:
> int92.lm <- update(int92.lm, .~. - threads - cores)
> summary(int92.lm)
Call:
lm(formula = nperf ~ clock + transistors + dieSize + voltage +
featureSize + channel + FO4delay + L1icache + sqrt(L1icache) +
L1dcache + sqrt(L1dcache) + L2cache + sqrt(L2cache))
Residuals:
14 15 16 17 18 19
0.4096 1.3957 -2.3612 0.1498 -1.5513 1.9575
Although these specific data values do not look out of place, having only
three unique values can make it impossible for lm() to compute the model
coefficients. Dropping L2cache and sqrt(L2cache) as potential predictors
finally produces the type of result we expect:
> int92.lm <- update(int92.lm, .~. - L2cache - sqrt(L2cache))
> summary(int92.lm)
Call:
lm(formula = nperf ~ clock + transistors + dieSize + voltage +
featureSize + channel + FO4delay + L1icache + sqrt(L1icache) +
L1dcache + sqrt(L1dcache))
Residuals:
Min 1Q Median 3Q Max
-7.3233 -1.1756 0.2151 1.0157 8.0634
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -58.51730 17.70879 -3.304 0.00278 **
clock 0.23444 0.01792 13.084 6.03e-13 ***
transistors -0.32032 1.13593 -0.282 0.78018
dieSize 0.25550 0.04800 5.323 1.44e-05 ***
voltage 1.66368 1.61147 1.032 0.31139
featureSize 377.84287 69.85249 5.409 1.15e-05 ***
channel -493.84797 88.12198 -5.604 6.88e-06 ***
FO4delay 0.14082 0.08581 1.641 0.11283
L1icache 4.21569 1.74565 2.415 0.02307 *
sqrt(L1icache) -12.33773 7.76656 -1.589 0.12425
L1dcache -5.53450 2.10354 -2.631 0.01412 *
sqrt(L1dcache) 23.89764 7.98986 2.991 0.00602 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = nperf ~ clock + dieSize + voltage + featureSize +
channel + FO4delay + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache))
Residuals:
Min 1Q Median 3Q Max
-13.2935 -3.6068 -0.3808 2.4535 19.9617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Call:
lm(formula = nperf ~ clock + dieSize + featureSize + channel +
FO4delay + L1icache + sqrt(L1icache) + L1dcache +
sqrt(L1dcache) +
transistors)
Residuals:
Min 1Q Median 3Q Max
-10.0828 -1.3106 0.1447 1.5501 8.7589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.28514 15.27839 -3.291 0.002700 **
clock 0.21854 0.01718 12.722 3.71e-13 ***
dieSize 0.20348 0.04401 4.623 7.77e-05 ***
featureSize 409.68604 67.00007 6.115 1.34e-06 ***
channel -490.99083 86.23288 -5.694 4.18e-06 ***
FO4delay 0.12986 0.09159 1.418 0.167264
The adjusted R-squared value now is 0.9746, which is much closer to the
adjusted R-squared value we had before dropping transistors. Continuing
with the backward elimination process, we first drop sqrt(L1icache) with a
p-value of 0.471413, then FO4delay with a p-value of 0.180836, and finally
sqrt(L1dcache) with a p-value of 0.071730.
After completing this backward elimination process, we find that the
following predictors belong in the final model for Int1992:
As shown below, all of these predictors have p-values below our thresh-
old of 0.05. Additionally, the adjusted R-square looks quite good at 0.9722.
> int92.lm <- update(int92.lm, .~. -sqrt(L1dcache))
> summary(int92.lm)
Call:
lm(formula = nperf ~ clock + dieSize + featureSize + channel +
L1icache + L1dcache + transistors, data = int92.dat)
Residuals:
Min 1Q Median 3Q Max
-10.1742 -1.5180 0.1324 1.9967 10.1737
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -34.17260 5.47413 -6.243 6.16e-07 ***
clock 0.18973 0.01265 15.004 9.21e-16 ***
dieSize 0.11751 0.02034 5.778 2.31e-06 ***
featureSize 305.79593 52.76134 5.796 2.20e-06 ***
channel -328.13544 53.04160 -6.186 7.23e-07 ***
L1icache 0.78911 0.16045 4.918 2.72e-05 ***
L1dcache -0.23335 0.03222 -7.242 3.80e-08 ***
This example illustrates that you cannot always look at only the p-values
to determine which potential predictors to eliminate in each step of the
backward elimination process. You also must be careful to look at the
broader picture, such as changes in the adjusted R-squared value and large
changes in the p-values of other predictors, after each change to the model.
51
CHAPTER 5. PREDICTING RESPONSES
would be like copying exam answers from the answer key and then using
that same answer key to grade your exam. Of course you would get a
perfect result. Instead, we must use one set of data to train the model and
another set of data to test it.
The difficulty with this train-test process is that we need separate but
similar data sets. A standard way to find these two different data sets is
to split the available data into two parts. We take a random portion of all
the available data and call it our training set. We then use this portion of
the data in the lm() function to compute the specific values of the model’s
coefficients. We use the remaining portion of the data as our testing set to
see how well the model predicts the results, compared to this test data.
The following sequence of operations splits the int00.dat data set into
the training and testing sets:
rows <- nrow(int00.dat)
f <- 0.5
upper_bound <- floor(f * rows)
permuted_int00.dat <- int00.dat[sample(rows), ]
train.dat <- permuted_int00.dat[1:upper_bound, ]
test.dat <- permuted_int00.dat[(upper_bound+1):rows, ]
The first line assigns the total number of rows in the int00.dat data
frame to the variable rows. The next line assigns to the variable f the frac-
tion of the entire data set we wish to use for the training set. In this case, we
somewhat arbitrarily decide to use half of the data as the training set and
the other half as the testing set. The floor() function rounds its argument
value down to the nearest integer. So the line upper_bound <- floor(f *
rows) assigns the middle row’s index number to the variable upper_bound.
The interesting action happens in the next line. The sample() function
returns a permutation of the integers between 1 and n when we give it
the integer value n as its input argument. In this code, the expression
sample(rows) returns a vector that is a permutation of the integers between
1 and rows, where rows is the total number of rows in the int00.dat data
frame. Using this vector as the row index for this data frame gives a ran-
dom permutation of all of the rows in the data frame, which we assign to the
new data frame, permuted_int00.dat. The next two lines assign the lower
portion of this new data frame to the training data set and the top portion to
the testing data set, respectively. This randomization process ensures that
we obtain a new random selection of the rows in the train-and-test data sets
train.dat int00.dat
Outputs
Inputs Outputs
Inputs f
test.dat
Outputs
Inputs
lm()
int00_new.lm
predict()
predicted.dat
- test.dat$nperf
delta
Figure 5.1: The training and testing process for evaluating the predictions
produced by a regression model.
The predict() function takes this new model as one of its arguments. It
uses this model to compute the predicted outputs when we use the test.dat
data frame as the input, as follows:
predicted.dat <- predict(int00_new.lm, newdata=test.dat)
Note that we use the $ notation to select the column with the output value,
nperf,from the test.dat data frame.
The mean of these ∆ differences for n different processors is:
n
¯ = 1
X
∆ ∆i (5.1)
n
i=1
A confidence interval computed for this mean will give us some indication
of how well a model trained on the train.dat data set predicted the per-
formance of the processors in the test.dat data set. The t.test() function
computes a confidence interval for the desired confidence level of these ∆i
values as follows:
> t.test(delta, conf.level = 0.95)
data: delta
t = -0.65496, df = 41, p-value = 0.5161
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-2.232621 1.139121
sample estimates:
mean of x
-0.5467502
∆i < 0, on the other hand, means that the model predicted that the per-
formance was lower than it actually was. Consequently, if the predictions
were reasonably good, we would expect to see a tight confidence interval
around zero. In this case, we obtain a 95 percent confidence interval of
[-2.23, 1.14]. Given that nperf is scaled to between 0 and 100, this is a
reasonably tight confidence interval that includes zero. Thus, we conclude
that the model is reasonably good at predicting values in the test.dat data
set when trained on the train.dat data set.
Another way to get a sense of the predictions’ quality is to generate a
scatter plot of the ∆i values using the plot() function:
plot(delta)
This function call produces the plot shown in Figure 5.2. Good predictions
would produce a tight band of values uniformly scattered around zero. In
this figure, we do see such a distribution, although there are a few outliers
that are more than ten points above or below zero.
It is important to realize that the sample() function will return a different
random permutation each time we execute it. These differing permutations
will partition different processors (i.e., rows in the data frame) into the train
and test sets. Thus, if we run this experiment again with exactly the same
inputs, we will likely get a different confidence interval and ∆i scatter plot.
For example, when we repeat the same test five times with identical inputs,
we obtain the following confidence intervals: [-1.94, 1.46], [-1.95, 2.68],
[-2.66, 3.81], [-6.13, 0.75], [-4.21, 5.29]. Similarly, varying the fraction
of the data we assign to the train and test sets by changing f = 0.5 also
changes the results.
It is good practice to run this type of experiment several times and ob-
serve how the results change. If you see the results vary wildly when you
re-run these tests, you have good reason for concern. On the other hand,
a series of similar results does not necessarily mean your results are good,
only that they are consistently reproducible. It is often easier to spot a bad
model than to determine that a model is good.
Based on the repeated confidence interval results and the corresponding
scatter plot, similar to Figure 5.2, we conclude that this model is reasonably
good at predicting the performance of a set of processors when the model
is trained on a different set of processors executing the same benchmark
10
5
0
delta
-5
-10
-15
0 20 40 60 80 100 120
Index
Figure 5.2: An example scatter plot of the differences between the pre-
dicted and actual performance results for the Int2000 bench-
mark when using the data-splitting technique to train and test
the model.
program. It is not perfect, but it is also not too bad. Whether the differences
are large enough to warrant concern is up to you.
using all the Int2000 data available in the int00.dat data frame. We then
predict the Fp2000 results using this model and the fp00.dat data. Again,
we assign the differences between the predicted and actual results to the
vector delta. Figure 5.3 shows the overall data flow for this training and
testing. The corresponding R commands are:
> int00.lm <- lm(nperf ~ clock + cores + voltage + channel +
L1icache + sqrt(L1icache) + L1dcache + sqrt(L1dcache) +
L2cache + sqrt(L2cache), data = int00.dat)
> predicted.dat <- predict(int00.lm, newdata=fp00.dat)
> delta <- predicted.dat - fp00.dat$nperf
> t.test(delta, conf.level = 0.95)
data: delta
t = 1.5231, df = 80, p-value = 0.1317
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.4532477 3.4099288
sample estimates:
mean of x
1.478341
int00.dat fp00.dat
Outputs
Outputs
Inputs Inputs
lm()
int00.lm
predict()
predicted.dat
- fp00.dat$nperf
delta
Figure 5.3: Predicting the Fp2000 results using the model developed with
the Int2000 data.
The resulting confidence interval for the delta values contains zero and
is relatively small. This result suggests that the model developed using
the Int2000 data is reasonably good at predicting the Fp2000 benchmark
program’s results. The scatter plot in Figure 5.4 shows the resulting delta
values for each of the processors we used in the prediction. The results
tend to be randomly distributed around zero, as we would expect from
a good regression model. Note, however, that some of the values differ
significantly from zero. The maximum positive deviation is almost 20,
and the magnitude of the largest negative value is greater than 43. The
confidence interval suggests relatively good results, but this scatter plot
shows that not all the values are well predicted.
20
10
0
-10
delta
-20
-30
-40
Index
Figure 5.4: A scatter plot of the differences between the predicted and ac-
tual performance results for the Fp2000 benchmark when pre-
dicted using the Int2000 regression model.
data: delta
t = 49.339, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
48.87259 52.94662
sample estimates:
mean of x
50.9096
In this case, the confidence interval for the delta values does not include
zero. In fact, the mean value of the differences is 50.9096, which indicates
that the average of the model-predicted values is substantially larger than
the actual average value. The scatter plot shown in Figure 5.5 further con-
firms that the predicted values are all much larger than the actual values.
This example is a good reminder that models have their limits. Appar-
ently, there are more factors that affect the performance of the next gener-
ation of the benchmark programs, Int2006, than the model we developed
using the Int2000 results captures. To develop a model that better predicts
future performance, we would have to uncover those factors. Doing so
requires a deeper understanding of the factors that affect computer perfor-
mance, which is beyond the scope of this tutorial.
100
80
delta
60
40
20
Index
Figure 5.5: A scatter plot of the differences between the predicted and ac-
tual performance results for the Int2006 benchmark, predicted
using the Int2000 regression model.
61
CHAPTER 6. READING DATA INTO THE R ENVIRONMENT
The name between the quotes is the name of the csv-formatted file to be
read. Each file line corresponds to one data record. Commas separate the
individual data fields in each record. This function assigns each data record
to a new row in the data frame, and assigns each data field to the corre-
sponding column. When this function completes, the variable processors
contains all the data from the file all-data.csv nicely organized into rows
and columns in a data frame.
If you type processors to see what is stored in the data frame, you will
get a long, confusing list of data. Typing
> head(processors)
will show a list of column headings and the values of the first few rows of
data. From this list, we can determine which columns to extract for our
model development. Although this is conceptually a simple problem, the
execution can be rather messy, depending on how the data was collected
and organized in the file.
As with any programming language, R lets you define your own func-
tions. This feature is useful when you must perform a sequence of opera-
tions multiple times on different data pieces, for instance. The format for
defining a function is:
function-name <- function(a1, a2, ...) {
R expressions
return(object)
}
where function-name is the function name you choose and a1, a2, ... is
the list of arguments in your function. The R system evaluates the expres-
sions in the body of the definition when the function is called. A function
can return any type of data object using the return() statement.
We will define a new function called extract_data to extract all the rows
that have a result for the given benchmark program from the processors
extracts every row that has a result for the given benchmark program and
assigns it to the corresponding data frame, int92.dat, fp92.dat, and so on.
We define the extract_data function as follows:
extract_data <- function(benchmark) {
The first line with the paste functions looks rather complicated. How-
ever, it simply forms the name of the column with the given benchmark
results. For example, when extract_data is called with Int2000 as the ar-
gument, the nested paste functions simply concatenate the strings "Spec",
"Int2000", and "..average.base.". The final string corresponds to the
name of the column in the processors data frame that contains the perfor-
mance results for the Int2000 benchmark, "SpecInt2000..average.base.".
The next line calls the function get_column, which selects all the rows
with the desired column name. In this case, that column contains the actual
performance result reported for the given benchmark program, perf. The
next four lines compute the normalized performance value, nperf, from the
perf value we obtained from the data frame. The following sequence of
calls to get_column extracts the data for each of the predictors we intend to
use in developing the regression model. Note that the second parameter in
each case, such as "Processor.Clock..MHz.", is the name of a column in the
processors data frame. Finally, the data.frame() function is a predefined
R function that assembles all its arguments into a single data frame. The
new function we have just defined, extract_data(), returns this new data
frame.
Next, we define the get_column() function to return all the data in a given
column for which the given benchmark program has been defined:
get_column <- function(x,y) {
The argument x is a string with the name of the benchmark program, and y
is a string with the name of the desired column. The nested paste() func-
tions produce the same result as the extract_data() function. The is.na()
function performs the interesting work. This function returns a vector with
“1” values corresponding to the row numbers in the processors data frame
that have NA values in the column selected by the benchmark index. If there
is a value in that location, is.na() will return a corresponding value that
is a 0. Thus, is.na indicates which rows are missing performance results
for the benchmark of interest. Inserting the exclamation point in front of
this function complements its output. As a result, the variable ix will con-
tain a vector that identifies every row that contains performance results for
the indicated benchmark program. The function then extracts the selected
67
CHAPTER 7. SUMMARY
Ultimately, you need to feel confident that your data set’s values are
reasonable and consistent.
7. Predict.
Now that you have a model that you feel appropriately explains your
data, you can use it to predict previously unknown output values.
ERE are a few suggested exercises to help you learn more about re-
H gression modeling using R.
1. Show how you would clean the data set for one of the selected bench-
mark results (Int1992, Int1995, etc.). For example, for every column
in the data frame, you could:
• Compute the average, variance, minimum, and maximum.
• Sort the column data to look for outliers or unusual patterns.
• Determine the fraction of NA values for each column.
How else could you verify that the data looks reasonable?
2. Plot the processor performance versus the clock frequency for each
of the benchmark results, similar to Figure 3.1.
71
CHAPTER 8. A FEW THINGS TO TRY NEXT
10. What can you say about these models’ predictive abilities, based on
the results from the previous problem? For example, how well does
a model developed for the integer benchmarks predict the same-year
performance of the floating-point benchmarks? What about predic-
tions across benchmark generations?
12. Repeat the previous problem, varying f for all the other data sets.
[2] Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, and
Mark Horowitz. CPU DB: Recording microprocessor history. Com-
munications of the ACM, 55(4):55–63, 2012.
[3] Andy P. Field, Jeremy Miles, and Zoe Field. Discovering statistics
using R. Sage Publications, 2012.
75
[10] PC Magazine. Pc magazine encyclopedia: Definition of
TDP. http://www.pcmag.com/encyclopedia/term/
60759/tdp, 2015. Accessed: 2015-10-22.
[12] Norman S Matloff. Parallel computing for data science. CRC Press,
2015.
77
prediction, 29, 51, 54, 57, 58, qqline(), 26, 40
69, 72, 73 qqnorm(), 26, 40
predictor, 17, 19, 71 read.csv(), 62
predictors, 9, 27, 29–31, 39, 64, resid(), 24, 26, 40
68, 72 return(), 62–64
sample(), 52
quantile-versus-quantile (Q-Q), sd(), 13, 14
25, 26, 40, 42 summary(), 21, 34, 35, 41,
quantiles, 23 44–48
quartiles, 22 t.test(), 54, 57, 59
quotes, 13 table(), 43–45
update(), 35, 44–48
R functions
var(), 5
NA, 7, 35, 39, 43, 64
R-squared, 29, 35, 40, 51, 69,
na.rm, 8
71, 72
$, 14
adjusted, 24, 29, 35, 40, 51,
[ ], 12
69
abline(), 20
multiple, 23
attach(), 14, 19
randomization, 52
c(), 5
residual analysis, 24, 40, 69, 71,
data.frame(), 63
72
detach(), 14
residual standard error, 23
fitted(), 24, 40
residuals, 21, 24, 25, 34, 40
floor(), 52
response, 17
function(), 62–64
head(), 11, 62 sanity checking, 8, 43, 67
is.na(), 64 scatter plot, 18, 28, 55, 56,
lm(), 19, 33, 41, 53, 57, 59 58–60, 71
max(), 13, 14, 28, 63 significance, 23, 32, 39, 69
mean(), 5, 7, 13, 14 singularities, 43
min(), 13, 14, 28, 63 slope, 19
ncol(), 13 SPEC, 10, 11, 17
nrow(), 13, 52 square brackets, 12
pairs(), 27 standard deviation, 13
paste(), 63, 64 standard error, 22, 23, 71, 72
plot(), 17, 20, 24, 40, 55
predict(), 54, 57, 59 t value, 23
81
Linear Regression Using R: An Introduction to Data Modeling presents
one of the fundamental data modeling techniques in an informal
tutorial style. Learn how to predict system outputs from measured
data using a detailed step-by-step process to develop, train, and test
reliable regression models. Key modeling and programming concepts
are intuitively described using the R programming language. All of the
necessary resources are freely available online.