0% found this document useful (0 votes)
27 views

Sparse Regression

Uploaded by

chthonic Lucifer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Sparse Regression

Uploaded by

chthonic Lucifer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture notes 7 March 28, 2016

Sparse regression

1 Linear regression
In statistics, the problem of regression is that of learning a function that allows to estimate
a certain quantity of interest, the response or dependent variable, from several observed
variables, known as covariates, features or independent variables. For example, we might be
interested in estimating the price of a house based on its extension, the number of rooms,
the year it was built, etc. The function that models the relation between the response and
the predictors is learnt from training data and can then be used to predict the response for
new examples.
In linear regression, we assume that the response is well modeled as a linear combination of
the predictors. The model is parametrized by an intercept β0 ∈ R and a vector of weights
β ∈ Rp , where p is the number of predictors. Let us assume that we have n data points
consisting of a value of the response yi and the corresponding values of the predictors Xi1 ,
Xi2 , . . . , Xip , 1 ≤ i ≤ n. The linear model is given by
p
X
yi ≈ β0 + βj Xij , 1 ≤ i ≤ n, (1)
j=1
        
y1 1 X11 X12 · · · X1p β1 1
 y2   1 
  ≈   β0 + 
 X21 X22 · · · X2p   β2   1 
   
 · · · = · · · β0 + Xβ. (2)

· · · · · ·  ···
yn 1 Xn1 Xn2 · · · Xnp βp 1

We have already encountered linear models that arise in inverse problems such as compressed
sensing or super-resolution. A major difference between statistics and those applications is
that in statistics the ultimate aim is to predict y accurately for new values of the predictors,
not to estimate β. The role of β is merely to quantify the linear relation between the response
and the predictors. In contrast, when solving an inverse problems the main objective is to
determine β, which has a physical interpretation: an image of a 2D slice of a body in MRI,
the spectrum of multisinusoidal signal in spectral super-resolution, reflection coefficients of
strata in seismography, etc.

1.1 Least-squares estimation

To calibrate the linear regression model, we estimate the weight vector from the training
data. By far the most popular method to compute the estimate is to minimize the `2 norm
of the fitting error on the training set. In more detail, the weight estimate βls is the solution
of the least-squares problem

minimize y − X β̃ . (3)
2

The least-squares cost function is convenient from a computational view, since it is convex
and can be minimized efficiently (in fact, as we will see in a moment it has a closed-form
solution). In addition, as detailed in Proposition 1.3 below, it has a reasonable probabilistic
interpretation.
The following proposition, proved in Section 1.1, shows that the solution to the least-squares
problem has a closed form solution.
Proposition 1.1 (Least-squares solution). For p ≥ n, if X is full rank then the solution to
the least-squares problem (3) is
−1 T
βls := X T X X y. (4)

A corollary to this result provides a geometric interpretation for the least-squares estimate
of y: it is obtained by projecting the response onto the column space of the matrix formed
by the predictors.
Corollary 1.2. For p ≥ n, if X is full rank then Xβls is the projection of y onto the column
space of X.

Proof. Let X = U Σ V T be the singular-value decomposition of X. Since X is full rank and


p ≥ n we have U T U = I, V T V = I and Σ is a square invertible matrix, which implies
−1 T
Xβls = X X T X X y (5)
T T T T

= UΣ V V Σ U UΣ V V Σ U y (6)
= U U T y. (7)

Proposition 1.3 (Least-squares solution as maximum-likelihood estimate). If we model y


as

y = Xβ + z (8)

where X ∈ Rn×p , p ≥ n, β ∈ Rp and y ∈ Rn are fixed and the entries of z ∈ Rn are


iid Gaussian random variables with mean zero and the same variance, then the maximum-
likelihood estimate of β given y is equal to βls .

The proposition is proved in Section A.2 of the appendix.

2
1.2 Preprocessing

The careful reader might notice that we have not explained how to fit the intercept β0 . Before
fitting a linear regression model, we typically perform the following preprocessing steps.

1. Centering each predictor column Xi subtracting its mean


n
X
µj := Xij (9)
i=1

so that each column of X has mean zero.

2. Normalizing each predictor column Xj by dividing by


v
u n
uX
σj := t (Xij − µj )2 (10)
i=1

so that each column of X has `2 norm equal to one. The objective is to make the
estimate invariant to the units used to measure each predictor.

3. Centering the response vector y by subtracting its mean


n
X
µy := yi . (11)
i=1

By the following lemma, the intercept of the linear model once we have centered the data is
equal to zero.

Lemma 1.4 (Intercept). If the mean of y and of each of the columns of X is zero, then the
intercept in the least-squares solution is also zero.

The lemma is proved in Section A.3 of the appendix.


Once have solved the least-squares problem using the centered and normalized data to obtain
βls , we can use the model to estimate the response corresponding to a new vector of predictors
x ∈ Rp by computing
p
X xi − µ i
f (x) := µy + βls,i . (12)
i=1
σi

3
1.3 Overfitting

Imagine that a friend tells you:


I found a cool way to predict the temperature in New York: It’s just a linear combination of
the temperature in every other state. I fit the model on data from the last month and a half
and it’s perfect!
Your friend is not lying, but the problem is that she is using a number of data points to fit
the linear model that is roughly the same as the number of parameters. If n = p we can find
a β such that y = Xβ exactly, even if y and X have nothing to do with each other! This is
called overfitting and is usually caused by using a model that is too flexible with respect to
the number of data that are available.
To evaluate whether a model suffers from overfitting we separate the data into a training set
and a test set. The training set is used to fit the model and the test set is used to evaluate
the error. A model that overfits the training set will have a very low error when evaluated
on the training examples, but will not generalize well to the test examples.
Figure 1 shows the result of evaluating the training error and the test error of a linear model
with p = 50 parameters fitted from n training examples. The training and test data are
generated by fixing a vector of weights β and then computing

ytrain = Xtrain β + ztrain , (13)


ytest = Xtest β, (14)

where the entries of Xtrain , Xtest , ztrain and β are sampled independently at random from a
Gaussian distribution with zero mean and unit variance. The training and test errors are
defined as
||Xtrain βls − ytrain ||2
errortrain = , (15)
||ytrain ||2
||Xtest βls − ytest ||2
errortest = . (16)
||ytest ||2
Note that even the true β does not achieve zero training error because of the presence of the
noise, but the test error is actually zero if we manage to estimate β exactly.
The training error of the linear model grows with n. This makes sense as the model has to fit
more data using the same number of parameters. When n is close to p (50), the fitted model
is much better than the true model at replicating the training data (the error of the true
model is shown in green). This is a sign of overfitting: the model is adapting to the noise
and not learning the true linear structure. Indeed, in that regime the test error is extremely
high. At larger n, the training error rises to the level achieved by the true linear model and
the test error decreases, indicating that we are learning the underlying model.

4
0.5
Error (training)
Error (test)
Noise level (training)
0.4

Relative error (l2 norm) 0.3

0.2

0.1

0.0 50 100 200 300 400 500


n

Figure 1: Relative `2 -norm error in estimating the response achieved using least-squares regression
for different values of n (the number of training data). The training error is plotted in blue, whereas
the test error is plotted in red. The green line indicates the training error of the true model used
to generate the data.

1.4 Theoretical analysis of linear regression

In this section we analyze the solution of the least-squares regression fit to understand
its dependence with the number of training examples. The following theorem, proved in
Section A.4 of the appendix, characterizes the error in estimating the weights under the
assumption that the data are indeed generated by a linear model with additive noise.
Theorem 1.5. Assume the data y are generated according to a linear model with additive
noise,

y = Xβ + z, (17)

where X ∈ Rn×p and β ∈ Rp , and that the entries of z ∈ Rn are drawn independently at
random from a Gaussian distribution with zero mean and variance σz2 . The least-squares
estimate

βls := arg min y − X β̃ (18)


β̃ 2

satisfies
p σz2 (1 − ) 2 p σz2 (1 + )
2
≤ ||β − βls ||2 ≤ 2
(19)
σmax σmin

5
 2

with probability 1 − 2 exp − p8 . σmin and σmax denote the smallest and largest singular
value of X respectively.

The bounds in the theorem are in terms of the singular values of the predictor matrix
X ∈ Rn×p . To provide some intuition as to the scaling of these singular values when we fix
p and increase n, let us assume that the entries of X are drawn independently at random
from a standard Gaussian √ distribution. Then by Proposition 3.4 in Lecture Notes 5 both
σmin and σmax are close to n with high probability as long as n > C p for some constant
C. This implies that if the variance of the noise z equals one,
r
p
||β − βls ||2 ≈ . (20)
n
If each of the entries of β has constant magnitude the `2 norm of β is approximately equal

to p. In this case, the theoretical analysis predicts that the normalized error
||β − βls ||2 1
≈√ . (21)
||β||2 n
Figure 2 shows the result of a simulation where the entries of X, β and z are all generated by
sampling independently
√ from standard Gaussian random variables. The relative error scales
precisely as 1/ n.
For a fixed number of predictors, this analysis indicates that the least-squares solution con-
verges to the true weights as the number of data n grows. In statistics jargon, the estimator
is consistent. The result suggests two scenarios in which the least-squares estimator may not
yield a good estimate: when p is of the same order as n and when some of the predictors are
highly correlated, as some of the singular values of X may be very small.

1.5 Global warming

In this section we describe the application of linear regression to climate data. In particular,
we analyze temperature data taken in a weather station in Oxford over 150 years.1 Our
objective is not to perform prediction, but rather to determine whether temperatures have
risen or decreased during the last 150 years in Oxford.
In order to separate the temperature into different components that account for seasonal
effects we use a simple linear with three predictors and an intercept
   
2πt 2πt
y ≈ β0 + β1 cos + β2 cos + β3 t (22)
12 12
1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.

6
0.10
p=50
p=100
p=200
Relative coefficient error (l2 norm) 0.08 p
1/ n

0.06

0.04

0.02

0.0050 5000 10000 15000 20000


n

Figure 2: Relative `2 -norm error of the least-squares estimate as n grows. The entries of X, β
and z are all generated by sampling independently from standard Gaussian random variables. The
simulation is consistent with 21.

where t denotes the time in months. The corresponding matrix of predictors is


 
2πt1 2πt1
 
1 cos 12 sin 12 t1
 
cos 2πt 2πt2
   
1 12
2
sin 12
t 2
X :=  . (23)
· · · ··· ··· · · ·
 
 
cos 2πt 2πtn
 
1 12
n
sin 12
tn

The intercept β0 represents the mean temperature, β1 and β2 account for periodic yearly
fluctuations and β3 is the overall trend. If β3 is positive then the model indicates that tem-
peratures are increasing, if it is negative then it indicates that temperatures are decreasing.
The results of fitting the linear model are shown in Figures 3 and 4. The fitted model
indicates that both the maximum and minimum temperatures have an increasing trend of
about 0.8 degrees Celsius (around 1.4 degrees Fahrenheit).

7
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

25 14
12
20
10
Temperature (Celsius)

Temperature (Celsius)

15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905

25 15

20
10
Temperature (Celsius)

Temperature (Celsius)

15
5
10
0
5

5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965

Figure 3: Temperature data together with the linear model described by (22) for both maximum
and minimum temperatures.

8
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

+ 0.75 ◦ C / 100 years + 0.88 ◦ C / 100 years


Figure 4: Temperature trend obtained by fitting the model described by (22) for both maximum
and minimum temperatures.

1.6 Logistic regression

The problem of classification in statistics and machine learning is very related to regression.
The only difference is that in classification the response is binary: it is equal to either 0 or 1.
Of course, the labels 0 and 1 are arbitrary, they represent two distinct classes into which we
aim to classify our data. For example, the predictors might be pictures and the two classes
cats and dogs.
In linear regression, we use a linear model to predict the response from the predictors. In
logistic regression, we use a linear model to predict how likely it is for the response to equal
0 or 1. This requires mapping the output of the linear model to the [0, 1] interval, which is
achieved by applying the logistic function or sigmoid,
1
g (t) := , (24)
1 + exp −t
to a linear combination of the predictors. In more detail, we model the probability that
yi = 1 by
p
!
X
P (yi = 1|Xi1 , Xi2 , . . . , Xip ) ≈ g β0 + βj Xij (25)
j=1
1
=  Pp . (26)
1 + exp −β0 − j=1 βj Xij

9
In Proposition 1.3 we established that least-squares fitting computes a maximum-likelihood
estimate in the case of linear models with additive Gaussian noise. The following proposition
derives a cost function to calibrate a logistic-regression model by maximizing the likelihood
under the assumption that the response is a Bernouilli random variable parametrized by the
linear model.

Proposition 1.6 (Logistic-regression cost function). If we assume that the response values
y1 , y2 , . . . , yn in the training data are independent samples from Bernouilli random variables
y̌1 , y̌2 , . . . , y̌n such that
p
!
X
P (y̌i = 1|Xi1 , Xi2 , . . . , Xip ) := g β0 + βj Xij , (27)
j=1
p
!
X
P (y̌i = 0|Xi1 , Xi2 , . . . , Xip ) := 1 − g β0 + βj Xij , (28)
j=1

then the maximum-likelihood estimate of the intercept and the weights β0 and β0 are obtained
by maximizing the function
p p
n
! !!
  X X X
log L β˜0 , β̃ := yi log g β˜0 + β̃j Xij + (1 − yi ) log 1 − g β˜0 + β̃j Xij .
i=1 j=1 j=1
(29)

Proof. Due to the independence assumption, the joint probability mass function (pmf) of
the random vector y̌ equals
n p
!yi p
!!1−yi
Y X X
py̌ (y) := g β˜0 + β̃j Xij 1 − g β˜0 + β̃j Xij . (30)
i=1 j=1 j=1

The likelihood is defined as the joint pmf parametrized by the weight vectors,
n p
!y i p
!!1−yi
  Y X X
L β˜0 , β̃ := g β˜0 + β̃j Xij 1 − g β˜0 + β̃j Xij . (31)
i=1 j=1 j=1

Taking the logarithm of this nonnegative function completes the proof.

The log-likelihood function is strictly concave, so the logistic-regression estimate is well


defined. Although the cost function is derived by assuming that the data follow a certain
probabilistic model, logistic regression is widely deployed in situations where the probabilistic
assumptions do not hold. The model will achieve high prediction accuracy on any dataset
where the predictors are linearly separable, as long as sufficiently data is available.

10
2 Sparse regression

2.1 Model selection

In Section 1.4 we establish that linear regression allows to learn a linear model when the
number of available examples n is large with respect to the number of predictors p. However,
in many modern applications, the number of predictors can be extremely large. An example
is computational genomics, where the predictors may correspond to gene-expression mea-
surements from thousands of genes, whereas n is the number of patients which might only
be in the hundreds.
It is obviously impossible to fit a linear model when p > n, or even when p ≈ n without
overfitting (depending on the noise level), but it may still be possible to fit a sparse linear
model that only depends on a subset of s < p predictors. Selecting the relevant predictors
to include in the model is called model selection in statistics.

2.2 Best subset selection and forward stepwise regression

A possible way to select a small number of relevant predictors from a training set is to fix the
order of the sparse model s < p and then evaluate the least-squares fit of all possible s-sparse
models in order to select the one that provides the best fit. This is called the best-subset
selection method. Unfortunately it is computationally intractable even for small values of s
and p. For instance, there are more than 1013 possible models if s = 10 and p = 100.
An alternative to an exhaustive evaluation of all possible sparse models is to select the
predictors greedily in the spirit of signal-processing methods such as orthogonal matching
pursuit. In forward stepwise regression we select the predictor that is most correlated with
the response and then project the rest of predictors onto its orthogonal complement. Iterating
this procedure allows to learn an s-sparse model in s steps.

Algorithm 2.1 (Forward stepwise regression). Given a matrix of predictors X ∈ Rn×p and
a response y ∈ Rn , we initialize the residual and the subset of relevant predictors S by setting,

j0 := arg max |hy, Xj i| (32)


j

S0 := {j0 } (33)
βls := arg min y − XS0 β̃ (34)
β̃ 2

r(0) := y − XS0 βls . (35)

11
Then for k = 2, 3, . . . , s we compute
 
jk := arg max y, Pcol ⊥ Xj (36)
j ∈S
/ j−1 (XSj−1 )
Sj := Sj−1 ∪ {jk } (37)
βls := arg min y − XSj β̃ (38)
β̃ 2

r(k) := r(k−1) − XSj βls . (39)

The algorithm is very similar to orthogonal matching pursuit. The only difference is the
orthogonalization step in which we project the remaining predictors onto the orthogonal
complement of the span of the predictors that have been selected already.

2.3 The lasso

Fitting a sparse model that uses a subset of the available predictors is equivalent to learning
a weight vector β that only contains a small number of nonzeros. As we saw in the case of
sparse signal representations and underdetermined inverse problems, penalizing the `1 norm
is an efficient way of promoting sparsity. In statistics, `1 -norm regularized least squares is
known as the lasso,
1 2
minimize y − β̃0 − X β̃ + λ β̃ (40)
2n 2 1
(41)

λ > 0 is a regularization parameter that controls the tradeoff between the fit to the data
and the `1 norm of the weights. Figure 5 shows the values of the coefficients obtained by the
lasso for a linear regression problem with 50 predictors where the response only depends on
3 of them. If λ is very large, all coefficients are equal to zero. If λ is very small, then the
lasso estimate is equal to the least-squares estimate. For values in between, the lasso yields
a sparse model containing the coefficients corresponding to the relevant predictors.
A different formulation for the lasso, which is the one that appeared in the original paper [4],
incorporates a constraint on the `1 norm of the weight vector, instead of an additive term.
2
minimize y − X β̃ − β̃0 (42)
2

subject to β̃ ≤τ (43)
1

The two formulations are equivalent, but the relation between λ and τ depends on X and y.

12
0.4
Relevant features
0.3
0.2
Coefficients
0.1
0.0
0.1
0.2
0.3
10-4 10-3 10-2 10-1 100 101
Regularization parameter

Figure 5: Magnitude of the lasso coefficients for different value of the regularization parameter.
The number of predictors is 50, but the response only depends on 3 of them, which are marked in
red.

To compare the performance of the lasso, forward stepwise regression and least-squares
regression on a dataset that follows a sparse linear model we generate simulated data by
computing

ytrain = Xtrain β + ztrain , (44)


ytest = Xtest β, (45)

where the entries of Xtrain , Xtest and ztrain and β are sampled independently at random from
a Gaussian distribution with zero mean and unit variance. 10 entries of β are also sampled
iid from a standard normal, but the rest are set to zero. As a result the response only
depends on 10 predictors out of a total of 50. The training and test errors were computed
as in (15) and (16).
Figure 6 shows the results for different values of n (to be clear we compute the estimates
once at each value of n). As expected, the least-squares estimator overfits the training data
and performs very badly on the test set for small values of n. In contrast, the lasso and
forward stepwise regression do not overfit and achieve much smaller errors on the test set,
even when n is equal to p.
In our implementation of forward stepwise regression we set the number of predictors in the
sparse model to the true number. On real data we would need to also estimate the order of
the model. The greedy method performs very well in some instances because it manages to

13
select the correct sparse model. However, in other cases, mistakes in selecting the relevant
predictors produce high fit errors, even on the training set. In contrast, the lasso achieves
accurate fits for every value of n.

2.4 Theoretical analysis of the lasso

Assume that we have data that follow a sparse linear model of the form

y = Xβ + z (46)

where the weight vector β is sparse, so that the response y only depends on s < p predictors.
By Theorem 1.5, if the noise has variance σ 2 , least-squares regression yields an estimate of
the weight vector that satisfies
r
p
||β − βls ||2 ≈ σz . (47)
n
as long as the matrix of predictors X is well conditioned and has entries with constant
amplitudes. When p is close to n this indicates that least-squares regression does not yield
a good estimate.
In order to characterize the error achieved by the lasso, we introduce the restricted-eigenvalue
property (REP), which is similar to the restricted-isometry property (RIP) that we studied
in Lecture Notes 5.
Definition 2.2 (Restricted-eigenvalue property). A matrix M ∈ Rn×p satisfies the restricted-
eigenvalue property with parameter s if there exists γ > 0 such that for any v ∈ Rp if

||vT c ||1 ≤ ||vT ||1 (48)

for any subset T with cardinality s then


1
||M v||22 ≥ γ ||v||22 . (49)
n

Just like the RIP, the REP states that the matrix preserves the norm of a certain class
of vectors. In this case, the vectors are not necessarily sparse, but rather have `1 -norm
concentrated on a sparse subset of the entries, which can be interpreted as a robustified
notion of sparsity. The property may hold even if p > n, i.e. when we have more predictors
than examples. The following theorem provides guarantees for the lasso under the REP.
Theorem 2.3. Assume that the data y are generated according to a linear model with additive
noise,

y = Xβ + z, (50)

14
0.35
Least squares
Forward regression
0.30 Lasso
Noise level
0.25
Relative error (l2 norm)

0.20

Training 0.15

0.10

0.05

0.0050 60 70 80 90 100
n
1.0
Least squares
Forward regression
Lasso
0.8
Relative error (l2 norm)

0.6

Test
0.4

0.2

0.050 60 70 80 90 100
n

Figure 6: Comparison of the training and test error of the lasso, forward stepwise regression and
least-squares regression for simulated data where the number of predictors is equal to 50 but only
10 are used to generate the response.

15
where X ∈ Rn×p and β ∈ Rp , and that the entries of z ∈ Rn are drawn independently at
random from a Gaussian distribution with zero mean and variance σz2 . If β has s nonzero
entries and X satisfies the restricted-eigenvalue property, the solution βlasso to
2
minimize y − X β̃ (51)
2

subject to β̃ ≤τ (52)
1

if we set τ := ||β||1 satisfies



σz 32 α s log p
||β − βlasso ||2 ≤ max ||Xi ||2 (53)
γn i

with probability 1 − 2 exp (− (α − 1) log p) for any α > 2.


p
The result establishes that the lasso achieves an error that scales as σz s/n, which is the
same rate achieved by least squares if the true sparse model is known!
In this section we have focused on the estimation of the weight vector. This is important
for model selection, as the sparsity pattern and the amplitude of the weights reveals the
sparse model used to predict the response. However, in statistics the main aim is often to
predict the response. For this purpose, in principle, conditions such as the REP should not
be necessary. For results on the prediction error of the lasso we refer the interested reader to
Chapter 11 of [3]. In Section 3 we discuss the performance of the lasso and related methods
when the predictor matrix does not satisfy the REP.

2.5 Sparse logistic regression

An advantage of the lasso over greedy methods to learn sparse linear models is that it can be
easily applied to logistic regression. All we need to do is add an `1 -norm regularization term
to the cost function derived in Section 1.6. In detail, to learn a sparse logistic-regression
model we minimize the function
p p
n
! !!
X X X
− yi log g β˜0 + β̃j Xij − (1 − yi ) log 1 − g β˜0 + β̃j Xij + λ β̃ .
1
i=1 j=1 j=1

This version of the lasso can be used to obtain a sparse logistic model for prediction of
binary responses. To illustrate this, we consider a medical dataset2 . The response indicates
whether 271 patients suffer from arrhythmia or not. The predictors contain information
about each patient, such as age, sex, height and weight, as well as features extracted from
2
The data can be found at https://archive.ics.uci.edu/ml/datasets/Arrhythmia

16
182 172 174 159 107 62 18 5
0.45
Misclassification Error
0.35
0.25

-10 -8 -6 -4 -2
log(Lambda)

Figure 7: Distribution of misclassification errors achieved after repeatedly fitting the model using
a random subset of 90% of the examples for different values of the regularization parameter λ. The
number of predictors included in the sparse model is indicated above the graph.

17
electrocardiogram recordings. The total number of predictors is 182. We use the glmnet
package in R [2] to fit the sparse logistic regression model.
Figure 7 shows the distribution of the test error achieved after repeatedly fitting the model
using a random subset of 90% of the examples; a procedure known as cross-validation in
statistics. The number of predictors included in the sparse model is indicated above the
graph. The best results are obtained by a model containing 62 predictors, but a model
containing 18 achieves very similar accuracy (both are marked with a dotted line).

3 Correlated predictors
In many situations, some of the predictors in a dataset may be highly correlated. As a result,
the predictor matrix is ill conditioned, which is problematic for least squares regression, and
also for the lasso. In this section, we discuss this issue and show how it can be tackled
through regularization of the regression cost function.

3.1 Ridge regression

When the data in a linear regression problem is of the form y = Xβ+z, we can write the error
of the least-squares estimator in terms of the singular-value decomposition of X = U ΣV T ,
v
u p  T 2
uX Uj z
||β − βls ||2 = t , (54)
j=1
σj

see (102) in Section 1.5. If a subset of the predictors are highly correlated, then some of
the singular values will have very small values, which results in noise amplification. Ridge
regression is an estimation technique that controls noise amplification by introducing an
`2 -norm penalty on the weight vector,
2 2
minimize y − X β̃ + λ β̃ , (55)
2 2

where λ > 0 is a regularization parameter that controls the weight of the regularization
term as in the lasso. In inverse problems, `2 -norm regularization is often known as Tikhonov
regularization.
The following proposition shows that, under the assumption that the data indeed follow a
linear model, the error of the ridge-regression estimator can be decomposed into a term that
depends on the signal and a term that depends on the noise.

18
Proposition 3.1 (Ridge-regression error). If y = Xβ + z and X ∈ Rn×p , n ≥ p, is full
rank, then the solution of Problem (55) can be written as
 2   
σ1 σ1
0 ··· 0 0 ··· 0
 σ12 +λ 2
  σ12 +λ 
σ2 σ2
 0 ··· 0  T  0 ··· 0  T
   
σ22 +λ σ22 +λ
βridge = V 

V β + V 
   U z. (56)
··· ···

   
 2
  
σp σp
0 0 · · · σ2 +λ 0 0 ··· σp2 +λ
p

We defer the proof to Section A.6 in the appendix.


Increasing the value of the regularization parameter λ allows to control the noise term when
some of the predictors are highly correlated. However, this also increases the error term that
depends on the original signal; if z = 0 then we don’t recover the true weight vector β unless
λ = 0. Calibrating the regularization parameter allows to adapt to the conditioning of the
predictor matrix and the noise level in order to achieve a good tradeoff between both terms.

3.2 The elastic net

Figure 8 shows the coefficients of the lasso and ridge regression learnt from a dataset where
the response follows a sparse regression model. The model includes 50 predictors but only
12 are used to generate the response. These 12 predictors are divided into two groups of 6
that are highly correlated3 .
The model obtained using ridge regression assigns similar weights to the correlated variables.
This is a desirably property since all of these predictors are equally predictive of the response
value. However, the learnt model is not sparse for any value of the regularization parameter,
which means that it selects all of the irrelevant predictors. In contrast, the lasso produces a
sparse model, but the coefficients of the relevant predictors are very erratic. In fact, in the
regime where the coefficients are sparse not all the relevant predictors are included in the
model (two from the second group are missing).
The following lemma, proved in gives some intuition as to why the coefficient path for the
lasso tends to be erratic when some predictors are highly correlated. When two predictors
are exactly the same, then the lasso chooses arbitrarily between the two, instead of including
both in the model with similar weights.

Lemma 3.2. If two columns of the predictor matrix X are identical Xi = Xj , i 6= j, and
3
In more detail, the predictors in each group are sampled from a Gaussian distribution with zero mean
and unit variance such that the covariance between each pair of predictors is equal to 0.95.

19
0.8
Group 1
0.6 Group 2

0.4

Coefficients
0.2
Lasso 0.0
0.2
0.4
0.6
10-4 10-3 10-2 10-1 100 101
Regularization parameter

0.8
Group 1
0.6 Group 2

0.4
Coefficients

0.2
Ridge 0.0
regression
0.2
0.4
0.6
10-4 10-3 10-2 10-1 100 101 102 103 104 105
Regularization parameter

0.8
Group 1
0.6 Group 2

0.4
Coefficients

0.2
Elastic net 0.0
(α = 0.5)
0.2
0.4
0.6
10-4 10-3 10-2 10-1 100 101
Regularization parameter
Figure 8: Coefficients of the lasso, ridge-regression and elastic-net estimate for a linear regression
problem where the response only depends on two groups of 6 predictors each out of a total of 50
predictors. The predictors in each group are highly correlated.
20
βlasso is a solution of the lasso, then

β (α)i := α βlasso,i + (1 − α) βlasso,j , (57)


β (α)j := (1 − α) βlasso,i + α βlasso,j , (58)
β (α)k := βlasso,k , k∈
/ {i, j} , (59)

is also a solution for any 0 < α < 1.

The following lemma, which we have borrowed from [6], provides some intuition as to why
strictly convex regularization functions such as ridge regression tend to weigh highly corre-
lated predictors in a similar way. This does not contradict the previous result because the
lasso is not strictly convex. The result is proved in Section A.8 of the appendix.
Lemma 3.3 (Identical predictors). Let us consider a regularized least squares problem of
the form
1 2  
minimize y − X β̃ + λ R β̃ (60)
2n 2

where R is an arbitrary regularizer, which is strictly convex and invariant to the ordering
of its argument. If two columns of the predictor matrix X are identical Xi = Xj then the
corresponding coefficients in the solution βR of the optimization problem are also identical:
βR,i = βR,j .

For sparse regression problems where some predictors are highly correlated ridge regression
weighs correlated predictors similarly, as opposed to the lasso, but does not yield a sparse
model. The elastic net combines the lasso and ridge-regression cost functions introducing
an extra regularization parameter α,
 
2 1−α 2 α
minimize y − X β̃ + λ β̃ + β̃ . (61)
2 2 2 2 1

For α = 0 the elastic net is equivalent to ridge regression, whereas for α = 1 it’s equivalent to
the lasso. For intermediate values of α the cost function yields sparse linear models where the
coefficients corresponding to highly correlated predictors have similar amplitudes, as shown
in Figure 8.
Figure 9 plots the training and test error achieved by least squares, ridge regression, the
lasso and the elastic net on a dataset where the response only depends on two groups of
highly correlated predictors. The total number of predictors in the dataset is p = 50 and the
number of training examples is n = 100. Least squares overfits the data, yielding the best
error for the training set. Ridge regression has a significantly lower test error, but does not
achieve the performance of the lasso because it does not yield a sparse model as can be seen
in Figure 8. The elastic net achieves the lowest test error.

21
1.0
Least squares (training)
Least squares (test)
Ridge (training)
0.8 Ridge (test)
Elastic net (training)
Relative error (l2 norm)

Elastic net (test)


0.6 Lasso (training)
Lasso (test)

0.4

0.2

0.0
10-4 10-3 10-2 10-1 100 101 102 103 104 105
Regularization parameter

Figure 9: Training and test error achieved by least squares, ridge regression, the lasso and the elas-
tic net on a dataset where the response only depends on two groups of highly correlated predictors.
The coefficient paths are shown in Figure 8.

22
4 Group sparsity

4.1 Introduction

Group sparsity is a generalization of sparsity that allows us to design models that incorporate
prior information about the data. If the entries of a vector are partitioned into several groups,
then the vector is group sparse if only the entries corresponding to a small number of groups
are nonzero, no matter how many entries are zero or nonzero within the groups. We have
already exploited group sparsity in the context of denoising in Lecture Notes 4. There we
used block thresholding to enforce the prior assumption that the STFT coefficients in a
speech signal tend to have significant amplitude in contiguous areas. In this section, we
will focus on the application of group-sparsity assumptions to regression models, where the
groups are used to encode information about the structure of the predictors.
We consider a linear regression model where the predictors are partitioned into k groups G1 ,
G2 , . . . , Gk ,
 
β
 G1 
h i
βG2 

y ≈ β0 + Xβ = β0 + XG1 XG2 · · · XGm   
. (62)
· · ·
 
βGk

A group-sparse regression model is a model in which only the predictors corresponding to a


small number of groups are used to predict the response. For such models, the coefficient
vector β has a group-sparse structure, since the entries corresponding to the rest of the groups
are equal to zero. For example, if the predictors include the temperature, air pressure and
other weather conditions at several locations, it might make sense to assume that the response
will only depend on the predictors associated to a small number of locations. This implies
that the β should be group sparse, where each group contains the predictors associated to a
particular location.

4.2 Multi-task learning

Multi-task learning is a problem in machine learning and statistics which consists of learning
models for several learning problems simultaneously, exploiting common structure. In the
case of regression, an important example is when we want to estimate several responses Y1 ,
Y2 , . . . , Yk ∈ Rn that depend on the same predictors X1 , X2 , . . . , Xp ∈ Rn . To learn a linear

23
model for this problem we need to fit a matrix of coefficients B ∈ Rp×k ,
 
Y = Y1 Y2 . . . Yk ≈ B0 + XB (63)
 
= B0 + X B1 B2 · · · Bk . (64)

If we estimate B by solving a least-squares problem, then this is exactly equivalent to learning


k linear regression separately, one for each response. However, a reasonable assumption
in many cases is that the different responses depend on the same predictors, albeit with
with different coefficients. This corresponds exactly to a group-sparse assumption on the
coefficient matrix B: B should have a small number of nonzero rows.

4.3 Mixed `1 /`2 norm

In order to promote group-sparse structure, a popular approach is to penalize the `1 /`2 norm,
which corresponds to the sum of the `2 -norms of the entries in the different groups. The
intuition is that minimizing the `1 norm induces sparsity, so minimizing the `1 norm of the
`2 -norms of the groups should induce sparsity at the group level.

Definition 4.1 (`1 /`2 norm). The `1 /`2 norm of a vector β with entries divided into k
groups G1 , G2 , . . . , Gk is defined as
k
X
||β||1,2 := ||βGi ||2 . (65)
i=1

In the case of multitask learning, where the groups correspond to the rows of a matrix, the
`1 /`2 norm of the matrix corresponds to the sum of the `2 norms of the rows,
k
X
||B||1,2 := ||B:i ||2 , (66)
i=1

where B:i denotes the ith row of B.


Let us give a simple example that shows why penalizing the `1 /`2 norm induces group
sparsity. Let us define two groups G1 := {1, 2} and G1 := {3}. The corresponding `1 /`2
norm is given by
 
β1 q
β2  = β12 + β22 + |β3 | (67)
β3 1,2

24
Our aim is to fit a regression model. Let us imagine that most of the response can be
explained by the first predictor, so that
 
β1
y ≈ X 0 ,
 (68)
0
but the fit can be improved in two ways, by setting either β2 or β3 to a certain value α.
The question is which of the two options is cheaper in terms of `1 /`2 norm, since this is the
option that will be chosen if we use the `1 /`2 norm to regularize the fit. The answer is that
modifying β2 has much less impact on the `1 /`2 norm
   
β1 0 q
 0  + α = β12 + α2 , (69)
0 0 1,2
   
β1 0 q
 0  + 0 = β12 + α2 + 2 |β1 | α, (70)
0 α 1,2
especially if α is small or β1 is large. The `1 /`2 norm induces a group-sparse structure by
making it less costly to include entries that belong to groups which already have nonzero
entries, with respect to groups where all the entries are equal to zero.
The following lemma derives the subgradient of the `1 /`2 norm. We defer the proof to
Section A.9 in the appendix.
Lemma 4.2 (Subgradient of the `1 /`2 norm). A vector g ∈ Rp is a subgradient of the `1 /`2
norm at β ∈ Rp if and only if
βGi
gGi = for βGi 6= 0, (71)
||βGi ||2
||gGi ||2 ≤ 1 for βGi = 0. (72)

4.4 Group and multitask lasso

The group lasso [5] combines a least-squares term with an `1 /`2 -norm regularization term to
fit a group-sparse linear regression model,
2
minimize Y − β̃0 − X β̃ + λ β̃ , (73)
F 1,2

where λ > 0 is a regularization parameter. Applying the exact same idea to the multi-task
regression problem yields the multi-task lasso
2
minimize Y − B̃0 − X B̃ + λ B̃ , (74)
F 1,2

25
1.0
Lasso (k=4)
0.9 Lasso (k=40)
Multi-task lasso (k=4)
0.8 Multi-task lasso (k=40)
Relative error (l2 norm)

0.7
0.6
s=4 0.5
0.4
0.3
0.2
0.1
10-5 10-4 10-3 10-2 10-1 100 101 102
Regularization parameter

1.0
Lasso (k=4)
0.9 Lasso (k=40)
Multi-task lasso (k=4)
0.8 Multi-task lasso (k=40)
Relative error (l2 norm)

0.7
0.6
s = 30
0.5
0.4
0.3
0.2
10-5 10-4 10-3 10-2 10-1 100 101 102
Regularization parameter

Figure 10: Errors achieved by the lasso and the multitask lasso on a multitask regression problem
where the same s predictors (out of a total of p = 100 predictors) are used to produce the response
in k sparse linear regression models. The training data is equal to n = 50.

26
Lasso Multitask lasso Original

3.6

3.2

2.8

2.4

2.0

1.6

1.2

0.8

0.4

0.0

Figure 11: Coefficients for the lasso and the multitask lasso when s = 30 and k = 40 in the
experiment described in Figure 10. The first 30 rows contain the relevant features.

where the Frobenius norm ||·||F is equivalent to the `2 norm of the matrix once it is vectorized.
Figure 10 shows a comparison between the errors achieved by the lasso and the multitask
lasso on a multitask regression problem where the same s predictors (out of a total of
p = 100 predictors) are used to produce the response in k sparse linear regression models.
The training data is equal to n = 50. The lasso fits each of the models separately, whereas
the multitask lasso produces a joint fit. This allows to learn the model more effectively and
achieve a lower error, particularly when the number of predictors is relatively large (s = 30).
Figure 11 shows the actual coefficients fit by the lasso and the multitask lasso when s = 30
and k = 40. The multitask lasso is able to promote a group sparse structure which results in
the correct identification of the relevant predictors. In contrast, the lasso fits sparse models
that do not necessarily contain the same predictors, making it easier to include irrelevant
predictors that seem relevant for a particular response because of the noise.

4.5 Proximal-gradient algorithm

In order to apply the group or the multitask lasso we need to solve a least-squares problem
with an `1 /`2 -norm. In this section we adapt the proximal-gradient algorithm described in
Lectures Notes 3 to this setting. The first step is to derive the proximal operator of this

27
norm.

Proposition 4.3 (Proximal operator of the `1 /`2 norm). The solution to the optimization
problem
1 2
minimize β − β̃ + α β̃ , (75)
2 2 1,2

where α > 0, is obtained by applying a block soft-thresholding operator to β

proxα ||·||1,2 (β) = BS α (β) , (76)

where

β G − α βGi
if ||βGi ||2 ≥ α
BS α (β)Gi :=
i
||βGi || 2 (77)
0 otherwise.

The proof of this result is in Section A.10 of the appendix.


The proximal-gradient method alternates between block-thresholding and taking a gradient
step to minimize the least-squares fit.

Algorithm 4.4 (Iterative Block-Thresholding Algorithm). We set the initial point x(0) to
an arbitrary value in Rn . Then we compute

x(k+1) = BS αk λ x(k) − αk AT Ax(k) − y ,



(78)

until a convergence criterion is satisfied.

Convergence may be accelerated using the ideas discussed in Lecture Notes 3 to motivate
the FISTA method.

References
The book by Hastie, Tibshirani and Wainwright [3] is a great reference on sparse regression.
We also recommend [1].

[1] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning.

[2] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models
via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.

28
[3] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and
generalizations. CRC Press, 2015.

[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society. Series B (Methodological), pages 267–288, 1996.

[5] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.

[6] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

A Proofs

A.1 Proof of Proposition 1.1

Let X = U Σ VT be the singular-value decomposition (SVD) of X. Under the conditions of


−1 T
the proposition, X T X X y = V Σ U T . We begin by separating y into two components

y = UUT y + I − UUT y

(79)

where U U T y is the projection of y onto the column space of X. Note that I − U U T y is
orthogonal to the column space of X and consequently to both U U T y and X β̃ for any β̃.
By Pythagoras’s Theorem
2 2 2
I − UUT y + U U T y − X β̃

y − X β̃ = 2
. (80)
2 2

˜ is
The minimum value of this cost function that can be achieved by optimizing over beta
2
||yX ⊥ ||2 . This can be achieved by solving the system of equations

U U T y = X β̃ = U Σ VT β̃. (81)

Since U T U = I because p ≥ n, multiplying both sides of the equality yields the equivalent
system

U T y = Σ VT β̃. (82)

Since X is full rank, Σ and V are square and invertible (and by definition of the SVD
V −1 = V T ), so

βls = V Σ U T y (83)

is the unique solution to the system and consequently also of the least-squares problem.

29
A.2 Proof of Proposition 1.3

We model the noise as a random vector ž which has entries that are independent Gaussian
random variables with mean zero and a certain variance σ. Note that ž is a random vector,
whereas z is the realization of the random vector. Similarly, the data y that we observe is
interpreted as a realization of a random vector y̌. The ith entry of

y̌ = Xβ + ž (84)

is a Gaussian random variable with mean (Xβ)i and variance σ 2 . The pdf of y̌i is conse-
quently of the form
    2 
1 t − X β̃
i
fy̌i (t) := √ exp − . (85)
 
2σ 2
2πσ

By assumption, the entries of ž are independent, so the joint pdf of y̌ is equal to


    2 
n
Y 1 ỹ i − X β̃
i
fy̌ (ỹ) := √ exp − (86)
 
2σ 2 
i=1
2πσ
 
1 1 2
:= p exp − 2 ỹ − X β̃ . (87)
(2π)n σ n 2σ 2

The likelihood is the probability density function of y̌ evaluated at the observed data y and
interpreted as a function of β̃.
 
  1 1 2
L β̃ = p exp − y − X β̃ . (88)
(2π)n 2 2

Since the function is nonnegative and the logarithm is a monotone function, we can take
optimize over the logarithm of the likelihood to find the maximum-likelihood estimate. We
conclude that it is given by the solution to the least-squares problem, since
 
βML = arg max L β̃ (89)
β̃
 
= arg max log L β̃ (90)
β̃

= arg min y − X β̃ (91)


β̃ 2

30
A.3 Proof of Lemma 1.4

Let 1 denote an n-dimensional vector of ones. The model with an intercept is equivalent to
 
  β̃
y≈ X 1 ˜ . (92)
β0

Applying Proposition 1.1 the least-squares fit is


 
β
h iT h i−1 h iT
  = X 1 X 1 X 1 y (93)
β0
ls
 −1  
T T T
X X X 1 X y
=    (94)
1T X n 1T y
 −1  
XT X 0 XT y
=    (95)
0 n 0
 −1   
XT X 0 XT y
=   (96)
1
0 n
0
 −1 T 
XT X X y
= , (97)
0

where have used the fact that 1T y = 0 and X T 1 = 0 because the mean of y and of each of
the columns of X is equal to zero.

A.4 Proof of Theorem 1.5

By Proposition 1.1
2
||β − βls ||22 = β − V Σ−1 U T y 2
(98)
2
= β − V Σ−1 U T (Xβ + z) 2
(99)
2
= V Σ−1 U T z 2
(100)
2
= Σ−1 U T z 2
(101)
p
!2
X UjT z
= , (102)
j=1
σj

31
which implies
2 2
UT z UT z
2
2
≤ ||β − βls ||22 ≤ 2
2
. (103)
σmax σmin

The distribution of the p-dimensional vector U T z is Gaussian with mean zero and covariance
matrix

U T Σz U = σz2 I, (104)
2
where Σz = σz2 I is the covariance of matrix of z. As a result, σ12 U T z 2 is a chi-square
z
random variable with p degrees of freedom. By Proposition A.2 in Lecture Notes 5 and the
union bound
2
pσz2 (1 − ) ≤ U T z 2
≤ pσz2 (1 + ) (105)
 2
  2
  2

with probability at least 1 − exp − p8 − exp − p2 ≥ 1 − 2 exp − p8 .

A.5 Proof of Theorem 2.3

We define the error

h := β − βlasso . (106)

The following lemma shows that the error satisfies the robust-sparsity condition in the defi-
nition of the restricted-eigenvalue property.

Lemma A.1. In the setting of Theorem 2.3 h := β − βlasso satisfies

||hT c ||1 ≤ ||hT ||1 (107)

where T is the support of the nonzero entries of β.

Proof. Since βlasso is feasible and β is supported on T

τ = ||β||1 ≥ ||βlasso ||1 (108)


= ||β + h||1 (109)
= ||β + hT ||1 + ||hT c ||1 (110)
≥ ||β||1 − ||hT ||1 + ||hT c ||1 . (111)

32
This implies that by the restricted-eigenvalue property we have
1
||h||22 ≤ ||X h||22 . (112)
γn
The following lemma allows to bound the right-hand side.

Lemma A.2. In the setting of Theorem 2.3

||Xh||22 ≤ 2 z T Xh. (113)

Proof. Because βlasso is the solution to the constrained optimization problem and β is also
feasible we have

||y − Xβlasso ||22 ≥ ||y − Xβlasso ||22 . (114)

Substituting y = Xβ + z,

||z − Xh||22 ≥ ||z||22 (115)

which implies the result.

Since ||hT c ||1 ≤ ||hT ||1 and hT only has s nonzero entries

||h||1 ≤ 2 s ||h||2 (116)

so by Lemma A.2, Hölder’s inequality and (112)

2 z T Xh
||h||22 ≤ (117)
γn
2 X T z ∞ ||h||1
≤ (118)
γn

4 s ||h||2 X T z ∞
≤ . (119)
γn
The proof of the theorem is completed by the following lemma, that uses the assumption on
the noise z to bound X T z ∞ .

Lemma A.3. In the setting of Theorem 2.3


 p 
P X T z ∞ > σz 2α n log p ≤ 2 exp (− (α − 1) log p) (120)

for any α > 2.

33
Proof. XiT z is Gaussian with variance σz2 ||Xi ||22 , so for t > 0 by Lemma 3.5 in Lecture Notes
5
 2
T
 t
P Xi z > tσz ||Xi ||2 ≤ 2 exp − (121)
2
By the union bound,
t2
   
T
P X z ∞
> tσz max ||Xi ||2 ≤ 2 p exp − (122)
i 2
 2 
t
= 2 exp − + log p (123)
2

Choosing t = 2α log p for α > 2 we obtain the desired result.

A.6 Proof of Proposition 3.1

The ridge-regression cost function is equivalent to the least-squares cost function


    2
y X
minimize − β̃ . (124)
0 λI 2

By Proposition 1.1 the solution to this problem is


 T  !−1  T  
X X X y
βridge := (125)
λI λI λI 0
−1
= X T X + λ2 I X T (Xβ + z)

(126)
−1
= V Σ2 V T + λ2 V V T V Σ2 V T β + V ΣU T z

(127)
−1 T
= V Σ2 + λ2 I V V Σ2 V T β + V ΣU T z .

(128)

A.7 Proof of Lemma 3.2

Since Xi = Xj ,
X
Xβ (α) = (α βlasso,i + (1 − α) βlasso,j ) Xi + ((1 − α) βlasso,i + α βlasso,j ) Xj + βlasso,k Xk
k∈{i,j}
/
X
= βlasso,i Xi + βlasso,j Xj + βlasso,k Xk (129)
k∈{i,j}
/

= Xβlasso , (130)

34
which implies ||y − Xβ (α)||2 = ||y − Xβlasso ||2 . Similarly,
X
||β (α)||1 = |α βlasso,i + (1 − α) βlasso,j | + |(1 − α) βlasso,i + α βlasso,j | + |βlasso,k | (131)
k∈{i,j}
/
X
≤ α |βlasso,i | + (1 − α) |βlasso,j | + (1 − α) |βlasso,i | + α |βlasso,j | + |βlasso,k |
k∈{i,j}
/

= ||βlasso ||1 . (132)

This implies that β (α) must also be a solution.

A.8 Proof of Lemma 3.3

Consider

β (α)i := α βR,i + (1 − α) βR,j , (133)


β (α)j := (1 − α) βR,i + α βR,j , (134)
β (α)k := βR,k , k∈
/ {i, j} . (135)

for 0 < α < 1. By the same argument in (130) ||y − Xβ (α)||2 = ||y − XβR ||2 . We define
0
βR,i := βR,j , (136)
0
βR,j := βR,i , (137)
0
βR,k := βR,k , k∈
/ {i, j} . (138)
0
Note that because R is invariant to the ordering of its argument R (βR ) = R (βR ). Since
0
β (α) = αβR + (1 − α) βR , by strict convexity of R
0
R (β (α)) < α R (βR ) + (1 − α) R (βR ) (139)
= R (βR ) (140)

if βR,i 6= βR,j . Since this would mean that βR is not a solution to the regularization problem,
this implies that βR,i = βR,j .

A.9 Proof of Lemma 4.2

We have that

||β + h||1,2 ≥ ||β||1,2 + g T h (141)

35
for all possible h ∈ Rp if and only if

||βGi + hGi ||2 ≥ ||βGi ||2 + gGTi hGi (142)

for all possible hGi ∈ R|Gi | , for 1 ≤ i ≤ k.


If βGi 6= 0 the only vector gGi that satisfies (142) is the gradient of the `2 norm at βGi

βGi
∇ ||·||2 (βGi ) = . (143)
||βGi ||2

The fact that the gradient is of this form follows from the chain rule.
If βGi = 0 any vector gGi with `2 norm bounded by one satisfies (142) by the Cauchy-Schwarz
inequality.

A.10 Proof of Lemma 4.3

We can separate the minimization problem into the different groups


k
1 2 X 1 2
min β − β̃ + α β̃ = min βGi − β̃Gi + α β̃Gi . (144)
β̃ 2 2 1,2 β̃Gi 2 2 2
i=1

We can therefore minimize the different terms of the sum separately. Each term
1 2
βGi − β̃Gi + α β̃Gi (145)
2 2 2

is convex and has subgradients of the form


   
g β̃Gi := β̃Gi − βGi + α q β̃Gi , (146)

   β̃Gi if β̃Gi 6= 0,
q β̃Gi := ||β̃Gi ||2 (147)
0 if β̃ 6= 0.
Gi

This follows from Lemma 4.2 and the fact that the sum of subgradients of several functions
is a subgradient of their sum.
Any minimizer β̂Gi of (145) must satisfy
 
g β̂Gi = 0. (148)

36
This implies that if β̂Gi 6= 0 then

α β̂Gi
βGi = β̂Gi + (149)
β̂Gi
2
  β̂Gi
= β̂Gi +α . (150)
2
β̂Gi
2

As a result, βGi and β̂Gi are collinear and

β̂Gi = ||βGi ||2 − α, (151)


2

which can only hold if ||βGi ||2 ≥ α. In that case,

α β̂Gi
β̂Gi = βGi − (152)
β̂Gi
2
α βGi
= βGi − . (153)
||βGi ||2

This establishes that as long as ||βGi ||2 ≥ α (153) is a solution to the proximal problem.

If β̂Gi = 0 then by (148)

α ≥ β̂Gi − βGi (154)


2
= ||βGi ||2 . (155)

This establishes that as long as ||βGi ||2 ≤ α β̂Gi = 0 is a solution to the proximal problem.

37

You might also like