Local Distance-Based Generalized Linear PDF
Local Distance-Based Generalized Linear PDF
XREAP2012-11
1. Introduction
Boj, Delicado, and Fortiana (2010) introduced local Distance-Based Linear Model (DB-
LM), a nonparametric prediction technique extending (weighted) DB-LM. In the present
paper we introduce further extensions. In general, any statistical technique based on
1
Universitat de Barcelona
2
Universitat Politècnica de Catalunya
3
CEEISCAT
4
Corresponding author: Eva Boj. Dept. de Matemàtica Econòmica, Financera i Actuarial, Univ.
de Barcelona, Diagonal 690, 08034 Barcelona, Spain. Tel: +34 934035744. Fax: +34-934034892.
E-mail: [email protected]
1
Weighted Least Squares (WLS) can be adapted to data presented as an inter-individual
distances matrix by just replacing each WLS step by the corresponding weighted DB-LM.
This procedure is easily extended to Iterative Weighted Least Squares (IWLS), as applied
in many statistical methods, ranging from Generalized Linear Models (GLM) McCullagh
and Nelder (1989) to Robust Regression (see, for instance, Green (1984), Street, Carroll,
and Ruppert (1988)). Here we develop in detail Distance-Based Generalized Linear
Models (DB-GLM), then we construct its local version.
The dbstats R package (Boj, Caballé, Delicado, and Fortiana 2012) contains classes
and functions implementing distance-based prediction methods such as DB-LM, local
DB-LM, DB-GLM, local DB-GLM and Distance-Based Partial Least Squares Regression
(DB-PLSR) Boj, Claramunt, Grané, and Fortiana (2007).
The paper is structured as follows: In Section 2.1 we review the main features of DB-
LM; in Section 2.2 we develop DB-GLM as an extension of DB-LM; in Section 2.3 we
introduce local DB-GLM. In Section 3 we describe the dbstats package for R. Finally,
in Section 4, we illustrate the use of dbstats to fit DB-GLM and local DB-GLM with
several examples.
2. Distance-Based Prediction
In this section, after recalling the main characteristics of DB-LM, we present DB-GLM
and then we show how to construct its local version.
DB-LM was introduced by Cuadras (1989) and has been developed in Cuadras and
Arenas (1990), Cuadras, Arenas, and Fortiana (1996), Boj, Claramunt, and Fortiana
(2007), Esteve, Boj, and Fortiana (2009) and Boj, Delicado, and Fortiana (2010). Here
we recall its main concepts, as given in these articles, where the reader is referred to for
more details and proofs.
2
Individuals in Ω are described by a set Z of variables, henceforth observed predictors,
possibly including both quantitative and qualitative measurements or, possibly, other
nonstandard quantities, such as character strings or functions. A distance (metric or
semi-metric) δ( · , · ) is defined in Ω, as a function of the Z variables. We denote by ∆
the n × n matrix, whose entries are the squared distances δ 2 (Ωi , Ωj ).
Assume a new case Ωn+1 is available, and we are given the 1 × n vector δ n+1 of squared
distances from Ωn+1 to the n previously known individuals. Ωn+1 can be represented as
a k-vector xn+1 in the row space of X w . Then, the predicted Y for Ωn+1 is xn+1 · β̂,
where β̂ is the vector of estimated regression coefficients.
DB-LM does not depend on a specific X w , since the final quantities are obtained directly
from the distances. Usually such a configuration needs not be made explicit, and neither
do β̂ or xn+1 . In DB-LM the hat matrix is:
H w = Gw · D w 1/2 · F w + · D w 1/2 ,
(1)
where D w = diag(w) is the diagonal matrix whose diagonal entries are the weights w,
1/2 1/2
F w = Dw · Gw · D w ,
and F w + is the Moore-Penrose pseudo-inverse of F w . Thus, H w is an intrinsic quantity,
meaning that it can be expressed directly as a function of the distances or, equivalently,
the inner products.
The predicted Y for a new case Ωn+1 , given its δ n+1 vector is:
1
(g w − δ n+1 ) · D w 1/2 · F w + · D w 1/2 · y.
ŷn+1 = (2)
2
3
depending on the chosen metric, r can be as high as n − 1, giving an overparametrized
model with unstable predictions, a sensible procedure is to replace the pseudo-inverse
F+w with a lower-rank approximation. This can be easily implemented by the Singular
Value Decomposition which, by the Schmidt-Eckart-Young Theorem (see, e.g., Stewart
(1993)), gives the best `2 approximation of any given rank k, 1 ≤ k ≤ r. Cross-validation
can then be used to select a suitable k.
In this section we review the basic concepts and notations of GLM, for the sake of an
easy reference. As it is well-known (see, eg., McCullagh and Nelder (1989)), in a GLM
we have a linear predictor η = X · β, which is related to the response variable Y by
means of a link function g(·), η = g (µ), then,
µ = g −1 (η) . (3)
In a GLM it is assumed that each component of the response has a distribution in the
exponential family, taking the form:
for some specific functions a(·), b(·) and c(·). If φ is known, this is an exponential family
model with canonical parameter θ.
E (Y ) = µ = b0 (θ) , (5)
4
and
The variance of Y is the product of two functions; one, b00 (θ), depends on the canonical
parameter only (and hence on the mean (5)) and will be called the variance function,
while the other is independent of θ and depends only on φ. The variance function as a
function of µ will be written V (µ) = b00 (θ). Commonly a (φ) is of the form a (φ) = φ/w
and φ, the dispersion parameter, is constant over observations. Respect w it is a known
prior weight that varies from observations to observation. If we have n independent
readings w = n. Finally, we can write (6) as
φ
var (Y ) = V (µ) · . (7)
w
dη
z 0 = η̂ 0 + (y − µ̂0 ) · , (8)
dµ 0
where the link derivative is evaluated at µ̂0 . The quadratic weight is defined by:
2
dη V0
W −1
0 = · , (9)
dµ 0 w
where V 0 is the variance function evaluated at µ̂0 . Now regress z 0 on the covariates
with weight W 0 to give new estimates of β̂ 1 of the parameters; from these form a new
5
estimate η̂ 1 of the linear predictor. Repeat until changes are sufficiently small.
Note that z is just a linearized form of the link function applied to the data, for, to first
order, g (y) ' g (µ) + (y − µ) · g 0 (µ). The variance of z is W −1 (ignoring the dispersion
parameter), assuming that η and µ are fixed and known.
Both DB-GLM and DB-LM share the same elements: a set of n individuals with as-
sociated weights vector w (standardized to unit sum), for which we have observed the
w-centered response vector y and a set of predictors Z. From the latter we calculate
the n × n distances matrix ∆. Just as GLM with respect to LM, DB-GLM differs from
DB-LM in two aspects:
2. The relation between the linear predictor η = X w · β, obtained from the latent
Euclidean configuration X w , and the response y is given by a link function g(·)
as in (3).
The DB-GLM model definition is sound since it does not depend of the particular choice
of the Euclidean configuration. Indeed, as stated, the model consists of random vectors
(Y1 , . . . , Yn )0 whose expectation, (µ1 , . . . µn )0 , transformed by the link function, is a vector
in the column space G of X w . Since G is also the column space of X w · X 0w = Gw ,
(Rao 1973, p. 27), we conclude the model depends only on ∆.
To fit DB-GLM we use the IWLS algorithm described above, where DB-LM substitutes
LM in formulas (8) and (9) to regress z 0 on the covariates with weight W 0 , in order to
obtain the new estimation η̂ 1 . Observe that the IWLS estimation process for DB-GLM
does not depend on a specific X w since the final quantities are obtained directly from
dη
distances. In the first step we need and initial µ̂0 . Then we calculate η̂ 0 and dµ .
0
These two elements only depend on the link function. Finally, we calculate V 0 , the
function (7) evaluated at µ̂0 , which only depends on the fitted vales µ̂ at each step.
6
Prediction for new observations is also independent of the choice of X w . Given a new
case Ωn+1 , described by the 1 × n vector δ n+1 of squared distances from Ωn+1 to the n
previously known individuals, the predicted ηn+1 for Ωn+1 is xn+1 · β̂, which is calculated
with formula (2) with the quantities of the last IWLS step. Then we can calculate
µn+1 = g −1 (ηn+1 ).
We consider again the framework stated in Section 2.2 when DB-GLM was introduced.
Our objective is now to fit a local DB-GLM, where local refers to the fact that when the
DB-GLM is used to predict the value of the response variable for an object Ωn+1 , we
use only the information provided by observed objects Ωi , i = 1, . . . , n, that are close to
Ωn+1 , giving to Ωi a weight that is a decreasing function of the distance between Ωi and
Ωn+1 . The idea is to translate to the DB-GLM context the principles of local likelihood, as
stated in Loader (1999) (see also Section 3.4 in Bowman and Azzalini (1997), Section 6.5
in Hastie, Tibshirani, and Friedman (2009) or Section 5.10 in Wasserman (2006)). Our
approach parallels that used in Boj, Delicado, and Fortiana (2010) when local DB-LM
is defined.
Let m(Ωn+1 ) be the expected value of the response y corresponding to the object Ωn+1 .
This is the value we want to estimate and we do that by using DB-GLM. We assume
that two distance functions, δ1 and δ2 , are defined between the elements of Ω (the set of
observable objects). We consider the weights
n
X
wi (Ωn+1 ) = K(δ1 (Ωn+1 , Ωi )/h)/ K(δ1 (Ωn+1 , Ωj )/h),
j=1
We consider the new individual Ωn+1 and we compute the squared distances from object
Ωn+1 to other individuals Ωi :
7
Then we run the IWLS algorithm for DB-GLM to obtain the local DB-GLM estimator
of m(Ωn+1 ):
m̂localDB−GLM (Ωn+1 ) = ŷn+1 .
There are two distance functions involved in the local DB-GLM: one of them, δ1 , is used
to compute the weight of observed objects Ωi around the object Ωn+1 where the response
function is estimated, and the other, δ2 , defines the distances between observations for
computing DB-GLM. The distances δ1 and δ2 can coincide or not. In the context of
local DB-LM, Boj, Delicado, and Fortiana (2010) show that using two distance functions
provides more flexibility than using only one (that is, δ1 = δ2 ).
The dbstats package for R (Boj, Caballé, Delicado, and Fortiana 2012) implements sev-
eral distance-based prediction methods. Currently the response is univariate. Distances
can either be directly input as an interdistances matrix, a squared interdistances matrix,
an inner-products matrix or computed from observed explanatory variables. We distin-
guish observed explanatory variables, denoted by Z, from Euclidean coordinates X w .
Observed explanatory variables Z are possibly a mixture of continuous, qualitative or
more general quantities. dbstats does not provide specific methods for computing dis-
tances, depending instead on other available functions and packages, such as:
daisy in the cluster package (Maechler 2012). Compared to dist above whose input
must be numeric variables, the main feature of daisy is its ability to handle other
variable types as well (e.g. nominal, ordinal, (a)symmetric binary) even when
different types occur in the same data set.
dist in the proxy package (Meyer and Buchta 2012). Supersedes the one in the stats
package. It allows a user-provided function, entered as a parameter, for evaluating
distances between observations, hence it can deal with any type of data.
Distance-related classes in dbstats are dist and dissimilarity (as in stats), D2, for
squared distances matrices; and Gram, for doubly centered inner product matrices. Util-
ity functions such as as.D2, as.Gram, D2toDist, D2toG, distoD2 and GtoD2 allow their
mutual interconversions (see (Boj, Caballé, Delicado, and Fortiana 2012) for details).
8
• dblm for DB-LM.
Generalized linear and local generalized linear models with a univariate response:
In the next subsections we describe the usage of dblm, dbglm and ldbglm. For ldblm
and plsr we refer to (Boj, Caballé, Delicado, and Fortiana 2012).
3.1. dblm
The usage of dblm depends on the input information. There are two ways to incorporate
predictors information: either as a formula or as a distance-type object of any of the
four classes: dist, dissimilarity, D2 or Gram.
For class D2
9
dblm(G, y, ... , method = "OCV", full_search = FALSE, weights,
rel.gvar = 0.95, eff.rank)
formula an object of class formula. A formula of the form y~Z. This argument is a
remnant of the lm function, kept for compatibility.
data an optional data frame containing the variables in the model (both response and
explanatory variables, either the observed ones, Z, or a Euclidean configuration
X w ).
metric metric function to be used when computing distances from observed explanatory
variables. One of "euclidean" (default), "manhattan", or "gower".
distance a dist or dissimilarity class object. See functions dist in the package
stats and daisy in the package cluster.
G a Gram class object. Doubly centered inner product matrix of the squared distances
matrix D2, i.e., Gw . See details above to learn the usage of dblm.Gram.
method sets the method to be used in deciding the effective rank, which is defined as the
number of linearly independent Euclidean coordinates used in prediction. There
are six different methods: "AIC", "BIC", "OCV" (default), "GCV", "eff.rank" and
"rel.gvar". OCV and GCV take the effective rank minimizing a cross-validatory
quantity (either ordinary ocv or generalized gcv). AIC and BIC take the effective
rank minimizing, respectively, the Akaike or Bayesian Information Criterion (see
the R function AIC for more details). The optimizacion procedure to be used in
the above four methods can be set with the full_search optional parameter.
When method is eff.rank, the effective rank is explicitly set by the user through
the eff.rank optional parameter which, in this case, becomes mandatory.
When method is rel.gvar, the fraction of the data geometric variability for model
fitting is explicitly set by the user through the rel.gvar optional parameter which,
in this case, becomes mandatory.
full search sets which optimization procedure will be used to minimize the modelling
10
criterion specified in method. Needs to be specified only if method is "AIC", "BIC",
"OCV" or "GCV". If full_search=TRUE, effective rank is set to its global best value,
after evaluating the criterion for all possible ranks. Potentially too computation-
ally expensive. If full_search=FALSE, the R function optimize is called. Then
computation time is shorter, but the result may be found a local minimum.
rel.gvar relative geometric variability (real between 0 and 1). Take the lowest effective
rank with a relative geometric variability higher or equal to rel.gvar. Default
value (rel.gvar=0.95) uses a 95% of the total variability. Applies only rel.gvar
if method = "rel.gvar".
eff.rank integer between 1 and the number of observations minus one. Number of
Euclidean coordinates used for model fitting. Applies only if method="eff.rank".
The function returns a list of class dblm containing the following components:
11
eff.rank the dimensions chosen to estimate the model.
3.2. dbglm
For class D2
12
rel.gvar = 0.95, eff.rank = NULL, offset, mustart = NULL)
formula an object of class formula. A formula of the form y~Z. This argument is a
remnant of the glm function, kept for compatibility.
data an optional data frame containing the variables in the model (both response and
explanatory variables, either the observed ones, Z, or a Euclidean configuration
X w ).
metric metric function to be used when computing distances from observed explanatory
variables. One of "euclidean" (the default), "manhattan", or "gower".
distance a dist or dissimilarity class object. See functions dist in the package
stats and daisy in the package cluster.
G a Gram class object. Doubly centered inner product matrix of the squared distances
matrix D2, i.e., Gw . See details in dblm.
family a description of the error distribution and link function to be used in the model.
This can be a character string naming a family function, a family function or the
result of a call to a family function. (See the R function family for details of family
functions.)
method sets the method to be used in deciding the effective rank, used in predic-
tion. There are six different methods: "AIC", "BIC", "OCV" (default), "GCV",
"eff.rank" and "rel.gvar". OCV and GCV take the effective rank minimizing
a cross-validatory quantity (either ordinary ocv or generalized gcv). AIC and
BIC take the effective rank minimizing, respectively, the Akaike or Bayesian In-
formation Criterion (see the R function AIC for more details). The optimizacion
procedure to be used in the above four methods can be set with the full_search
optional parameter.
When method is eff.rank, the effective rank is explicitly set by the user through
the eff.rank optional parameter which, in this case, becomes mandatory.
When method is rel.gvar, the fraction of the data geometric variability for model
13
fitting is explicitly set by the user through the rel.gvar optional parameter which,
in this case, becomes mandatory.
full search sets which optimization procedure will be used to minimize the modelling
criterion specified in method. Needs to be specified only if method is "AIC", "BIC",
"OCV" or "GCV". If full_search=TRUE, effective rank is set to its global best value,
after evaluating the criterion for all possible ranks. Potentially too computation-
ally expensive. If full_search=FALSE, the R function optimize is called. Then
computation time is shorter, but the result may be found a local minimum.
weights an optional numeric vector of prior weights to be used in the fitting process.
By default all individuals have the same weight.
maxiter maximum number of iterations in the iterated dblm algorithm. (Default = 100)
rel.gvar relative geometric variability (a real number between 0 and 1). At each dblm
iteration, take the lowest effective rank, with a relative geometric variability higher
or equal to rel.gvar. Default value (rel.gvar=0.95) uses the 95% of the total
variability.
eff.rank integer between 1 and the number of observations minus one. Number of
Euclidean coordinates used for model fitting in each dblm iteration. If specified
its value overrides rel.gvar. When eff.rank=NULL (default), calls to dblm are
made with method=rel.gvar.
offset this can be used to specify an a priori known component to be included in the
linear predictor during fitting. This should be NULL or a numeric vector of length
equal to the number of cases.
For Gamma-distributed responses, the domain of the canonical link function is not the
same as the permitted range of the mean. In particular, the linear predictor might be
14
negative, obtaining an impossible negative mean. Should that event occur, dbglm stops
with an error message. Proposed alternative is to use a non-canonical link function.
The function returns a list of class dbglm containing the following components:
residuals the working residuals, that is the dblm residuals in the last iteration of dblm
fit.
aic.model A version of Akaike’s Information Criterion. Equal to minus twice the max-
imized log-likelihood plus twice the number of parameters. Computed by the aic
component of the family. For binomial and Poison families the dispersion is fixed
at one and the number of parameters is the number of coefficients. For Gaussian,
Gamma and Inverse Gaussian families the dispersion is estimated from the resid-
ual deviance, and the number of parameters is the number of coefficients plus one.
For a Gaussian family the MLE of the dispersion is used so this is a valid value
of AIC, but for Gamma and Inverse Gaussian families it is not. For families fitted
by quasi-likelihood the value is NA.
null.deviance the deviance for the null model. The null model will include the offset,
and an intercept if there is one in the model. Note that this will be incorrect if
the link function depends on the data other than through the fitted mean: specify
a zero offset to force a correct calculation.
weights the working weights, that are the weights in the last iteration of dblm fit.
15
convcrit convergence criterion. One of: "DevStat" (stopping criterion 1), "muStat"
(stopping criterion 2), "maxiter" (maximum allowed number of iterations has
been exceeded).
eff.rank the working effective rank, that is the eff.rank in the last dblm iteration.
3.3. ldbglm
For class D2
16
ldbglm(D2_1, D2_2 = D2_1, y, family = gaussian(), kind.of.kernel = 1,
method = "GCV", weights, user_h = NULL, h.range = NULL, noh = 10,
k.knn = 3, rel.gvar = 0.95, eff.rank = NULL, maxiter = 100,
eps1 = 1e-10, eps2 = 1e-10, ...)
formula an object of class formula. A formula of the form y~Z. This argument is a
remnant of the loess function, kept for compatibility.
data an optional data frame containing the variables in the model (both response and
explanatory variables, either the observed ones, Z, or a Euclidean configuration
X w ).
dist1 a dist or dissimilarity class object. Distances between observations, used for
neighborhood localizing definition. Weights for observations are computed as a
decreasing function of their dist1 distances to the neighborhood center, e.g. a
new observation whose reoponse has to be predicted. These weights are then
entered to a dbglm, where distances are evaluated with dist2.
dist2 a dist or dissimilarity class object. Distances between observations, used for
fitting dbglm. Default dist2=dist1.
G1 a Gram class object. Doubly centered inner product matrix associated with the
squared distances matrix D2_1.
17
G2 a Gram class object. Doubly centered inner product matrix associated with the
squared distances matrix D2_2. Default G2=G1
family a description of the error distribution and link function to be used in the model.
This can be a character string naming a family function, a family function or the
result of a call to a family function. (See the R function family for details of family
functions.)
kind.of.kernel integer number between 1 and 6 which determines the user’s choice of
smoothing kernel. (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4)
Normal, (5) Triangular, (6) Uniform.
metric1 metric function to be used when computing dist1 from observed explanatory
variables. One of "euclidean" (default), "manhattan", or "gower".
metric2 metric function to be used when computing dist2 from observed explanatory
variables. One of "euclidean" (default), "manhattan", or "gower".
method sets the method to be used in deciding the optimal bandwidth h. There are
five different methods, AIC, BIC, OCV, GCV (default) and user_h. OCV and GCV take
the optimal bandwidth minimizing a cross-validatory quantity (either ocv or gcv).
AIC and BIC take the optimal bandwidth minimizing, respectively, the Akaike or
Bayesian Information Criterion (see the R function AIC for more details). When
method is user_h, the bandwidth is explicitly set by the user through the user_h
optional parameter which, in this case, becomes mandatory.
user h global bandwidth user_h, set by the user, controlling the size of the local neigh-
borhood of Z. Smoothing parameter (Default: 1st quartile of all the distances d(i,j)
in dist1). Applies only if method="user_h".
h.range a vector of length 2 giving the range for automatic bandwidth choice. (Default:
quantiles 0.05 and 0.5 of d(i,j) in dist1).
noh number of bandwidth h values within h.range for automatic bandwidth choice (if
method!="user_h").
rel.gvar relative geometric variability (a real number between 0 and 1). At each dblm
iteration, take the lowest effective rank, with a relative geometric variability higher
or equal to rel.gvar. Default value (rel.gvar=0.95) uses the 95% of the total
variability.
18
eff.rank integer between 1 and the number of observations minus one. Number of
Euclidean coordinates used for model fitting in each dblm iteration. If specified
its value overrides rel.gvar. When eff.rank=NULL (default), calls to dblm are
made with method=rel.gvar.
maxiter maximum number of iterations in the iterated dblm algorithm. (Default = 100)
The function returns a list of class ldbglm containing the following components:
19
dist1 the distance matrix (object of class "D2" or "dist") used to calculate the weights
of the observations.
dist2 the distance matrix (object of class "D2" or "dist") used to fit the dbglm.
4. Examples
We fit DB-GLM to the data set on Swedish third-party motor insurance in 1977 described
in Hallin and Ingenbleek (1983). The file is included in faraway package with the name
motorins (Faraway 2012). Data for factor Zone =1 can be found too in Andrews and
Herzberg (1985, pp. 413-421). These data correspond to the cities of Stockholm, Gteburg
and Malmo, and were obtained from a committee study of risk premiums in motor
insurance. The total number of observations (for Zone=1) is n = 295 corresponding to
different non-empty risk groups. For each group, Y is the number of claims suffered
by the automobile insured in the exposure w, which is the number of insured in policy-
years. The factors thought to be important in modeling the occurrence of claims are
three: Distance (Kilometers Travelled), Bonus (No-claims bonus) and Make (specified
car makes). The number of levels of each factor are 5, 7 and 9 respectively. Distance
and Bonus are continuous numerical predictors and we have coded numerically versions
of them as follows:
We have represented each state of Distance by a class mark. Central classes are repre-
sented by the interval average, whereas class marks for the extreme classes are reasonably
representative values. The codes are:
20
Make will be considered as a nominal categorical variable in Gower’s formula (11). It
is coded numerically (as 1 to 9) just as a programming convenience. It represents 9
specified car makes.
R> library(dbstats)
R> require(faraway)
R> data(motorins)
R> Motor1 <- subset(motorins, Zone == 1)
R> Motor1$frequency <- Motor1$Claims / Motor1$Insured
R> y <- Motor1$frequency
R> w <- Motor1$Insured
R> Motor1$KmC <- rep(0,nrow(Motor1))
R> Motor1$KmC[Motor1$Kilometres == "1"] <- 750
R> Motor1$KmC[Motor1$Kilometres == "2"] <- 8000
R> Motor1$KmC[Motor1$Kilometres == "3"] <- 17500
R> Motor1$KmC[Motor1$Kilometres == "4"] <- 22500
R> Motor1$KmC[Motor1$Kilometres == "5"] <- 40000
R> Motor1$BonC <- as.numeric(Motor1$Bonus)
R> Motor1$MakeC <- as.numeric(Motor1$Make)
The first step in the treatment of these data by DB-GLM is the choice of a suitable
metric. In principle it is possible to tailor a metric to reflect specific information on
predictors and on how their proximity relates to the particular prediction under study.
Here it is sufficient to utilize an omnibus metric function which satisfies the Euclidean
condition. One very popular such metric for mixtures of numerical continuous, cate-
gorical and binary predictor variables is the one based on Gower’s general similarity
coefficient (see Gower (1971) for further details):
p1
P
(1 − |xih − xjh |/Gh ) + a + αij
h=1
sij = (11)
p1 + (p2 − d) + p3
where p1 is the number of continuous variables, a and d are the number of positive
and negative matches, respectively, for the p2 binary variables, and αij is the number
of matches for the p3 multi-state categorical variables. Gh is the range of the h-th
continuous variable. The squared distance is computed as: δij2 = 1 − sij . Gower (1971)
proves that this distance satisfies the Euclidean condition. In our example, p1 = 2,
p2 = 0 and p3 = 1 in (11).
For GLM, we use 11 parameters: 2 for Distance and Bonus with the class marks defined
above and 9 binary variables for each nominal class of Make, taking into account an
intercept term. Both in GLM and DB-GLM, we assume both Poisson and Binomial
21
distributions for claim frequency, combined with its associated canonical links. The
weights of regressions are the exposures w.
1. rel.gvar = 1, i.e., we take into account for the model all the dimensions of the
latent Euclidean configurations;
2. method = "GCV", i.e., we choose the effective rank which minimizes a generalized
cross-validation (leave-one-out) statistic;
3. rel.gvar = 0.90, i.e., we take into account for the model the 90% of explained
geometric variability. In both cases (Poisson and Binomial) coincide with the
choice of taking into account an effective rank of 10, which coincide with the
number of parameters used in the fitted GLM (without counting the intercept
term).
We obtain in both cases (the Poisson and the Binomial ones), lower residual deviances
with the distance-based treatment of the GLM than those obtained with the classical
GLM, see Tables 1 and 2. The detailed instructions used to elaborate these tables for
the function dbglm can be found in Annex A.
To illustrate the summary command, we choose a DB-GLM, the one with Poisson re-
sponse and Logarithmic link, using Gower’s distance and fitted taking into account the
"GCV" method. In Annex A it is the one named dbglm2. We show the results too for
dbglm4 which corresponds to the case when we fit the Poisson with Logarithmic link,
using the Euclidean distance and taking into account the complete geometric variability.
That case is relevant because it is the particular case in which the results coincide with
the classical GLM assuming Poisson and Logarithmic link (named glm1 in Annex A).
Similarly, the summary of dbglm8 coincides with the summary of glm2 for Binomial re-
sponse and Logit link (see Annex A)
If we compare the output of the summary command of dbglm4 and glm1 below, we can
observe that both are similarly programmed. The main difference is that in dbglm we
have not estimations of coefficients, because DB-GLM does not assign a coefficient to
each explanatory variable.
R> summary(dbglm2)
22
Deviance Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.39000 -0.71300 0.05910 0.02042 0.83250 6.72400
R> summary(dbglm4)
Deviance Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.51300 -0.89800 -0.06430 -0.06486 0.80760 10.09000
R> summary(glm1)
Call:
glm(formula = y ~ KmC + BonC + factor(MakeC), family =
poisson(link = "log"), data = Motor1, weights = w)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.5134 -0.8980 -0.0643 0.8076 10.0902
Coefficients:
23
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.640e+00 2.637e-02 -62.181 < 2e-16 ***
KmC 1.431e-05 6.381e-07 22.424 < 2e-16 ***
BonC -2.165e-01 2.762e-03 -78.387 < 2e-16 ***
factor(MakeC)2 1.282e-01 4.598e-02 2.788 0.00531 **
factor(MakeC)3 -2.140e-01 5.162e-02 -4.146 3.38e-05 ***
factor(MakeC)4 -5.162e-01 4.987e-02 -10.352 < 2e-16 ***
factor(MakeC)5 1.270e-01 4.850e-02 2.618 0.00883 **
factor(MakeC)6 -3.976e-01 4.467e-02 -8.900 < 2e-16 ***
factor(MakeC)7 -1.320e-01 5.891e-02 -2.240 0.02508 *
factor(MakeC)8 1.396e-01 8.673e-02 1.609 0.10762
factor(MakeC)9 -3.079e-02 2.276e-02 -1.353 0.17618
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
[1] -2.521631e-09
which if a subset of the plots is required, specify a subset of the numbers 1:6.
id.n number of points to be labelled in each plot, starting with the most extreme.
24
main an overall title for the plot. Only if one of the six plots is selected.
type glm the type of prediction (required only for a dbglm class object). Like predict.
dbglm, the default "link" is on the scale of the linear predictors; the alternative
"response" is on the scale of the response variable.
The first five plots are useful for residual analysis and are the same as in plot.lm. The
last plot allows us to view the "OCV", "GCV", "AIC" or "BIC" criterion according to
which the rank used dblm function has been chosen. It applies only if the parameter
full_search in dblm its TRUE.
It is easy to get the predicted mean values, as these are calculated by the inverse link
function on the linear predictors. We refer to the R function family to view how to
insert user-defined linkfun and linkinv in dbstats.
To illustrate the plot command, we exhibit for model dbglm2 the five possible plots:
Residuals vs Fitted, Normal Q-Q, Scale-location, Cook’s distance and Residuals
vs Leverage, which can be found in Figures 1, 2, 3, 4 and 5 respectively.
Values predicted with predict may be the expected mean values of the response for the
new data (type="response"), or the linear predictors evaluated at the estimated dblm
of the last iteration, as is in the plot command above. Additionally, we can choose
the type of the new data, which can be: "Z" if newdata contains the values of the
explanatory variables, "D2" if contains the squared distances matrix or "G" if contains
the inner products matrix. Its usage is:
With model dbglm2, for data in the original set, such as insureds with 750 kilometers
travelled per year, with Bonus and Make both in class 1, we can execute:
[,1]
[1,] 0.1711442
25
Poisson / Logarithmic Residual Deviance Eff.rank
DB-GLM (rel.gvar = 1) 454.05 18
DB-GLM (method = "GCV") 485.72 13
DB-GLM (rel.gvar = 0.90) 539.55 10
GLM 779.36 10
Table 1: Results of the fitting for the Poisson model with the Logarithmic link
Table 2: Results of the fitting for the Binomial model with the Logit link
Comparing this prediction with the corresponding one from fitted.values, we confirm
that both values agree (with a precission of 1e-15).
R> dbglm2$fitted.values[1] - pr
[,1]
[1,] 4.746203e-15
Now, we compute the prediction for a new insured, e.g., with 900 kilometers travelled
per year, with Bonus and Make both in class 1, and we obtain the prediction:
[,1]
[1,] 0.1717889
In this section we make an example using local DB-GLM with functional data as ex-
planatory variable and a binary response. We say that an observed variable is functional
when a whole function is registered for each individual in the sample (see Ramsay and
Silverman (2005) for a general perspective on Functional Data Analysis and Ferraty and
Vieu (2006a) for a nonparametric approach).
26
Figure 1: Residuals vs Fitted plot for DB-GLM with Poisson response and Loga-
rithmic link, using Gower’s distance and fitted taking into account the GCV
method
Figure 2: Normal Q-Q plot plot for DB-GLM with Poisson response and Logarithmic
link, using Gower’s distance and fitted taking into account the GCV method
27
Figure 3: Scale-location plot for DB-GLM with Poisson response and Logarithmic
link, using Gower’s distance and fitted taking into account the GCV method
Figure 4: Cook’s distance plot for DB-GLM with Poisson response and Logarithmic
link, using Gower’s distance and fitted taking into account the GCV method
28
Figure 5: Residuals vs Leverage plot for DB-GLM with Poisson response and Log-
arithmic link, using Gower’s distance and fitted taking into account the GCV
method
We consider the near infrared (NIR) spectral data set contending with wheat samples
that was described in Kalivas (1997). This data set contains data from 100 wheat
samples. The available information for each sample consists of two scalar measures
(protein and moisture contents; only protein content are used here) and a functional
variable, the NIR spectra: samples were measured using diffuse reflection in units
of log inverse reflectance log(1/R) at wavelengths going from 1100 to 2500 nm in
2 nm intervals (reflectance refers to the fraction of incident electromagnetic power
that is reflected by the sample; see Brenchley, Hörchner, and Kalivas (1997) for more
details about NIR measurements). The protein and spectrum data are available at
ftp://ftp.clarkson.edu/users/h/o/hopkepk/chemdata/kalivas/, at files protein.
asc and whtspec.asc, respectively.
Let us define the binary variable y indicating for each wheat sample in the data set
whether its protein content is over the median value or not:
Our goal is to predict the variable y using the NIR spectra function (whtspec) as pre-
dictor.
29
Wheat data set
1.2
NIR spectra ( log(1/R) )
1.0
0.8
0.6
0.4
1200 1400 1600 1800 2000 2200 2400
0.006
0.002
−0.002
2e−04
0e+00
−3e−04
Figure 6: Wheat data set. NIR spectra functions, jointly with their first and second
derivatives. Functions in red corresponds to wheat samples with protein con-
tent over the median.
We use the R package fda.usc (Febrero-Bande and Oviedo 2012) to deal with NIR
spectra data as functional data:
R> library(fda.usc)
R> whtspec.fdata <- fdata(mdata = whtspec, argvals = wave.length, names
= list(main = "Wheat data set", xlab = "Wave length (nm)", ylab =
"NIR spectra ( log(1/R) )"))
R> plot(whtspec.fdata, col = y+1)
R> plot(fdata.deriv(whtspec.fdata, nderiv = 1), col = y+1, main =
"Wheat data set. 1st derivative")
R> plot(fdata.deriv(whtspec.fdata, nderiv = 2), col = y+1, main=
"Wheat data set. 2nd derivative")
This way the functions, as well as their first and second derivatives, are plotes as Figure
6 shows. Wheat samples have been colored according to the value of the binary variable
y (red when y == 1). From the figure it is not obvious how NIR spectra functions or
their derivatives could allows us to predict the value of y, the indicator of high protein
content.
30
In order to measure the prediction ability of a given prediction rule we randomly divide
the data set into a training set (with 60 wheat samples) and a validation set (with the
remainig 40 samples):
R> set.seed(2)
R> trai <- sample(1:100)[1:60]
We will compare the performance of the following binary prediction tools, all of them
using functional predictors:
• Local DB-GLM: Local distance based generalized linear model, developed in Sec-
tion 2.3.
For each of these three prediction tools we have used three different functional predictors:
NIR spectra functions, their first derivatives and their second derivatives. The measure of
prediction quality that we are using for comparing the 9 procedures under consideration
will be the number of bad classified wheat samples among the 40 in the validation set.
The following code was used to prepare the functional data to fit the functional GLM
with fregre.glm:
31
Prediction tool
Functional glm DB-GLM local DB-GLM
Functional predictor:
NIR spectra functions 9 14 8
First derivative 12 10 8
Second derivative 14 12 11
Table 3: Number of bad classified wheat samples among the 40 in the validation set.
The choice of a basis of B-splines with 18 elements (nbasis = 18) is arbitrary and it
could be improved by using a choice based on leave-one-out prediction error criterium.
These code lines allows us to call the function fregre.glm when the functional predictor
used is observed NIR spectra function:
In order to use the first or the second derivatives the definition of formula f0 must be
modified as folows:
Then the call to fregre.glm must be changed accordingly (see Annex B). The first
column of Table 3 shows the number of wheat samples in the validation set that are
bad clssified when using the functional GLM. It can be seen that the best results are
obtained when using the observed functions.
In order to fit a DB-GLM with the function dbglm the first step is to compute the
inter-individual distance matrix. When dealing with functional data our choice is to
use one of the semimetrics defined in Ferraty and Vieu (2006a) as they are imple-
mented in their own R library NPFDA (Ferraty and Vieu (2006b); free access on line at
http://www.lsp.ups-tlse.fr/staph/npfda/), also available at the package fda.usc
(see functions metric.lp, semimetric.basis and semimetric.NPFDA in this package).
In particular here we use L2 distances between NIR spectra functions or their derivatives
calculated after representing functions in a B-spline basis.
32
The following code was used to fit the functional DB-GLM with dbglm. NIR spectra
functions are used as predictors:
In order to use the first or the second derivatives, the parameter nderiv at the second
sentence must be set equal to 1 or 2, respectively, to obtain the corresponding distance
matrices D2.1, D2.1.trai, D2.2 and D2.0.trai. The rest of the code must be changed
accordingly (see Annex B). The results of DB-GLM fits, in term of the number of wheat
samples in the validation set that are bad classified, are shown at the second column of
Table 3. In this case the use of derivatives improves the results. The performances of
functional GLM and DB-GLM are comparable.
We are fitting now local distance based generalized linear models (local DB-GLM) with
the function ldbglm. We are using again L2 distances between NIR spectra functions
or their derivatives. The automatic choice of the smoothing parameter h will be done
with the Generalized Cross Validation criterium (method="GCV"; results using different
methods are similar).
We show the code used for fitting the local DB-GLM when NIR spectra functions are
used as predictors.
Observe that it has been necessary to modify the default range of candidate values for
h (h.range = c(2,4)). By default the range for h was [.5, 2] and the optimum value
for h was the upper limit of this interval. Plotting the fitted DB-GLM model (using
plot(res.ldbglm.0, which = 3)) it can be seen that the range [2, 4] is addequate for
h. The optimal value is attained at h = 2.3331, as it can be seen when doing the
summary of the fitted model:
R > summary(res.ldbglm.0)
33
call: ldbglm.D2(D2_1 = D2.0.trai, y = yt, family = "binomial", method = "GCV",
h.range = c(2, 4), rel.gvar = 0.9, maxiter = 25)
Residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.7100 -0.2875 -0.0550 -0.0150 0.2150 0.8400
Number of Observations: 60
R-squared : 0.3852
Trace of smoother matrix: 4.74
family: binomial
The number of bad classified wheat samples is 8, as it can be seen at the third column
of Table 3. In order to use the first or the second derivatives of NIR spectra func-
tions as predictors, distance matrices D2.0 and D2.0.trai must be replaced by D2.1
and D2.1.trai, or by D2.2 and D2.0.trai. The rest of the code must be changed
accordingly (see Annex B). Care must be taken when choosing a sensible range where
bandwidth h must be, because the range of values for distances between the observed
functions is quite different to that corresponding to their first or second derivatives. In
this case the range used h when using first derivatives whas h.range=c(0.003,0.007)
and it was h.range=c(0.0004,0.001) when using second derivatives. Look at the third
column of Table 3 to see the number of bad classified samples in the validation set. It
follows from these number that the three local DB-GLM fits performs similarly and that
they do a little better job than functional GLM or global DB-GLM.
Acknowlegments
Work supported in part by the Spanish Ministerio de Educación y Ciencia and FEDER,
grants MTM2010-17323 and MTM2010-14887, and by Generalitat de Catalunya, AGAUR
34
grant 2009SGR970.
A. Code excerpts
35
R> ###### Binomial (Logit link) and Euclidean distance
R> ## With dbglm, case: rel.gvar = 1
R> dbglm8 <- dbglm(y ~ KmC + BonC + factor(MakeC), Motor1,
family = binomial(link = "logit"), metric = "euclidean", weights
= w, rel.gvar = 1)
R> ## With glm
R> glm2 <- glm(y ~ KmC + BonC + factor(MakeC), family =
binomial(link = "logit"), data = Motor1, weights = w)
B. Code excerpts
36
R> dbglm.err.1 <- (abs(pred.dbglm.1 - yv) > .5)
R> print(sum(dbglm.err.1))
37
References
Andrews, D. F. and A. M. Herzberg (1985). Data. A collection of problems from many
fields for the student and research worker. New York, NY, USA: Springer.
Boj, E., A. Caballé, P. Delicado, and J. Fortiana (2012). dbstats: Distance-based
statistics (dbstats). R package version 1.0.2.
Boj, E., M. M. Claramunt, and J. Fortiana (2007). Selection of predictors in distance-
based regression. Communications in Statistics A. Theory and Methods 36, 87–98.
Boj, E., M. M. Claramunt, A. Grané, and J. Fortiana (2007). Implementing pls for
distance-based regression: computational issues. Computational Statistics 22, 237–
248.
Boj, E., P. Delicado, and J. Fortiana (2010). Local linear functional regression based
on weighted distance-based regression. Computational Statistics and Data Analy-
sis 54, 429–437.
Bowman, A. and A. Azzalini (1997). Applied Smoothing Techniques for Data Analysis.
Oxford: Oxford University Press.
Brenchley, J. M., U. Hörchner, and J. H. Kalivas (1997). Wavelength selection char-
acterization for nir spectra. Applied Spectroscopy 51, 689–699.
Cuadras, C. and C. Arenas (1990). A distance-based regression model for prediction
with mixed data. Communications in Statistics A. Theory and Methods 19, 2261–
2279.
Cuadras, C. M. (1989). Distance analysis in discrimination and classification using
both continuous and categorical variables. In Y. Dodge (Ed.), Statistical Data
Analysis and Inference, Amsterdam, The Netherlands, pp. 459–473. North-Holland
Publishing Co.
Cuadras, C. M., C. Arenas, and J. Fortiana (1996). Some computational aspects of a
distance-based model for prediction. Communications in Statistics B. Simulation
and Computation 25, 593–609.
Esteve, A., E. Boj, and J. Fortiana (2009). Interaction terms in distance-based regres-
sion. Communications in Statistics A. Theory and Methods 38, 3498–3509.
Faraway, J. (2012). faraway: Functions and datasets for books by Julian Faraway.
R package version 1.0.5.
Febrero-Bande, M. and M. Oviedo (2012). fda.usc: Functional Data Analysis and
Utilities for Statistical Computing (fda.usc). R package version 0.9.7.
Ferraty, F. and P. Vieu (2006a). Non parametric functional data analysis. Theory and
practice. Springer.
Ferraty, F. and P. Vieu (2006b). Reference manual for implementing NonParamet-
ric Functional Data Analysis (NPFDA). Companion manual of the book: Non-
Parametric Functional Data Analysis: Theory and Practice, Springer-Verlag (New
York), 2006.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties.
Biometrics 27, 857–874.
Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood esti-
mation, and some robust and resistant alternatives. Journal of the Royal Statistical
38
Society. Series B (Methodological) 46 (2), 149–192.
Hallin, M. and J. F. Ingenbleek (1983). The Swedish automobile portfolio in 1977.
a statistical study. Skandinavisk Aktuarietidskrift (Scandinavian Actuarial Jour-
nal) 83, 49–64.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learn-
ing. Data Mining, Inference, and Prediction (2nd ed.). Springer.
Kalivas, J. H. (1997). Two data sets of near infrared spectra. Chemometrics and
Intelligent Laboratory Systems 37, 255–259.
Loader, C. (1999). Local regression and likelihood. New York: Springer.
Maechler, M. (2012). cluster: Cluster Analysis Extended Rousseeuw et al. R package
version 1.14.2.
McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2nd ed). London:
Chapman and Hall.
Meyer, D. and C. Buchta (2012). proxy: Distance and Similarity Measures. R package
version 0.4-7.
Ramsay, J. O. and B. W. Silverman (2005). Functional Data Analysis (2nd ed). New
York: Springer.
Rao, C. R. (1973). Linear Statistical Inference and its Applications. New York, NY,
USA: John Wiley & Sons.
Stewart, G. W. (1993). On the early history of the singular values decomposition.
SIAM Review 35, 551–566.
Street, J. O., R. J. Carroll, and D. Ruppert (1988). A note on computing robust
regression estimates via iteratively reweighted least squares. The American Statis-
tician 42 (2), 152–154.
Wasserman, L. (2006). All of Nonparametric Statistics. New York: Springer.
Wood, S. N. (2006). Generalized Additive models: An Introduction with R. Boca Ra-
ton, FL, USA: Chapman & Hall/CRC.
39
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2006
CREAP2006-01
Matas, A. (GEAP); Raymond, J.Ll. (GEAP)
"Economic development and changes in car ownership patterns"
(Juny 2006)
CREAP2006-02
Trillas, F. (IEB); Montolio, D. (IEB); Duch, N. (IEB)
"Productive efficiency and regulatory reform: The case of Vehicle Inspection Services"
(Setembre 2006)
CREAP2006-03
Bel, G. (PPRE-IREA); Fageda, X. (PPRE-IREA)
"Factors explaining local privatization: A meta-regression analysis"
(Octubre 2006)
CREAP2006-04
Fernàndez-Villadangos, L. (PPRE-IREA)
"Are two-part tariffs efficient when consumers plan ahead?: An empirical study"
(Octubre 2006)
CREAP2006-05
Artís, M. (AQR-IREA); Ramos, R. (AQR-IREA); Suriñach, J. (AQR-IREA)
"Job losses, outsourcing and relocation: Empirical evidence using microdata"
(Octubre 2006)
CREAP2006-06
Alcañiz, M. (RISC-IREA); Costa, A.; Guillén, M. (RISC-IREA); Luna, C.; Rovira, C.
"Calculation of the variance in surveys of the economic climate”
(Novembre 2006)
CREAP2006-07
Albalate, D. (PPRE-IREA)
"Lowering blood alcohol content levels to save lives: The European Experience”
(Desembre 2006)
CREAP2006-08
Garrido, A. (IEB); Arqué, P. (IEB)
“The choice of banking firm: Are the interest rate a significant criteria?”
(Desembre 2006)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
CREAP2006-09
Segarra, A. (GRIT); Teruel-Carrizosa, M. (GRIT)
"Productivity growth and competition in spanish manufacturing firms:
What has happened in recent years?”
(Desembre 2006)
CREAP2006-10
Andonova, V.; Díaz-Serrano, Luis. (CREB)
"Political institutions and the development of telecommunications”
(Desembre 2006)
CREAP2006-11
Raymond, J.L.(GEAP); Roig, J.L.. (GEAP)
"Capital humano: un análisis comparativo Catalunya-España”
(Desembre 2006)
CREAP2006-12
Rodríguez, M.(CREB); Stoyanova, A. (CREB)
"Changes in the demand for private medical insurance following a shift in tax incentives”
(Desembre 2006)
CREAP2006-13
Royuela, V. (AQR-IREA); Lambiri, D.; Biagi, B.
"Economía urbana y calidad de vida. Una revisión del estado del conocimiento en España”
(Desembre 2006)
CREAP2006-14
Camarero, M.; Carrion-i-Silvestre, J.LL. (AQR-IREA).;Tamarit, C.
"New evidence of the real interest rate parity for OECD countries using panel unit root tests with breaks”
(Desembre 2006)
CREAP2006-15
Karanassou, M.; Sala, H. (GEAP).;Snower , D. J.
"The macroeconomics of the labor market: Three fundamental views”
(Desembre 2006)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2007
XREAP2007-01
Castany, L (AQR-IREA); López-Bazo, E. (AQR-IREA).;Moreno , R. (AQR-IREA)
"Decomposing differences in total factor productivity across firm size”
(Març 2007)
XREAP2007-02
Raymond, J. Ll. (GEAP); Roig, J. Ll. (GEAP)
“Una propuesta de evaluación de las externalidades de capital humano en la empresa"
(Abril 2007)
XREAP2007-03
Durán, J. M. (IEB); Esteller, A. (IEB)
“An empirical analysis of wealth taxation: Equity vs. Tax compliance”
(Juny 2007)
XREAP2007-04
Matas, A. (GEAP); Raymond, J.Ll. (GEAP)
“Cross-section data, disequilibrium situations and estimated coefficients: evidence from car ownership
demand”
(Juny 2007)
XREAP2007-05
Jofre-Montseny, J. (IEB); Solé-Ollé, A. (IEB)
“Tax differentials and agglomeration economies in intraregional firm location”
(Juny 2007)
XREAP2007-06
Álvarez-Albelo, C. (CREB); Hernández-Martín, R.
“Explaining high economic growth in small tourism countries with a dynamic general equilibrium model”
(Juliol 2007)
XREAP2007-07
Duch, N. (IEB); Montolio, D. (IEB); Mediavilla, M.
“Evaluating the impact of public subsidies on a firm’s performance: a quasi-experimental approach”
(Juliol 2007)
XREAP2007-08
Segarra-Blasco, A. (GRIT)
“Innovation sources and productivity: a quantile regression analysis”
(Octubre 2007)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2007-09
Albalate, D. (PPRE-IREA)
“Shifting death to their Alternatives: The case of Toll Motorways”
(Octubre 2007)
XREAP2007-10
Segarra-Blasco, A. (GRIT); Garcia-Quevedo, J. (IEB); Teruel-Carrizosa, M. (GRIT)
“Barriers to innovation and public policy in catalonia”
(Novembre 2007)
XREAP2007-11
Bel, G. (PPRE-IREA); Foote, J.
“Comparison of recent toll road concession transactions in the United States and France”
(Novembre 2007)
XREAP2007-12
Segarra-Blasco, A. (GRIT);
“Innovation, R&D spillovers and productivity: the role of knowledge-intensive services”
(Novembre 2007)
XREAP2007-13
Bermúdez Morata, Ll. (RFA-IREA); Guillén Estany, M. (RFA-IREA), Solé Auró, A. (RFA-IREA)
“Impacto de la inmigración sobre la esperanza de vida en salud y en discapacidad de la población
española”
(Novembre 2007)
XREAP2007-14
Calaeys, P. (AQR-IREA); Ramos, R. (AQR-IREA), Suriñach, J. (AQR-IREA)
“Fiscal sustainability across government tiers”
(Desembre 2007)
XREAP2007-15
Sánchez Hugalbe, A. (IEB)
“Influencia de la inmigración en la elección escolar”
(Desembre 2007)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2008
XREAP2008-01
Durán Weitkamp, C. (GRIT); Martín Bofarull, M. (GRIT) ; Pablo Martí, F.
“Economic effects of road accessibility in the Pyrenees: User perspective”
(Gener 2008)
XREAP2008-02
Díaz-Serrano, L.; Stoyanova, A. P. (CREB)
“The Causal Relationship between Individual’s Choice Behavior and Self-Reported Satisfaction: the Case
of Residential Mobility in the EU”
(Març 2008)
XREAP2008-03
Matas, A. (GEAP); Raymond, J. L. (GEAP); Roig, J. L. (GEAP)
“Car ownership and access to jobs in Spain”
(Abril 2008)
XREAP2008-04
Bel, G. (PPRE-IREA) ; Fageda, X. (PPRE-IREA)
“Privatization and competition in the delivery of local services: An empirical examination of the dual
market hypothesis”
(Abril 2008)
XREAP2008-05
Matas, A. (GEAP); Raymond, J. L. (GEAP); Roig, J. L. (GEAP)
“Job accessibility and employment probability”
(Maig 2008)
XREAP2008-06
Basher, S. A.; Carrión, J. Ll. (AQR-IREA)
Deconstructing Shocks and Persistence in OECD Real Exchange Rates
(Juny 2008)
XREAP2008-07
Sanromá, E. (IEB); Ramos, R. (AQR-IREA); Simón, H.
Portabilidad del capital humano y asimilación de los inmigrantes. Evidencia para España
(Juliol 2008)
XREAP2008-08
Basher, S. A.; Carrión, J. Ll. (AQR-IREA)
Price level convergence, purchasing power parity and multiple structural breaks: An application to US
cities
(Juliol 2008)
XREAP2008-09
Bermúdez, Ll. (RFA-IREA)
A priori ratemaking using bivariate poisson regression models
(Juliol 2008)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2008-10
Solé-Ollé, A. (IEB), Hortas Rico, M. (IEB)
Does urban sprawl increase the costs of providing local public services? Evidence from Spanish
municipalities
(Novembre 2008)
XREAP2008-11
Teruel-Carrizosa, M. (GRIT), Segarra-Blasco, A. (GRIT)
Immigration and Firm Growth: Evidence from Spanish cities
(Novembre 2008)
XREAP2008-12
Duch-Brown, N. (IEB), García-Quevedo, J. (IEB), Montolio, D. (IEB)
Assessing the assignation of public subsidies: Do the experts choose the most efficient R&D projects?
(Novembre 2008)
XREAP2008-13
Bilotkach, V., Fageda, X. (PPRE-IREA), Flores-Fillol, R.
Scheduled service versus personal transportation: the role of distance
(Desembre 2008)
XREAP2008-14
Albalate, D. (PPRE-IREA), Gel, G. (PPRE-IREA)
Tourism and urban transport: Holding demand pressure under supply constraints
(Desembre 2008)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2009
XREAP2009-01
Calonge, S. (CREB); Tejada, O.
“A theoretical and practical study on linear reforms of dual taxes”
(Febrer 2009)
XREAP2009-02
Albalate, D. (PPRE-IREA); Fernández-Villadangos, L. (PPRE-IREA)
“Exploring Determinants of Urban Motorcycle Accident Severity: The Case of Barcelona”
(Març 2009)
XREAP2009-03
Borrell, J. R. (PPRE-IREA); Fernández-Villadangos, L. (PPRE-IREA)
“Assessing excess profits from different entry regulations”
(Abril 2009)
XREAP2009-04
Sanromá, E. (IEB); Ramos, R. (AQR-IREA), Simon, H.
“Los salarios de los inmigrantes en el mercado de trabajo español. ¿Importa el origen del capital
humano?”
(Abril 2009)
XREAP2009-05
Jiménez, J. L.; Perdiguero, J. (PPRE-IREA)
“(No)competition in the Spanish retailing gasoline market: a variance filter approach”
(Maig 2009)
XREAP2009-06
Álvarez-Albelo,C. D. (CREB), Manresa, A. (CREB), Pigem-Vigo, M. (CREB)
“International trade as the sole engine of growth for an economy”
(Juny 2009)
XREAP2009-07
Callejón, M. (PPRE-IREA), Ortún V, M.
“The Black Box of Business Dynamics”
(Setembre 2009)
XREAP2009-08
Lucena, A. (CREB)
“The antecedents and innovation consequences of organizational search: empirical evidence for Spain”
(Octubre 2009)
XREAP2009-09
Domènech Campmajó, L. (PPRE-IREA)
“Competition between TV Platforms”
(Octubre 2009)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2009-10
Solé-Auró, A. (RFA-IREA),Guillén, M. (RFA-IREA), Crimmins, E. M.
“Health care utilization among immigrants and native-born populations in 11 European countries. Results
from the Survey of Health, Ageing and Retirement in Europe”
(Octubre 2009)
XREAP2009-11
Segarra, A. (GRIT), Teruel, M. (GRIT)
“Small firms, growth and financial constraints”
(Octubre 2009)
XREAP2009-12
Matas, A. (GEAP), Raymond, J.Ll. (GEAP), Ruiz, A. (GEAP)
“Traffic forecasts under uncertainty and capacity constraints”
(Novembre 2009)
XREAP2009-13
Sole-Ollé, A. (IEB)
“Inter-regional redistribution through infrastructure investment: tactical or programmatic?”
(Novembre 2009)
XREAP2009-14
Del Barrio-Castro, T., García-Quevedo, J. (IEB)
“The determinants of university patenting: Do incentives matter?”
(Novembre 2009)
XREAP2009-15
Ramos, R. (AQR-IREA), Suriñach, J. (AQR-IREA), Artís, M. (AQR-IREA)
“Human capital spillovers, productivity and regional convergence in Spain”
(Novembre 2009)
XREAP2009-16
Álvarez-Albelo, C. D. (CREB), Hernández-Martín, R.
“The commons and anti-commons problems in the tourism economy”
(Desembre 2009)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2010
XREAP2010-01
García-López, M. A. (GEAP)
“The Accessibility City. When Transport Infrastructure Matters in Urban Spatial Structure”
(Febrer 2010)
XREAP2010-02
García-Quevedo, J. (IEB), Mas-Verdú, F. (IEB), Polo-Otero, J. (IEB)
“Which firms want PhDs? The effect of the university-industry relationship on the PhD labour market”
(Març 2010)
XREAP2010-03
Pitt, D., Guillén, M. (RFA-IREA)
“An introduction to parametric and non-parametric models for bivariate positive insurance claim severity
distributions”
(Març 2010)
XREAP2010-04
Bermúdez, Ll. (RFA-IREA), Karlis, D.
“Modelling dependence in a ratemaking procedure with multivariate Poisson regression models”
(Abril 2010)
XREAP2010-05
Di Paolo, A. (IEB)
“Parental education and family characteristics: educational opportunities across cohorts in Italy and
Spain”
(Maig 2010)
XREAP2010-06
Simón, H. (IEB), Ramos, R. (AQR-IREA), Sanromá, E. (IEB)
“Movilidad ocupacional de los inmigrantes en una economía de bajas cualificaciones. El caso de España”
(Juny 2010)
XREAP2010-07
Di Paolo, A. (GEAP & IEB), Raymond, J. Ll. (GEAP & IEB)
“Language knowledge and earnings in Catalonia”
(Juliol 2010)
XREAP2010-08
Bolancé, C. (RFA-IREA), Alemany, R. (RFA-IREA), Guillén, M. (RFA-IREA)
“Prediction of the economic cost of individual long-term care in the Spanish population”
(Setembre 2010)
XREAP2010-09
Di Paolo, A. (GEAP & IEB)
“Knowledge of catalan, public/private sector choice and earnings: Evidence from a double sample
selection model”
(Setembre 2010)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2010-10
Coad, A., Segarra, A. (GRIT), Teruel, M. (GRIT)
“Like milk or wine: Does firm performance improve with age?”
(Setembre 2010)
XREAP2010-11
Di Paolo, A. (GEAP & IEB), Raymond, J. Ll. (GEAP & IEB), Calero, J. (IEB)
“Exploring educational mobility in Europe”
(Octubre 2010)
XREAP2010-12
Borrell, A. (GiM-IREA), Fernández-Villadangos, L. (GiM-IREA)
“Clustering or scattering: the underlying reason for regulating distance among retail outlets”
(Desembre 2010)
XREAP2010-13
Di Paolo, A. (GEAP & IEB)
“School composition effects in Spain”
(Desembre 2010)
XREAP2010-14
Fageda, X. (GiM-IREA), Flores-Fillol, R.
“Technology, Business Models and Network Structure in the Airline Industry”
(Desembre 2010)
XREAP2010-15
Albalate, D. (GiM-IREA), Bel, G. (GiM-IREA), Fageda, X. (GiM-IREA)
“Is it Redistribution or Centralization? On the Determinants of Government Investment in Infrastructure”
(Desembre 2010)
XREAP2010-16
Oppedisano, V., Turati, G.
“What are the causes of educational inequalities and of their evolution over time in Europe? Evidence
from PISA”
(Desembre 2010)
XREAP2010-17
Canova, L., Vaglio, A.
“Why do educated mothers matter? A model of parental help”
(Desembre 2010)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2011
XREAP2011-01
Fageda, X. (GiM-IREA), Perdiguero, J. (GiM-IREA)
“An empirical analysis of a merger between a network and low-cost airlines”
(Maig 2011)
XREAP2011-02
Moreno-Torres, I. (ACCO, CRES & GiM-IREA)
“What if there was a stronger pharmaceutical price competition in Spain? When regulation has a similar
effect to collusion”
(Maig 2011)
XREAP2011-03
Miguélez, E. (AQR-IREA); Gómez-Miguélez, I.
“Singling out individual inventors from patent data”
(Maig 2011)
XREAP2011-04
Moreno-Torres, I. (ACCO, CRES & GiM-IREA)
“Generic drugs in Spain: price competition vs. moral hazard”
(Maig 2011)
XREAP2011-05
Nieto, S. (AQR-IREA), Ramos, R. (AQR-IREA)
“¿Afecta la sobreeducación de los padres al rendimiento académico de sus hijos?”
(Maig 2011)
XREAP2011-06
Pitt, D., Guillén, M. (RFA-IREA), Bolancé, C. (RFA-IREA)
“Estimation of Parametric and Nonparametric Models for Univariate Claim Severity Distributions - an
approach using R”
(Juny 2011)
XREAP2011-07
Guillén, M. (RFA-IREA), Comas-Herrera, A.
“How much risk is mitigated by LTC Insurance? A case study of the public system in Spain”
(Juny 2011)
XREAP2011-08
Ayuso, M. (RFA-IREA), Guillén, M. (RFA-IREA), Bolancé, C. (RFA-IREA)
“Loss risk through fraud in car insurance”
(Juny 2011)
XREAP2011-09
Duch-Brown, N. (IEB), García-Quevedo, J. (IEB), Montolio, D. (IEB)
“The link between public support and private R&D effort: What is the optimal subsidy?”
(Juny 2011)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2011-10
Bermúdez, Ll. (RFA-IREA), Karlis, D.
“Mixture of bivariate Poisson regression models with an application to insurance”
(Juliol 2011)
XREAP2011-11
Varela-Irimia, X-L. (GRIT)
“Age effects, unobserved characteristics and hedonic price indexes: The Spanish car market in the 1990s”
(Agost 2011)
XREAP2011-12
Bermúdez, Ll. (RFA-IREA), Ferri, A. (RFA-IREA), Guillén, M. (RFA-IREA)
“A correlation sensitivity analysis of non-life underwriting risk in solvency capital requirement
estimation”
(Setembre 2011)
XREAP2011-13
Guillén, M. (RFA-IREA), Pérez-Marín, A. (RFA-IREA), Alcañiz, M. (RFA-IREA)
“A logistic regression approach to estimating customer profit loss due to lapses in insurance”
(Octubre 2011)
XREAP2011-14
Jiménez, J. L., Perdiguero, J. (GiM-IREA), García, C.
“Evaluation of subsidies programs to sell green cars: Impact on prices, quantities and efficiency”
(Octubre 2011)
XREAP2011-15
Arespa, M. (CREB)
“A New Open Economy Macroeconomic Model with Endogenous Portfolio Diversification and Firms
Entry”
(Octubre 2011)
XREAP2011-16
Matas, A. (GEAP), Raymond, J. L. (GEAP), Roig, J.L. (GEAP)
“The impact of agglomeration effects and accessibility on wages”
(Novembre 2011)
XREAP2011-17
Segarra, A. (GRIT)
“R&D cooperation between Spanish firms and scientific partners: what is the role of tertiary education?”
(Novembre 2011)
XREAP2011-18
García-Pérez, J. I.; Hidalgo-Hidalgo, M.; Robles-Zurita, J. A.
“Does grade retention affect achievement? Some evidence from PISA”
(Novembre 2011)
XREAP2011-19
Arespa, M. (CREB)
“Macroeconomics of extensive margins: a simple model”
(Novembre 2011)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2011-20
García-Quevedo, J. (IEB), Pellegrino, G. (IEB), Vivarelli, M.
“The determinants of YICs’ R&D activity”
(Desembre 2011)
XREAP2011-21
González-Val, R. (IEB), Olmo, J.
“Growth in a Cross-Section of Cities: Location, Increasing Returns or Random Growth?”
(Desembre 2011)
XREAP2011-22
Gombau, V. (GRIT), Segarra, A. (GRIT)
“The Innovation and Imitation Dichotomy in Spanish firms: do absorptive capacity and the technological
frontier matter?”
(Desembre 2011)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
2012
XREAP2012-01
Borrell, J. R. (GiM-IREA), Jiménez, J. L., García, C.
“Evaluating Antitrust Leniency Programs”
(Gener 2012)
XREAP2012-02
Ferri, A. (RFA-IREA), Guillén, M. (RFA-IREA), Bermúdez, Ll. (RFA-IREA)
“Solvency capital estimation and risk measures”
(Gener 2012)
XREAP2012-03
Ferri, A. (RFA-IREA), Bermúdez, Ll. (RFA-IREA), Guillén, M. (RFA-IREA)
“How to use the standard model with own data”
(Febrer 2012)
XREAP2012-04
Perdiguero, J. (GiM-IREA), Borrell, J.R. (GiM-IREA)
“Driving competition in local gasoline markets”
(Març 2012)
XREAP2012-05
D’Amico, G., Guillen, M. (RFA-IREA), Manca, R.
“Discrete time Non-homogeneous Semi-Markov Processes applied to Models for Disability Insurance”
(Març 2012)
XREAP2012-06
Bové-Sans, M. A. (GRIT), Laguado-Ramírez, R.
“Quantitative analysis of image factors in a cultural heritage tourist destination”
(Abril 2012)
XREAP2012-07
Tello, C. (AQR-IREA), Ramos, R. (AQR-IREA), Artís, M. (AQR-IREA)
“Changes in wage structure in Mexico going beyond the mean: An analysis of differences in distribution,
1987-2008”
(Maig 2012)
XREAP2012-08
Jofre-Monseny, J. (IEB), Marín-López, R. (IEB), Viladecans-Marsal, E. (IEB)
“What underlies localization and urbanization economies? Evidence from the location of new firms”
(Maig 2012)
XREAP2012-09
Muñiz, I. (GEAP), Calatayud, D., Dobaño, R.
“Los límites de la compacidad urbana como instrumento a favor de la sostenibilidad. La hipótesis de la
compensación en Barcelona medida a través de la huella ecológica de la movilidad y la vivienda”
(Maig 2012)
SÈRIE DE DOCUMENTS DE TREBALL DE LA XREAP
XREAP2012-10
Arqué-Castells, P. (GEAP), Mohnen, P.
“Sunk costs, extensive R&D subsidies and permanent inducement effects”
(Maig 2012)
XREAP2012-11
Boj, E. (CREB), Delicado, P., Fortiana, J., Esteve, A., Caballé, A.
“Local Distance-Based Generalized Linear Models using the dbstats package for R”
(Maig 2012)
[email protected]