0% found this document useful (0 votes)
15 views61 pages

23 Ejs2180

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views61 pages

23 Ejs2180

Uploaded by

Gaurav Dhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Electronic Journal of Statistics

Vol. 17 (2023) 3226–3286


ISSN: 1935-7524
https://doi.org/10.1214/23-EJS2180

Regression diagnostics meets forecast


evaluation: conditional calibration,
reliability diagrams, and coefficient of
determination
Tilmann Gneiting1,2 and Johannes Resin3,2,1
1 Computational Statistics, Heidelberg Institute for Theoretical Studies
e-mail: [email protected]
2 Institute for Stochastics, Karlsruhe Institute of Technology (KIT)
3 Alfred Weber Institute for Economics, Heidelberg University
e-mail: [email protected]

Abstract: A common principle in model diagnostics and forecast evalua-


tion is that fitted or predicted distributions ought to be reliable, ideally in
the sense of auto-calibration, where the outcome is a random draw from the
posited distribution. For binary responses, auto-calibration is the universal
concept of reliability. For real-valued outcomes, a general theory of cali-
bration has been elusive, despite a recent surge of interest in distributional
regression and machine learning. We develop a framework rooted in prob-
ability theory, which gives rise to hierarchies of calibration, and applies to
both predictive distributions and stand-alone point forecasts. In a nutshell,
a prediction is conditionally T-calibrated if it can be taken at face value in
terms of an identifiable functional T. We introduce population versions of
T-reliability diagrams and revisit a score decomposition into measures of
miscalibration, discrimination, and uncertainty. In empirical settings, sta-
ble and efficient estimators of T-reliability diagrams and score components
arise via nonparametric isotonic regression and the pool-adjacent-violators
algorithm. For in-sample model diagnostics, we propose a universal coeffi-
cient of determination that nests and reinterprets the classical R 2 in least
squares regression and its natural analog R 1 in quantile regression, yet
applies to T-regression in general.

MSC2020 subject classifications: Primary 62G99, 62J20.


Keywords and phrases: Calibration test, canonical loss, consistent scor-
ing function, model diagnostics, nonparametric isotonic regression, pre-
quential principle, score decomposition, skill score.

Received October 2022.

1. Introduction

Predictive distributions ought to be calibrated or reliable (Dawid, 1984; Gneit-


ing and Katzfuss, 2014). More generally, statistical models ought to provide

arXiv: 2108.03210
3226
Regression diagnostics meets forecast evaluation 3227

plausible probabilistic explanations of observations, be it in-sample or out-of-


sample, ideally in the sense of auto-calibration, meaning that the outcomes are
indistinguishable from random draws from the posited distributions. For binary
outcomes, auto-calibration is the universal standard of reliability. In the general
case of linearly ordered, real-valued outcomes, weaker, typically unconditional
facets of calibration have been studied, with probabilistic calibration, which cor-
responds to the uniformity of the probability integral transform (PIT; Dawid,
1984; Diebold, Gunther and Tay, 1998), being the most popular notion. Re-
cently, conditional notions have been proposed (Mason et al., 2007; Bentzien
and Friederichs, 2014; Strähl and Ziegel, 2017), and there has been a surge
of attention to calibration in the machine learning community, where the full
conditional distribution of a response, given a feature vector, is of increasing
interest, as exemplified by the work of Guo et al. (2017), Kuleshov, Fenner
and Ermon (2018), Song et al. (2019), Gupta, Podkopaev and Ramdas (2020),
Zhao, Ma and Ermon (2020), Sahoo et al. (2021) and Roelofs et al. (2022).
While in the literature on forecast evaluation predictive performance is judged
out-of-sample, calibration is relevant in regression diagnostics as well, where
in-sample goodness-of-fit is assessed via test statistics or criteria of R2 -type. In
many ways, the complementary perspectives of model diagnostics and forecast
evaluation are two sides of the same coin.
In this paper, we strive to develop a theory of calibration for real-valued out-
comes that complements the aforementioned strands of literature. Starting from
measure theoretic and probabilistic foundations, we develop practical tools for
visualizing, diagnosing and testing calibration, for both in-sample and out-of-
sample settings, and applying both to full distributions and functionals thereof.
Section 2 develops an overarching, rigorous theoretical framework in a general
population setting, where we establish hierarchical relations between notions
of unconditional and conditional calibration, with Figure 1 summarizing key
results. We reduce a posited distribution to a typically single-valued statisti-
cal functional, T, and define conditional calibration in terms of said functional.
While in general auto-calibration fails to imply calibration in terms of a func-
tional, we prove this implication for functionals defined via an identification
function, such as event probabilities, means, quantiles, and generalized quan-
tiles. We plot recalibrated values of the functional against posited values to
obtain T-reliability diagrams and revisit extant score decompositions to define
nonnegative measures of miscalibration (MCB), discrimination (DSC) and un-
certainty (UNC), for which the mean score satisfies S̄ = MCB − DSC + UNC.
These considerations continue to apply when T-regression is studied as an end
in itself, such as in mean (least squares) and quantile regression. In this set-
ting, Theorem 2.26 establishes a general link between unconditional calibration
and canonical score optimization, which nests classical results in least squares
regression and the partitioning inequalities of quantile regression (Koenker and
Bassett, 1978, Theorem 3.4).
In Section 3, we turn to empirical settings and statistical inference. We
adopt and generalize the approach of Dimitriadis, Gneiting and Jordan (2021)
that uses isotonic regression and the pool-adjacent-violators (PAV) algorithm
3228 T. Gneiting and J. Resin

(a) Under Assumption 2.15: (b) Under Assumption 2.6:


AC ⇒ STC AC
⇓ ⇓ ⇓


CC ⇔ QC ⇔ TC CC → TC ← QC
⇓ ⇓ ⇓ ⇓

→



PC MC PC MC

Fig 1. Preview of key findings in Section 2.3: Hierarchies of calibration (a) for continuous,
strictly increasing cumulative distribution functions (CDFs) with common support and (b)
under minimal conditions, with auto-calibration (AC) being the strongest notion. Conditional
exceedance probability calibration (CC) is a conditional version of probabilistic calibration
(PC), whereas threshold calibration (TC) is a conditional version of marginal calibration
(MC). Quantile calibration (QC) differs from CC and TC in subtle ways. Strong threshold
calibration (STC) is a stronger notion of threshold calibration introduced by Sahoo et al.
(2021) for continuous CDFs. Hook arrows show conjectured implications.

(Ayer et al., 1955) to obtain consistent, optimally binned, reproducible, and PAV
based (CORP) estimates of T-reliability diagrams and score components, along
with uncertainty quantification via resampling. As opposed to extant estimators,
the CORP approach yields non-decreasing reliability diagrams and guarantees
the nonnegativity of the estimated MCB and DSC components. The regularizing
constraint of isotonicity avoids artifacts and overfitting. For in-sample model di-
agnostics, we introduce a generalized coefficient of determination R∗ that links to
skill scores, and nests both the classical variance explained or R2 in least squares
regression (Kvålseth, 1985), and its natural analogue R1 in quantile regression
(Koenker and Machado, 1999). Subject to modest conditions R∗ ∈ [0, 1], with
values of 0 and 1 indicating uninformative and immaculate fits, respectively.
In forecast evaluation, reliability diagrams and score components serve to
diagnose and quantify performance on test samples. The most prominent case
arises when T is the mean functional and performance is assessed by the mean
squared error (MSE). As a preview of the diagnostic tools developed in this
paper, we assess point forecasts by Tredennick et al. (2021) of (log-transformed)
butterfly population size from a ridge regression and a null model. The CORP
mean reliability diagrams and MSE decompositions in Figure 2 show that, while
both models are reliable, ridge regression enjoys considerably higher discrimi-
nation ability.
The paper closes in Section 4, where we discuss our findings and provide a
roadmap for follow-up research. While Dimitriadis, Gneiting and Jordan (2021)
introduced the CORP approach in the nested case of probability forecasts for
binary outcomes, the setting of real-valued outcomes treated in this paper is
far more complex as it necessitates the consideration of statistical functionals in
general. Throughout, we link the traditional case of regression diagnostics and
(stand-alone) point forecast evaluation, where functionals such as conditional
means, moments, quantiles, or expectiles are modeled and predicted, to model
diagnostics and forecast evaluation in the fully distributional setting (Gneiting
and Katzfuss, 2014; Hothorn, Kneib and Bühlmann, 2014). Appendices A–C
include material of more specialized or predominantly technical character.
Regression diagnostics meets forecast evaluation 3229

Fig 2. CORP mean reliability diagrams for point forecasts of (log-transformed) butterfly pop-
ulation size from the null model (left) and ridge regression (right) of Tredennick et al. (2021),
along with 90% consistency bands and miscalibration (MCB), discrimination (DSC) and un-
certainty (UNC) components of the mean squared error (MSE).

2. Notions of calibration, reliability diagrams, and score


decompositions

Generally, we use the symbol L to denote a generic conditional or unconditional


law or distribution, and we identify distributions with their cumulative distri-
bution functions (CDFs). We write N (m, c2 ) to denote a normal distribution
with mean m and variance c2 , and we let ϕ and Φ denote the density and the
CDF, respectively, of a standard normal variable.

2.1. Prediction spaces and prequential principle

We consider the joint law of a posited distribution and the respective outcome
in the technical setting of Gneiting and Ranjan (2013). Specifically, let (Ω, A, Q)
be a prediction space, i.e., a probability space where the elementary elements
ω ∈ Ω correspond to realizations of the random triple

(F, Y, U ),

where Y is the real-valued outcome, F is a posited distribution for Y in the form


of a CDF, and U is uniformly distributed on the unit interval.1 Statements in-
volving conditional or unconditional distributions, expectations, or probabilities,
generally refer to the probability measure Q, which specifies the joint distribu-
tion of the forecast F and the outcome Y . The uniform random variable U
1 Following the extant literature, we view both the predictive distributions, which might

depend on stochastic parameters, and the outcomes as random elements (Murphy and Winkler,
1987; Gneiting, Balabdaoui and Raftery, 2007). In parts of our paper, it suffices to consider a
point prediction space, where the elements of the probability space correspond to realizations
of the random tuple (X, Y ) where X is a point forecast (Ehm et al., 2016, eq. (20)).
3230 T. Gneiting and J. Resin

allows for randomization. Throughout, we assume that U is independent of the


σ-algebra generated by the random variable Y and the random function F in
the technical sense detailed prior to Definition 2.6 in Strähl and Ziegel (2017).
Let A0 ⊆ A denote the forecaster’s information basis, i.e., a sub-σ-algebra
such that F is measurable with respect to A0 . Then F is ideal relative to A0
(Gneiting and Ranjan, 2013) if
F (y) = Q (Y ≤ y | A0 ) almost surely, for all y ∈ R.
If F is ideal relative to some sub σ-algebra A0 , then it is auto-calibrated (Tsy-
plakov, 2013) in the sense that
F (y) = Q (Y ≤ y | F ) almost surely, for all y ∈ R,
which is equivalent to being ideal relative to the information basis σ(F ) ⊆
A0 . Extensions to prediction spaces with tuples (Y, F1 , . . . , Fk , U ) that allow
for multiple CDF-valued forecasts F1 , . . . , Fk with associated information bases
A1 , . . . , Ak ⊂ A are straightforward.
Example 2.1 (Gneiting and Ranjan, 2013; Pohle, 2020). Conditionally on a
standard normal variate μ, let the outcome Y be normal with mean μ and vari-
ance 1. Then the perfect forecast F1 = N (μ, 1) is ideal relative to the informa-
tion basis A1 = σ(μ) generated by μ. The unconditional forecast F2 = N (0, 2)
agrees with the marginal distribution of the outcome Y and is ideal relative to
the trivial σ-algebra A2 = {∅, Ω}.
More elaborate notions of prediction spaces are feasible. In particular, one
might include a covariate or feature vector Z and consider random tuples of
the form (Z, F, Y, U ). Indeed, the transdisciplinary scientific literature has con-
sidered reliability relative to covariate information, under labels such as strong
(Van Calster et al., 2016) or individual (Chung et al., 2021; Zhao, Ma and Er-
mon, 2020) calibration. We refrain from doing so as our simple setting adheres to
the prequential principle posited by Dawid (1984), according to which predictive
performance needs to be evaluated on the basis of the tuple (F, Y ) only, without
consideration of the forecast-generating mechanism. The aforementioned exten-
sions become critical in studies of cross-calibration (Strähl and Ziegel, 2017),
stratification (Ehm and Ovcharov, 2017; Allen, 2021; Bashaykh, 2022), sensitiv-
ity (Fissler and Pesenti, 2023), and fairness (Pleiss et al., 2017; Mitchell et al.,
2021).

2.2. Traditional notions of unconditional calibration

Let us recall the classical notions of calibration of predictive distributions for


real-valued outcomes. In order to do so, we define the probability integral trans-
form (PIT)
ZF = F (Y −) + U (F (Y ) − F (Y −)) (1)
of the CDF-valued random quantity F , where F (y−) = limx↑y F (x) denotes
the left-hand limit of F at y ∈ R, and the random variable U is standard
Regression diagnostics meets forecast evaluation 3231

uniform and independent of F and Y . The PIT of a continuous CDF F is


simply ZF = F (Y ). The predictive distribution F is probabilistically calibrated
or PIT calibrated if ZF is uniformly distributed on the unit interval. The use
of the probabilistic calibration criterion was suggested by Dawid (1984) and
popularized by Diebold, Gunther and Tay (1998), who proposed the use of PIT
histograms as a diagnostic tool. Importantly, in continuous settings probabilistic
calibration implies that prediction intervals bracketed by quantiles capture the
outcomes at the respective nominal level.
Furthermore, the predictive distribution F is marginally calibrated (Gneiting,
Balabdaoui and Raftery, 2007) if

EQ [F (y)] = Q (Y ≤ y) for all y ∈ R.

Thus, for a marginally calibrated predictive distribution, the frequency of (not)


exceeding a threshold value matches the posited unconditional probability.
Example 2.2. In the setting of Example 2.1, let η attain the values ±η0 with
equal probability, independently of μ and Y , where η0 > 0. Then the unfocused
forecast with CDF
1
F (y) = (Φ(y − μ) + Φ(y − μ − η)) (2)
2
is probabilistically calibrated but fails to be marginally calibrated (Gneiting,
Balabdaoui and Raftery, 2007). Similarly, let δ take the values ±δ0 with equal
probability, independently of μ and Y , where δ0 ∈ (0, 1). Then the lopsided
forecast F with density

f (y) = (1 − δ)ϕ(y − μ)1(y < μ) + (1 + δ)ϕ(y − μ)1(y > μ) (3)

is marginally calibrated but fails to be probabilistically calibrated. For details


see Appendix A.2.
It is well known that an ideal forecast is both probabilistically calibrated and
marginally calibrated (Gneiting and Ranjan, 2013, Theorem 2.8; Song et al.,
2019, Theorem 1). Reformulated in terms of auto-calibration the following holds.
Theorem 2.3. Auto-calibration implies marginal and probabilistic calibration.
Auto-calibration thus is a stronger requirement than either marginal or prob-
abilistic calibration, and the latter are logically independent. However, in the
special case of a binary outcome, probabilistic calibration and auto-calibration
are equivalent (Gneiting and Ranjan, 2013, Theorem 2.11), and auto-calibration
serves as a universal notion of calibration. In the case of three or more dis-
tinct outcomes, Gneiting and Ranjan (2013) conjectured that auto-calibration
is stronger than simultaneous marginal and probabilistic calibration. Example 10
of Tsyplakov (2014) proves the conjecture by constructing continuous forecast
distributions that are both marginally and probabilistically calibrated but fail
to be auto-calibrated. The following example resolves the questions raised by
Gneiting and Ranjan (2013) in full detail.
3232 T. Gneiting and J. Resin

Example 2.4. We begin by considering continuous CDFs and then discuss a


discrete example with three distinct outcomes only.
(a) Suppose that μ is normal with mean 0 and variance c2 . Conditionally
on μ, let the piecewise uniform predictive distribution F be a mixture
of uniform measures on [μ, μ + 1], [μ + 1, μ + 2], and [μ + 2, μ + 3] with
weights p1 , p2 and p3 , respectively, and let the outcome Y be drawn from
a mixture with weights q1 , q2 and q3 on these intervals. Finally, let the
tuple (p1 , p2 , p3 ; q1 , q2 , q3 ) attain each of the values
1 1 1 5 1 4  1 1 1 1 8 1  1 1 1 4 1 5 
2 , 4 , 4 ; 10 , 10 , 10 , 4 , 2 , 4 ; 10 , 10 , 10 , and 4 , 4 , 2 ; 10 , 10 , 10 ,

with Q-probability 13 . Evidently, F fails to be auto-calibrated. However,


F is marginally calibrated as, conditionally on μ, it assigns the same mass
1
3 to each of the intervals, in agreement with the conditional distribution
of Y . As for the PIT ZF , conditionally on μ its CDF is piecewise linear
on the partition induced by 0, 14 , 12 , 34 , and 1. Thus, in order to estab-
lish probabilistic calibration it suffices to verify that Q(ZF ≤ x) = x for
x ∈ { 14 , 12 , 34 }, as confirmed by elementary calculations. Integration over μ
completes the argument.
(b) For a full resolution of the aforementioned conjecture by Gneiting and
Ranjan (2013), we fix μ = 0 and replace the intervals by fixed numbers
y1 < y2 < y3 . Thus, F assigns mass pj to yj , whereas the event Y = yj
realizes with probability qj for j = 1, 2, and 3. The forecast remains
probabilistically and marginally calibrated, and fails to be auto-calibrated.

2.3. Conditional calibration

While checks for probabilistic calibration have become a cornerstone of pre-


dictive distribution evaluation (Dawid, 1984; Diebold, Gunther and Tay, 1998;
Gneiting, Balabdaoui and Raftery, 2007), both marginal and probabilistic cali-
bration concern unconditional facets of predictive performance, which is increas-
ingly being considered insufficient (e.g., Levi et al., 2022). Stronger conditional
notions of calibration, which condition on facets of the predictive distribution,
have emerged in various strands of the scientific literature. For example, Mason
et al. (2007) used conditional (non) exceedance probabilities (CEP) to assess the
calibration of ensemble weather forecasts. These were used by Held, Rufibach
and Balabdaoui (2010) and Strähl and Ziegel (2017) to derive calibration tests,
which operate under the hypothesis that the forecast F is CEP calibrated in the
sense that
 
Q ZF ≤ α | qα− (F ) = α almost surely, for all α ∈ (0, 1), (4)
where qα− (F ) = inf{x ∈ R : F (x) ≥ α} denotes the (lower) α-quantile of F .
Similarly, Henzi, Ziegel and Gneiting (2021) introduced the notion of a threshold
calibrated forecast F , which stipulates that
Q (Y ≤ t | F (t)) = F (t) almost surely, for all t ∈ R. (5)
Regression diagnostics meets forecast evaluation 3233

Essentially, CEP calibration is a conditional version of probabilistic calibra-


tion, and threshold calibration is conditional marginal calibration.
Theorem 2.5. CEP calibration implies probabilistic calibration, and threshold
calibration implies marginal calibration.
Proof. Immediate by taking unconditional expectations, as noted by Henzi,
Ziegel and Gneiting (2021).
Variants of these concepts can be found scattered in the literature. Notably,
Sahoo et al. (2021) introduce a notion of calibration for continuous predictive
distributions, which requires that

Q(ZF ≤ α | F (t)) = α almost surely, fornall α ∈ (0, 1), t ∈ R. (6)

As in Figure 1, we refer to this property as strong threshold calibration. The


notion is weaker than auto-calibration, but it implies both CEP calibration and
threshold calibration, subject to conditions that we discuss below.
We proceed to the general notion of conditional T-calibration in terms of a
statistical functional T as introduced by Arnold (2020) and Bashaykh (2022).
Other authors (Pohle, 2020; Krüger and Ziegel, 2021) refer to this notion or spe-
cial cases thereof as auto-calibration with respect to T. A statistical functional
on some class F of probability measures is a measurable function T : F → T
into a (typically, finite-dimensional) space T with Borel-σ-algebra B(T ). Tech-
nically, we work in the prediction space setting under a natural measurability
condition that is not restrictive (Fissler and Holzmann, 2022).
Assumption 2.6. The class F and the functional T are such that F ∈ F,
the mapping T(F ) : (Ω, A) → (T , B(T )) is measurable, and L (Y | T(F )) ∈ F
almost surely.
Definition 2.7. Under Assumption 2.6, the predictive distribution F is condi-
tionally T-calibrated, or simply T-calibrated, if

T (L(Y | T(F ))) = T(F ) almost surely.

Essentially, under a T-calibrated predictive distribution F , we can take T(F )


at face value. Perhaps surprisingly, an auto-calibrated forecast is not necessar-
ily T-calibrated, as noted by Arnold (2020, Section 3.2). For a simple coun-
terexample, consider the perfect forecast from Example 2.1, which fails to be
T-calibrated when T is the variance, the standard deviation, the interquartile
range, or a related measure of dispersion.
We proceed to show that this startling issue does not occur with identifi-
able functionals, i.e., functionals induced by an identification function (Theo-
rem 2.11). Similar to the classical procedure in M -estimation (Huber, 1964),
an identification function weighs negative values in the case of underpredic-
tion against positive values in the case of overprediction, and the corresponding
functional maps to the possibly set-valued argument at which an associated
expectation changes sign. Following Jordan, Mühlemann and Ziegel (2022), a
3234 T. Gneiting and J. Resin

Table 1
Key examples of identifiable functionals with associated parameters, identification function,
and generic type. For a similar listing see Table 1 in Jordan, Mühlemann and Ziegel (2022).
Functional Parameters Identification function Type
Threshold (non)
t∈R V (x, y) = x − 1{y ≤ t} singleton
exceedance
Mean V (x, y) = x − y singleton
Median V (x, y) = 1{y < x} − 1
2
interval
Moment of
n = 1, 2, . . . V (x, y) = x − y n singleton
order n (mn )
α-Expectile (eα ) α ∈ (0, 1) V (x, y) = |1{y < x} − α| (x − y) singleton
α-Quantile (qα ) α ∈ (0, 1) V (x, y) = 1{y < x} − α interval
α ∈ (0, 1),
Huber V (x, y) = |1{y < x} − α| κa,b (x − y) interval
a, b > 0

measurable function V : R × R → R is an identification function if V (·, y) is


increasing and left-continuous for all y ∈ R. We operate under Assumption 2.6
with the implicit understanding that V (x, ·) is quasi-integrable with respect to
all F ∈ F for all x ∈ R. Then, for any probability measure F in the class F, the
functional T(F ) induced by V is defined as

T(F ) = [T− (F ), T+ (F )] ⊆ [−∞, +∞] = R̄,

where the lower and upper bounds are given by the random variables
  

T (F ) = sup x : V (x, y) dF (y) < 0 (7)

and
  
+
T (F ) = inf x : V (x, y) dF (y) > 0 . (8)

An identifiable functional T is of singleton type if T(F ) is a singleton for


every F ∈ F. Otherwise, T is of interval type. Table 1 lists key examples,
such as threshold-defined event probabilities, quantiles, expectiles, and mo-
ments. The definition of the Huber functional involves the clipping function
κa,b (t) = max(min(t, b), −a) with parameters a, b > 0 (Taggart, 2022). In the
limiting cases as a = b → 0 and a = b → ∞, the Huber functional recovers the
α-quantile (qα ) and the α-expectile (eα ), respectively.
For identifiable functionals we can define an unconditional notion of T-cali-
bration as well. Note that in contrast to traditional settings, where F is fixed, we
work in the prediction space setting, where F is a random CDF. In principle, the
subsequent Definitions 2.9 and 2.24 depend on the choice of the identification
function. However, as we demonstrate in Appendix A.4, the following condition
ensures that the identification function is unique, up to a positive constant, so
that ambiguities are avoided.
Regression diagnostics meets forecast evaluation 3235

Assumption 2.8. The identification function V induces the functional T on a


convex class F0 ⊇ F of probability measures, which contains the Dirac measures
δy for all y ∈ R. The identification function V is
(i) of prediction error form, i.e., there exists an increasing, left-continuous
function v : R → R such that V (x, y) = v(x − y) with v(−r) < 0 and
v(r) > 0 for some r > 0, or
(ii) of the form V (x, y) = x − T(δy ) for a functional T of singleton type.
The examples in Table 1 all satisfy Assumption 2.8.
Definition 2.9. Suppose that the functional T is generated by an identifica-
tion function V , and let Assumptions 2.6 and 2.8 hold. Then the predictive
distribution F is unconditionally T-calibrated if
EQ [V (T+ (F )−ε, Y )] ≤ 0 and EQ [V (T− (F )+ε, Y )] ≥ 0 for all ε > 0. (9)
For the interval-valued α-quantile functional, qα (F ) = [qα− (F ), qα+ (F )], con-
dition (9) reduces to the traditional unconditional coverage condition
   
Q Y ≤ qα− (F ) ≥ α and Q Y ≥ qα+ (F ) ≥ 1 − α, (10)
with the latter being equivalent to Q(Y < qα+ (F )) ≤ α. Probabilistic calibra-
tion implies unconditional α-quantile calibration at every level α ∈ (0, 1), as
hinted at by Kuleshov, Fenner and Ermon (2018, Section 3.1), and proved in
our Appendix A.5. Under technical assumptions, condition (9) simplifies to
EQ [V (T(F ), Y )] = 0, (11)
with the classical unbiasedness condition EQ [m1 (F )] = EQ [Y ] arising in the case
of the mean or expectation functional.
Example 2.10. Let T be the mean functional or a quantile. Then the unfo-
cused forecast from Example 2.2 is unconditionally T-calibrated but fails to be
conditionally T-calibrated. For details see Figure 4 and Appendix A.1.
Importantly, for any identifiable functional auto-calibration implies both un-
conditional and conditional T-calibration, as we demonstrate now. Note that
Assumption 2.6 is a minimal condition, which is required to define conditional
T-calibration in the first place, whereas Assumption 2.8 ensures that uncondi-
tional T-calibration is a well-defined concept.
Theorem 2.11. Suppose that the functional T is generated by an identification
function and Assumption 2.6 holds. Then auto-calibration implies conditional
T-calibration, and, subject to Assumption 2.8, conditional T-calibration implies
unconditional T-calibration.
Proof. The statements in this proof are understood to hold almost surely. By
Theorem 4.34 and Proposition 4.35 of Breiman (1992) in concert with auto-
calibration, F is a regular conditional distribution of Y given F , and we conclude
that 
E [V (x, Y ) | F ] = V (x, y) dF (y).
3236 T. Gneiting and J. Resin

Furthermore, a regular conditional distribution FT = L(Y | T(F )) of Y given


T(F ) exists, and the tower property of conditional expectation implies that

V (x, y) dFT (y) = E [V (x, Y ) | T(F )] = E [E [V (x, Y ) | F ] | T(F )]
 

=E V (x, y) dF (y)  T(F ) .

Let T(F ) = [T− (F ), T+ (F )] and T(FT ) = [T− (FT ), T+ (FT )], where the bound-
aries are random variables. The proof of the first part is complete if we can show
that T− (FT ) = T− (F ) and T+ (FT ) = T+ (F ).
Let ε > 0. By the definition of T+ (F ), we know that V (T+ (F ), y) dF (y) ≤ 0
and V (T+ (F ) + ε, y) dF (y) > 0. Using nested conditional expectations as
above, the same inequalities hold almost surely when integrating with respect to
FT . Hence, by the definition of T+ (FT ), we obtain T+ (F ) ≤ T+ (FT ) < T+ (F )+
ε. An analogous argument shows that T− (F ) − ε ≤ T− (FT ) ≤ T− (F ) + ε,
which completes the proof of the first part and shows that F is conditionally
T-calibrated.
Finally, if F is conditionally T-calibrated, unconditional T-calibration follows
by taking nested expectations in the terms in the defining inequalities.
An analogous result is easily derived for CEP calibration.
Theorem 2.12. Under Assumption 2.6 for quantiles, auto-calibration implies
CEP calibration.
Proof. It holds that

Q(ZF ≤ α | qα− (F )) = E 1{ZF ≤ α} | qα− (F ) = E E [1{ZF ≤ α} | F ] | qα− (F )

almost surely for α ∈ (0, 1). As F is a version of L(Y | F ), the nested expectation
equals α almost surely by Proposition 2.1 of Rüschendorf (2009), which implies
CEP calibration.
When evaluating full predictive distributions, it is natural to consider families
of functionals as in the subsequent definition, where part (a) is compatible with
the extant notion in (5).
Definition 2.13. A predictive distribution F is
(a) threshold calibrated if it is conditionally F (t)-calibrated for all t ∈ R;
(b) quantile calibrated if it is conditionally qα -calibrated for all α ∈ (0, 1);
(c) expectile calibrated if it is conditionally eα -calibrated for all α ∈ (0, 1);
(d) moment calibrated if it is conditionally n-th moment calibrated for all
integers n = 1, 2, . . .
While CEP, quantile, and threshold calibration are closely related notions,
they generally are not equivalent. For illustration, we consider predictive CDFs
in the spirit of Example 2.4.
Regression diagnostics meets forecast evaluation 3237

Example 2.14.
(a) Let μ ∼ N (0, c2 ). Conditionally on μ, let F be a mixture of uniform
distributions on the intervals [μ, μ + 1], [μ + 1, μ + 2], [μ + 2, μ + 3], and
[μ + 3, μ + 4] with weights p1 , p2 , p3 , and p4 , respectively, and let Y be
from a mixture with weights q1 , q2 , q3 , and q4 . Furthermore, let the tuple
(p1 , p2 , p3 , p4 ; q1 , q2 , q3 , q4 ) attain each of the values
1 1 3 1
 1 1 1 3

2 , 0, 2 , 0; 4 , 0, 4 , 0 , 2 , 0, 0, 2 ; 4 , 0, 0, 4 ,
 1 1   1 
0, 2 , 2 , 0; 0, 14 , 34 , 0 , 0, 2 , 0, 12 ; 0, 34 , 0, 14
with equal probability. Then the continuous forecast F is threshold cali-
brated and CEP calibrated but fails to be quantile calibrated.
(b) Let the tuple (p1 , p2 , p3 ; q1 , q2 , q3 ) attain each of the values
1 1 1 5 4 1  1 1 1 1 5 4  1 1 1 4 1 5 
2 , 4 , 4 ; 10 , 10 , 10 , 4 , 2 , 4 ; 10 , 10 , 10 , and 4 , 4 , 2 ; 10 , 10 , 10

with equal probability. Let F assign mass pj to numbers yj for j = 1, 2, 3,


where y1 < y2 < y3 , and let the event Y = yj realize with probabil-
ity qj . The resulting discrete forecast is quantile and threshold calibrated.
However, it fails to be CEP calibrated or even PIT calibrated.
Under the following conditions CEP, quantile, and threshold calibration co-
incide.
Assumption 2.15. In addition to Assumption 2.6 for quantiles and threshold
(non) exceedances at all levels and thresholds, respectively, let the following
hold.
(i) The CDFs in the class F are continuous and strictly increasing on a com-
mon support interval.
(ii) There exists a countable set G ⊆ F such that Q(F ∈ G) = 1.
Theorem 2.16.
(a) Under Assumption 2.15(i) CEP and quantile calibration are equivalent and
imply probabilistic calibration.
(b) Under Assumptions 2.15(i)–(ii) CEP, quantile and threshold calibration
are equivalent and imply both probabilistic and marginal calibration.
Proof. By Assumption 2.15(i) the CDFs F ∈ F are invertible on the common
support with the quantile function α → qα− (F ) as inverse. Hence, for every
α ∈ (0, 1) the functional qα is of singleton-type and qα (F ) = {qα− (F )}. In this
light, the almost-sure identity
Q(ZF ≤ α | qα− (F )) = Q(Y ≤ qα− (F ) | qα (F ))
implies part (a). To prove part (b), let G be as in Assumption 2.15(ii) and
assume without loss of generality that Q(F = G) > 0 for all G ∈ G. If α ∈ (0, 1)
and t ∈ R are such that Q(F (t) = α) > 0, then
Q(Y ≤ t | F (t) = α) = Q(Y ≤ qα− (F ) | qα− (F ) = t),
3238 T. Gneiting and J. Resin

Fig 3. The equiprobable predictive distribution F picks the piecewise linear, partially (namely,
for y ≤ 2) identical CDFs F1 and F2 with equal probability. It is jointly CEP, quantile, and
threshold calibrated but fails to be auto-calibrated. For a very similar construction see Example
10 of Tsyplakov (2014).

where Assumption 2.15(ii) ensures that the events conditioned on have posi-
tive probability. Hence, quantile and threshold calibration are equivalent. The
remaining implications are immediate from Theorem 2.5.
We conjecture that the statement of part (b) holds under Assumption 2.15(i)
alone but are unaware of a measure theoretic argument that serves to generalize
the discrete reasoning in our proof. As indicated in panel (b) of Figure 1, we
also conjecture that CEP or quantile calibration imply threshold calibration in
general, though we have not been able to prove this implication, nor can we
show that CEP or quantile calibration imply marginal calibration in general.
Strong threshold calibration as defined in (6) implies both CEP and threshold
calibration under Assumption 2.15, by arguments similar to those in the above
proof. The following result thus demonstrates that the hierarchies in panel (a)
and, with the aforementioned exceptions, in panel (b) of Figure 1 are complete,
with the caveat that hierarchies may collapse if the class F is sufficiently small,
as exemplified by Theorem 2.11 of Gneiting and Ranjan (2013).
Proposition 2.17. Even under Assumption 2.15 the following hold:
(a) Strong threshold calibration does not imply auto-calibration.
(b) Joint CEP, quantile, and threshold calibration does not imply strong thresh-
old calibration.
(c) Joint probabilistic and marginal calibration does not imply threshold cali-
bration.
(d) Probabilistic calibration does not imply marginal calibration.
(e) Marginal calibration does not imply probabilistic calibration.

Proof. We establish the claims in a series of (counter) examples, starting with


part (b), where we present an example based on two equiprobable, partially
overlapping CDFs in Figure 3. A similar example based on four equiprobable,
Regression diagnostics meets forecast evaluation 3239

Table 2
Properties of the forecasts in our examples. We note whether they are auto-calibrated (AC),
CEP calibrated (CC), quantile calibrated (QC), threshold calibrated (TC), probabilistically
calibrated (PC), or marginally calibrated (MC), and whether the involved distributions are
continuous and strictly increasing on a common support (CSI) as in Assumption 2.15(i).
Except for the auto-calibrated cases, the forecasts fail to be moment calibrated.
Source Forecast Type CSI AC CC QC TC PC MC
Example 2.1 Perfect       
Example 2.1 Unconditional       
Figure 3 Equiprobable      
Example 2.4 Piecewise uniform as c → 0  
Example 2.2 Unfocused  
Example 2.2 Lopsided  
Example 2.14 Continuous    
Example 2.14 Discrete   

partially overlapping CDFs in Appendix A.6 yields part (a). As for part (c), we
return to the piecewise uniform forecast in Example 2.4, where for simplicity
we fix μ = 0. This forecast is probabilistically and marginally calibrated, but it
fails to be threshold calibrated because
 
Q Y ≤ 32 | F ( 32 ) = 58 = 10
5
+ 12 · 10
1
= 11 5
20 = 8 .

As for parts (d) and (e), we refer to the unfocused and lopsided forecasts from
Example 2.2 with μ = 0 fixed.
Clearly, further hierarchical relations are immediate. For example, given that
probabilistic calibration does not imply marginal calibration, it does not imply
threshold calibration nor auto-calibration. We leave further discussion to future
work but note that moment calibration does not imply probabilistic nor marginal
calibration, as follows easily from classical results on the moment problem (e.g.,
Stoyanov, 2000). For an overview of calibration properties in our examples, see
Table 2.

2.4. Reliability diagrams

As we proceed to define reliability diagrams, it is useful to restrict attention


to single-valued functionals. To this end, if an identifiable functional T is of
interval type, we instead consider its single-valued lower or upper bound, T− (F )
or T+ (F ), which we call the lower and upper version of T, or simply the lower
and upper functional, without explicit reference to the original functional T.
The following result demonstrates that T-calibration implies calibration of
the upper and lower functional.
Proposition 2.18. Suppose that the functional T is generated by an identifi-
cation function V , and let Assumption 2.6 hold. Then conditional T-calibration
implies conditional T− - and T+ -calibration, and, subject to Assumption 2.8,
unconditional T-calibration implies unconditional T− - and T+ -calibration.
3240 T. Gneiting and J. Resin

Fig 4. Threshold (top), quantile (middle), and moment (bottom) reliability diagrams for point
forecasts induced by (left) the unfocused forecast with η0 = 1.5 and (middle) the lopsided
forecast with δ0 = 0.7 from Example 2.2, and (right) the piecewise uniform forecast with
c = 0.5 from Example 2.4. Each display plots recalibrated against original values. Deviations
from the diagonal indicate violations of T-calibration. For details see Appendix A.

Proof. Suppose that T∗ is the lower or upper version of a functional T generated


by the identification function V . As σ(T∗ (F )) ⊆ σ(T(F )), we find that

E [V (x, Y ) | T∗ (F )] = E [E [V (x, Y ) | T(F )] | T∗ (F )]

is almost surely ≤ 0 if x < T∗ (F ), and almost surely ≥ 0 if x > T∗ (F ). Hence,


T∗ (F ) ∈ T(L(Y | T∗ (F ))). If T∗ is the lower functional, the former inequality
is strict and hence T− (F ) = min T(L(Y | T− (F ))), whereas if T∗ is the upper
functional, the latter is strict and hence T+ (F ) = max T(L(Y | T+ (F ))).
Unconditional T∗ -calibration is an immediate consequence of unconditional
T-calibration.
Regression diagnostics meets forecast evaluation 3241

In this light, we restrict attention to single-valued functionals that are lower or


upper versions of identifiable functionals, or identifiable functionals of singleton
type. Any such functional can be associated with a random variable X = T(F ),
and we call any random variable Xrc , for which

Xrc = T (L (Y | X)) (12)

almost surely, a recalibrated version of X. Clearly, we can also define Xrc for a
stand-alone point forecast X, based on conceptualized distributions, by resorting
to the joint distribution of the random tuple (X, Y ), provided the right-hand
side of (12) is well defined and finite almost surely. The point forecast X is
conditionally T-calibrated, or simply T-calibrated, if X = Xrc almost surely.
Subject to Assumption 2.8, X is unconditionally T-calibrated if

E[V (X − ε, Y )] ≤ 0 and E[V (X + ε, Y )] ≥ 0 for all ε > 0. (13)

For recent discussions of the particular cases of the mean or expectation and
quantile functionals see, e.g., Nolde and Ziegel (2017, Sections 2.1–2.2). Patton
(2020, Proposition 2), Krüger and Ziegel (2021, Definition 3.1) and Satopää
(2021, Section 2).
To compare the posited functional X with its recalibrated version Xrc , we
introduce the T-reliability diagram.
Assumption 2.19. The functional T is a lower or upper version of an identifi-
able functional, or an identifiable functional of singleton type. The point forecast
X is a random variable, and the recalibrated forecast Xrc = T(L(Y | X)) is well
defined and finite almost surely.
Definition 2.20. Under Assumption 2.19, the T-reliability diagram is the graph
of a mapping x → T (L(Y | X = x)) on the support of X.
While technically the T-reliability diagram depends on the choice of a regular
conditional distribution for the outcome Y , this issue is not a matter of practi-
cal relevance. Evidently, for a T-calibrated forecast the T-reliability diagram is
concentrated on the diagonal. Conversely, deviations from the diagonal indicate
violations of T-calibration and can be interpreted diagnostically, as illustrated
in Figure 4 for threshold, quantile, and moment calibration. For a similar display
in the specific case of mean calibration see Figure 1 of Pohle (2020).
In the setting of fully specified predictive distributions, the distinction be-
tween unconditional and conditional T-calibration is natural. Perhaps surpris-
ingly, the distinction vanishes in the setting of stand-alone point forecasts if the
associated identification function is of prediction error form and the forecast
and the residual are independent.
Theorem 2.21. Let Assumption 2.19 hold, and suppose that the underlying
identification function V satisfies Assumption 2.8. Suppose furthermore that the
point forecast X and the generalized residual T(δY ) − X are independent. Then
X is conditionally T-calibrated if, and only if, it is unconditionally T-calibrated.
3242 T. Gneiting and J. Resin

Proof. In the case of an identification function V of type (i) in Assumption 2.8,


let b(y) = T(δy ) − y for y ∈ R. Then b = b(y) is constant as v(x − y  ) =
v(x − (y  − y) − y) implies T(δy ) = T(δy ) + y  − y. Given any constant c ∈ R it
holds that

E[V (X + c, Y ) | X] = E[v(X + c − T(δY ) + b) | X] = E[v(c + b − (T(δY ) − X))].

In view of (12) and (13), conditional and unconditional T-calibration are equiv-
alent. In the case of an identification function of type (ii) in Assumption 2.8,
the above arguments continue to hold if we take v to be the identity map x → x
and b = 0.
For quantiles, expectiles, and Huber functionals, the identification function V
is of prediction error form and the generalized residual reduces to the standard
residual, X −Y . In particular, this observation applies in the case of least squares
regression, where T is the mean functional, and the forecast and the residual
have typically been assumed to be independent in the literature. We discuss the
statistical implications of Theorem 2.21 in Appendix B.

2.5. Score decompositions

We now revisit a score decomposition into measures of miscalibration (MCB),


discrimination (DSC), and uncertainty (UNC) based on consistent scoring func-
tions. Specifically, suppose that S is a consistent loss or scoring function for the
functional T on the class F in the sense that

EF [S(t, Y )] ≤ EF [S(x, Y )]

for all F ∈ F, all t ∈ T(F ) = [T− (F ), T+ (F )] and all x ∈ R (Savage, 1971;


Gneiting, 2011). If the inequality is strict unless x ∈ T(F ), then S is strictly
consistent. Consistent scoring functions serve as all-purpose performance mea-
sures that elicit fair and honest assessments and reward the utilization of broad
information bases (Holzmann and Eulert, 2014). If the functional T is of in-
terval type, a consistent scoring function S is consistent for both T− and T+ ,
but strict consistency is lost when T is replaced by its lower or upper version
and S is strictly consistent for T. For prominent examples of consistent scoring
functions, see Table 3.
A functional is elicitable if it admits a strictly consistent scoring function
(Gneiting, 2011). Under general conditions, elicitability is equivalent to identifi-
ability (Steinwart et al., 2014, Theorem 5). The respective functionals allow for
both principled relative forecast evaluation through the use of consistent scoring
functions, and principled absolute forecast evaluation via T-reliability diagrams
and score decompositions, as discussed in what follows.
Let L(Y ) denote the unconditional distribution of the outcome and suppose
that x0 = T(L(Y )) is well defined. As before, we operate under Assumption 2.19
and work with X = T(F ), its recalibrated version Xrc , and the reference forecast
x0 . Again, the simplified notation accommodates stand-alone point forecasts,
Regression diagnostics meets forecast evaluation 3243

and it suffices to consider the joint distribution of the tuple (X, Y ). Follow-
ing the lead of Dawid (1986) in the case of binary outcomes, and Ehm and
Ovcharov (2017) and Pohle (2020) in the setting of point forecasts for real-
valued outcomes, we consider the expected scores

S̄ = EQ [S(X, Y )], S̄rc = EQ [S(Xrc , Y )], and S̄mg = EQ [S(x0 , Y )] (14)

for the forecast at hand, its recalibrated version, and the marginal reference
forecast x0 , respectively.
Definition 2.22. Let Assumption 2.19 hold, and let x0 = T(L(Y )) and the
expectations S̄, S̄rc , and S̄mg in (14) be well defined and finite. Then we refer to

MCBS = S̄ − S̄rc , DSCS = S̄mg − S̄rc , and UNCS = S̄mg ,

as miscalibration, discrimination, and uncertainty, respectively.


Following Dimitriadis, Gneiting and Jordan (2021), our terminology differs
from terms typically used to describe score decompositions for probabilistic
forecasts under proper scoring rules, where miscalibration and discrimination are
commonly referred to as reliability and resolution, respectively (e.g., Bröcker,
2009; Siegert, 2017). The following result decomposes the expected score S̄ for
the forecast at hand into miscalibration (MCBS ), discrimination (DSCS ), and
uncertainty (UNCS ) components.
Theorem 2.23 (Dawid, 1986; Pohle, 2020). In the setting of Definition 2.22,
suppose that the scoring function S is consistent for the functional T. Then it
holds that
S̄ = MCBS − DSCS + UNCS , (15)
where MCBS ≥ 0 with equality if X is conditionally T-calibrated, and DSCS ≥ 0
with equality if Xrc = x0 almost surely. If S is strictly consistent then MCBS = 0
only if X is conditionally T-calibrated, and DSCS = 0 only if Xrc = x0 almost
surely.
A remaining question is what consistent scoring function S ought to be used in
practice. To address this issue, we resort to mixture or Choquet representations
of consistent loss functions, as introduced by Ehm et al. (2016) for quantiles
and expectiles and developed in full generality by Dawid (2016), Ziegel (2016)
and Jordan, Mühlemann and Ziegel (2022). Specifically, we rely on an obvious
generalization of Proposition 2.6 of Jordan, Mühlemann and Ziegel (2022), as
noted at the start of their Section 2. Let T be identifiable with identification
function V satisfying Assumption 2.8, and let η ∈ R. Then the elementary loss
function Sη , given by

Sη (x, y) = (1{η ≤ x} − 1{η ≤ y}) V (η, y), (16)

is consistent for T. As an immediate consequence, any well-defined function of


the form 
S(x, y) = Sη (x, y) dH(η), (17)
R
3244 T. Gneiting and J. Resin

Table 3
Canonical loss functions in the sense of Definition 2.24.
Functional Parameter Canonical Loss
Moment of order n n = 1, 2, . . . S(x, y) = (x − y n )2
α-Expectile α ∈ (0, 1) S(x, y) = 2 |1{x ≥ y} − α| (x − y)2
α-Quantile α ∈ (0, 1) S(x, y) = 2 (1{x ≥ y} − α) (x − y)

Table 4
Components of the decomposition (15) for the mean squared error (MSE) under
mean-forecasts induced by the predictive distributions in Examples 2.1 and 2.2. Uncertainty
(UNC) equals 2 irrespective of the forecast at hand. The term I(η0 ) is in integral form and
can be evaluated numerically. For details see Appendix A.
Predictive
Mean-Forecast MSE MCB DSC
Distribution
Perfect μ 1 0 1
Unconditional 0 2 0 0
Unfocused μ + 12 η 1 + 14 η02 ( 14 − I(η0 ))η02 1 − I(η0 )η02

Lopsided μ+ √2 δ
π
1+ 2 2
δ
π 0
( 14 − I( 8
δ )) 8 δ 2
π 0 π 0
1 − I( 8
δ ) 8 δ2
π 0 π 0

where H is a locally finite measure on R, is consistent for T. If T is a quantile,


an expectile, an event probability or a moment, then the construction includes
all consistent scoring functions, subject to standard conditions, and agrees with
suitably adapted classes of generalized piecewise linear (GPL) and Bregman
functions, respectively (Gneiting, 2011; Ehm et al., 2016).
We now formalize what Ehm and Ovcharov (2017, p. 477) call the “most
prominent” choice, namely, scoring functions for which the mixing measure H
in the representation (17) is uniform.
Definition 2.24. Suppose that the functional T is generated by an identifica-
tion function V satisfying Assumption 2.8, with elementary loss functions Sη as
defined in (16). Then a loss function S is canonical for T if it is nonnegative and
admits a representation of the form

S(x, y) = a Sη (x, y) dλ(η) + b(y), (18)
R
where λ is the Lebesgue measure, a > 0 is a constant, and b is a measurable
function.
Clearly, any canonical loss function is a consistent scoring function for T.
Furthermore, if the identification function is of the prediction error form, then
any canonical loss function has score differentials that are invariant under trans-
lation in the sense that S(x1 + c, y + c) − S(x2 + c, y + c) = S(x1 , y) − S(x2 , y).
Conversely, we note from Section 5.1 of Savage (1971) that for the mean func-
tional the canonical loss functions are the only consistent scoring functions of
this type.
Typically, one chooses the constant a > 0 and the measurable function b(y)
in (18) such that the canonical loss admits a concise closed form, as exemplified
Regression diagnostics meets forecast evaluation 3245

Fig 5. Components of the decomposition (15) for the mean squared error (MSE) under mean-
forecasts induced by the unfocused and lopsided predictive distributions from Example 2.2 and
Table 4, as functions of η0 ≥ 0 and δ0 ∈ (0, 1), respectively.

in Table 3. Since any selection incurs the same point forecast ranking, we refer
to the choice in Table 3 as the canonical loss function. The most prominent
example arises when T is the mean functional, where the ubiquitous quadratic
or squared error scoring function,
S(x, y) = (x − y)2 , (19)
is canonical. In this case, the UNC component equals the unconditional variance
of Y as x0 is simply the marginal mean μY of Y , and the MCB and DSC
components of the general score decomposition (15) are
2 2
MCB = E (X − Xrc ) and DSC = E (Xrc − μY ) ,
respectively. Note that here and in the following, we drop the subscript S when-
ever we use a canonical loss. Table 4 and Figure 5 provide explicit examples.
In the nested case of a binary outcome Y , where X and Xrc specify event
probabilities, the quadratic loss function reduces to the Brier score (Gneiting
and Raftery, 2007), and we refer to Dimitriadis, Gneiting and Jordan (2021) and
references therein for details on score decompositions. In the case of threshold
calibration, the point forecast x = F (t) is induced by a predictive distribution,
and the Brier score can be written as
2
S(x, y) = (F (t) − 1{y ≤ t}) . (20)
For both real-valued and binary outcomes, it is often preferable to use the square
root of the miscalibration component (MCB1/2 ) as a measure of calibration error
that can be interpreted on natural scales (e.g., Roelofs et al., 2022).
A canonical loss function for the Huber functional (Table 1) is given by

⎨2a|x − y| − a , x − y < −a,
⎪ 2

S(x, y) = 2 |1{x ≥ y} − α| (x − y)2 , −a ≤ x − y ≤ b,




2b|x − y| − b2 , x − y > b;
3246 T. Gneiting and J. Resin

cf. Taggart (2022, Definition 4.2). In the limiting case as a = b → ∞, we recover


the canonical loss functions for the α-expectile, which include the quadratic loss
in (19). Similarly, if we rescale suitably and take the limit as a = b → 0, we
recover the asymmetric piecewise linear or pinball loss, as listed in Table 3,
which lies at the heart of quantile regression.
We move on to a remarkable property of canonical loss functions. In a nut-
shell, the point forecast X is unconditionally T-calibrated if, and only if, the
expected canonical loss deteriorates under translation. This property, which
nests classical results in regression theory, as we demonstrate at the end of
Section 3.3, does not hold under consistent scoring functions in general. For a
counterexample see Appendix A.8.
Assumption 2.25. The point forecast X, the functional T and the identifica-
tion function V satisfy Assumptions 2.8 and 2.19, and S is a canonical loss for
T. Furthermore, E[S(X + η, Y )] and E[V (X + η, Y )] are well defined and locally
bounded as functions of η ∈ R.
Theorem 2.26. Under Assumption 2.25, the point forecast X is uncondition-
ally T-calibrated if, and only if,

E [S(X + c, Y )] ≥ E [S(X, Y )] for all c ∈ R.

Proof. If X is unconditionally T-calibrated and c > 0, then

E [S(X + c, Y )] − E [S(X, Y )] (21)



=E (1{η ≤ X + c} − 1{η ≤ X}) V (η, Y ) dη
  
=E V (X + η, Y ) dη = E [V (X + η, Y )] dη
(0,c] (0,c]

is nonnegative by the second part of the unconditional T-calibration crite-


rion (13). Conversely, if the score difference in (21) is nonnegative for all c > 0,
then so is
 
1 1
E[V (X + c, Y )] = E[V (X + c, Y )] dη ≥ E[V (X + η, Y )] dη
c (0,c] c (0,c]

as the identification function V is non-decreasing in its first argument. Hence,


the second part of (13) is satisfied.
An analogous argument shows that the score difference (21) is nonnegative
for all c < 0 if, and only if, the first inequality in (13) is satisfied.

As a consequence, under a canonical loss function the MCB component in the


score decomposition (15) of Theorem 2.23 decomposes into nonnegative uncon-
ditional and conditional components MCBu and MCBc , respectively, subject to
the mild condition that unconditional recalibration via translation is feasible.
Regression diagnostics meets forecast evaluation 3247

Theorem 2.27. Let Assumption 2.25 hold, and suppose there is a constant c
such that X + c is unconditionally T-calibrated. Let Xurc = X + c and S̄urc =
E[S(Xurc , Y )], and define

MCBu = S̄ − S̄urc and MCBc = S̄urc − S̄rc .

Then
MCB = MCBu + MCBc ,
where MCBu ≥ 0 with equality if X is unconditionally T-calibrated, and MCBc ≥
0 with equality if Xrc = Xurc almost surely. If S is strictly consistent, then
MCBu = 0 only if X is unconditionally T-calibrated, and MCBc = 0 only if
Xrc = Xurc almost surely.
Proof. Immediate from Theorems 2.23 and 2.26, and the fact that conditional
recalibration of X and X + c yields the same Xrc .
In settings that are equivariant under translation, such as for expectiles,
quantiles, and Huber functionals when both X and Y are supported on the real
line, X can always be unconditionally recalibrated by adding a constant. Under
any canonical loss function S, the basic decomposition (15) then extends to

S̄ = MCBu + MCBc − DSC + UNC. (22)

For instance, when S(x, y) = (x − y)2 is the canonical loss for the mean func-
tional, MCBu = c2 is the squared unconditional bias. The forecasts in Figure 5
and Table 4 are free of unconditional bias, so MCBu = 0 and MCBc = MCB.
In all cases studied thus far, canonical loss functions are strictly consistent
(Ehm et al., 2016), and so MCBu = 0 if and only if the forecast is uncondition-
ally T-calibrated, and MCBc = 0 if and only if Xurc = Xrc almost surely. While
in other settings, such as when the outcomes are bounded, unconditional recali-
bration by translation might be counterintuitive (in principle) or impossible (in
practice), the statement of Theorem 2.27 continues to hold, and the above re-
sults can be refined to admit more general forms of unconditional recalibration.
We leave these and other ramifications to future work.

3. Empirical reliability diagrams and score decompositions: the


CORP approach

We turn to empirical settings, where calibration checks, scores, and score de-
compositions address critical practical problems in both model diagnostics and
forecast evaluation. The most direct usage is in the evaluation of out-of-sample
predictive performance, where forecasts may either take the form of fully spec-
ified predictive distributions, or be single-valued point forecasts that arise, im-
plicitly or explicitly, as functionals of predictive distributions. Similarly, in model
diagnostics, where in-sample goodness-of-fit is of interest, the model might sup-
ply fully specified, parametric or non-parametric conditional distributions, or
3248 T. Gneiting and J. Resin

single-valued regression output that is interpreted as a functional of an un-


derlying, implicit or explicit, probability distribution. Prominent examples for
the latter setting include ordinary least squares regression, where the mean or
expectation functional is sought, and quantile regression.
In the case of fully specified predictive distributions, we work with tuples of
the form
(F1 , y1 ), . . . , (Fn , yn ), (23)
where Fi is a posited conditional CDF for the real-valued observation yi for
i = 1, . . . , n, which we interpret as a sample from an underlying population Q
in the prediction space setting of Section 2. In the case of stand-alone point
forecasts or regression output, we assume throughout that the functional T is
of the type stated in Assumption 2.19 and work with tuples of the form

(x1 , y1 ), . . . , (xn , yn ), (24)

where xi = T(Fi ) ∈ R derives explicitly or implicitly from a predictive distribu-


tion Fi for i = 1, . . . , n.
In the remainder of the section, we introduce empirical versions of T-reliability
diagrams (Definition 2.20) and score components (Definition 2.22) for samples
of the form (23) or (24), which allow for both diagnostic checks and inference
about an underlying population Q. While practitioners may think of our empir-
ical versions exclusively from diagnostic perspectives, we emphasize that they
can be interpreted as estimators of the population quantities and be analyzed
as such. A key feature of our approach is the use of nonparametric isotonic re-
gression via the pool-adjacent-violators algorithm, as proposed by Dimitriadis,
Gneiting and Jordan (2021) in the particular case of binary outcomes. The gen-
eralization that we discuss here is hinted at in the discussion section of their
paper.

3.1. The T-pool-adjacent-violators ( T-PAV) algorithm

Our key tool and workhorse is a very general version of the classical pool-
adjacent-violators (PAV) algorithm for nonparametric isotonic regression (Ayer
et al., 1955; van Eeden, 1958). Historically, work on the PAV algorithm has
focused on the mean functional, as reviewed by Barlow et al. (1972), Robertson
and Wright (1980), and de Leeuw, Hornik and Mair (2009), among others. In
contrast, Jordan, Mühlemann and Ziegel (2022) study the PAV algorithm in
very general terms that accommodate our setting.
We rely on their work and describe the T-pool-adjacent-violators algorithm
based on tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24), where without loss of
generality we may assume that x1 ≤ · · · ≤ xn . Furthermore, we let δi denote
the point measure in the outcome yi . More generally, for 1 ≤ k ≤ l ≤ n we let

1  l
δk:l = δi
l−k+1
i=k
Regression diagnostics meets forecast evaluation 3249

be the associated empirical measure. Algorithm 1 describes the generation of


an increasing sequence x 1 ≤ · · · ≤ x n of recalibrated values, which by con-
struction are conditionally T-calibrated with respect to the empirical measure
associated with ( x1 , y1 ), . . . , (
xn , yn ). The algorithm rests on partitions of the
index set {1, . . . , n} into groups Gk:l = {k, . . . , l} of consecutive integers. Suc-
cessive groups are pooled iteratively if the value of the functional applied to
the empirical measure associated with the preceding group exceeds the value
associated with the subsequent group.
Algorithm 1: General T-PAV algorithm based on data of the form (24)
Input: (x1 , y1 ), . . . , (xn , yn ) ∈ R2 where x1 ≤ · · · ≤ xn
Output: T-calibrated values x 1 , . . . , x
n
partition into groups G1:1 , . . . , Gn:n and let x i = T(δi ) for i = 1, . . . , n
while there are groups Gk:i and G(i+1):l such that x 1 ≤ · · · ≤ xi and
i > x
x i+1 do
merge Gk:i and G(i+1):l into Gk:l and let x i = T(δk:l ) for i = k, . . . , l
end

The following result summarizes the remarkable properties of the T-PAV


algorithm, as proved in Section 3.2 of Jordan, Mühlemann and Ziegel (2022).
Theorem 3.1 (Jordan, Mühlemann and Ziegel, 2022). Suppose that the func-
tional T is as stated in Assumption 2.19. Then Algorithm 1 generates a sequence
x n such that the empirical measure associated with (
1 , . . . , x x1 , y1 ), . . . , (
xn , yn )
is conditionally T-calibrated. This sequence is optimal with respect to any scor-
ing function S of the form (17), in that

1 1
n n
xi , yi ) ≤
S( S(ti , yi ) (25)
n i=1 n i=1

for any non-decreasing sequence t1 ≤ · · · ≤ tn .


We note that for a functional T of interval type, the minimum on the left-hand
side of (25) is the same under the lower functional T− and the upper functional
T+ , respectively, as defined in Section 2.4. For customary functionals, such as
threshold (non) exceedance probabilities, quantiles, expectiles, and moments,
the optimality is universal, as functions of the form (17) exhaust the class of the
T-consistent scoring functions subject to mild conditions (Ehm et al., 2016).
While the PAV algorithm has been used extensively for the recalibration of
probabilistic classifiers (e.g., Flach, 2012), we are unaware of any extant work
that uses Algorithm 1 for forecast recalibration, forecast evaluation, or model
diagnostics in non-binary settings.

3.2. Empirical T-reliability diagrams

Recently, Dimitriadis, Gneiting and Jordan (2021) introduced the CORP ap-
proach for the estimation of reliability diagrams and score decompositions in
3250 T. Gneiting and J. Resin

the case of probability forecasts for binary outcomes. In a nutshell, the acronym
CORP refers to an estimator that is Consistent under the assumption of iso-
tonicity for the population recalibration function and Optimal in both finite
sample and asymptotic settings, while facilitating Reproducibility, and being
based on the PAV algorithm. Here, we extend the CORP approach and employ
nonparametric isotonic T-regression via the T-PAV algorithm under Assump-
tion 2.19, where T is the lower or upper version of an identifiable functional, or
an identifiable singleton functional.
We begin by defining the empirical T-reliability diagram, which is a sample
version of the population diagram in Definition 2.20.
Definition 3.2. Let the functional T be as stated in Assumption 2.19, and sup-
pose that x n originate from tuples (x1 , y1 ), . . . , (xn , yn ) with x1 ≤ · · · ≤
1 , . . . , x
xn via Algorithm 1. Then the CORP empirical T-reliability diagram is the graph
of the piecewise linear function that connects the points (x1 , x 1 ), . . . , (xn , x
n ) in
the Euclidean plane.
A few scattered references in the literature on forecast evaluation have pro-
posed displays of recalibrated against original values for functionals other than
binary event probabilities: Figures 3 and 7 of Bentzien and Friederichs (2014)
and Figure 8 of Pohle (2020) consider quantiles, and Figures 2–5 of Satopää
and Ungar (2015) concern the mean functional. However, none of these papers
employ the PAV algorithm, and the resulting diagrams are subject to issues
of stability and efficiency, as illustrated by Dimitriadis, Gneiting and Jordan
(2021) in the case of binary outcomes.
For the CORP empirical T-reliability diagram to be consistent in the sense
of large sample convergence to the population version of Definition 2.20, the
assumption of isotonicity of the population recalibration function needs to be
invoked. As argued by Roelofs et al. (2022) and Dimitriadis, Gneiting and Jordan
(2021), such an assumption is natural, and practitioners tend to dismiss non-
isotonic recalibration functions as artifacts. Evidently, these arguments transfer
to arbitrary functionals, and any violations of the isotonicity assumption entail
horizontal segments in CORP reliability diagrams, thereby indicating a lack of
reliability. Large sample theory for CORP estimates of the recalibration func-
tion and the T-reliability diagram depends on the functional T, the type —
discrete or continuous — of the marginal distribution of the point forecast X,
and smoothness conditions. Mösching and Dümbgen (2020) establish rates of
uniform convergence in the cases of threshold (non) exceedance and quantile
functionals that complement classical theory (Barlow et al., 1972; Casady and
Cryer, 1976; Wright, 1984; Robertson, Wright and Dykstra, 1988; El Barmi and
Mukerjee, 2005; Guntuboyina and Sen, 2018).
In the case of binary outcomes, Bröcker and Smith (2007, p. 651) argue that
reliability diagrams ought to be supplemented by consistency bars for “imme-
diate visual evaluation as to just how likely the observed relative frequencies
are under the assumption that the predicted probabilities are reliable.” Dimitri-
adis, Gneiting and Jordan (2021) develop asymptotic and Monte Carlo based
methods for the generation of consistency bands to accompany a CORP relia-
Regression diagnostics meets forecast evaluation 3251

bility diagram for dichotomous outcomes, and provide code in the form of the
reliabilitydiag package (Dimitriadis and Jordan, 2021) for R (R Core Team,
2021). The consistency bands quantify and visualize the variability of the em-
pirical reliability diagram under the respective null hypothesis, i.e., they show
the pointwise range of the CORP T-reliability diagram that we expect to see
under a calibrated forecast. Algorithms 2 and 3 in Appendix B generalize this
approach to produce consistency bands from data of the form (23) under the as-
sumption of auto-calibration. In the specific case of threshold calibration, where
the induced outcome is dichotomous, the assumptions of auto-calibration (in the
binary setting) and T-calibration (for the non-exceedance functional) coincide
(Gneiting and Ranjan, 2013, Theorem 2.11), and we use the aforementioned
algorithms to generate consistency bands (Figure 6, top row). Generally, auto-
calibration is a strictly stronger assumption than T-calibration, with ensuing
issues, which we discuss in Appendix B.1. Furthermore, to generate consistency
bands from data of the form (24), we cannot operate under the assumption of
auto-calibration.
As a crude yet viable alternative, we propose in Appendix B.2 a Monte Carlo
technique for the generation of consistency bands that is based on resampling
residuals. As in traditional regression diagnostics, the approach depends on the
assumption of independence between point forecasts and residuals. Figure 6
shows examples of T-reliability diagrams with associated residual-based 90%
consistency bands for the perfect, unfocused, and lopsided forecasts from Sec-
tion 2 for the mean functional (middle row) and the lower quantile functional
at level 0.10 (bottom row). For further discussion see Appendix B.2. In the
case of the mean functional, we add the scatter diagram for the original data
of the form (24), whereas in the other two cases, inset histograms visualize the
marginal distribution of the point forecast.
We encourage follow-up work on both Monte Carlo and asymptotic meth-
ods for the generation of consistency and confidence bands that are tailored to
specific functionals of interest, similar to the analysis by Dimitriadis, Gneiting
and Jordan (2021) in the basic case of probability forecasts for binary out-
comes.

3.3. Empirical score decompositions

In this section, we consider data (x1 , y1 ), . . . , (xn , yn ) of the form (24), where
implicitly or explicitly xi = T(Fi ) for a single-valued functional T. Let x 1 , . . . , x
n
denote the respective T-PAV recalibrated values, and let x 0 = T(F0 ), where F0
is the empirical CDF of the outcomes y1 , . . . , yn . Let

1 1 1
n n n

S= S(xi , yi ), 
Src = S(
xi , yi ), and 
Smg = S(
x0 , yi )
n i=1 n i=1 n i=1
(26)
denote the mean score of the point forecast at hand, the recalibrated point fore-
cast, and the functional T applied to the unconditional, marginal distribution
3252 T. Gneiting and J. Resin

Fig 6. CORP empirical threshold (top, t = 1), mean (middle) and quantile (bottom, α = 0.10)
reliability diagrams for the perfect (left), unfocused (middle), and lopsided (right) forecast
from Examples 2.1 and 2.2 with 90% consistency bands and CORP score components under
the associated canonical loss function based on samples of size 400.

of the outcome, respectively. If all quantities in (26) are finite, we refer to


S = 
MCB S−
Src , S =
DSC Smg − 
Src , and S = 
UNC Smg (27)
as the miscalibration, discrimination and uncertainty components of the mean
score 
S. Our next result generalizes Theorem 1 of Dimitriadis, Gneiting and Jor-
dan (2021) and decomposes the mean score  S into a signed sum of nonnegative,
readily interpretable components.
Theorem 3.3. Suppose that the functional T satisfies the conditions in As-
sumption 2.19. Let the scoring function S be of the form (17), suppose that
1 , . . . x
x n originate from tuples (x1 , y1 ), . . . , (xn , yn ) via Algorithm 1, and let all
Regression diagnostics meets forecast evaluation 3253

terms in (26) be finite. Then

  S + UNC
 S − DSC
S = MCB  S, (28)

 S ≥ 0 with equality if x
where MCB  S ≥ 0 with
i = xi for i = 1, . . . , n, and DSC
i = x
equality if x 0 for i = 1, . . . , n.
If S is strictly consistent, then MCB  S = 0 only if x i = xi for i = 1, . . . , n

and DSCS = 0 only if x i = x
0 for i = 1, . . . , n.
Proof. Immediate from Theorem 3.1.

Thus, CORP estimates of score components enjoy the same properties as


the respective population quantities (Theorem 2.23, eq. (15)). This agreement
is not to be taken for granted, as the nonnegativity of the estimated compo-
nents cannot be guaranteed if approaches other than the T-PAV algorithm are
used for recalibration (Dimitriadis, Gneiting and Jordan, 2021, Supplementary
Section S5).
Recently, the estimation of calibration error has seen a surge of interest in
machine learning (Guo et al., 2017; Kuleshov, Fenner and Ermon, 2018; Ku-
mar, Liang and Ma, 2019; Nixon et al., 2019; Roelofs et al., 2022). Under
the natural assumption of isotonicity of the population recalibration function,
 S is a consistent estimate of the population quantity MCBS , with canon-
MCB
ical loss functions being natural choices for S. As noted, it is often preferable
to use the square root of the miscalibration component under squared error
as a measure of calibration error that can be interpreted on natural scales.
Asymptotic distributions for our estimators depend on the functional T, the
scoring function S, and regularity conditions. Large sample theory can leverage
extant theory for nonparametric isotonic regression, as hinted at in the previ-
ous section, though score components might show distinct asymptotic behavior.
Further development is beyond the scope of the present paper and strongly
encouraged.
In the remainder of the section, we assume that S is a canonical score and
drop the subscript in the score components. If there is a constant  c ∈ R such
that the empirical measure in (x1 +  c, y1 ), . . . , (xn + 
c, yn ) is unconditionally
T-calibrated, let
1
n

Surc = S(xi +  c, yi ). (29)
n i=1

We then refer to

u = 
MCB S−
Surc c = 
and MCB Surc − 
Src

as the CORP unconditional and conditional miscalibration components of the


mean canonical score, respectively. Under mild conditions, these estimates are
nonnegative and share properties of the respective population quantities in The-
orem 2.27.
3254 T. Gneiting and J. Resin

Theorem 3.4. Let the conditions of Theorem 3.3 hold, and let S be a canonical
loss function for T. Suppose there is a constant  c ∈ R such that the empirical
measure in (x1 +  c, y1 ), . . . , (xn + 
c, yn ) is unconditionally T-calibrated, and
suppose that all terms in (29) are finite. Then

 = MCB
MCB  c,
 u + MCB

 u ≥ 0 and MCB
where MCB  c ≥ 0.

Proof. Immediate from Theorems 2.26 and 3.1 and the trivial fact that the
addition of a constant is a special case of an isotonic mapping.
In the middle row of Figure 6, the extended CORP decomposition,

  u + MCB
S = MCB  + UNC,
 c − DSC  (30)

which estimates the population decomposition (22), is applied to the mean


squared error (MSE). Likewise, the extended CORP decomposition of the canon-
ical score for quantiles, i.e., the piecewise linear quantile score (QS) from Table 3,
is shown in the bottom row. The top row concerns threshold calibration, and we
report the standard CORP decomposition (28) of the Brier score (BS) from (20).
While the assumptions of Theorem 3.3 are satisfied in this setting, the addition
of the constant  c may yield forecast values outside the unit interval, whence we
refrain from considering the refined decomposition in (30).
In this context, the distinction between out-of-sample forecast evaluation and
in-sample model diagnostics is critical. When evaluating out-of-sample forecasts,
both unconditional and conditional miscalibration are relevant. In contrast, in-
sample model fits frequently enforce unconditional calibration. For example,
if we fit a regression model with intercept by minimizing the canonical loss
for a functional T, Theorem 2.26 applied to the associated empirical measure
guarantees in-sample unconditional T-calibration. As special cases, this line of
reasoning yields classical results in ordinary least squares regression, and the
partitioning inequalities of quantile regression in Theorem 3.4 of Koenker and
Bassett (1978).

3.4. Skill scores and a universal coefficient of determination

Let us revisit the mean scores in (26) under the natural assumption that the
terms in  Smg are finite and that 
S and  Smg is strictly positive. In out-of-sample
forecast evaluation, the quantity


S Smg − 
 S  S − MCB
DSC S

Sskill = 1 − = = (31)

Smg 
Smg S
UNC
is known as skill score (Murphy and Epstein, 1989; Murphy, 1996; Gneiting
and Raftery, 2007; Jolliffe and Stephenson, 2012) and may attain both positive
and negative values. In particular, when S(x, y) = (x − y)2 is the canonical loss
Regression diagnostics meets forecast evaluation 3255

function for the mean functional,  Sskill coincides with the popular Nash-Sutcliffe
model efficiency coefficient (NSE; Nash and Sutcliffe, 1970; Moriasi et al., 2007).
A positive skill score indicates predictive performance better than the simplistic
unconditional reference forecast, whereas a negative skill score suggests that
we are better off using the simple reference forecast. Of course, it is possible,
and frequently advisable, to base skill scores on reference standards that are
more sophisticated than an unconditional, constant point forecast (Hyndman
and Koehler, 2006).
In contrast, if the goal is in-sample model diagnostics, the quantity in (31)
typically is nonnegative. As we demonstrate now, it constitutes a powerful gen-
eralization of the coefficient of determination, R2 , or variance explained in least
squares regression, and its close cousin, the R1 -measure in quantile regression
(Koenker and Machado, 1999). Specifically, we propose the use of

 S − MCB
DSC S
R∗ = , (32)
S
UNC
as a universal coefficient of determination. In practice, one takes S to be a
canonical loss for the functional T at hand, and we drop the subscripts in this
case. The classical R2 measure arises when S(x, y) = (x − y)2 is the canonical
squared error loss function for the mean functional, and the R1 measure of
Koenker and Machado (1999) emerges when S(x, y) = 2 (1{x ≥ y} − α) (x − y)
is the canonical piecewise linear loss under the α-quantile functional. Of course,
in the case α = 12 of the median, the piecewise linear loss reduces to the absolute
error.
In Figure 7, we present a numerical example on the toy data from Figure 1
in Kvålseth (1985). The straight lines show the linear (ordinary least squares)
mean and linear (Laplace) median regression fits, which Kvålseth (1985) sought
to compare. The piecewise linear broken curves illustrate the nonparametric
isotonic regression fits, as realized by the T-PAV algorithm, where T is the
mean and the lower and the upper median, respectively. As the linear regression
fits induce the same ranking of the point forecasts, they yield the same PAV-
recalibrated values that enter the terms in the score decomposition (27), and
thus they have identical discrimination components in (28), which equal 10.593
under squared error and 2.333 under absolute error, regardless of which isotonic
median is used. The uncertainty components, which equal 12.000 under squared
error, and 2.889 under absolute error, are identical as well, since they depend
on the observations only. Thus, the differences in R2 respectively R1 in Figure 7
stem from distinct miscalibration components. Of course, linear mean regression
is preferred under squared error, and linear median regression is preferred under
absolute error.
Various authors have discussed desiderata for a generally applicable defini-
tion of a coefficient of determination (Kvålseth, 1985; Nakagawa and Schielzeth,
2013) for the assessment of in-sample fit. In particular, such a coefficient ought
to be dimensionless and take values in the unit interval, with a value of 1 in-
dicating a perfect fit, and a value of 0 representing a complete lack of fit. The
3256 T. Gneiting and J. Resin

Fig 7. Linear mean and linear median regression lines for toy example from Kvålseth (1985,
Figure 1), along with nonparametric isotonic mean and median regression fits. The isotonic
median regression fit is not unique and framed by the respective lower and upper functional.

universal coefficient of determination R∗ enjoys these properties under modest


conditions.
Assumption 3.5. Suppose that the functional T is as stated in Assump-
tion 2.19 with associated identification function V . Let the scoring function
S be of the form (17), and suppose that x n in (26) originate from tuples
1 , . . . x
(x1 , y1 ), . . . , (xn , yn ) via Algorithm 1. Furthermore, let the following hold.
(i) The terms contributing to S mg in (26) are finite, and 
 and S Smg > 0.
(ii) The values x1 , . . . , xn have been fitted to y1 , . . . , yn by in-sample empirical
loss minimization with respect to S, with any constant fit x1 = · · · = xn
being admissible.
For example, suppose that T is the mean functional and S is the canonical
squared error scoring function. Then condition (i) is satisfied with the exception
of the trivial case where y1 = · · · = yn , and condition (ii) is satisfied under
linear (ordinary least squares) mean regression with intercept. Similarly, if T
is a quantile and S is the canonical piecewise linear loss function, then (i) is
satisfied except when y1 = · · · = yn , and (ii) is satisfied under linear quantile
regression with intercept. In this light, the following theorem covers the classical
settings for the R2 and R1 measures.
Theorem 3.6. Under Assumption 3.5 it holds that

R∗ ∈ [0, 1]

with R∗ = 0 if xi = x
0 for i = 1, . . . , n, and R∗ = 1 if xi = T(δi ) for
i = 1, . . . , n.
Proof. The claim follows from Theorem 3.1, the trivial fact that a constant fit is
a special case of an isotonic mapping, and the assumed form (17) of the scoring
function.
Regression diagnostics meets forecast evaluation 3257

We emphasize that Assumption 3.5 and Theorem 3.6 are concerned with, and
tailored to, in-sample model diagnostics. At the expense of technicalities, the
regularity conditions can be relaxed, but the details are tedious and we leave
them to subsequent work. The condition that any constant fit x1 = · · · = xn be
admissible is critical and cannot be relaxed.

3.5. Empirical examples

We now illustrate the use of reliability diagrams, score decompositions, skill


scores, and the coefficient of determination R∗ for the purposes of forecast eval-
uation and model diagnostics.
In the basic setting of tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24), the point
forecast xi represents the functional T of a posited distribution for yi . The most
prominent case of the mean functional and canonical squared error loss (19)
is illustrated in Figure 2, where point forecasts by Tredennick et al. (2021)
of (log-transformed) butterfly population size are assessed. The CORP mean
reliability diagram along with 90% consistency bands under the hypothesis of
mean calibration complements the scatter plot provided by Tredennick et al.
(2021, Figure 6). With a mean squared error (MSE) of 0.224, ridge regression
performs much better than the null model with an MSE of 0.262. The CORP
score decomposition shown in Figure 2 refines and supports the analysis.
We move on to discuss the more complex setting of tuples (F1 , y1 ), . . . , (Fn , yn )
of the form (24), where Fi is a posited distribution for yi (i = 1, . . . , n). As dis-
cussed in Section 2 and visualized in Figure 1, the traditional unconditional
notions of calibration, namely, probabilistic and marginal calibration, consti-
tute weak forms of reliability. For this very reason, we recommend that checks
for probabilistic and marginal calibration are given priority in this setting, much
in line with current practice. Typically, probabilistic calibration is checked by
plotting histograms of empirical probability integral transform (PIT) values
(Diebold, Gunther and Tay, 1998; Gneiting, Balabdaoui and Raftery, 2007),
though this practice is hindered by the need for binning. In Appendix B.3, we
discuss the PIT reliability diagram, a rarely used alternative that avoids binning
and retains the spirit of our CORP approach by plotting the CDF of the em-
pirical PIT values. Similarly, as we also discuss in Appendix B.3, the marginal
reliability diagram can be used to assess marginal calibration in the spirit of
the CORP approach. If the analysis indicates gross violations of probabilistic or
marginal calibration, we note from Section 2 and Figure 1 that key notions of
conditional calibration must be violated as well. Otherwise, we might proceed
to check stronger conditional notions of calibration, such as threshold, mean,
and quantile calibration.
To illustrate this process, we consider quarterly Bank of England forecasts of
consumer price index (CPI) inflation rates, as issued since 2004. The forecast
distributions, for which we give details and refer to extant analyses in Ap-
pendix C.2, are two-piece normal distributions that are communicated to the
public via fan charts. The forecasts are at prediction horizons up to six quarters
3258 T. Gneiting and J. Resin

Fig 8. Calibration diagnostics for Bank of England forecasts of CPI inflation at a prediction
horizon of one quarter: (a) PIT reliability diagram, along with the empirical autocorrelation
functions of (b) original and (c) squared, centered PIT values, (d) marginal, (e) threshold,
and (f) 75%-quantile reliability diagram. If applicable, we show 90% consistency bands and
CORP score components under the associated canonical loss function, namely, the Brier score
(BS) and the piecewise linear quantile score (QS), respectively.

ahead in the time series setting, where k step ahead forecasts that are ideal
with respect to the canonical filtration show PIT values that are independent
at lags ≥ k + 1 in addition to being uniformly distributed (Diebold, Gunther
and Tay, 1998). However, as discussed in Appendix C.1, independent, uniformly
distributed PIT values do not imply auto-calibration, except in a special case.
Thus, calibration diagnostics beyond checks of the uniformity and independence
of the PIT are warranted.
In Figure 8, we consider forecasts one quarter ahead and show PIT and
marginal reliability diagrams, along with empirical autocorrelation functions
(ACFs) for the first two moments of the PIT. In part, the PIT reliability diagram
and the ACFs lie outside the respective 90% consistency bands. For a closer look,
we also plot the threshold reliability diagram at the policy target of 2% and the
lower α-quantile reliability diagram for α = 0.75. The deviations from reliability
remain minor, in stark contrast to calibration diagnostics at prediction horizons
k ≥ 4, for which we refer to Appendix C.2.
Figure 9 shows the standard CORP decomposition (28) of the Brier score
(BS) for the induced probability forecasts at the 2% target and the extended
Regression diagnostics meets forecast evaluation 3259

Fig 9. Score decomposition (28) respectively (30) and skill score (31) for probability forecasts
of not exceeding the 2% inflation target (left) and 75%-quantile forecasts (right) induced by
Bank of England fan charts for CPI inflation, under the associated canonical scoring function.

CORP decomposition (30) of the piecewise linear quantile score for α-quantile
forecasts at level α = 0.75 and lead times up to six quarters ahead. In the lat-
ter case, the difference between MCB and MCBu equals the MCBc component.
Generally, the miscalibration components increase while the discrimination com-
ponents decrease with the lead time. Related results for the quantile functional
can be found in Pohle (2020, Table 5, Figures 7 and 8), where there is a notable
increase in the discrimination (resolution) component at the largest two lead
times, which is caused by counterintuitive decays in the recalibration functions.
In contrast, the regularizing constraint of isotonicity prevents overfitting in the
CORP approach.
The coefficient of determination or skill score R∗ decays with the prediction
horizon and becomes negative at lead times k ≥ 4. This observation suggests
that forecasts remain informative at lead times up to at most three quarters
ahead, in line with the substantive findings in Pohle (2020) and other extant
work, as hinted at in Appendix C.2.

4. Discussion

We have developed a comprehensive theoretical and methodological framework


for the analysis of calibration and reliability, serving the purposes of both (out-
of-sample) forecast evaluation and (in-sample) model diagnostics. A common
principle is that fitted or predicted distributions ought to be calibrated or reli-
able, ideally in the sense of auto-calibration, which stipulates that the outcomes
are random draws from the posited distributions. For general real-valued out-
comes, we have seen that auto-calibration is stronger than both classical un-
conditional and recently proposed conditional notions of calibration. We have
developed hierarchies of calibration in the spirit of Van Calster et al. (2016),
as highlighted in Figure 1, and proposed a generic notion of conditional cal-
ibration in terms of statistical functionals. Specifically, a posited distribution
3260 T. Gneiting and J. Resin

is conditionally T-calibrated if the induced point forecast for the functional T


can be taken at face value. This concept continues to apply when stand-alone
point forecasts or regression output in terms of the functional T are considered,
so T-reliability diagrams and associated score decompositions can be used in
these settings. Importantly, our tools apply regardless of how forecasts are gen-
erated, be it through the use of traditional statistical regression models, modern
machine learning techniques, or even subjective human judgment.
We have adopted and generalized the nonparametric approach of Dimitri-
adis, Gneiting and Jordan (2021), who obtained consistent, optimally binned,
reproducible, and PAV based (CORP) estimators of T-reliability diagrams and
score components in the case of probability forecasts for binary outcomes. While
our tools apply in the much broader setting of identifiable functionals and real-
valued outcomes, the arguments put forth by Dimitriadis, Gneiting and Jordan
(2021) continue to apply, in that CORP estimators are bound to, simultaneously,
improve statistical efficiency, reproducibility (Stodden et al., 2016), and stabil-
ity (Yu and Kumbier, 2020). In a nutshell, the CORP approach is flexible, due
to its use of nonparametric regression for recalibration, and yet it avoids over-
fitting, owing to the regularizing constraint of isotonicity. Notably, the CORP
score decomposition yields a new, universal coefficient of determination, R∗ ,
that nests and generalizes the classical R2 in ordinary least squares (mean) re-
gression, and its cousin R1 in quantile regression. In independent work, Allen
(2021) also observes the link between skill scores, score decompositions, and the
coefficient of determination. We have illustrated the CORP approach on Bank
of England forecasts of inflation, along with a brief ecological example. Gneiting
et al. (2023) provide an in depth review of the particular case of forecasts in
the form of (one or multiple) quantiles, accompanied by case studies. Code in R
(R Core Team, 2021) for reproducing our results is available at https://github.
com/resinj/replication_GR21 (Gneiting and Resin, 2023).
Follow-up work on the CORP approach for specific functionals T is essen-
tial, including but not limited to the ubiquitous cases of quantiles and the mean
functional, where the newly developed tools can supplement classical approaches
to regression diagnostics, as hinted at in the ecological example. In particular,
we have applied a crude, all-purpose, residual-based permutation approach to
generate consistency bands for T-reliability diagrams under the hypothesis of
T-calibration. Clearly, this approach can be refined, and we anticipate vigorous
work on consistency and confidence bands, based on either resampling or large
sample theory, akin to the developments in Dimitriadis, Gneiting and Jordan
(2021) for probability forecasts of binary outcomes. Similarly, CORP estimates
of miscalibration components under canonical loss functions are natural candi-
dates for the quantification of calibration error in empirical work. Reliability
and discrimination ability are complementary attributes of point forecasts and
regression output, and discrimination can be assessed quantitatively via the re-
spective score component. When many forecasts are to be compared with each
other, scatter plots of CORP miscalibration (MCB) and discrimination (DSC)
components admit succinct visual displays of predictive performance. In this
type of display, forecasts with the same score or, equivalently, identical coeffi-
Regression diagnostics meets forecast evaluation 3261

cient of determination, R∗ , gather on lines with unit slope, and multiple facets
of forecast quality can be assessed simultaneously, for a general alternative to
the widely used Taylor (2001) diagram.
Formal tests of hypotheses of calibration are critical in both specific ap-
plications, such as banking regulation (e.g., Nolde and Ziegel, 2017), and in
generic tasks, such as the assessment of goodness of fit in regression (Dimitri-
adis, Gneiting and Jordan, 2021, Section S2). In Appendix B.4, we comment
on this problem from the perspective of the theoretical and methodological
advances presented here. While specific developments need to be deferred to
future work, it is our belief that the progress in our understanding of notions
and hierarchies of calibration, paired with the CORP approach to estimating
reliability diagrams and score components, can spur a wealth of new and fruitful
developments in these directions.

Appendix A: Supporting calculations for Section 2

Here, we provide supporting computations and discussion for Examples 2.2


and 2.4, Definitions 2.9 and 2.24, Figures 4 and 5, and Table 4, along with
a discussion of the relation between probabilistic calibration and unconditional
quantile calibration, and a counterexample hinted at in the main text. For sub-
sequent use, the first three (non-centered) moments of the normal distribution
N (μ, σ 2 ) are μ, μ2 + σ 2 , and μ3 + 3μσ 2 . As in the main text, we let ϕ and Φ de-
note the density and the cumulative distribution function (CDF), respectively,
of a standard normal variable.

A.1. Unfocused forecast

For fixed a, b ∈ R, the function


1
y → Φa (y − b) = (Φ(y − b) + Φ(y − a − b))
2
is a CDF. The random CDF (2) for the unfocused forecast in Example 2.2
can be written as F (y) = Φη (y − μ), where η and μ are independent random
variables and η = ±η0 for some constant η0 > 0. Then the conditional CDF for
the outcome Y given the posited (non) exceedance probability F (t) at any fixed
threshold t ∈ R or, equivalently, given the quantile forecast F −1 (α) at any fixed
level α ∈ (0, 1) is
Q(Y ≤ y | F (t) = α) = Q(Y ≤ y | F −1 (α) = t) = Q(Y ≤ y | μ = t − Φ−1 η (α))
1 
= ϕ(t − Φ−1 −1
sη0 (α)) Φ(y − (t − Φsη0 (α))).
s=±1 ϕ(t − Φ−1
sη0 (α)) s=±1

As F is symmetric, conditioning on the mean is the same as conditioning on the


median. The second moment is m2 (F ) = 1 + μ2 + μη + 12 η 2 ≥ 1 + 14 η 2 , so that
  
1 1 2
Q(Y ≤ y | m2 (F ) = m) = Q Y ≤ y | μ = − η ± m − 1 − η
2 4
3262 T. Gneiting and J. Resin

is a mixture of normal distributions.


 Similarly,
 the third moment is m3 (F ) =
μ3 + 32 ημ2 + 3 12 η 2 + 1 μ + 12 η η 2 + 3 = f (μ; η), so that Q(Y ≤ y | m3 (F ) =
m) = Q(Y ≤ y | f (μ; η) = m) also is a mixture of normal distributions. To
compute the roots of the mapping x → f (x; η), we fix η at ±η0 and use a
numeric solver (polyroot in R).
As regards the score decomposition (15) with S(x, y) = (x − y)2 for the
implied mean-forecast, m1 (F ) = μ + 12 η, the expected score of the recalibrated
mean-forecast is
    2
s
s=±1 ϕ m1 (F ) + 2 η0 m1 (F ) + 2s η0
S̄rc = E   s
 −Y
s=±1 ϕ m1 (F ) + 2 η0
   1  2
1 s s
s=±1 ϕ μ + 2 η + 2 η0 0 2 η + 2 η0
=E   1 s
 − (Y − μ)
s=±1 ϕ μ + 2 η + 2 η0
 2
ϕ(μ + η)
= η0 E
2
+ E[Y − μ]2
ϕ(μ) + ϕ(μ + η)

= η02 E Ψ2η0 (μ) + 1,

where we define Ψa (x) = ϕ(x + a)/(ϕ(x) + ϕ(x + a)) for a ∈ R and note that
E[Ψ2η (μ) | η] = E[Ψ2η0 (μ)]. The associated integral
 ∞  2
ϕ(x + η0 )
I(η0 ) = E Ψ2η0 (μ) = ϕ(x) dx
−∞ ϕ(x) + ϕ(x + η0 )

needs to be evaluated numerically.

A.2. Lopsided forecast

We proceed in analogy to the development for the unfocused forecast. For fixed
a ∈ [0, 1] and b ∈ R, the function

y → Φa (y − b) = (1 − a)Φ(y − b)1{y ≤ b} + ((1 + a)Φ(y − b) − a) 1{y > b}

is a CDF. The CDF for the lopsided forecast with random density (3) from
Example 2.2 can be written as F (y) = Φδ (y −μ), where δ and μ are independent
random variables and δ = ±δ0 for some δ0 ∈ (0, 1). As E[Φδ (y − μ) | μ] = Φ(y −
μ), the lopsided forecast is marginally calibrated. It fails to be probabilistically
calibrated since ZF = Φδ (Y − μ) has CDF
    
1  u u 1 u + sδ0 u 1
Q(ZF ≤ u) = 1 ≤ + 1 >
2 s=±1 1 − sδ0 1 − sδ0 2 1 + sδ0 1 − sδ0 2

for u ∈ (0, 1) by the law of total probability.


Regression diagnostics meets forecast evaluation 3263

The conditional CDF for the outcome Y given the posited (non) exceedance
probability F (t) at any fixed threshold t ∈ R or, equivalently, given the quantile
forecast F −1 (α) at any fixed level α ∈ (0, 1) is
Q(Y ≤ y | F (t) = α) = Q(Y ≤ y | F −1 (α) = t) = Q(Y ≤ y | μ = t − Φ−1 δ (α))
1 
= −1 ϕ(t − Φ−1 −1
sδ0 (α)) Φ(y − (t − Φsδ0 (α))),
s=±1 ϕ(t − Φ sδ0 (α)) s=±1

where Φ−1
a (α) = Φ
−1
(α/(1−a)) if α ≤ 12 (1−a) and Φ−1
a (α) = Φ
−1
((a+α)/(a+
1)) otherwise.
As F is a mixture of truncated normal distributions, its moments are mixtures
of the component moments, for which we refer to Orjebin (2014). The first
moment is m1 (F ) = μ + 2δϕ(0), so that
Q(Y ≤ y | m1 (F ) = m) = Q(Y ≤ y | μ = m − 2δϕ(0))
1 
= ϕ(m − 2sδ0 ϕ(0)) Φ(y − (m − 2sδ0 ϕ(0)))
s=±1 ϕ(m − 2sδ0 ϕ(0)) s=±1

is a mixture of normal distributions. Similarly, the second and third moments are
m2 (F ) = μ2 + 1 + 4δϕ(0)μ ≥ 1 − 4δ 2 ϕ(0)2 and m3 (F ) = μ3 + 3μ + 2δϕ(0)(3μ2 +
2) = f (μ; δ), respectively, so that

Q(Y ≤ y | m2 (F ) = m) = Q(Y ≤ y | μ = −2δϕ(0) ± 4δ 2 ϕ(0)2 − 1 + m),
Q(Y ≤ y | m3 (F ) = m) = Q(Y ≤ y | f (μ; δ) = m)
also admit expressions as mixtures of normal distributions. Again, we use a
numeric solver to find the roots of x → f (x; ±δ0 ).
As the implied mean-forecast, m1 (F ) = μ + 2δϕ(0), agrees with the implied
mean-forecast of the unfocused forecast with η = (8/π)1/2 δ, the terms in the
score decomposition (15) with S(x, y) = (x − y)2 derive from the respective
terms in the score decomposition for the unfocused forecast, as illustrated in
Figure 5.

A.3. Piecewise uniform forecast


(i) (i) (i) (i) (i) (i)
Given any fixed index i ∈ {1, 2, 3}, let the tuple (p1 , p2 , p3 ; q1 , q2 , q3 )
attain the value ( 12 , 14 , 14 ; 10
5 1
, 10 4
, 10 ) if i = 1, the value ( 14 , 12 , 14 ; 10
1 8
, 10 1
, 10 ) if
i = 2, and the value ( 14 , 14 , 12 ; 10
4 1
, 10 5
, 10 ) if i = 3. Furthermore, let Pi be the CDF
(i) (i)
of a mixture of uniform measures on [0, 1], [1, 2], and [2, 3] with weights p1 , p2 ,
(i)
and p3 , respectively. Similarly, let Qi be the CDF of the respective mixture
(i) (i) (i)
with weights q1 , q2 , and q3 , respectively.
The random CDF for the piecewise uniform forecast in Example 2.4 can
then be written as F (x) = Pι (x − μ), where the random variables ι and μ are
independent, and the integer-valued variable ι is such that
 
(ι) (ι) (ι) (ι) (ι) (ι)
(p1 , p2 , p3 ; q1 , q2 , q3 ) = p1 , p2 , p3 ; q1 , q2 , q3 .
3264 T. Gneiting and J. Resin

The conditional CDF for the outcome Y given the posited (non) exceedance
probability F (t) at any fixed threshold t ∈ R or, equivalently, given the quantile
forecast F −1 (α) at any fixed level α ∈ (0, 1) then is

Q(Y ≤ y | F (t) = α) = Q(Y ≤ y | F −1 (α) = t) = Q(Y ≤ y | μ = t − Pι−1 (α))


1   t − P −1 (α) 
=   ϕ i
Qi (y − (t − Pi−1 (α))),
t−Pi−1 (α) c
i=1,2,3 ϕ c i=1,2,3

where c is the standard deviation of μ, as defined in Example 2.4. The first


moment of F is m1 (F ) = μ + 1 + 14 ι, so that

Q(Y ≤ y | m1 (F ) = m) = Q(Y ≤ y | μ = m − 1 − 14 ι)
1  m − 1 − 1i
=   ϕ 4
Qi (y − (m − 1 − 14 i))
m−1− 14 i c
i=1,2,3 ϕ c i=1,2,3

is a mixture of shifted versions of Q1 , Q2 , and Q3 . The associated first moment


is the respective mixture of m + 20 3
, m, and m − 20 3
.
  k+1  (ι)
Given any integer k ≥ 0, let βk = j=1,2,3 j − (j − 1)k+1 pj . The
second moment of F is m2 (F ) = μ2 + β1 μ + 13 β2 , whence
  
1 1 2 1
Q(Y ≤ y | m2 (F ) = m) = Q Y ≤ y | μ = − β1 ± β − β2 + m
2 4 1 3

also admits an expression in terms of mixtures of shifted versions of Q1 , Q2 ,


and Q3 . Finally, the third moment of F is m3 (F ) = μ3 + 32 β1 μ2 + β2 μ + 14 β3 =
f (μ; ι), so that the conditional distribution Q(Y ≤ y | m3 (F ) = m) = Q(Y ≤
y | f (μ; ι) = m) and the associated third moment can be computed analogously.

A.4. Identification functions, unconditional calibration, and


canonical loss

In this section, we demonstrate that Definitions 2.9 and 2.24 are unambiguous
and do not depend on the choice of the identification function, which is essen-
tially unique. To this end, we first contrast the notions of identification functions
in Fissler and Ziegel (2016) and Jordan, Mühlemann and Ziegel (2022). Fissler
and Ziegel (2016) call V : R×R → R a (strict F-)identification function if V (x, ·)
is integrable with respect to all F ∈ F for all x ∈ R. Jordan, Mühlemann and
Ziegel (2022) additionally require V to be increasing and left-continuous in its
first argument. Furthermore, there is a subtle difference in the way that the func-
tional is induced. While Fissler and Ziegel (2016) define the induced functional
as the set   
T0 (F ) = x ∈ R : V (x, y) dF (y) = 0 ,
Regression diagnostics meets forecast evaluation 3265

Jordan, Mühlemann and Ziegel (2022) define it to be the closed interval T (F ) =


[T − (F ), T + (F )], where T− (F ) and T+ (F ) are defined in (7) and (8), respec-
tively. The approach by Jordan, Mühlemann and Ziegel (2022) allows for quan-
tiles to be treated in full generality and ensures that the interval T(F ) coincides
with the closure of T0 (F ) if the latter is nonempty.
In the setting of Fissler and Ziegel (2016), if V is an identification function,
then so is (x, y) → h(x)V (x, y) whenever h(x) = 0 for all x ∈ R. If the class F is
sufficiently rich, then any two locally bounded identification functions V and V
that induce a functional T0 of singleton type relate to each other in the stated
form almost everywhere on the interior of T0 (F) × R (Dimitriadis, Fissler and
Ziegel, 2023, Theorem 4), which implies that increasing identification functions
of prediction error form are unique up to a positive constant. The following
proposition provides an elementary proof under slightly different conditions that
are tailored to our setting. Notably, identification functions of prediction error
form induce functionals that are equivariant under translation by Proposition 4.7
of Fissler and Ziegel (2019), a result which can easily be transferred to the setting
of Jordan, Mühlemann and Ziegel (2022).
Proposition A.1. Let F be a convex class of probability measures such that
δy ∈ F for all y ∈ R. If the functional T is induced on F by an identification
function V (x, y) = v(x − y) of prediction error form, where v is increasing and
left-continuous with v(−r) < 0 and v(r) > 0 for some r > 0, then any other
identification function of the stated form that induces T is a positive multiple
of V .
Proof. Let V : (x, y) → v(x − y) and V : (x, y) → v(x − y) induce the functionals
T and T, respectively. We proceed to show that T = T implies v = h0 · v for
some constant h0 > 0.
To this end, suppose that T = T, and let

r− = sup{r : v(r) < 0} = T− (δ0 ) = T− (δ0 ) = sup{r : v(r) < 0} > −∞,
r+ = inf{r : v(r) > 0} = T+ (δ0 ) = T+ (δ0 ) = inf{r : v(r) > 0} < ∞.

By left-continuity and monotonicity of v and v, it follows that v(r) = v(r) = 0


for r ∈ (r− , r+ ], v(r) < 0 and v(r) < 0 for r < r− , and v(r) > 0 and v(r) > 0
for r > r+ .
Let h(r) = v(r)/v(r) > 0 for r ∈ R \ [r− , r+ ]. If r < r− ≤ r+ < s, then
v(r) = h(r)v(r) < 0 and v(s) = h(s)v(s) > 0. Assume h(r) < h(s), and let
p ∈ (0, 1) be such that
 −1  −1
h(s)v(s) v(s)
1− < p< 1− .
h(r)v(r) v(r)

Then (1 − p)h(r)v(r) + ph(s)v(s) > 0 > (1 − p)v(r) + pv(s) and T+ (pδ−s + (1 −


p)δ−r ) < 0 ≤ T− (pδ−s + (1 − p)δ−r ), a contradiction. An analogous argument
applies if we assume that h(r) > h(s), and we conclude that h(r) = h(s).
3266 T. Gneiting and J. Resin

If r, s < r− , then h(r) = h(s) = h(t) for any t > r+ by the above line of
reasoning. An analogous argument yields h(r) = h(s) for r, s > r+ . Therefore,
the function h is constant and v(r) = h0 · v(r) for a constant h0 > 0 and all r ∈
R \ {r− }. Finally, we obtain v(r− ) = limr↑r− v(r) = limr↑r− h0 · v(r) = h0 · v(r− )
by left-continuity.
Hence, if we assume an identification function of type (i) in Assumption 2.8,
Definitions 2.9 and 2.24 do not depend on the choice of the identification func-
tion, as it is unique up to a positive constant. Trivially, the same holds true for
type (ii). To complete the argument that the definitions are unambiguous, the
following technical argument is needed.
Remark A.2. If a functional T of singleton type is identified by both an iden-
tification function V (x, y) = v(x − y) of type (i) and an identification function
V (x, y) = x − T(δy ) of type (ii), then V is also of type (i). To confirm this
claim, let z denote the unique value at which the sign of v changes, and note
that z = T(δy ) − y for all y since V induces the functional T for each Dirac
measure δy . Hence, T(δy ) = y + z and V (x, y) = x − y − z is of type (i).
We close this section with comments on the role of the class F. As expressed
by Assumption 2.8, we prefer to work with identification functions that elicit
the target functional T on a large, convex class F of probability measures, to
avoid unnecessary constraints on forecast(er)s. Furthermore, when evaluating
stand-alone point forecasts, the underlying predictive distributions typically are
implicit, and assumptions other than the existence of the functional at hand are
unwarranted and contradict the prequential principle. Evidently, if the class F
is sufficiently restricted, additional identification functions arise. For example,
the piecewise constant identification function associated with the median can
be used to identify the mean within any class of symmetric distributions.

A.5. Probabilistic calibration and unconditional quantile calibration

As noted in the main text, probabilistic calibration implies the unconditional


α-quantile calibration condition (10) at every level α ∈ (0, 1). To verify this
implication, it suffices to note that if F is probabilistically calibrated, then
α = Q(ZF ≤ α) ≤ Q(F (Y −) ≤ α) = Q(Y ≤ qα− (F )) and 1 − α = Q(ZF > α) ≤
Q(F (Y ) > α) = Q(Y ≥ qα+ (F )). As Example 2.14(b) demonstrates, the reverse
implication does not hold in general. However, Assumption 2.15 ensures the
equivalence of probabilistic calibration and unconditional α-quantile calibration
at every level α ∈ (0, 1).

A.6. Counterexample (Proposition 2.17(a))

As pointed out by Sahoo et al. (2021, p. 5), strong threshold calibration does not
imply auto-calibration. Here, we provide a simple example illustrating this fact
as Sahoo et al. (2021) do not present such. The example is similar in spirit to the
Regression diagnostics meets forecast evaluation 3267

Fig 10. Same as the lower row of Figure 4 but with displays on original (rather than root-
transformed) scales: Moment reliability diagrams for point forecasts induced by (left) the
unfocused forecast with η0 = 1.5 and (middle) the lopsided forecast with δ0 = 0.7 from
Example 2.2, and (right) the piecewise uniform forecast with c = 0.5 from Example 2.4.

continuous forecast of Example 2.14(a) (as c → 0) but with strictly increasing


distribution functions satisfying Assumption 2.15.
Let F be a mixture of uniform distributions on the intervals [0, 1], [1, 2], [2, 3],
and [3, 4] with weights p1 , p2 , p3 , and p4 , respectively, and let Y be from a
mixture with weights q1 , q2 , q3 , and q4 . Furthermore, let the tuple (p1 , p2 , p3 , p4 ;
q1 , q2 , q3 , q4 ) attain each of the values
 4 1 4 1 16 4 4 1   1 4 1 4 4 16 1 4 
, 10 , 10 , 10 ; 25 , 25 , 25 , 25 , , 10 , 10 , 10 ; 25 , 25 , 25 , 25 ,
 10
4 1 1 4 4 1 4 16
  10
1 4 4 1 1 4 16 4

10 , 10 , 10 , 10 ; 25 , 25 , 25 , 25 , 10 , 10 , 10 , 10 ; 25 , 25 , 25 , 25

with equal probability. The equal average of the distribution of the PIT condi-
tional on either forecast from the top row, and either forecast from the bottom
row, is uniform. As any nontrivial conditioning in terms of a threshold yields
a combination of two forecast cases, one from the top row and one from the
bottom row, the forecast F is strongly threshold calibrated.

A.7. Remarks on Figure 4

The root transforms in the moment reliability diagrams in the bottom row
of Figure 4 bring the first, second, and third moment to the same scale. The
peculiar dent in the reliability curve for the (third root of the) third moment of
the piecewise uniform forecast results from the transform, which magnifies small
deviations between x = m3 (F ) and xrc when x is close to zero. For comparison,
Figure 10 shows moment reliability diagrams for all three forecasts without
applying the root transform.

A.8. Counterexample (Theorem 2.26)

The statement in Theorem 2.26 does not hold under consistent scoring func-
tions in general. For a counterexample, consider the empirical distribution of
3268 T. Gneiting and J. Resin

(x1 , y1 ), . . . , (xn , yn ), where xi = i and yi = xi + 10


9 for i = 1, . . . , 9, and
x10 = y10 = −10. The respective mean-forecast X fails to be unconditionally
mean calibrated, whereas the shifted version X + 1 is unconditionally mean cali-
brated. Nonetheless, the expected elementary score (16) for the mean functional
(i.e., V (x, y) = x − y) with index η = − 19 2 increases when X gets replaced with
X + 1.

Appendix B: Consistency resamples and calibration tests

Monte Carlo based consistency bands for T-reliability diagrams can be gener-
ated from resamples, at any desired nominal level. The consistency bands then
show the pointwise range of the resampled calibration curves. For now, let us
assume that we have data (x1 , y1 ), . . . , (xn , yn ) of the form (24) along with m
resamples at hand, and defer the critical question of how to generate the resam-
ples.
Algorithm 2: Consistency bands for T-reliability curves based on re-
samples
(j) (j)
Input: resamples (x1 , y1 ), . . . , (xn , yn ) for j = 1, . . . , m
Output: α × 100% consistency band
for j ∈ {1, . . . , m} do
(j) (j)
apply Algorithm 1 to obtain x 1 , . . . , x
n from
(j) (j)
(x1 , y1 ), . . . , (xn , yn )
end
for i ∈ {1, . . . , n} do
(1) (m)
let li and ui be the empirical quantiles of x i , . . . , x
i at level α2
and 1 − 2 α

end
interpolate the point sets (x1 , l1 ), . . . , (xn , ln ) and (x1 , u1 ), . . . , (xn , un )
linearly, to obtain the lower and upper bound of the consistency band,
respectively
Complementary to consistency bands, tests for the assumed type of calibra-
tion, as quantified by the functional T and a generic miscalibration measure
MCB, can be performed as usual. Specifically, we compute MCBj for each re-
sample j = 1, . . . , m, and, if r of the resampled measures MCB1 , . . . , MCBm are
less than or equal to the miscalibration measure computed from the original
data, we declare a Monte Carlo p-value of 1 − m+1 r
.

B.1. Consistency resamples under the hypothesis of auto-calibration

When working with original data of the form (23), we can generate resamples
under the hypothesis of auto-calibration in the obvious way, as follows.
Regression diagnostics meets forecast evaluation 3269

Algorithm 3: Consistency resamples under the hypothesis of auto-


calibration
Input: (F1 , y1 ), . . . , (Fn , yn )
(j) (j)
Output: resamples (x1 , y1 ), . . . , (xn , yn ) for j = 1, . . . , m
for i ∈ {1, . . . , n} do
let xi = T(Fi )
end
for j ∈ {1, . . . , m} do
for i ∈ {1, . . . , n} do
(j)
sample yi from Fi
end
end

As noted, in the case of threshold calibration, the induced outcome is bi-


nary, whence the assumptions of auto-calibration and T-calibration coincide.
For other types of functionals, auto-calibration is a strictly stronger assumption
than T-calibration, and it is important to note that the resulting inferential
procedures may be confounded by forecast attributes other than T-calibration.
For illustration, let us return to the setting of Example 2.1 and suppose that,
conditionally on a standard normal variate μ, the outcome Y is normal with
mean μ and variance 1. Given any fixed σ > 0, the forecast Fσ = N (μ, σ 2 ) is
auto-calibrated if, and only if, σ = 1. However, if T is the mean or median func-
tional, then Fσ is T-calibrated under any σ > 0. Clearly, if we use Algorithm 3
to generate resamples, then the consistency bands generated by Algorithm 2
might be misleading with regard to the assessment of T-calibration. For exam-
ple, if σ < 1 the confidence bands tend to be narrow and might erroneously
suggest a lack of T-calibration, despite the forecast being T-calibrated.

B.2. Consistency resamples under the hypothesis of T-calibration

The issues just described call for an alternative to Algorithm 3. Residual-based


approaches can be used to generate resamples under the weaker hypothesis of
T-calibration. In developing such a method, we restrict the discussion to single-
valued functionals T under which yi = T(δi ), which covers all cases of key
interest, such as the mean functional, lower or upper quantiles, and expectiles.
As is standard in regression diagnostics, residual-based approaches operate on
the basis of tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24) under the assumptions
of independence between the point forecast, xi , and the residual, yi − xi , and
exchangeability of the residuals. For a discussion in the context of backtests in
banking regulation, see Example 3 of Nolde and Ziegel (2017).
Interestingly, Theorem 2.21 demonstrates that under these assumptions a
forecast is conditionally T-calibrated if, and only if, it is unconditionally T-
calibrated. Thus, we draw resamples in a two-stage procedure. First, we find
the constant c from Theorem 2.27 such that the empirical distribution of (x1 +
c, y1 ), . . . , (xn + c, yn ) or, equivalently, (x1 , y1 − c), . . . , (xn , yn − c), is uncon-
3270 T. Gneiting and J. Resin

ditionally T-calibrated, and then we resample from the respective residuals, as


follows.
Algorithm 4: Consistency resamples under the joint hypothesis of
T-calibration and independence between point forecasts and residuals
Input: (x1 , y1 ), . . . , (xn , yn )
(j) (j)
Output: resamples (x1 , y1 ), . . . , (xn , yn ) for j = 1, . . . , m
for i ∈ {1, . . . , n} do
let ri = yi − xi
end
find c such that (x1 + c, y1 ), . . . , (xn + c, yn ) is unconditionally
T-calibrated
for j ∈ {1, . . . , m} do
sample r1 , . . . , rn from {r1 , . . . , rn } with replacement
for i ∈ {1, . . . , n} do
let yi = xi + ri − c
end
end

As noted in the main text, the consistency bands for the threshold reliabil-
ity diagrams in Figures 6 and 8 have been generated by Algorithms 2 and 4.
This approach is similar to the Monte Carlo technique proposed by Dimitri-
adis, Gneiting and Jordan (2021) that applies in the case of (induced) binary
outcomes (only). However, unlike Dimitriadis, Gneiting and Jordan (2021), we
do not resample the forecasts themselves. To generate consistency bands for
the mean and quantile reliability diagrams in these figures, we apply Algo-
rithm 2 to m = 1000 resamples generated by Algorithm 4. Evidently, this
procedure is crude and relies on classical assumptions. Nonetheless, we believe
that in many practical settings, where visual tools for diagnostic checks of cal-
ibration are sought, the consistency bands thus generated provide useful guid-
ance.
Further methodological development on consistency and confidence bands
needs to be tailored to the specific functional T of interest, and follow-up work on
Monte Carlo techniques and large sample theory is strongly encouraged. Extant
asymptotic theory for nonparametric isotonic regression, as implemented by
Algorithm 1, is available for quantiles and the mean or expectation functional,
as developed and reviewed by Barlow et al. (1972), Casady and Cryer (1976),
Wright (1984), Robertson, Wright and Dykstra (1988), El Barmi and Mukerjee
(2005), and Mösching and Dümbgen (2020), and can be leveraged, though with
hurdles, as rates of convergence depend on distributional assumptions and limit
distributions involve nuisance parameters that need to be estimated, whereas
the use of bootstrap methods might be impacted by the issues described by Sen,
Banerjee and Woodroofe (2010).
Regression diagnostics meets forecast evaluation 3271

B.3. Reliability diagrams and consistency bands for probabilistic


and marginal calibration

For the classical notions of unconditional calibration in Section 2.2, the CORP
approach does not apply directly, but its spirit can be retained and adapted.
As for probabilistic calibration, the prevalent practice is to plot histograms of
empirical probability integral transform (PIT) values, as proposed by Diebold,
Gunther and Tay (1998), Gneiting, Balabdaoui and Raftery (2007), and Czado,
Gneiting and Held (2009), though this practice is hindered by the necessity for
binning, as analyzed by Heinrich (2021) in the nearly equivalent setting of rank
histograms. The population version of our suggested alternative is the PIT re-
liability diagram, which is simply the graph of the CDF of the PIT ZF in (1).
The PIT reliability diagram coincides with the diagonal in the unit square if,
and only if, F is probabilistically calibrated. For tuples of the form (23) the
empirical PIT reliability diagram shows the empirical CDF of the (potentially
randomized) PIT values. This approach does not require binning and can be
interpreted in much the same way as a PIT diagram: An inverse S-shape corre-
sponds to a U-shape in histograms and indicates underdispersion of the forecast,
as typically encountered in practice. Evidently, this idea is not new and extant
implementations can be found in work by Pinson and Hagedorn (2012) and
Henzi, Ziegel and Gneiting (2021).
As regards marginal calibration, we define the population version of the
marginal reliability diagram as the point set

{(EQ [F (y)], Q(Y ≤ y)) ∈ [0, 1]2 : y ∈ R}.

The marginal reliability diagram is concentrated on the diagonal in the unit


square if, and only if, F is marginally calibrated. For tuples of the form (23) the
empirical marginal reliability diagram
n is a plot of the empirical non-exceedance
probability (NEP) F0 (y) = n1 i=1 1{y ≥ yi } against the average forecast NEP
n
F̄ (y) = n1 i=1 Fi (y) at the unique values y of the outcomes y1 , . . . , yn , and
interpolated linearly in between. Of course, this idea is not new either and the
resulting diagram can be interpreted as a P-P plot.
For marginal calibration diagrams, we obtain consistency bands under the
(j) (j)
assumption
n of marginal calibration by drawing resamples y1 , . . . , yn from
1
F̄ = n i=1 Fi , computing the respective marginal reliability curve, and re-
peating over Monte Carlo replicates j = 1, . . . , m. Then we find consistency
bands in the spirit of Algorithm 2. For PIT reliability diagrams, a trivial tech-
nique applies as we may obtain consistency bands under the assumption of
probabilistic calibration by (re)sampling n independent standard uniform vari-
ates, computing the respective empirical CDF, and repeating over Monte Carlo
replicates. Evidently, there are alternatives based on empirical process theory
(Shorack and Wellner, 2009).
Figure 11 illustrates PIT and marginal reliability diagrams on our customary
examples, along with 90% consistency bands based on m = 1000 Monte Carlo
replicates.
3272 T. Gneiting and J. Resin

Fig 11. PIT (top) and marginal (bottom) reliability diagrams for the perfect (left), unfocused
(middle), and lopsided (right) forecast from Examples 2.1 and 2.2, along with 90% consistency
bands based on samples of size 400.

B.4. Testing hypotheses of calibration

While the explicit development of calibration tests exceeds the scope of our pa-
per, we believe that the results and discussion in Section 2 convey an important
general message: It is critical that the assessed notion of calibration be carefully
and explicitly specified. Throughout, we consider tests under the assumption of
independent, identically distributed data from a population. For extensions to
dependent samples, we refer to Strähl and Ziegel (2017), who generalized the
prediction space concept to allow for serial dependence, and point at methods
introduced by, e.g., Corradi and Swanson (2007), Knüppel (2015), and Bröcker
and Ben Bouallègue (2020).
The most basic case is that of tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24),
where implicitly or explicitly xi = T(Fi ) for a single-valued functional T. We
first discuss tests of unconditional calibration.Ifn the simplified condition (11) is
sufficient, a two-sided t-test based on v = n1 i=1 V (xi , yi ) can be used to test
for unconditional calibration. In the general case, two one-sided t-tests can be
used along with a Bonferroni correction. In the special case of quantiles, there
is no need to resort to the approximate t-tests, and exact binomial tests can be
used instead. Essentially, this special case is the setting of backtests for value-
at-risk reports in banking regulation, for which we refer to Nolde and Ziegel
(2017, Sections 2.1–2.2).
As noted earlier in the section, resamples generated under the hypothesis of
Regression diagnostics meets forecast evaluation 3273

conditional T-calibration can readily be used to perform Monte Carlo tests for
the respective hypothesis, based on CORP score components that are computed
on the surrogate data. Alternatively, one might leverage extant large sample
theory for nonparametric isotonic regression (Barlow et al., 1972; Casady and
Cryer, 1976; Wright, 1984; Robertson, Wright and Dykstra, 1988; El Barmi and
Mukerjee, 2005; Mösching and Dümbgen, 2020). Independently of the use of
resampling or asymptotic theory, CORP based tests avoid the issues and in-
stabilities incurred by binning (Dimitriadis, Gneiting and Jordan, 2021, Section
S2) and may simultaneously improve efficiency and stability. In passing, we hint
at relations to the null hypothesis of Mincer-Zarnowitz regression (Krüger and
Ziegel, 2021) and tests of predictive content (Galbraith, 2003; Breitung and
Knüppel, 2021).
We move on to the case of fully specified distributions, where we work with
tuples (F1 , y1 ), . . . , (Fn , yn ) of the form (23), where Fi is a posited conditional
CDF for yi (i = 1, . . . , n). Tests for probabilistic calibration then amount to tests
for the uniformity of the (potentially, randomized) PIT values. Wallis (2003) and
Wilks (2019, p. 769) suggest chi-square tests for this purpose, which depend on
binning, and thus are subject to the aforementioned instabilities. To avoid bin-
ning, we recommend the use of test statistics that operate on the empirical
CDF of the PIT values, such as the classical Kolmogorov–Smirnov statistic,
as suggested and used to test for PIT calibration by Noceti, Smith and Hodges
(2003) and Knüppel (2015), or, more generally, tests based on distance measures
between the empirical CDF of the PIT values, and the CDF of the standard
uniform distribution that arises under the hypothesis of probabilistic calibra-
tion. Recently proposed alternatives arise via e-values (Henzi and Ziegel, 2022).
Similarly, tests for marginal calibration can be based on resamples and distance
measures between F̄ and F0 , or leverage asymptotic theory.
In the distributional setting, arbitrarily many types of reliability can be
tested for, and all of the aforementioned tests for unconditional or conditional
T-calibration apply. Multiple testing needs to be accounted for properly, and
the development of simultaneous tests for various types of calibration would be
useful. In this context, let us recall from Theorem 2.16 that, subject to techni-
cal conditions, CEP, threshold, and quantile calibration are equivalent and tests
for CEP calibration (Held, Rufibach and Balabdaoui, 2010; Strähl and Ziegel,
2017), quantile and threshold calibration assess identical hypotheses.

Appendix C: Time series settings and the Bank of England example

In typical time series settings, as exemplified by our analysis of Bank of England


forecasts in Section 3, the assumption of independent replicates of forecasts
and observations is too restrictive. While the diagnostic methods proposed in
our paper continue to apply, statistical inference requires care, as discussed by
Corradi and Swanson (2007) and Knüppel (2015), among other authors. Here,
we elucidate the role of uniform and independent probability integral transform
(PIT) values for calibration in time series settings, and give further details and
results for the Bank of England example.
3274 T. Gneiting and J. Resin

C.1. The role of uniform and independent PITs

In a landmark paper, Diebold, Gunther and Tay (1998, p. 867) showed that a se-
quence of continuous predictive distributions Ft for a sequence Yt of observations
at time t = 0, 1, . . . results in a sequence of independent, uniformly distributed
PITs if Ft is ideal relative to the σ-algebra generated by past observations,
At = σ(Y0 , Y1 , . . . , Yt−1 ). This property does not depend on the continuity of
Ft and continues to hold under general predictive CDFs and the randomized
definition (1) of the PIT (Rüschendorf and de Valk, 1993, Theorem 3).
In the case of continuous predictive distributions, Tsyplakov (2011, Section 2)
noted without proof that if the forecasts Ft are based only on past observations,
i.e., if Ft is At -measurable, then the converse holds, namely, uniform and in-
dependent PITs arise only if Ft is ideal relative to At . The following result
formalizes Tsyplakov’s claim and proves it in the general setting, without any
assumption of continuity.
Theorem C.1. Let (Yt )t=0,1,... be a sequence of random variables, and let At =
σ(Y0 , . . . , Yt−1 ) for t = 0, 1, . . . . Furthermore, let (Ft )t=0,1,... be a sequence of
CDFs, such that Ft is At -measurable for t = 0, 1, . . . , and let (Ut )t=0,1,... be a
sequence of independent, uniformly distributed random variables, independent
of the sequence (Yt ). Then the sequence of randomized PITs, (Zt ) = (Ft (Yt −) +
Ut (Ft (Yt ) − Ft (Yt −))) is an independent sequence of uniform random variables
on the unit interval if, and only if, Ft is ideal relative to At , i.e., Ft = L(Yt | At )
almost surely for t = 0, 1, . . . .
The proof utilizes the following simple lemma.
Lemma C.2. Let X, Y, Z be random variables. If X = Z almost surely, then
E[Y | X] = E[Y | Z] almost surely.
Proof. Problem 14 of Breiman (1992, Chapter 4), which is proved by Schmidt
(2011, Satz 18.2.10), states that for random variables X1 and X2 such that
σ(Y, X1 ) is independent of σ(X2 ), E[Y | X1 , X2 ] = E[Y | X1 ] almost surely.
The statement of the lemma follows as E[Y | X] = E[Y | X, X − Z] = E[Y |
Z, X − Z] = E[Y | Z] almost surely.
Proof of Theorem C.1. Since Ft is measurable with respect to At , there exists
a measurable function ft : Rt → F such that Ft = ft (Y0 , . . . , Yt−1 ) for each t by
the Doob–Dynkin Lemma (Schmidt, 2011, Satz 7.1.16).2 We define

Gt := ft (G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))

2 Note that f is constant, and f is not a random quantity but a fixed function that encodes
0 t
how the predictive distributions are generated from past observations. The σ-algebra on F ,
which is implicitly used throughout, is given by
AF = σ({{F ∈ F : F (x) ∈ B} : x ∈ Q, B ∈ B(R)}),
where B(R) denotes the Borel σ-algebra on R. For each x ∈ Q there exists a measurable
function fx,t such that Ft (x) = fx,t (Y0 , . . . , Yt−1 ) by the Doob-Dynkin Lemma, and ft is
essentially the countable (and hence measurable) collection (fx,t )x∈Q .
Regression diagnostics meets forecast evaluation 3275

recursively for all t, and show the “only if” assertion by induction.
To this end, let t ≥ 0 and assume the induction hypothesis that Fi is ideal
relative to Ai for i = 0, . . . , t − 1. By Rüschendorf and de Valk (1993, Theorem
3(a)) and the construction of Gt , the induction hypothesis implies
(Y0 , . . . , Yt−1 ) = (F0−1 (Z0 ), . . . , Ft−1
−1
(Zt−1 )) = (G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))

almost surely, where the last vector is σ(Z0 , . . . , Zt−1 )-measurable. By Lem-
ma C.2, it follows that
L(Zt | At ) = L(Zt | σ(G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))) = U ([0, 1])

almost surely, where the second equality stems from the fact that Zt is inde-
pendent of σ(G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 )) ⊂ σ(Z0 , . . . , Zt−1 ). This independence
implies that Ft is ideal relative to At because
Ft (y) = Q(Zt < Ft (y) | At ) ≤ Q(Yt ≤ y | At ) ≤ Q(Zt ≤ Ft (y) | At ) = Ft (y)
almost surely, and hence Ft (y) = Q(Yt ≤ y | At ) almost surely for all y ∈ Q,
thereby completing both the induction step and the claim for the base case
t = 0.
Evidently, the assumption that no information other than the history of the
time series itself has been utilized to construct the forecasts is very limiting. In
this light, it is not surprising that, while the “if” part of Theorem C.1 is robust,
the “only if” claim fails if Ft is allowed to use information beyond the canonical
filtration, even if that information is uninformative. A simple counterexample
is given by the unfocused forecast from Example 2.2, which is probabilistically
calibrated but fails to be auto-calibrated. Its PIT nevertheless is uniform and
independent even for autoregressive variants (Tsyplakov, 2011, Section 6).

C.2. Details and further results for the Bank of England example

Bank of England forecasts of inflation rates are available within the data ac-
companying the quarterly Monetary Policy Report (formerly Inflation Report),
which is available online at https://www.bankofengland.co.uk/sitemap/
monetary-policy-report. The forecasts are visualized and communicated in the
form of fan charts that span prediction intervals at increasing forecast horizons,
and derive from two-piece normal forecast distributions. A detailed account of
the parametrizations for the two-piece normal distribution used by the Bank of
England can be found in Julio (2006), and we have implemented the formulas
in this reference. Historical quarterly CPI inflation rates are published by the
UK Office for National Statistics (ONS) and available online at https://www.
ons.gov.uk/economy/inflationandpriceindices/timeseries/d7g7.
We consider forecasts of consumer price index (CPI) inflation based on mar-
ket expectations for future interest rates at prediction horizons of zero to six
quarters ahead, valid for the third quarter of 2005 up to the first quarter of 2020,
for a total of n = 59 quarters. These and earlier Bank of England forecasts of in-
flation rates have been checked for reliability by Wallis (2003), who considered
3276 T. Gneiting and J. Resin

probabilistic calibration, by Clements (2004) in terms of probabilistic, mean,


and threshold calibration, by Galbraith and van Norden (2012), who consid-
ered probabilistic and mean calibration, by Strähl and Ziegel (2017) with focus
on conditional exceedance probability (CEP) calibration, and by Pohle (2020),
who considered quantile calibration. The 2% inflation target is discussed on
the Bank of England website at https://www.bankofengland.co.uk/monetary-
policy/inflation.
Figures 12–17 show calibration diagnostics for inflation forecasts at prediction
horizons of k ∈ {0, 2, 3, 4, 5, 6} quarters ahead, in the same format as Figure 8
in the main text, which concerns forecasts at a lead time of one quarter.

Acknowledgments

The authors would like to thank Sebastian Arnold, Fadoua Balabdaoui-Mohr,


Jonas Brehmer, Frank Diebold, Timo Dimitriadis, Uwe Ehret, Andreas Fink,
Tobias Fissler, Rafael Frongillo, Norbert Henze, Alexander Henzi, Alexander I.
Jordan, Kristof Kraus, Fabian Krüger, Sebastian Lerch, Michael Maier-Gerber,
Anja Mühlemann, Jim Pitman, Marc-Oliver Pohle, Roopesh Ranjan, Benedikt
Schulz, Ville Satopää, Daniel Wolffram and Johanna F. Ziegel, as well as anony-
mous reviewers, for helpful comments and discussion.

Fig 12. Same as Figure 8 in the main text but at a prediction horizon of zero quarters.
Regression diagnostics meets forecast evaluation 3277

Fig 13. Same as Figure 12 but at a prediction horizon of two quarters.

Fig 14. Same as Figure 12 but at a prediction horizon of three quarters.


3278 T. Gneiting and J. Resin

Fig 15. Same as Figure 12 but at a prediction horizon of four quarters.

Fig 16. Same as Figure 12 but at a prediction horizon of five quarters.


Regression diagnostics meets forecast evaluation 3279

Fig 17. Same as Figure 12 but at a prediction horizon of six quarters.

Funding

Our research has been funded by the Klaus Tschira Foundation. Johannes Resin
gratefully acknowledges support from the German Research Foundation (DFG)
through grant number 502572912.

Supplementary Material

Replication code
(doi: 10.1214/23-EJS2180SUPP; .zip). R code for replication purposes.

References

Allen, S. (2021). Advanced Statistical Post-Processing of Ensemble Weather


Forecasts, PhD thesis, University of Exeter, UK.
Arnold, S. (2020). Isotonic Distributional Approximation, Master’s thesis,
Universität Bern, Switzerland.
Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silvermann, E.
(1955). An empirical distribution function for sampling with incomplete in-
formation. Annals of Mathematical Statistics 26 641–647. MR0073895
3280 T. Gneiting and J. Resin

Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D.


(1972). Statistical Inference Under Order Restrictions: The Theory and Ap-
plication of Isotonic Regression. Wiley, New York. MR0326887
El Barmi, H. and Mukerjee, H. (2005). Inferences under a stochastic order-
ing constraint: The k-sample case. Journal of the American Statistical Asso-
ciation 100 252–261. MR2156835
Bashaykh, H. (2022). Statistical Assessment of Forecast Calibration, PhD the-
sis, University of Exeter, UK.
Bentzien, S. and Friederichs, P. (2014). Decomposition and graphical por-
trayal of the quantile score. Quarterly Journal of the Royal Meteorological
Society 140 1924–1934.
Breiman, L. (1992). Probability, SIAM Classics ed. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia. MR1163370
Breitung, J. and Knüppel, M. (2021). How far can we forecast? Statistical
tests of the predictive content. Journal of Applied Econometrics 36 369–392.
MR4309589
Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper
scores. Quarterly Journal of the Royal Meteorological Society 135 1512–1519.
Bröcker, J. and Ben Bouallègue, Z. (2020). Stratified rank histograms for
ensemble forecast verification under serial dependence. Quarterly Journal of
the Royal Meteorological Society 146 1976–1990.
Bröcker, J. and Smith, L. A. (2007). Increasing the reliability of reliability
diagrams. Weather and Forecasting 22 651–661.
Casady, R. J. and Cryer, J. D. (1976). Monotone percentile regression.
Annals of Statistics 4 532–541. MR0403050
Chung, Y., Neiswanger, W., Char, I. and Schneider, J. (2021). Beyond
pinball loss: Quantile methods for calibrated uncertainty quantification. In
Proceedings of the 35th Conference on Neural Information Processing Systems
(NeurIPS).
Clements, M. P. (2004). Evaluating the Bank of England density forecasts of
inflation. The Economic Journal 114 844–866.
Corradi, V. and Swanson, N. R. (2007). Predictive density and conditional
confidence interval accuracy tests. Journal of Econometrics 135 187–228.
MR2328400
Czado, C., Gneiting, T. and Held, L. (2009). Predictive model assessment
for count data. Biometrics 65 1254–1261. MR2756513
Dawid, A. P. (1984). Statistical theory: The prequential approach. Journal of
the Royal Statistical Society Series A 147 278–292. MR0763811
Dawid, A. P. (1986). Probability forecasting. In Encyclopedia of Statistical
Sciences, 7 210–218. Wiley-Interscience.
Dawid, A. P. (2016). Contribution to the discussion of “Of quantiles and ex-
pectiles: Consistent scoring functions, Choquet representations and forecast
rankings” by W. Ehm, T. Gneiting, A. Jordan and F. Krüger. Journal of the
Royal Statistical Society Series B 78 505–562. MR3506792
de Leeuw, J., Hornik, K. and Mair, P. (2009). Isotone optimization in R:
Pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of
Regression diagnostics meets forecast evaluation 3281

Statistical Software 32 1–24.


Diebold, F. X., Gunther, T. A. and Tay, A. S. (1998). Evaluating den-
sity forecasts with applications to financial risk management. International
Economic Review 39 863–883.
Dimitriadis, T., Fissler, T. and Ziegel, J. F. (2023). Osband’s principle
for identification functions. Statistical Papers. In press, https://doi.org/
10.1007/s00362-023-01428-x.
Dimitriadis, T., Gneiting, T. and Jordan, A. I. (2021). Stable reliability
diagrams for probabilistic classifiers. Proceedings of the National Academy of
Sciences of the United States of America 118 e2016191118. MR4275118
Dimitriadis, T. and Jordan, A. I. (2021). reliabilitydiag: Reliability diagrams
using isotonic regression. R package version 0.2.0, https://cran.r-project.
org/package=reliabilitydiag.
Ehm, W. and Ovcharov, E. Y. (2017). Bias-corrected score decomposition
for generalized quantiles. Biometrika 104 473–480. MR3698267
Ehm, W., Gneiting, T., Jordan, A. and Krüger, F. (2016). Of quan-
tiles and expectiles: Consistent scoring functions, Choquet representations
and forecast rankings. Journal of the Royal Statistical Society Series B 78
505–562. MR3506792
Fissler, T. and Holzmann, H. (2022). Measurability of functionals and
of ideal point forecasts. Electronic Journal of Statistics 16 5019–5034.
MR4490414
Fissler, T. and Pesenti, S. M. (2023). Sensitivity measures based on scor-
ing functions. European Journal of Operational Research 307 1408–1423.
MR4543545
Fissler, T. and Ziegel, J. F. (2016). Higher order elicitability and Osband’s
principle. Annals of Statistics 44 1680–1706. MR3519937
Fissler, T. and Ziegel, J. F. (2019). Order-sensitivity and equivariance of
scoring functions. Electronic Journal of Statistics 13 1166–1211. MR3935847
Flach, P. (2012). Machine Learning: The Art and Science of Algorithms that
Make Sense of Data. Cambridge University Press, Cambrige. MR3088204
Galbraith, J. W. (2003). Content horizons for univariate time-series forecasts.
International Journal of Forecasting 19 43–55.
Galbraith, J. W. and van Norden, S. (2012). Assessing gross domestic
product and inflation probability forecasts derived from Bank of England
fan charts. Journal of the Royal Statistical Society Series A 175 713–727.
MR2948371
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the
American Statistical Association 106 746–762. MR2847988
Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic
forecasts, calibration and sharpness. Journal of the Royal Statistical Society
Series B 69 243–268. MR2325275
Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Re-
view of Statistics and Its Application 1 125–151.
Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, pre-
diction, and estimation. Journal of the American Statistical Association 102
3282 T. Gneiting and J. Resin

359–378. MR2345548
Gneiting, T. and Ranjan, R. (2013). Combining predictive distributions.
Electronic Journal of Statistics 7 1747–1782. MR3080409
Gneiting, T., Wolffram, D., Resin, J., Kraus, K., Bracher, J., Dimi-
triadis, T., Hagenmeyer, V., Jordan, A. I., Lerch, S., Phipps, K. and
Schienle, M. (2023). Model diagnostics and forecast evaluation for quantiles.
Annual Review of Statistics and Its Application 10 597–621.
Gneiting, T. and Resin, J., (2023). Supplement to “Regression diagnostics
meets forecast evaluation: conditional calibration, reliability diagrams, and
coefficient of determination”. DOI: 10.1214/23-EJS2180SUPP.
Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regres-
sion. Statistical Science 33 568–594. MR3881209
Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017). On cali-
bration of modern neural networks. In Proceedings of the 34th International
Conference on Machine Learning (ICML).
Gupta, C., Podkopaev, A. and Ramdas, A. (2020). Distribution-free binary
classification: Prediction sets, confidence intervals and calibration. In Pro-
ceedings of the 34th Conference on Neural Information Processing Systems
(NeurIPS).
Heinrich, C. (2021). On the number of bins in a rank histogram. Quarterly
Journal of the Royal Meteorological Society 147 544–556.
Held, L., Rufibach, K. and Balabdaoui, F. (2010). A score regression ap-
proach to assess calibration of continuous probabilistic predictions. Biometrics
66 1295–1305. MR2758518
Henzi, A., Ziegel, J. F. and Gneiting, T. (2021). Isotonic distributional
regression. Journal of the Royal Statistical Society Series B 83 963–993.
MR4349124
Henzi, A. and Ziegel, J. F. (2022). Valid sequential inference on probability
forecast performance. Biometrika 109 647–663. MR4472840
Holzmann, H. and Eulert, M. (2014). The role of the information set for
forecasting—with applications to risk management. Annals of Applied Statis-
tics 8 595–621. MR3192004
Hothorn, T., Kneib, T. and Bühlmann, P. (2014). Conditional transfor-
mation models. Journal of the Royal Statistical Society Series B 76 3–27.
MR3153931
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics 35 73–101. MR0161415
Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of
forecast accuracy. International Journal of Forecasting 22 679–688.
Jolliffe, I. T. and Stephenson, D. B. (2012). Forecast Verification: A Prac-
titioner’s Guide in Atmospheric Science, second ed. Wiley, Chichester.
Jordan, A. I., Mühlemann, A. and Ziegel, J. F. (2022). Characteriz-
ing the optimal solutions to the isotonic regression problem for identifiable
functionals. Annals of the Institute of Statistical Mathematics 74 489–514.
MR4417369
Julio, J. M. (2006). The fan chart: The technical details of the new implemen-
Regression diagnostics meets forecast evaluation 3283

tation. Banco de la República Colombia Bogotá, Borradores de Economía,


468.
Knüppel, M. (2015). Evaluating the calibration of multi-step-ahead density
forecasts using raw moments. Journal of Business & Economic Statistics 33
270–281. MR3337062
Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica 46
33–50. MR0474644
Koenker, R. and Machado, J. A. F. (1999). Goodness of fit and related
inference processes for quantile regression. Journal of the American Statistical
Association 94 1296–1310. MR1731491
Krüger, F. and Ziegel, J. F. (2021). Generic conditions for forecast domi-
nance. Journal of Business & Economic Statistics 39 972–983. MR4319685
Kuleshov, V., Fenner, N. and Ermon, S. (2018). Accurate uncertainties for
deep learning using calibrated regression. In Proceedings of the 35th Interna-
tional Conference on Machine Learning (ICML).
Kumar, A., Liang, P. S. and Ma, T. (2019). Verified uncertainty calibra-
tion. In Proceedings of the 33rd Conference on Neural Information Processing
Systems (NeurIPS).
Kvålseth, T. (1985). Cautionary note about R2 . American Statistician 39
279–285.
Levi, D., Gispan, L., Giladi, N. and Fetaya, E. (2022). Evaluating and
calibrating uncertainty prediction in regression tasks. Sensors 22 5540.
Mason, S. J., Galpin, J. S., Goddard, L., Graham, N. E. and Rajart-
nam, B. (2007). Conditional exceedance probabilities. Monthly Weather Re-
view 135 363–372.
Mitchell, S., Potash, E., Barocas, S., D’Amour, A. and Lum, K. (2021).
Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of
Statistics and Its Application 8 141–163. MR4243544
Moriasi, D. N., Arnold, J. G., Van Liew, M. W., Bingner, R. L.,
Harmel, R. D. and Veith, T. L. (2007). Model evaluation guidelines for
systematic quantification of accuracy in watershed simulations. Transactions
of the ASABE 50 885–900.
Mösching, A. and Dümbgen, L. (2020). Monotone least squares and isotonic
quantiles. Electronic Journal of Statistics 14 24–49. MR4047593
Murphy, A. H. (1996). General decomposition of MSE-based skill scores: Mea-
sures of some basic aspects of forecast quality. Monthly Weather Review 124
2353–2369.
Murphy, A. H. and Epstein, E. S. (1989). Skill scores and correlation coef-
ficients in model verification. Monthly Weather Review 117 572–581.
Murphy, A. H. and Winkler, R. L. (1987). A General Framework for Fore-
cast Verification. Monthly Weather Review 115 1330–1338.
Nakagawa, S. and Schielzeth, H. (2013). A general and simple method for
obtaining R2 from generalized linear mixed-effects models. Methods in Ecology
and Evolution 4 133–142.
Nash, J. E. and Sutcliffe, J. V. (1970). River flow forecasting through
conceptual models. Part I – A discussion of principles. Journal of Hydrology
3284 T. Gneiting and J. Resin

10 282–290.
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G. and Tran, D.
(2019). Measuring calibration in deep learning. In Proceedings of Computer
Vision and Pattern Recognition (CVPR) Conference Workshops.
Noceti, P., Smith, J. and Hodges, S. (2003). An evaluation of tests of dis-
tributional forecasts. Journal of Forecasting 22 447–455.
Nolde, N. and Ziegel, J. F. (2017). Elicitability and backtesting: Per-
spectives for banking regulation. Annals of Applied Statistics 11 1833–1874.
MR3743276
Orjebin, E. (2014). A recursive formula for the moments of a truncated
univariate normal distribution. Working paper, https://people.smp.uq.
edu.au/YoniNazarathy/teaching_projects/studentWork/EricOrjebin_
TruncatedNormalMoments.pdf.
Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of
Business & Economic Statistics 38 796–809. MR4154889
Pinson, P. and Hagedorn, R. (2012). Verification of the ECMWF ensem-
ble forecasts of wind speed against analyses and observations. Meteorological
Applications 19 484–500.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J. and Weinberger, K. J.
(2017). On fairness and calibration. In Proceedings of the 31st Conference on
Neural Information Processing Systems (NIPS).
Pohle, M. O. (2020). The Murphy decomposition and the calibration-
resolution principle: A new perspective on forecast evaluation. Preprint,
arXiv:2005.01835.
Robertson, T. and Wright, F. T. (1980). Algorithms in order restricted
statistical inference and the Cauchy mean value property. Annals of Statistics
8 645–651. MR0568726
Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted
Statistical Inference. Wiley, Chichester. MR0961262
Roelofs, R., Cain, N., Shlens, J. and Mozer, M. C. (2022). Mitigating
bias in calibration error estimation. In Proceedings of the 25th International
Conference on Artificial Intelligence and Statistics (AISTATS).
Rüschendorf, L. (2009). On the distributional transform, Sklar’s theorem,
and the empirical copula process. Journal of Statistical Planning and Infer-
ence 139 3921–3927. MR2553778
Rüschendorf, L. and de Valk, V. (1993). On regression representations of
stochastic processes. Stochastic Processes and their Applications 46 183–198.
MR1226406
Sahoo, R., Zhao, S., Chen, A. and Ermon, S. (2021). Reliable decisions with
threshold calibration. In Advances in Neural Information Processing Systems.
Satopää, V. and Ungar, L. (2015). Combining and extremizing real-valued
forecasts. Preprint, arXiv:1506.06405.
Satopää, V. A. (2021). Improving the wisdom of crowds with analysis of vari-
ance of predictions of related outcomes. International Journal of Forecasting
37 1728–1747.
Savage, L. J. (1971). Elicitation of personal probabilities and expectations.
Regression diagnostics meets forecast evaluation 3285

Journal of the American Statistical Association 66 783–801. MR0331571


Schmidt, K. D. (2011). Maß und Wahrscheinlichkeit, revised ed. Springer,
Heidelberg. MR3524794
Sen, B., Banerjee, M. and Woodroofe, M. (2010). Inconsistency of
bootstrap: The Grenander estimator. Annals of Statistics 38 1953–1977.
MR2676880
Shorack, G. R. and Wellner, J. A. (2009). Empirical Processes with Ap-
plications to Statistics, SIAM Classics ed. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia. MR3396731
Siegert, S. (2017). Simplifying and generalising Murphy’s Brier score de-
composition. Quarterly Journal of the Royal Meteorological Society 143
1178–1183.
Song, H., Diethe, T., Kull, M. and Flach, P. (2019). Distribution cali-
bration for regression. In Proceedings of the 36th International Conference on
Machine Learning (ICML).
Steinwart, I., Pasin, C., Williamson, R. and Zhang, S. (2014). Elicita-
tion and identification of properties. Journal of Machine Learning Research:
Workshop and Conference Proceedings 35 1–45.
Stodden, V., McNutt, M., Bailey, D. H., Deelman, E., Gil, Y., Han-
son, B., Heroux, M. A., Ioannidis, J. P. A. and Taufer, M. (2016). En-
hancing reproducibility for computational methods. Science 354 1240–1241.
Stoyanov, J. (2000). Krein condition in probabilistic moment problems.
Bernoulli 6 939–949. MR1791909
Strähl, C. and Ziegel, J. (2017). Cross-calibration of probabilistic forecasts.
Electronic Journal of Statistics 11 608–639. MR3619318
Taggart, R. (2022). Point forecasting and forecast evaluation with generalized
Huber loss. Electronic Journal of Statistics 16 201–231. MR4359360
Taylor, K. E. (2001). Summarizing multiple aspects of model performance in
a single diagram. Journal of Geophysical Research 106 7183–7192.
R Core Team (2021). R: A Language and Environment for Statistical Com-
puting R Foundation for Statistical Computing, Vienna, Austria, https://
www.R-project.org/.
Tredennick, A. T., Hooker, G., Ellner, S. P. and Adler, P. B. (2021).
A practical guide to selecting nodels for exploration, inference, and prediction
in ecology. Ecology 102 e03336.
Tsyplakov, A. (2011). Evaluating density forecasts: A comment. Preprint,
http://dx.doi.org/10.2139/ssrn.1907799 .
Tsyplakov, A. (2013). Evaluation of probabilistic forecasts: Proper scoring
rules and moments. Preprint, http://dx.doi.org/10.2139/ssrn.2236605.
Tsyplakov, A. (2014). Theoretical guidelines for a partially informed forecast
examiner. Preprint, https://mpra.ub.uni-muenchen.de/67333/.
Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B.,
Pencina, M. J. and Steyerberg, E. W. (2016). A calibration hierarchy for
risk models was defined: From utopia to empirical data. Journal of Clinical
Epidemiology 74 167–176.
van Eeden, C. (1958). Testing and Estimating Ordered Parameters of Proba-
3286 T. Gneiting and J. Resin

bility Distributions, PhD thesis, University of Amsterdam, Netherlands.


Wallis, K. F. (2003). Chi-squared tests of interval and density forecasts, and
the Bank of England’s fan charts. International Journal of Forecasting 19
165–175.
Wilks, D. S. (2019). Indices of rank histogram flatness and their sampling
properties. Monthly Weather Review 147 763–769.
Wright, F. T. (1984). The asymptotic behavior of monotone regression esti-
mates. Canadian Journal of Statistics 12 229–236.
Yu, B. and Kumbier, K. (2020). Veridical data science. Proceedings of the Na-
tional Academy of Sciences of the United States of America 117 3920–3929.
Zhao, S., Ma, T. and Ermon, S. (2020). Individual calibration with ran-
domized forecasting. In Proceedings of the 37th International Conference on
Machine Learning (ICML).
Ziegel, J. F. (2016). Contribution to the discussion of “Of quantiles and ex-
pectiles: Consistent scoring functions, Choquet representations and forecast
rankings” by W. Ehm, T. Gneiting, A. Jordan and F. Krüger. Journal of the
Royal Statistical Society Series B: Methodological 78 505–562.

You might also like