23 Ejs2180
23 Ejs2180
1. Introduction
arXiv: 2108.03210
3226
Regression diagnostics meets forecast evaluation 3227
⇒
CC ⇔ QC ⇔ TC CC → TC ← QC
⇓ ⇓ ⇓ ⇓
→
←
PC MC PC MC
Fig 1. Preview of key findings in Section 2.3: Hierarchies of calibration (a) for continuous,
strictly increasing cumulative distribution functions (CDFs) with common support and (b)
under minimal conditions, with auto-calibration (AC) being the strongest notion. Conditional
exceedance probability calibration (CC) is a conditional version of probabilistic calibration
(PC), whereas threshold calibration (TC) is a conditional version of marginal calibration
(MC). Quantile calibration (QC) differs from CC and TC in subtle ways. Strong threshold
calibration (STC) is a stronger notion of threshold calibration introduced by Sahoo et al.
(2021) for continuous CDFs. Hook arrows show conjectured implications.
(Ayer et al., 1955) to obtain consistent, optimally binned, reproducible, and PAV
based (CORP) estimates of T-reliability diagrams and score components, along
with uncertainty quantification via resampling. As opposed to extant estimators,
the CORP approach yields non-decreasing reliability diagrams and guarantees
the nonnegativity of the estimated MCB and DSC components. The regularizing
constraint of isotonicity avoids artifacts and overfitting. For in-sample model di-
agnostics, we introduce a generalized coefficient of determination R∗ that links to
skill scores, and nests both the classical variance explained or R2 in least squares
regression (Kvålseth, 1985), and its natural analogue R1 in quantile regression
(Koenker and Machado, 1999). Subject to modest conditions R∗ ∈ [0, 1], with
values of 0 and 1 indicating uninformative and immaculate fits, respectively.
In forecast evaluation, reliability diagrams and score components serve to
diagnose and quantify performance on test samples. The most prominent case
arises when T is the mean functional and performance is assessed by the mean
squared error (MSE). As a preview of the diagnostic tools developed in this
paper, we assess point forecasts by Tredennick et al. (2021) of (log-transformed)
butterfly population size from a ridge regression and a null model. The CORP
mean reliability diagrams and MSE decompositions in Figure 2 show that, while
both models are reliable, ridge regression enjoys considerably higher discrimi-
nation ability.
The paper closes in Section 4, where we discuss our findings and provide a
roadmap for follow-up research. While Dimitriadis, Gneiting and Jordan (2021)
introduced the CORP approach in the nested case of probability forecasts for
binary outcomes, the setting of real-valued outcomes treated in this paper is
far more complex as it necessitates the consideration of statistical functionals in
general. Throughout, we link the traditional case of regression diagnostics and
(stand-alone) point forecast evaluation, where functionals such as conditional
means, moments, quantiles, or expectiles are modeled and predicted, to model
diagnostics and forecast evaluation in the fully distributional setting (Gneiting
and Katzfuss, 2014; Hothorn, Kneib and Bühlmann, 2014). Appendices A–C
include material of more specialized or predominantly technical character.
Regression diagnostics meets forecast evaluation 3229
Fig 2. CORP mean reliability diagrams for point forecasts of (log-transformed) butterfly pop-
ulation size from the null model (left) and ridge regression (right) of Tredennick et al. (2021),
along with 90% consistency bands and miscalibration (MCB), discrimination (DSC) and un-
certainty (UNC) components of the mean squared error (MSE).
We consider the joint law of a posited distribution and the respective outcome
in the technical setting of Gneiting and Ranjan (2013). Specifically, let (Ω, A, Q)
be a prediction space, i.e., a probability space where the elementary elements
ω ∈ Ω correspond to realizations of the random triple
(F, Y, U ),
depend on stochastic parameters, and the outcomes as random elements (Murphy and Winkler,
1987; Gneiting, Balabdaoui and Raftery, 2007). In parts of our paper, it suffices to consider a
point prediction space, where the elements of the probability space correspond to realizations
of the random tuple (X, Y ) where X is a point forecast (Ehm et al., 2016, eq. (20)).
3230 T. Gneiting and J. Resin
Table 1
Key examples of identifiable functionals with associated parameters, identification function,
and generic type. For a similar listing see Table 1 in Jordan, Mühlemann and Ziegel (2022).
Functional Parameters Identification function Type
Threshold (non)
t∈R V (x, y) = x − 1{y ≤ t} singleton
exceedance
Mean V (x, y) = x − y singleton
Median V (x, y) = 1{y < x} − 1
2
interval
Moment of
n = 1, 2, . . . V (x, y) = x − y n singleton
order n (mn )
α-Expectile (eα ) α ∈ (0, 1) V (x, y) = |1{y < x} − α| (x − y) singleton
α-Quantile (qα ) α ∈ (0, 1) V (x, y) = 1{y < x} − α interval
α ∈ (0, 1),
Huber V (x, y) = |1{y < x} − α| κa,b (x − y) interval
a, b > 0
where the lower and upper bounds are given by the random variables
−
T (F ) = sup x : V (x, y) dF (y) < 0 (7)
and
+
T (F ) = inf x : V (x, y) dF (y) > 0 . (8)
Let T(F ) = [T− (F ), T+ (F )] and T(FT ) = [T− (FT ), T+ (FT )], where the bound-
aries are random variables. The proof of the first part is complete if we can show
that T− (FT ) = T− (F ) and T+ (FT ) = T+ (F ).
Let ε > 0. By the definition of T+ (F ), we know that V (T+ (F ), y) dF (y) ≤ 0
and V (T+ (F ) + ε, y) dF (y) > 0. Using nested conditional expectations as
above, the same inequalities hold almost surely when integrating with respect to
FT . Hence, by the definition of T+ (FT ), we obtain T+ (F ) ≤ T+ (FT ) < T+ (F )+
ε. An analogous argument shows that T− (F ) − ε ≤ T− (FT ) ≤ T− (F ) + ε,
which completes the proof of the first part and shows that F is conditionally
T-calibrated.
Finally, if F is conditionally T-calibrated, unconditional T-calibration follows
by taking nested expectations in the terms in the defining inequalities.
An analogous result is easily derived for CEP calibration.
Theorem 2.12. Under Assumption 2.6 for quantiles, auto-calibration implies
CEP calibration.
Proof. It holds that
almost surely for α ∈ (0, 1). As F is a version of L(Y | F ), the nested expectation
equals α almost surely by Proposition 2.1 of Rüschendorf (2009), which implies
CEP calibration.
When evaluating full predictive distributions, it is natural to consider families
of functionals as in the subsequent definition, where part (a) is compatible with
the extant notion in (5).
Definition 2.13. A predictive distribution F is
(a) threshold calibrated if it is conditionally F (t)-calibrated for all t ∈ R;
(b) quantile calibrated if it is conditionally qα -calibrated for all α ∈ (0, 1);
(c) expectile calibrated if it is conditionally eα -calibrated for all α ∈ (0, 1);
(d) moment calibrated if it is conditionally n-th moment calibrated for all
integers n = 1, 2, . . .
While CEP, quantile, and threshold calibration are closely related notions,
they generally are not equivalent. For illustration, we consider predictive CDFs
in the spirit of Example 2.4.
Regression diagnostics meets forecast evaluation 3237
Example 2.14.
(a) Let μ ∼ N (0, c2 ). Conditionally on μ, let F be a mixture of uniform
distributions on the intervals [μ, μ + 1], [μ + 1, μ + 2], [μ + 2, μ + 3], and
[μ + 3, μ + 4] with weights p1 , p2 , p3 , and p4 , respectively, and let Y be
from a mixture with weights q1 , q2 , q3 , and q4 . Furthermore, let the tuple
(p1 , p2 , p3 , p4 ; q1 , q2 , q3 , q4 ) attain each of the values
1 1 3 1
1 1 1 3
2 , 0, 2 , 0; 4 , 0, 4 , 0 , 2 , 0, 0, 2 ; 4 , 0, 0, 4 ,
1 1 1
0, 2 , 2 , 0; 0, 14 , 34 , 0 , 0, 2 , 0, 12 ; 0, 34 , 0, 14
with equal probability. Then the continuous forecast F is threshold cali-
brated and CEP calibrated but fails to be quantile calibrated.
(b) Let the tuple (p1 , p2 , p3 ; q1 , q2 , q3 ) attain each of the values
1 1 1 5 4 1 1 1 1 1 5 4 1 1 1 4 1 5
2 , 4 , 4 ; 10 , 10 , 10 , 4 , 2 , 4 ; 10 , 10 , 10 , and 4 , 4 , 2 ; 10 , 10 , 10
Fig 3. The equiprobable predictive distribution F picks the piecewise linear, partially (namely,
for y ≤ 2) identical CDFs F1 and F2 with equal probability. It is jointly CEP, quantile, and
threshold calibrated but fails to be auto-calibrated. For a very similar construction see Example
10 of Tsyplakov (2014).
where Assumption 2.15(ii) ensures that the events conditioned on have posi-
tive probability. Hence, quantile and threshold calibration are equivalent. The
remaining implications are immediate from Theorem 2.5.
We conjecture that the statement of part (b) holds under Assumption 2.15(i)
alone but are unaware of a measure theoretic argument that serves to generalize
the discrete reasoning in our proof. As indicated in panel (b) of Figure 1, we
also conjecture that CEP or quantile calibration imply threshold calibration in
general, though we have not been able to prove this implication, nor can we
show that CEP or quantile calibration imply marginal calibration in general.
Strong threshold calibration as defined in (6) implies both CEP and threshold
calibration under Assumption 2.15, by arguments similar to those in the above
proof. The following result thus demonstrates that the hierarchies in panel (a)
and, with the aforementioned exceptions, in panel (b) of Figure 1 are complete,
with the caveat that hierarchies may collapse if the class F is sufficiently small,
as exemplified by Theorem 2.11 of Gneiting and Ranjan (2013).
Proposition 2.17. Even under Assumption 2.15 the following hold:
(a) Strong threshold calibration does not imply auto-calibration.
(b) Joint CEP, quantile, and threshold calibration does not imply strong thresh-
old calibration.
(c) Joint probabilistic and marginal calibration does not imply threshold cali-
bration.
(d) Probabilistic calibration does not imply marginal calibration.
(e) Marginal calibration does not imply probabilistic calibration.
Table 2
Properties of the forecasts in our examples. We note whether they are auto-calibrated (AC),
CEP calibrated (CC), quantile calibrated (QC), threshold calibrated (TC), probabilistically
calibrated (PC), or marginally calibrated (MC), and whether the involved distributions are
continuous and strictly increasing on a common support (CSI) as in Assumption 2.15(i).
Except for the auto-calibrated cases, the forecasts fail to be moment calibrated.
Source Forecast Type CSI AC CC QC TC PC MC
Example 2.1 Perfect
Example 2.1 Unconditional
Figure 3 Equiprobable
Example 2.4 Piecewise uniform as c → 0
Example 2.2 Unfocused
Example 2.2 Lopsided
Example 2.14 Continuous
Example 2.14 Discrete
partially overlapping CDFs in Appendix A.6 yields part (a). As for part (c), we
return to the piecewise uniform forecast in Example 2.4, where for simplicity
we fix μ = 0. This forecast is probabilistically and marginally calibrated, but it
fails to be threshold calibrated because
Q Y ≤ 32 | F ( 32 ) = 58 = 10
5
+ 12 · 10
1
= 11 5
20 = 8 .
As for parts (d) and (e), we refer to the unfocused and lopsided forecasts from
Example 2.2 with μ = 0 fixed.
Clearly, further hierarchical relations are immediate. For example, given that
probabilistic calibration does not imply marginal calibration, it does not imply
threshold calibration nor auto-calibration. We leave further discussion to future
work but note that moment calibration does not imply probabilistic nor marginal
calibration, as follows easily from classical results on the moment problem (e.g.,
Stoyanov, 2000). For an overview of calibration properties in our examples, see
Table 2.
Fig 4. Threshold (top), quantile (middle), and moment (bottom) reliability diagrams for point
forecasts induced by (left) the unfocused forecast with η0 = 1.5 and (middle) the lopsided
forecast with δ0 = 0.7 from Example 2.2, and (right) the piecewise uniform forecast with
c = 0.5 from Example 2.4. Each display plots recalibrated against original values. Deviations
from the diagonal indicate violations of T-calibration. For details see Appendix A.
almost surely, a recalibrated version of X. Clearly, we can also define Xrc for a
stand-alone point forecast X, based on conceptualized distributions, by resorting
to the joint distribution of the random tuple (X, Y ), provided the right-hand
side of (12) is well defined and finite almost surely. The point forecast X is
conditionally T-calibrated, or simply T-calibrated, if X = Xrc almost surely.
Subject to Assumption 2.8, X is unconditionally T-calibrated if
For recent discussions of the particular cases of the mean or expectation and
quantile functionals see, e.g., Nolde and Ziegel (2017, Sections 2.1–2.2). Patton
(2020, Proposition 2), Krüger and Ziegel (2021, Definition 3.1) and Satopää
(2021, Section 2).
To compare the posited functional X with its recalibrated version Xrc , we
introduce the T-reliability diagram.
Assumption 2.19. The functional T is a lower or upper version of an identifi-
able functional, or an identifiable functional of singleton type. The point forecast
X is a random variable, and the recalibrated forecast Xrc = T(L(Y | X)) is well
defined and finite almost surely.
Definition 2.20. Under Assumption 2.19, the T-reliability diagram is the graph
of a mapping x → T (L(Y | X = x)) on the support of X.
While technically the T-reliability diagram depends on the choice of a regular
conditional distribution for the outcome Y , this issue is not a matter of practi-
cal relevance. Evidently, for a T-calibrated forecast the T-reliability diagram is
concentrated on the diagonal. Conversely, deviations from the diagonal indicate
violations of T-calibration and can be interpreted diagnostically, as illustrated
in Figure 4 for threshold, quantile, and moment calibration. For a similar display
in the specific case of mean calibration see Figure 1 of Pohle (2020).
In the setting of fully specified predictive distributions, the distinction be-
tween unconditional and conditional T-calibration is natural. Perhaps surpris-
ingly, the distinction vanishes in the setting of stand-alone point forecasts if the
associated identification function is of prediction error form and the forecast
and the residual are independent.
Theorem 2.21. Let Assumption 2.19 hold, and suppose that the underlying
identification function V satisfies Assumption 2.8. Suppose furthermore that the
point forecast X and the generalized residual T(δY ) − X are independent. Then
X is conditionally T-calibrated if, and only if, it is unconditionally T-calibrated.
3242 T. Gneiting and J. Resin
In view of (12) and (13), conditional and unconditional T-calibration are equiv-
alent. In the case of an identification function of type (ii) in Assumption 2.8,
the above arguments continue to hold if we take v to be the identity map x → x
and b = 0.
For quantiles, expectiles, and Huber functionals, the identification function V
is of prediction error form and the generalized residual reduces to the standard
residual, X −Y . In particular, this observation applies in the case of least squares
regression, where T is the mean functional, and the forecast and the residual
have typically been assumed to be independent in the literature. We discuss the
statistical implications of Theorem 2.21 in Appendix B.
EF [S(t, Y )] ≤ EF [S(x, Y )]
and it suffices to consider the joint distribution of the tuple (X, Y ). Follow-
ing the lead of Dawid (1986) in the case of binary outcomes, and Ehm and
Ovcharov (2017) and Pohle (2020) in the setting of point forecasts for real-
valued outcomes, we consider the expected scores
for the forecast at hand, its recalibrated version, and the marginal reference
forecast x0 , respectively.
Definition 2.22. Let Assumption 2.19 hold, and let x0 = T(L(Y )) and the
expectations S̄, S̄rc , and S̄mg in (14) be well defined and finite. Then we refer to
Table 3
Canonical loss functions in the sense of Definition 2.24.
Functional Parameter Canonical Loss
Moment of order n n = 1, 2, . . . S(x, y) = (x − y n )2
α-Expectile α ∈ (0, 1) S(x, y) = 2 |1{x ≥ y} − α| (x − y)2
α-Quantile α ∈ (0, 1) S(x, y) = 2 (1{x ≥ y} − α) (x − y)
Table 4
Components of the decomposition (15) for the mean squared error (MSE) under
mean-forecasts induced by the predictive distributions in Examples 2.1 and 2.2. Uncertainty
(UNC) equals 2 irrespective of the forecast at hand. The term I(η0 ) is in integral form and
can be evaluated numerically. For details see Appendix A.
Predictive
Mean-Forecast MSE MCB DSC
Distribution
Perfect μ 1 0 1
Unconditional 0 2 0 0
Unfocused μ + 12 η 1 + 14 η02 ( 14 − I(η0 ))η02 1 − I(η0 )η02
√
Lopsided μ+ √2 δ
π
1+ 2 2
δ
π 0
( 14 − I( 8
δ )) 8 δ 2
π 0 π 0
1 − I( 8
δ ) 8 δ2
π 0 π 0
Fig 5. Components of the decomposition (15) for the mean squared error (MSE) under mean-
forecasts induced by the unfocused and lopsided predictive distributions from Example 2.2 and
Table 4, as functions of η0 ≥ 0 and δ0 ∈ (0, 1), respectively.
in Table 3. Since any selection incurs the same point forecast ranking, we refer
to the choice in Table 3 as the canonical loss function. The most prominent
example arises when T is the mean functional, where the ubiquitous quadratic
or squared error scoring function,
S(x, y) = (x − y)2 , (19)
is canonical. In this case, the UNC component equals the unconditional variance
of Y as x0 is simply the marginal mean μY of Y , and the MCB and DSC
components of the general score decomposition (15) are
2 2
MCB = E (X − Xrc ) and DSC = E (Xrc − μY ) ,
respectively. Note that here and in the following, we drop the subscript S when-
ever we use a canonical loss. Table 4 and Figure 5 provide explicit examples.
In the nested case of a binary outcome Y , where X and Xrc specify event
probabilities, the quadratic loss function reduces to the Brier score (Gneiting
and Raftery, 2007), and we refer to Dimitriadis, Gneiting and Jordan (2021) and
references therein for details on score decompositions. In the case of threshold
calibration, the point forecast x = F (t) is induced by a predictive distribution,
and the Brier score can be written as
2
S(x, y) = (F (t) − 1{y ≤ t}) . (20)
For both real-valued and binary outcomes, it is often preferable to use the square
root of the miscalibration component (MCB1/2 ) as a measure of calibration error
that can be interpreted on natural scales (e.g., Roelofs et al., 2022).
A canonical loss function for the Huber functional (Table 1) is given by
⎧
⎨2a|x − y| − a , x − y < −a,
⎪ 2
Theorem 2.27. Let Assumption 2.25 hold, and suppose there is a constant c
such that X + c is unconditionally T-calibrated. Let Xurc = X + c and S̄urc =
E[S(Xurc , Y )], and define
Then
MCB = MCBu + MCBc ,
where MCBu ≥ 0 with equality if X is unconditionally T-calibrated, and MCBc ≥
0 with equality if Xrc = Xurc almost surely. If S is strictly consistent, then
MCBu = 0 only if X is unconditionally T-calibrated, and MCBc = 0 only if
Xrc = Xurc almost surely.
Proof. Immediate from Theorems 2.23 and 2.26, and the fact that conditional
recalibration of X and X + c yields the same Xrc .
In settings that are equivariant under translation, such as for expectiles,
quantiles, and Huber functionals when both X and Y are supported on the real
line, X can always be unconditionally recalibrated by adding a constant. Under
any canonical loss function S, the basic decomposition (15) then extends to
For instance, when S(x, y) = (x − y)2 is the canonical loss for the mean func-
tional, MCBu = c2 is the squared unconditional bias. The forecasts in Figure 5
and Table 4 are free of unconditional bias, so MCBu = 0 and MCBc = MCB.
In all cases studied thus far, canonical loss functions are strictly consistent
(Ehm et al., 2016), and so MCBu = 0 if and only if the forecast is uncondition-
ally T-calibrated, and MCBc = 0 if and only if Xurc = Xrc almost surely. While
in other settings, such as when the outcomes are bounded, unconditional recali-
bration by translation might be counterintuitive (in principle) or impossible (in
practice), the statement of Theorem 2.27 continues to hold, and the above re-
sults can be refined to admit more general forms of unconditional recalibration.
We leave these and other ramifications to future work.
We turn to empirical settings, where calibration checks, scores, and score de-
compositions address critical practical problems in both model diagnostics and
forecast evaluation. The most direct usage is in the evaluation of out-of-sample
predictive performance, where forecasts may either take the form of fully spec-
ified predictive distributions, or be single-valued point forecasts that arise, im-
plicitly or explicitly, as functionals of predictive distributions. Similarly, in model
diagnostics, where in-sample goodness-of-fit is of interest, the model might sup-
ply fully specified, parametric or non-parametric conditional distributions, or
3248 T. Gneiting and J. Resin
Our key tool and workhorse is a very general version of the classical pool-
adjacent-violators (PAV) algorithm for nonparametric isotonic regression (Ayer
et al., 1955; van Eeden, 1958). Historically, work on the PAV algorithm has
focused on the mean functional, as reviewed by Barlow et al. (1972), Robertson
and Wright (1980), and de Leeuw, Hornik and Mair (2009), among others. In
contrast, Jordan, Mühlemann and Ziegel (2022) study the PAV algorithm in
very general terms that accommodate our setting.
We rely on their work and describe the T-pool-adjacent-violators algorithm
based on tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24), where without loss of
generality we may assume that x1 ≤ · · · ≤ xn . Furthermore, we let δi denote
the point measure in the outcome yi . More generally, for 1 ≤ k ≤ l ≤ n we let
1 l
δk:l = δi
l−k+1
i=k
Regression diagnostics meets forecast evaluation 3249
1 1
n n
xi , yi ) ≤
S( S(ti , yi ) (25)
n i=1 n i=1
Recently, Dimitriadis, Gneiting and Jordan (2021) introduced the CORP ap-
proach for the estimation of reliability diagrams and score decompositions in
3250 T. Gneiting and J. Resin
the case of probability forecasts for binary outcomes. In a nutshell, the acronym
CORP refers to an estimator that is Consistent under the assumption of iso-
tonicity for the population recalibration function and Optimal in both finite
sample and asymptotic settings, while facilitating Reproducibility, and being
based on the PAV algorithm. Here, we extend the CORP approach and employ
nonparametric isotonic T-regression via the T-PAV algorithm under Assump-
tion 2.19, where T is the lower or upper version of an identifiable functional, or
an identifiable singleton functional.
We begin by defining the empirical T-reliability diagram, which is a sample
version of the population diagram in Definition 2.20.
Definition 3.2. Let the functional T be as stated in Assumption 2.19, and sup-
pose that x n originate from tuples (x1 , y1 ), . . . , (xn , yn ) with x1 ≤ · · · ≤
1 , . . . , x
xn via Algorithm 1. Then the CORP empirical T-reliability diagram is the graph
of the piecewise linear function that connects the points (x1 , x 1 ), . . . , (xn , x
n ) in
the Euclidean plane.
A few scattered references in the literature on forecast evaluation have pro-
posed displays of recalibrated against original values for functionals other than
binary event probabilities: Figures 3 and 7 of Bentzien and Friederichs (2014)
and Figure 8 of Pohle (2020) consider quantiles, and Figures 2–5 of Satopää
and Ungar (2015) concern the mean functional. However, none of these papers
employ the PAV algorithm, and the resulting diagrams are subject to issues
of stability and efficiency, as illustrated by Dimitriadis, Gneiting and Jordan
(2021) in the case of binary outcomes.
For the CORP empirical T-reliability diagram to be consistent in the sense
of large sample convergence to the population version of Definition 2.20, the
assumption of isotonicity of the population recalibration function needs to be
invoked. As argued by Roelofs et al. (2022) and Dimitriadis, Gneiting and Jordan
(2021), such an assumption is natural, and practitioners tend to dismiss non-
isotonic recalibration functions as artifacts. Evidently, these arguments transfer
to arbitrary functionals, and any violations of the isotonicity assumption entail
horizontal segments in CORP reliability diagrams, thereby indicating a lack of
reliability. Large sample theory for CORP estimates of the recalibration func-
tion and the T-reliability diagram depends on the functional T, the type —
discrete or continuous — of the marginal distribution of the point forecast X,
and smoothness conditions. Mösching and Dümbgen (2020) establish rates of
uniform convergence in the cases of threshold (non) exceedance and quantile
functionals that complement classical theory (Barlow et al., 1972; Casady and
Cryer, 1976; Wright, 1984; Robertson, Wright and Dykstra, 1988; El Barmi and
Mukerjee, 2005; Guntuboyina and Sen, 2018).
In the case of binary outcomes, Bröcker and Smith (2007, p. 651) argue that
reliability diagrams ought to be supplemented by consistency bars for “imme-
diate visual evaluation as to just how likely the observed relative frequencies
are under the assumption that the predicted probabilities are reliable.” Dimitri-
adis, Gneiting and Jordan (2021) develop asymptotic and Monte Carlo based
methods for the generation of consistency bands to accompany a CORP relia-
Regression diagnostics meets forecast evaluation 3251
bility diagram for dichotomous outcomes, and provide code in the form of the
reliabilitydiag package (Dimitriadis and Jordan, 2021) for R (R Core Team,
2021). The consistency bands quantify and visualize the variability of the em-
pirical reliability diagram under the respective null hypothesis, i.e., they show
the pointwise range of the CORP T-reliability diagram that we expect to see
under a calibrated forecast. Algorithms 2 and 3 in Appendix B generalize this
approach to produce consistency bands from data of the form (23) under the as-
sumption of auto-calibration. In the specific case of threshold calibration, where
the induced outcome is dichotomous, the assumptions of auto-calibration (in the
binary setting) and T-calibration (for the non-exceedance functional) coincide
(Gneiting and Ranjan, 2013, Theorem 2.11), and we use the aforementioned
algorithms to generate consistency bands (Figure 6, top row). Generally, auto-
calibration is a strictly stronger assumption than T-calibration, with ensuing
issues, which we discuss in Appendix B.1. Furthermore, to generate consistency
bands from data of the form (24), we cannot operate under the assumption of
auto-calibration.
As a crude yet viable alternative, we propose in Appendix B.2 a Monte Carlo
technique for the generation of consistency bands that is based on resampling
residuals. As in traditional regression diagnostics, the approach depends on the
assumption of independence between point forecasts and residuals. Figure 6
shows examples of T-reliability diagrams with associated residual-based 90%
consistency bands for the perfect, unfocused, and lopsided forecasts from Sec-
tion 2 for the mean functional (middle row) and the lower quantile functional
at level 0.10 (bottom row). For further discussion see Appendix B.2. In the
case of the mean functional, we add the scatter diagram for the original data
of the form (24), whereas in the other two cases, inset histograms visualize the
marginal distribution of the point forecast.
We encourage follow-up work on both Monte Carlo and asymptotic meth-
ods for the generation of consistency and confidence bands that are tailored to
specific functionals of interest, similar to the analysis by Dimitriadis, Gneiting
and Jordan (2021) in the basic case of probability forecasts for binary out-
comes.
In this section, we consider data (x1 , y1 ), . . . , (xn , yn ) of the form (24), where
implicitly or explicitly xi = T(Fi ) for a single-valued functional T. Let x 1 , . . . , x
n
denote the respective T-PAV recalibrated values, and let x 0 = T(F0 ), where F0
is the empirical CDF of the outcomes y1 , . . . , yn . Let
1 1 1
n n n
S= S(xi , yi ),
Src = S(
xi , yi ), and
Smg = S(
x0 , yi )
n i=1 n i=1 n i=1
(26)
denote the mean score of the point forecast at hand, the recalibrated point fore-
cast, and the functional T applied to the unconditional, marginal distribution
3252 T. Gneiting and J. Resin
Fig 6. CORP empirical threshold (top, t = 1), mean (middle) and quantile (bottom, α = 0.10)
reliability diagrams for the perfect (left), unfocused (middle), and lopsided (right) forecast
from Examples 2.1 and 2.2 with 90% consistency bands and CORP score components under
the associated canonical loss function based on samples of size 400.
S + UNC
S − DSC
S = MCB S, (28)
S ≥ 0 with equality if x
where MCB S ≥ 0 with
i = xi for i = 1, . . . , n, and DSC
i = x
equality if x 0 for i = 1, . . . , n.
If S is strictly consistent, then MCB S = 0 only if x i = xi for i = 1, . . . , n
and DSCS = 0 only if x i = x
0 for i = 1, . . . , n.
Proof. Immediate from Theorem 3.1.
We then refer to
u =
MCB S−
Surc c =
and MCB Surc −
Src
Theorem 3.4. Let the conditions of Theorem 3.3 hold, and let S be a canonical
loss function for T. Suppose there is a constant c ∈ R such that the empirical
measure in (x1 + c, y1 ), . . . , (xn +
c, yn ) is unconditionally T-calibrated, and
suppose that all terms in (29) are finite. Then
= MCB
MCB c,
u + MCB
u ≥ 0 and MCB
where MCB c ≥ 0.
Proof. Immediate from Theorems 2.26 and 3.1 and the trivial fact that the
addition of a constant is a special case of an isotonic mapping.
In the middle row of Figure 6, the extended CORP decomposition,
u + MCB
S = MCB + UNC,
c − DSC (30)
Let us revisit the mean scores in (26) under the natural assumption that the
terms in Smg are finite and that
S and Smg is strictly positive. In out-of-sample
forecast evaluation, the quantity
S Smg −
S S − MCB
DSC S
Sskill = 1 − = = (31)
Smg
Smg S
UNC
is known as skill score (Murphy and Epstein, 1989; Murphy, 1996; Gneiting
and Raftery, 2007; Jolliffe and Stephenson, 2012) and may attain both positive
and negative values. In particular, when S(x, y) = (x − y)2 is the canonical loss
Regression diagnostics meets forecast evaluation 3255
function for the mean functional, Sskill coincides with the popular Nash-Sutcliffe
model efficiency coefficient (NSE; Nash and Sutcliffe, 1970; Moriasi et al., 2007).
A positive skill score indicates predictive performance better than the simplistic
unconditional reference forecast, whereas a negative skill score suggests that
we are better off using the simple reference forecast. Of course, it is possible,
and frequently advisable, to base skill scores on reference standards that are
more sophisticated than an unconditional, constant point forecast (Hyndman
and Koehler, 2006).
In contrast, if the goal is in-sample model diagnostics, the quantity in (31)
typically is nonnegative. As we demonstrate now, it constitutes a powerful gen-
eralization of the coefficient of determination, R2 , or variance explained in least
squares regression, and its close cousin, the R1 -measure in quantile regression
(Koenker and Machado, 1999). Specifically, we propose the use of
S − MCB
DSC S
R∗ = , (32)
S
UNC
as a universal coefficient of determination. In practice, one takes S to be a
canonical loss for the functional T at hand, and we drop the subscripts in this
case. The classical R2 measure arises when S(x, y) = (x − y)2 is the canonical
squared error loss function for the mean functional, and the R1 measure of
Koenker and Machado (1999) emerges when S(x, y) = 2 (1{x ≥ y} − α) (x − y)
is the canonical piecewise linear loss under the α-quantile functional. Of course,
in the case α = 12 of the median, the piecewise linear loss reduces to the absolute
error.
In Figure 7, we present a numerical example on the toy data from Figure 1
in Kvålseth (1985). The straight lines show the linear (ordinary least squares)
mean and linear (Laplace) median regression fits, which Kvålseth (1985) sought
to compare. The piecewise linear broken curves illustrate the nonparametric
isotonic regression fits, as realized by the T-PAV algorithm, where T is the
mean and the lower and the upper median, respectively. As the linear regression
fits induce the same ranking of the point forecasts, they yield the same PAV-
recalibrated values that enter the terms in the score decomposition (27), and
thus they have identical discrimination components in (28), which equal 10.593
under squared error and 2.333 under absolute error, regardless of which isotonic
median is used. The uncertainty components, which equal 12.000 under squared
error, and 2.889 under absolute error, are identical as well, since they depend
on the observations only. Thus, the differences in R2 respectively R1 in Figure 7
stem from distinct miscalibration components. Of course, linear mean regression
is preferred under squared error, and linear median regression is preferred under
absolute error.
Various authors have discussed desiderata for a generally applicable defini-
tion of a coefficient of determination (Kvålseth, 1985; Nakagawa and Schielzeth,
2013) for the assessment of in-sample fit. In particular, such a coefficient ought
to be dimensionless and take values in the unit interval, with a value of 1 in-
dicating a perfect fit, and a value of 0 representing a complete lack of fit. The
3256 T. Gneiting and J. Resin
Fig 7. Linear mean and linear median regression lines for toy example from Kvålseth (1985,
Figure 1), along with nonparametric isotonic mean and median regression fits. The isotonic
median regression fit is not unique and framed by the respective lower and upper functional.
R∗ ∈ [0, 1]
with R∗ = 0 if xi = x
0 for i = 1, . . . , n, and R∗ = 1 if xi = T(δi ) for
i = 1, . . . , n.
Proof. The claim follows from Theorem 3.1, the trivial fact that a constant fit is
a special case of an isotonic mapping, and the assumed form (17) of the scoring
function.
Regression diagnostics meets forecast evaluation 3257
We emphasize that Assumption 3.5 and Theorem 3.6 are concerned with, and
tailored to, in-sample model diagnostics. At the expense of technicalities, the
regularity conditions can be relaxed, but the details are tedious and we leave
them to subsequent work. The condition that any constant fit x1 = · · · = xn be
admissible is critical and cannot be relaxed.
Fig 8. Calibration diagnostics for Bank of England forecasts of CPI inflation at a prediction
horizon of one quarter: (a) PIT reliability diagram, along with the empirical autocorrelation
functions of (b) original and (c) squared, centered PIT values, (d) marginal, (e) threshold,
and (f) 75%-quantile reliability diagram. If applicable, we show 90% consistency bands and
CORP score components under the associated canonical loss function, namely, the Brier score
(BS) and the piecewise linear quantile score (QS), respectively.
ahead in the time series setting, where k step ahead forecasts that are ideal
with respect to the canonical filtration show PIT values that are independent
at lags ≥ k + 1 in addition to being uniformly distributed (Diebold, Gunther
and Tay, 1998). However, as discussed in Appendix C.1, independent, uniformly
distributed PIT values do not imply auto-calibration, except in a special case.
Thus, calibration diagnostics beyond checks of the uniformity and independence
of the PIT are warranted.
In Figure 8, we consider forecasts one quarter ahead and show PIT and
marginal reliability diagrams, along with empirical autocorrelation functions
(ACFs) for the first two moments of the PIT. In part, the PIT reliability diagram
and the ACFs lie outside the respective 90% consistency bands. For a closer look,
we also plot the threshold reliability diagram at the policy target of 2% and the
lower α-quantile reliability diagram for α = 0.75. The deviations from reliability
remain minor, in stark contrast to calibration diagnostics at prediction horizons
k ≥ 4, for which we refer to Appendix C.2.
Figure 9 shows the standard CORP decomposition (28) of the Brier score
(BS) for the induced probability forecasts at the 2% target and the extended
Regression diagnostics meets forecast evaluation 3259
Fig 9. Score decomposition (28) respectively (30) and skill score (31) for probability forecasts
of not exceeding the 2% inflation target (left) and 75%-quantile forecasts (right) induced by
Bank of England fan charts for CPI inflation, under the associated canonical scoring function.
CORP decomposition (30) of the piecewise linear quantile score for α-quantile
forecasts at level α = 0.75 and lead times up to six quarters ahead. In the lat-
ter case, the difference between MCB and MCBu equals the MCBc component.
Generally, the miscalibration components increase while the discrimination com-
ponents decrease with the lead time. Related results for the quantile functional
can be found in Pohle (2020, Table 5, Figures 7 and 8), where there is a notable
increase in the discrimination (resolution) component at the largest two lead
times, which is caused by counterintuitive decays in the recalibration functions.
In contrast, the regularizing constraint of isotonicity prevents overfitting in the
CORP approach.
The coefficient of determination or skill score R∗ decays with the prediction
horizon and becomes negative at lead times k ≥ 4. This observation suggests
that forecasts remain informative at lead times up to at most three quarters
ahead, in line with the substantive findings in Pohle (2020) and other extant
work, as hinted at in Appendix C.2.
4. Discussion
cient of determination, R∗ , gather on lines with unit slope, and multiple facets
of forecast quality can be assessed simultaneously, for a general alternative to
the widely used Taylor (2001) diagram.
Formal tests of hypotheses of calibration are critical in both specific ap-
plications, such as banking regulation (e.g., Nolde and Ziegel, 2017), and in
generic tasks, such as the assessment of goodness of fit in regression (Dimitri-
adis, Gneiting and Jordan, 2021, Section S2). In Appendix B.4, we comment
on this problem from the perspective of the theoretical and methodological
advances presented here. While specific developments need to be deferred to
future work, it is our belief that the progress in our understanding of notions
and hierarchies of calibration, paired with the CORP approach to estimating
reliability diagrams and score components, can spur a wealth of new and fruitful
developments in these directions.
where we define Ψa (x) = ϕ(x + a)/(ϕ(x) + ϕ(x + a)) for a ∈ R and note that
E[Ψ2η (μ) | η] = E[Ψ2η0 (μ)]. The associated integral
∞ 2
ϕ(x + η0 )
I(η0 ) = E Ψ2η0 (μ) = ϕ(x) dx
−∞ ϕ(x) + ϕ(x + η0 )
We proceed in analogy to the development for the unfocused forecast. For fixed
a ∈ [0, 1] and b ∈ R, the function
is a CDF. The CDF for the lopsided forecast with random density (3) from
Example 2.2 can be written as F (y) = Φδ (y −μ), where δ and μ are independent
random variables and δ = ±δ0 for some δ0 ∈ (0, 1). As E[Φδ (y − μ) | μ] = Φ(y −
μ), the lopsided forecast is marginally calibrated. It fails to be probabilistically
calibrated since ZF = Φδ (Y − μ) has CDF
1 u u 1 u + sδ0 u 1
Q(ZF ≤ u) = 1 ≤ + 1 >
2 s=±1 1 − sδ0 1 − sδ0 2 1 + sδ0 1 − sδ0 2
The conditional CDF for the outcome Y given the posited (non) exceedance
probability F (t) at any fixed threshold t ∈ R or, equivalently, given the quantile
forecast F −1 (α) at any fixed level α ∈ (0, 1) is
Q(Y ≤ y | F (t) = α) = Q(Y ≤ y | F −1 (α) = t) = Q(Y ≤ y | μ = t − Φ−1 δ (α))
1
= −1 ϕ(t − Φ−1 −1
sδ0 (α)) Φ(y − (t − Φsδ0 (α))),
s=±1 ϕ(t − Φ sδ0 (α)) s=±1
where Φ−1
a (α) = Φ
−1
(α/(1−a)) if α ≤ 12 (1−a) and Φ−1
a (α) = Φ
−1
((a+α)/(a+
1)) otherwise.
As F is a mixture of truncated normal distributions, its moments are mixtures
of the component moments, for which we refer to Orjebin (2014). The first
moment is m1 (F ) = μ + 2δϕ(0), so that
Q(Y ≤ y | m1 (F ) = m) = Q(Y ≤ y | μ = m − 2δϕ(0))
1
= ϕ(m − 2sδ0 ϕ(0)) Φ(y − (m − 2sδ0 ϕ(0)))
s=±1 ϕ(m − 2sδ0 ϕ(0)) s=±1
is a mixture of normal distributions. Similarly, the second and third moments are
m2 (F ) = μ2 + 1 + 4δϕ(0)μ ≥ 1 − 4δ 2 ϕ(0)2 and m3 (F ) = μ3 + 3μ + 2δϕ(0)(3μ2 +
2) = f (μ; δ), respectively, so that
Q(Y ≤ y | m2 (F ) = m) = Q(Y ≤ y | μ = −2δϕ(0) ± 4δ 2 ϕ(0)2 − 1 + m),
Q(Y ≤ y | m3 (F ) = m) = Q(Y ≤ y | f (μ; δ) = m)
also admit expressions as mixtures of normal distributions. Again, we use a
numeric solver to find the roots of x → f (x; ±δ0 ).
As the implied mean-forecast, m1 (F ) = μ + 2δϕ(0), agrees with the implied
mean-forecast of the unfocused forecast with η = (8/π)1/2 δ, the terms in the
score decomposition (15) with S(x, y) = (x − y)2 derive from the respective
terms in the score decomposition for the unfocused forecast, as illustrated in
Figure 5.
The conditional CDF for the outcome Y given the posited (non) exceedance
probability F (t) at any fixed threshold t ∈ R or, equivalently, given the quantile
forecast F −1 (α) at any fixed level α ∈ (0, 1) then is
Q(Y ≤ y | m1 (F ) = m) = Q(Y ≤ y | μ = m − 1 − 14 ι)
1 m − 1 − 1i
= ϕ 4
Qi (y − (m − 1 − 14 i))
m−1− 14 i c
i=1,2,3 ϕ c i=1,2,3
In this section, we demonstrate that Definitions 2.9 and 2.24 are unambiguous
and do not depend on the choice of the identification function, which is essen-
tially unique. To this end, we first contrast the notions of identification functions
in Fissler and Ziegel (2016) and Jordan, Mühlemann and Ziegel (2022). Fissler
and Ziegel (2016) call V : R×R → R a (strict F-)identification function if V (x, ·)
is integrable with respect to all F ∈ F for all x ∈ R. Jordan, Mühlemann and
Ziegel (2022) additionally require V to be increasing and left-continuous in its
first argument. Furthermore, there is a subtle difference in the way that the func-
tional is induced. While Fissler and Ziegel (2016) define the induced functional
as the set
T0 (F ) = x ∈ R : V (x, y) dF (y) = 0 ,
Regression diagnostics meets forecast evaluation 3265
r− = sup{r : v(r) < 0} = T− (δ0 ) = T− (δ0 ) = sup{r : v(r) < 0} > −∞,
r+ = inf{r : v(r) > 0} = T+ (δ0 ) = T+ (δ0 ) = inf{r : v(r) > 0} < ∞.
If r, s < r− , then h(r) = h(s) = h(t) for any t > r+ by the above line of
reasoning. An analogous argument yields h(r) = h(s) for r, s > r+ . Therefore,
the function h is constant and v(r) = h0 · v(r) for a constant h0 > 0 and all r ∈
R \ {r− }. Finally, we obtain v(r− ) = limr↑r− v(r) = limr↑r− h0 · v(r) = h0 · v(r− )
by left-continuity.
Hence, if we assume an identification function of type (i) in Assumption 2.8,
Definitions 2.9 and 2.24 do not depend on the choice of the identification func-
tion, as it is unique up to a positive constant. Trivially, the same holds true for
type (ii). To complete the argument that the definitions are unambiguous, the
following technical argument is needed.
Remark A.2. If a functional T of singleton type is identified by both an iden-
tification function V (x, y) = v(x − y) of type (i) and an identification function
V (x, y) = x − T(δy ) of type (ii), then V is also of type (i). To confirm this
claim, let z denote the unique value at which the sign of v changes, and note
that z = T(δy ) − y for all y since V induces the functional T for each Dirac
measure δy . Hence, T(δy ) = y + z and V (x, y) = x − y − z is of type (i).
We close this section with comments on the role of the class F. As expressed
by Assumption 2.8, we prefer to work with identification functions that elicit
the target functional T on a large, convex class F of probability measures, to
avoid unnecessary constraints on forecast(er)s. Furthermore, when evaluating
stand-alone point forecasts, the underlying predictive distributions typically are
implicit, and assumptions other than the existence of the functional at hand are
unwarranted and contradict the prequential principle. Evidently, if the class F
is sufficiently restricted, additional identification functions arise. For example,
the piecewise constant identification function associated with the median can
be used to identify the mean within any class of symmetric distributions.
As pointed out by Sahoo et al. (2021, p. 5), strong threshold calibration does not
imply auto-calibration. Here, we provide a simple example illustrating this fact
as Sahoo et al. (2021) do not present such. The example is similar in spirit to the
Regression diagnostics meets forecast evaluation 3267
Fig 10. Same as the lower row of Figure 4 but with displays on original (rather than root-
transformed) scales: Moment reliability diagrams for point forecasts induced by (left) the
unfocused forecast with η0 = 1.5 and (middle) the lopsided forecast with δ0 = 0.7 from
Example 2.2, and (right) the piecewise uniform forecast with c = 0.5 from Example 2.4.
with equal probability. The equal average of the distribution of the PIT condi-
tional on either forecast from the top row, and either forecast from the bottom
row, is uniform. As any nontrivial conditioning in terms of a threshold yields
a combination of two forecast cases, one from the top row and one from the
bottom row, the forecast F is strongly threshold calibrated.
The root transforms in the moment reliability diagrams in the bottom row
of Figure 4 bring the first, second, and third moment to the same scale. The
peculiar dent in the reliability curve for the (third root of the) third moment of
the piecewise uniform forecast results from the transform, which magnifies small
deviations between x = m3 (F ) and xrc when x is close to zero. For comparison,
Figure 10 shows moment reliability diagrams for all three forecasts without
applying the root transform.
The statement in Theorem 2.26 does not hold under consistent scoring func-
tions in general. For a counterexample, consider the empirical distribution of
3268 T. Gneiting and J. Resin
Monte Carlo based consistency bands for T-reliability diagrams can be gener-
ated from resamples, at any desired nominal level. The consistency bands then
show the pointwise range of the resampled calibration curves. For now, let us
assume that we have data (x1 , y1 ), . . . , (xn , yn ) of the form (24) along with m
resamples at hand, and defer the critical question of how to generate the resam-
ples.
Algorithm 2: Consistency bands for T-reliability curves based on re-
samples
(j) (j)
Input: resamples (x1 , y1 ), . . . , (xn , yn ) for j = 1, . . . , m
Output: α × 100% consistency band
for j ∈ {1, . . . , m} do
(j) (j)
apply Algorithm 1 to obtain x 1 , . . . , x
n from
(j) (j)
(x1 , y1 ), . . . , (xn , yn )
end
for i ∈ {1, . . . , n} do
(1) (m)
let li and ui be the empirical quantiles of x i , . . . , x
i at level α2
and 1 − 2 α
end
interpolate the point sets (x1 , l1 ), . . . , (xn , ln ) and (x1 , u1 ), . . . , (xn , un )
linearly, to obtain the lower and upper bound of the consistency band,
respectively
Complementary to consistency bands, tests for the assumed type of calibra-
tion, as quantified by the functional T and a generic miscalibration measure
MCB, can be performed as usual. Specifically, we compute MCBj for each re-
sample j = 1, . . . , m, and, if r of the resampled measures MCB1 , . . . , MCBm are
less than or equal to the miscalibration measure computed from the original
data, we declare a Monte Carlo p-value of 1 − m+1 r
.
When working with original data of the form (23), we can generate resamples
under the hypothesis of auto-calibration in the obvious way, as follows.
Regression diagnostics meets forecast evaluation 3269
As noted in the main text, the consistency bands for the threshold reliabil-
ity diagrams in Figures 6 and 8 have been generated by Algorithms 2 and 4.
This approach is similar to the Monte Carlo technique proposed by Dimitri-
adis, Gneiting and Jordan (2021) that applies in the case of (induced) binary
outcomes (only). However, unlike Dimitriadis, Gneiting and Jordan (2021), we
do not resample the forecasts themselves. To generate consistency bands for
the mean and quantile reliability diagrams in these figures, we apply Algo-
rithm 2 to m = 1000 resamples generated by Algorithm 4. Evidently, this
procedure is crude and relies on classical assumptions. Nonetheless, we believe
that in many practical settings, where visual tools for diagnostic checks of cal-
ibration are sought, the consistency bands thus generated provide useful guid-
ance.
Further methodological development on consistency and confidence bands
needs to be tailored to the specific functional T of interest, and follow-up work on
Monte Carlo techniques and large sample theory is strongly encouraged. Extant
asymptotic theory for nonparametric isotonic regression, as implemented by
Algorithm 1, is available for quantiles and the mean or expectation functional,
as developed and reviewed by Barlow et al. (1972), Casady and Cryer (1976),
Wright (1984), Robertson, Wright and Dykstra (1988), El Barmi and Mukerjee
(2005), and Mösching and Dümbgen (2020), and can be leveraged, though with
hurdles, as rates of convergence depend on distributional assumptions and limit
distributions involve nuisance parameters that need to be estimated, whereas
the use of bootstrap methods might be impacted by the issues described by Sen,
Banerjee and Woodroofe (2010).
Regression diagnostics meets forecast evaluation 3271
For the classical notions of unconditional calibration in Section 2.2, the CORP
approach does not apply directly, but its spirit can be retained and adapted.
As for probabilistic calibration, the prevalent practice is to plot histograms of
empirical probability integral transform (PIT) values, as proposed by Diebold,
Gunther and Tay (1998), Gneiting, Balabdaoui and Raftery (2007), and Czado,
Gneiting and Held (2009), though this practice is hindered by the necessity for
binning, as analyzed by Heinrich (2021) in the nearly equivalent setting of rank
histograms. The population version of our suggested alternative is the PIT re-
liability diagram, which is simply the graph of the CDF of the PIT ZF in (1).
The PIT reliability diagram coincides with the diagonal in the unit square if,
and only if, F is probabilistically calibrated. For tuples of the form (23) the
empirical PIT reliability diagram shows the empirical CDF of the (potentially
randomized) PIT values. This approach does not require binning and can be
interpreted in much the same way as a PIT diagram: An inverse S-shape corre-
sponds to a U-shape in histograms and indicates underdispersion of the forecast,
as typically encountered in practice. Evidently, this idea is not new and extant
implementations can be found in work by Pinson and Hagedorn (2012) and
Henzi, Ziegel and Gneiting (2021).
As regards marginal calibration, we define the population version of the
marginal reliability diagram as the point set
Fig 11. PIT (top) and marginal (bottom) reliability diagrams for the perfect (left), unfocused
(middle), and lopsided (right) forecast from Examples 2.1 and 2.2, along with 90% consistency
bands based on samples of size 400.
While the explicit development of calibration tests exceeds the scope of our pa-
per, we believe that the results and discussion in Section 2 convey an important
general message: It is critical that the assessed notion of calibration be carefully
and explicitly specified. Throughout, we consider tests under the assumption of
independent, identically distributed data from a population. For extensions to
dependent samples, we refer to Strähl and Ziegel (2017), who generalized the
prediction space concept to allow for serial dependence, and point at methods
introduced by, e.g., Corradi and Swanson (2007), Knüppel (2015), and Bröcker
and Ben Bouallègue (2020).
The most basic case is that of tuples (x1 , y1 ), . . . , (xn , yn ) of the form (24),
where implicitly or explicitly xi = T(Fi ) for a single-valued functional T. We
first discuss tests of unconditional calibration.Ifn the simplified condition (11) is
sufficient, a two-sided t-test based on v = n1 i=1 V (xi , yi ) can be used to test
for unconditional calibration. In the general case, two one-sided t-tests can be
used along with a Bonferroni correction. In the special case of quantiles, there
is no need to resort to the approximate t-tests, and exact binomial tests can be
used instead. Essentially, this special case is the setting of backtests for value-
at-risk reports in banking regulation, for which we refer to Nolde and Ziegel
(2017, Sections 2.1–2.2).
As noted earlier in the section, resamples generated under the hypothesis of
Regression diagnostics meets forecast evaluation 3273
conditional T-calibration can readily be used to perform Monte Carlo tests for
the respective hypothesis, based on CORP score components that are computed
on the surrogate data. Alternatively, one might leverage extant large sample
theory for nonparametric isotonic regression (Barlow et al., 1972; Casady and
Cryer, 1976; Wright, 1984; Robertson, Wright and Dykstra, 1988; El Barmi and
Mukerjee, 2005; Mösching and Dümbgen, 2020). Independently of the use of
resampling or asymptotic theory, CORP based tests avoid the issues and in-
stabilities incurred by binning (Dimitriadis, Gneiting and Jordan, 2021, Section
S2) and may simultaneously improve efficiency and stability. In passing, we hint
at relations to the null hypothesis of Mincer-Zarnowitz regression (Krüger and
Ziegel, 2021) and tests of predictive content (Galbraith, 2003; Breitung and
Knüppel, 2021).
We move on to the case of fully specified distributions, where we work with
tuples (F1 , y1 ), . . . , (Fn , yn ) of the form (23), where Fi is a posited conditional
CDF for yi (i = 1, . . . , n). Tests for probabilistic calibration then amount to tests
for the uniformity of the (potentially, randomized) PIT values. Wallis (2003) and
Wilks (2019, p. 769) suggest chi-square tests for this purpose, which depend on
binning, and thus are subject to the aforementioned instabilities. To avoid bin-
ning, we recommend the use of test statistics that operate on the empirical
CDF of the PIT values, such as the classical Kolmogorov–Smirnov statistic,
as suggested and used to test for PIT calibration by Noceti, Smith and Hodges
(2003) and Knüppel (2015), or, more generally, tests based on distance measures
between the empirical CDF of the PIT values, and the CDF of the standard
uniform distribution that arises under the hypothesis of probabilistic calibra-
tion. Recently proposed alternatives arise via e-values (Henzi and Ziegel, 2022).
Similarly, tests for marginal calibration can be based on resamples and distance
measures between F̄ and F0 , or leverage asymptotic theory.
In the distributional setting, arbitrarily many types of reliability can be
tested for, and all of the aforementioned tests for unconditional or conditional
T-calibration apply. Multiple testing needs to be accounted for properly, and
the development of simultaneous tests for various types of calibration would be
useful. In this context, let us recall from Theorem 2.16 that, subject to techni-
cal conditions, CEP, threshold, and quantile calibration are equivalent and tests
for CEP calibration (Held, Rufibach and Balabdaoui, 2010; Strähl and Ziegel,
2017), quantile and threshold calibration assess identical hypotheses.
In a landmark paper, Diebold, Gunther and Tay (1998, p. 867) showed that a se-
quence of continuous predictive distributions Ft for a sequence Yt of observations
at time t = 0, 1, . . . results in a sequence of independent, uniformly distributed
PITs if Ft is ideal relative to the σ-algebra generated by past observations,
At = σ(Y0 , Y1 , . . . , Yt−1 ). This property does not depend on the continuity of
Ft and continues to hold under general predictive CDFs and the randomized
definition (1) of the PIT (Rüschendorf and de Valk, 1993, Theorem 3).
In the case of continuous predictive distributions, Tsyplakov (2011, Section 2)
noted without proof that if the forecasts Ft are based only on past observations,
i.e., if Ft is At -measurable, then the converse holds, namely, uniform and in-
dependent PITs arise only if Ft is ideal relative to At . The following result
formalizes Tsyplakov’s claim and proves it in the general setting, without any
assumption of continuity.
Theorem C.1. Let (Yt )t=0,1,... be a sequence of random variables, and let At =
σ(Y0 , . . . , Yt−1 ) for t = 0, 1, . . . . Furthermore, let (Ft )t=0,1,... be a sequence of
CDFs, such that Ft is At -measurable for t = 0, 1, . . . , and let (Ut )t=0,1,... be a
sequence of independent, uniformly distributed random variables, independent
of the sequence (Yt ). Then the sequence of randomized PITs, (Zt ) = (Ft (Yt −) +
Ut (Ft (Yt ) − Ft (Yt −))) is an independent sequence of uniform random variables
on the unit interval if, and only if, Ft is ideal relative to At , i.e., Ft = L(Yt | At )
almost surely for t = 0, 1, . . . .
The proof utilizes the following simple lemma.
Lemma C.2. Let X, Y, Z be random variables. If X = Z almost surely, then
E[Y | X] = E[Y | Z] almost surely.
Proof. Problem 14 of Breiman (1992, Chapter 4), which is proved by Schmidt
(2011, Satz 18.2.10), states that for random variables X1 and X2 such that
σ(Y, X1 ) is independent of σ(X2 ), E[Y | X1 , X2 ] = E[Y | X1 ] almost surely.
The statement of the lemma follows as E[Y | X] = E[Y | X, X − Z] = E[Y |
Z, X − Z] = E[Y | Z] almost surely.
Proof of Theorem C.1. Since Ft is measurable with respect to At , there exists
a measurable function ft : Rt → F such that Ft = ft (Y0 , . . . , Yt−1 ) for each t by
the Doob–Dynkin Lemma (Schmidt, 2011, Satz 7.1.16).2 We define
Gt := ft (G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))
2 Note that f is constant, and f is not a random quantity but a fixed function that encodes
0 t
how the predictive distributions are generated from past observations. The σ-algebra on F ,
which is implicitly used throughout, is given by
AF = σ({{F ∈ F : F (x) ∈ B} : x ∈ Q, B ∈ B(R)}),
where B(R) denotes the Borel σ-algebra on R. For each x ∈ Q there exists a measurable
function fx,t such that Ft (x) = fx,t (Y0 , . . . , Yt−1 ) by the Doob-Dynkin Lemma, and ft is
essentially the countable (and hence measurable) collection (fx,t )x∈Q .
Regression diagnostics meets forecast evaluation 3275
recursively for all t, and show the “only if” assertion by induction.
To this end, let t ≥ 0 and assume the induction hypothesis that Fi is ideal
relative to Ai for i = 0, . . . , t − 1. By Rüschendorf and de Valk (1993, Theorem
3(a)) and the construction of Gt , the induction hypothesis implies
(Y0 , . . . , Yt−1 ) = (F0−1 (Z0 ), . . . , Ft−1
−1
(Zt−1 )) = (G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))
almost surely, where the last vector is σ(Z0 , . . . , Zt−1 )-measurable. By Lem-
ma C.2, it follows that
L(Zt | At ) = L(Zt | σ(G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 ))) = U ([0, 1])
almost surely, where the second equality stems from the fact that Zt is inde-
pendent of σ(G−1 −1
0 (Z0 ), . . . , Gt−1 (Zt−1 )) ⊂ σ(Z0 , . . . , Zt−1 ). This independence
implies that Ft is ideal relative to At because
Ft (y) = Q(Zt < Ft (y) | At ) ≤ Q(Yt ≤ y | At ) ≤ Q(Zt ≤ Ft (y) | At ) = Ft (y)
almost surely, and hence Ft (y) = Q(Yt ≤ y | At ) almost surely for all y ∈ Q,
thereby completing both the induction step and the claim for the base case
t = 0.
Evidently, the assumption that no information other than the history of the
time series itself has been utilized to construct the forecasts is very limiting. In
this light, it is not surprising that, while the “if” part of Theorem C.1 is robust,
the “only if” claim fails if Ft is allowed to use information beyond the canonical
filtration, even if that information is uninformative. A simple counterexample
is given by the unfocused forecast from Example 2.2, which is probabilistically
calibrated but fails to be auto-calibrated. Its PIT nevertheless is uniform and
independent even for autoregressive variants (Tsyplakov, 2011, Section 6).
C.2. Details and further results for the Bank of England example
Bank of England forecasts of inflation rates are available within the data ac-
companying the quarterly Monetary Policy Report (formerly Inflation Report),
which is available online at https://www.bankofengland.co.uk/sitemap/
monetary-policy-report. The forecasts are visualized and communicated in the
form of fan charts that span prediction intervals at increasing forecast horizons,
and derive from two-piece normal forecast distributions. A detailed account of
the parametrizations for the two-piece normal distribution used by the Bank of
England can be found in Julio (2006), and we have implemented the formulas
in this reference. Historical quarterly CPI inflation rates are published by the
UK Office for National Statistics (ONS) and available online at https://www.
ons.gov.uk/economy/inflationandpriceindices/timeseries/d7g7.
We consider forecasts of consumer price index (CPI) inflation based on mar-
ket expectations for future interest rates at prediction horizons of zero to six
quarters ahead, valid for the third quarter of 2005 up to the first quarter of 2020,
for a total of n = 59 quarters. These and earlier Bank of England forecasts of in-
flation rates have been checked for reliability by Wallis (2003), who considered
3276 T. Gneiting and J. Resin
Acknowledgments
Fig 12. Same as Figure 8 in the main text but at a prediction horizon of zero quarters.
Regression diagnostics meets forecast evaluation 3277
Funding
Our research has been funded by the Klaus Tschira Foundation. Johannes Resin
gratefully acknowledges support from the German Research Foundation (DFG)
through grant number 502572912.
Supplementary Material
Replication code
(doi: 10.1214/23-EJS2180SUPP; .zip). R code for replication purposes.
References
359–378. MR2345548
Gneiting, T. and Ranjan, R. (2013). Combining predictive distributions.
Electronic Journal of Statistics 7 1747–1782. MR3080409
Gneiting, T., Wolffram, D., Resin, J., Kraus, K., Bracher, J., Dimi-
triadis, T., Hagenmeyer, V., Jordan, A. I., Lerch, S., Phipps, K. and
Schienle, M. (2023). Model diagnostics and forecast evaluation for quantiles.
Annual Review of Statistics and Its Application 10 597–621.
Gneiting, T. and Resin, J., (2023). Supplement to “Regression diagnostics
meets forecast evaluation: conditional calibration, reliability diagrams, and
coefficient of determination”. DOI: 10.1214/23-EJS2180SUPP.
Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regres-
sion. Statistical Science 33 568–594. MR3881209
Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017). On cali-
bration of modern neural networks. In Proceedings of the 34th International
Conference on Machine Learning (ICML).
Gupta, C., Podkopaev, A. and Ramdas, A. (2020). Distribution-free binary
classification: Prediction sets, confidence intervals and calibration. In Pro-
ceedings of the 34th Conference on Neural Information Processing Systems
(NeurIPS).
Heinrich, C. (2021). On the number of bins in a rank histogram. Quarterly
Journal of the Royal Meteorological Society 147 544–556.
Held, L., Rufibach, K. and Balabdaoui, F. (2010). A score regression ap-
proach to assess calibration of continuous probabilistic predictions. Biometrics
66 1295–1305. MR2758518
Henzi, A., Ziegel, J. F. and Gneiting, T. (2021). Isotonic distributional
regression. Journal of the Royal Statistical Society Series B 83 963–993.
MR4349124
Henzi, A. and Ziegel, J. F. (2022). Valid sequential inference on probability
forecast performance. Biometrika 109 647–663. MR4472840
Holzmann, H. and Eulert, M. (2014). The role of the information set for
forecasting—with applications to risk management. Annals of Applied Statis-
tics 8 595–621. MR3192004
Hothorn, T., Kneib, T. and Bühlmann, P. (2014). Conditional transfor-
mation models. Journal of the Royal Statistical Society Series B 76 3–27.
MR3153931
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics 35 73–101. MR0161415
Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of
forecast accuracy. International Journal of Forecasting 22 679–688.
Jolliffe, I. T. and Stephenson, D. B. (2012). Forecast Verification: A Prac-
titioner’s Guide in Atmospheric Science, second ed. Wiley, Chichester.
Jordan, A. I., Mühlemann, A. and Ziegel, J. F. (2022). Characteriz-
ing the optimal solutions to the isotonic regression problem for identifiable
functionals. Annals of the Institute of Statistical Mathematics 74 489–514.
MR4417369
Julio, J. M. (2006). The fan chart: The technical details of the new implemen-
Regression diagnostics meets forecast evaluation 3283
10 282–290.
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G. and Tran, D.
(2019). Measuring calibration in deep learning. In Proceedings of Computer
Vision and Pattern Recognition (CVPR) Conference Workshops.
Noceti, P., Smith, J. and Hodges, S. (2003). An evaluation of tests of dis-
tributional forecasts. Journal of Forecasting 22 447–455.
Nolde, N. and Ziegel, J. F. (2017). Elicitability and backtesting: Per-
spectives for banking regulation. Annals of Applied Statistics 11 1833–1874.
MR3743276
Orjebin, E. (2014). A recursive formula for the moments of a truncated
univariate normal distribution. Working paper, https://people.smp.uq.
edu.au/YoniNazarathy/teaching_projects/studentWork/EricOrjebin_
TruncatedNormalMoments.pdf.
Patton, A. J. (2020). Comparing possibly misspecified forecasts. Journal of
Business & Economic Statistics 38 796–809. MR4154889
Pinson, P. and Hagedorn, R. (2012). Verification of the ECMWF ensem-
ble forecasts of wind speed against analyses and observations. Meteorological
Applications 19 484–500.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J. and Weinberger, K. J.
(2017). On fairness and calibration. In Proceedings of the 31st Conference on
Neural Information Processing Systems (NIPS).
Pohle, M. O. (2020). The Murphy decomposition and the calibration-
resolution principle: A new perspective on forecast evaluation. Preprint,
arXiv:2005.01835.
Robertson, T. and Wright, F. T. (1980). Algorithms in order restricted
statistical inference and the Cauchy mean value property. Annals of Statistics
8 645–651. MR0568726
Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted
Statistical Inference. Wiley, Chichester. MR0961262
Roelofs, R., Cain, N., Shlens, J. and Mozer, M. C. (2022). Mitigating
bias in calibration error estimation. In Proceedings of the 25th International
Conference on Artificial Intelligence and Statistics (AISTATS).
Rüschendorf, L. (2009). On the distributional transform, Sklar’s theorem,
and the empirical copula process. Journal of Statistical Planning and Infer-
ence 139 3921–3927. MR2553778
Rüschendorf, L. and de Valk, V. (1993). On regression representations of
stochastic processes. Stochastic Processes and their Applications 46 183–198.
MR1226406
Sahoo, R., Zhao, S., Chen, A. and Ermon, S. (2021). Reliable decisions with
threshold calibration. In Advances in Neural Information Processing Systems.
Satopää, V. and Ungar, L. (2015). Combining and extremizing real-valued
forecasts. Preprint, arXiv:1506.06405.
Satopää, V. A. (2021). Improving the wisdom of crowds with analysis of vari-
ance of predictions of related outcomes. International Journal of Forecasting
37 1728–1747.
Savage, L. J. (1971). Elicitation of personal probabilities and expectations.
Regression diagnostics meets forecast evaluation 3285