0 ratings0% found this document useful (0 votes) 25 views24 pagesLogist Cal
measure of goodness of fit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
FILE: USING LOGISTIC MODEL
DATE: 30JUNG7
USING LOGISTIC MODEL CALIGRATION TO ASSESS
THE QUALITY DF PROBABILITY PREDICTIONS
Frank E. Harrell, dr,
Kerry L. Lee
Division of Biomatry, Buke University Medical Center
Box 3363, Durham, North Carolina 27710, USA
SUMMARY
We used a Togistic calibration model (Cor, 1958a} to partitfon a
logarithmic scoring rule (used to assess the quality of probability
predictions) inta indexes of discrimination and three indexes of
unreltability. An index of overall quality that 1s not penalized for a
prevalence correction is a1S0 proposed. Yorious tests for discriminatian and
unvelfability arise fmmediately fron these indexes. Power properties of &
test for unreltapility are studied,
1._INTRODUCTION
The assessment of predictive accuracy 1s of central importance in
validating and comparing either subjective or model-based predictions of event
outcomes. ‘When one is predicting a continuoys outcome measurement, using
ordinary Multiple Uinear vegression for axanple, assessnent, of the quality of
predictians can be carried out in a straightforward way using scatter
diagrams, correlations (predicted with observed), and error estimates
(predicted-obsarved outcomes). When, however, the predictfon fs the
probability that a particular event will occur, assesstent of predictive
quality {5 mich more difficult due to the binary nature of the outcome.
Two commonly used concepts of quality af predictions are the
discrimination of a predictor {often called itz refinement), 1.¢., the ability
of the predicter to separate or rank-order observations with different
qutcomes, and its reliability (sometimes called walidity or degree of beingcalibrated) = the “correctness” of predictions on an absolute scale. If, for
example, 4 predictor assigned 4 20% probability of diseese for each of a
homogenous group of 100 patients and 20 patients were later diagnosed to have
the disease, the predictions would be reliable. Even though reliabtlity 4s a
Simpler concept, discrimination fs easter to uniquely quantify. For example,
we may caTculate the concordance probability from the Wilcoxon-Mann-Whitney
statistic - the proportion of pairs of subjects, one with and one without the
outcone being predicted, such that the subject wfth the outcome had the higher
predicted probobility (see Harrell, et al, 1982), We refer to this
tencordance measure as the c-index. This measure is equivalent to the area
under a ‘receiver operating charactertstic” curva (Hanley and McHeil, 1982)
and 1s 2 linear translation of Somers’ rank correlation between predicted
probabilities and the binary event indicator, which in the absence of ttes on
predicted values fs also equivalent to the Goodman-Kruskal rank correlation
coefficient (Goodman and Kruskal, 2975),
Reliability is traditionally assessed by estimating the prevalence of the
‘vent in question for each level of predicted probability. This method works
wel] when efther only a few unique predictions are made or the sample is
extremely large. When the predictions vary continuously from 0 to 1, sone
grouping of the probabilittes fs usually necessary, This may be accomplished
by rounding predicted probebilities inte intervals or by constructing quantile
groups. nce grouping is done, a variety of goodness of fit tests are
available for detecting unreliability (Lemestow and Hosmer, 1982).
The method of grouping predicted probabilities to assess reliability has
Several drawbacks when applied to predictions that range continuously from 0
tool. ‘The most serious of these fs the fact that one's assessment of
reliability can change significantly depending cn how the groups are formed.
in addition, when one wants to test whether the predictions are “significantly
unreliable", 1.2., whether the observed prevalence differs significantly from
the predicted values, on ordinary chi-square goodness-of-fit test lacks power.
If separate samples were used for deriving ant assessing predictions, and the
probabilities were divided into ten groups, the x? statistic has 9 degrees of
freedom (d.f.J, which has a critical value of 16.9 at the 52 level.If unrelfab?lty could be described with only 2 4.f., the critical value is
reduced to 6,
Yarious indexes have been proposed for assessing the accuracy of
probability forecasts [see DeGroot and Feinberg 1982, Hilden, Habbema, and
Bjerregaard, 1978, and Spiegethalter, 1986 for detailed discussions and
bibliographies]. One commoniy used accuracy fridex is Brier’s (1950) quadratic
scoring rule. This index is a “proper scoring rule", meaning that a predictor
opkimizes the Brier index by predicting the true probability of the event in
duestion, — Brfer's index has been decomposed inte reliability and
diseriminatton components (Hilden et al., 1978, Sptegelhalter, 1986, Yates,
1982, Blattenberger and Lad, 1985, DeGroot and Feinberg, 1992) and a test for
reliabllity based on one decomposition has been proposed {Hilden et al,,
1978).
Logarithmic scoring rules (Goad, 1952} are also popular, and Cox (1958b}
has presented one related test for rellability based on linear log odds
altematives to perfect reliability. In contrast to Brier's index, less work
has been dene to decompose logarithmic scoring rules into indexes of
reliability and discrimination, The method presented tm Section 2 allows
decomposition of an overall quality measure {nto discrimination and various
unrelfability companents, each chance-corrected, and admits straightforward
likelihood ratio and score tests for significant discrimination and
anveliability,
Throughout the discussion we assume that predictions and outcomes are
stochastically independent. This is true if, for example, a regression model
(eg, @ logistic model} was derived from a "training" sample and predictive
accuracy was assessed on an independent "test" sample, which fs often the only
way to obtain an unbiased validation of the entire madeling process.
2. DESCRIBING ACCURACY USING CALIBRATION
Suppose that ome could estimate the "calibration curve" - the relation-
ship between the predicted probability and the true probability of the event.
Given an estimate of this relationship, one way to quantity the unreliability‘of predictions is to measure what has to be done to make the calibration curve
‘supertapased on the fdeal curve fa 45 degree Hne}. Discrimination, on the
other hand, fs related te whether or not the predictions are im any way
velated to the outcomes, {-e., whether or not the calibration curve is
harizontal.
Thus the problem of quantifying unreliability and discrimination can be
solved by estimating the relationship between predicted and observed values.
Since observed values are binary, the retationship 1s stated in terms of the
erobability that the outcome occurs. The method of maximum THkelthood can be
used to estimate this relationship even when mo two predictions are the same.
The method anly assumes that predictions are related to outcomes through a
smooth curve that. interpolates between different predictions,
For a simple predictor variable x, the logistic regression model (Cox,
1968a, 1966, Walker and Duncan, 1967) relates X to the probability of an
event. Let the event or outcome variable be denated by Y, where Y=1 when the
event occurs and te) otherwise, the model is as follows:
Probivel x)»
' (2.1)
4+ exp[-(abx}]
where exp(x} is e%, the natural anti logarithm, Cox {1958b, 1970) proposed
using the Iinear logistic model to relate "subjective probebilities” to
“objective probabilities”. Let the predicted probabilities of Y1.¥oseeu¥, = 1
he denoted by Pi, P,, -.., P, for n subjects, or cases. Let the true
(calibrated) probabilities be denoted by Py‘. Py's...+ Py’. We can estimate
Py given Py by estimating the relationship between P, and ¥,. Py is first
transformed fron 2 6-1 scale te am unlimited scale to better fit the model.
The logistic calibration model is:
Prob(Y.=1 P;) = 1 (2.2)
1 + expf-(atbL,}]where Ly = Togit(P,) - log(P,/{1-P)}). The model can be restated as
1
(2,3)
1+ oxpi-a)[ey/(1-P 7
Figure 1 shows the shape of calibration curves for various values of 4 and b,
The fdeal relationship (ne calfbration required) 1s found on the curve marked
aaQ, b=1.
s-- Figure 1 About Here --=
Note thet when a0 and bel, Prt = P,, and no calibration ts required. When no
Slope calibration is required (bel),
Pie (2.4)
Py + (tePydexpt-ad
In this case, where only 2 prevalence adjustment is made, exp{a} is the odds
ratio of the corrected to the uncorrected overall prevalence, and the calibra-
tlon mode? is identical to the simplest form of Bayes' rule, It should be
recognized that a simpler calibration mode? such as Py! = atbP, cannot be used
because this would allow Py* to be Tess than O or greater than 1.
In many cases where a model has not been developed carefully on 4 data-
set, predictions from the model will be found too extreme when they are
validated 1m an independent sample because of averfitting the original dato-
set. For example, a probability of death of .1 may need to be calibrated to
-25, and a prediction of .9 catibrated to .78. The corresponding logistic
calibration for this example fs obtained using a<0, be.5 in equation (2.3). If
predictions need to be shrunk symmetricatly toward a probability of .§ and a
predicted probeMitty af P is calibrated to & value af P*, the calibrating
equation is derived from (2,3) using a0, b=Togit(P')/Togit(P).The parameters a and b can be estimated by maximizing the posterior
Tikeliheod of the observed data (Pi, Ty, fels2ie.in), or equivalently by
minimizing -2 times the 1og-1fkel fhood function,
n
La -2 £ EY, log{Py'} + (1-¥,} log(l-Py')].
fay
(2.8)
a
= 2 E EY, (avbly) -logta+enp(athl 9]
fal
L cam be thought of as measuring Informatfon or quality of the predictions in
relation to the outcomes, given a and b.
3a_DERLVATION GF_ACCURACY INDEXES
The following notation will be used:
L(a,b) = minimum L for all ayb
L(a,1) = nfnimum L for all a subject to bel
Lia,0) = mintnun £ for all a subject to b=0
= -20[1,logPe{1-¥, og(1-F)]
L{O,1) = value of L at a0, bet
= -2E[4;1ogPgH(1-¥, }Tog(1-P, 1s
where PaEY,/n. We compute the unreliability U of the predictions from the
difference in quality of the uncalibrated predictions and the quality of
sTope- and intereept-calibrated predictions:
U= [L(0,1) - Lfa.b) - 2]/n - Gy
Since L(Q,1) - L(a,) #3 4 Mkelthood ratio statistic for testing Ht a-Q,bel
with an asymptotic y2 distribution having expectad value 2 if Hy is tre, U
hos expected velue 0 if the predictions are reliable. Division by the sampleSize makes the range of U independent of n. can be decomposed into I= U, +
U,. where U, is the unreliability due to the need for an overall prevalence
correction (correction of intercept on logit scale) and U, is unreliability
due to the need for a slope correction given any needed prevalance correction:
Uy = Eel.) - bad) - Isr Q.2)
U, = [L(a.1) - Lfa,b) ~ 2]/n . (3.3)
U, is the difference im quality of the best uncalibrated predictor and an
intercept-calibrated predictor. U, is the difference 1m quality of the best
intercept-caltbrated predictor and the best slope- and intercept-calibrated
predictor. The -1 tetm chuses each index to hava expected value 0 if the
corresponding type of unreliability ts truly absent. Large values of the
unrelfability indexes mean that the predictions are unrelfable. Kegative
values indicate better reliability than one would expect. py chance.
Likelihood ratio statistics are immediately available for testing each
type of unreliability:
Asymptotic
Kul] Hypothes i
Significant total unreliability — L{0,1)}+L(a,b} 2 (3.4)
H, 58-0 61
Significant unreliability due to L{0,1)-L(a,2) 1 (3.5)
overall prevalence error
Hy:aeO bel
Significant unreliability due to L{a,1)-L(a,b) 1 (3.6)
Slope error given prevalance
correction
HosbelSimple scere tests are atso avaflable for testing the first two
hypotheses above, avoiding the need for Iterative calculations (Rao, 1973}. A
2 d.t, asymptotic y® score statistic for Hyt or OL 1s given by
CEVG-Pyd DLy(Yy-Pp97 BP CL-Py) ELM T-P) ~Preryery) (3.7)
TLyPy(1-P,) mL2e, 0-74} EL (14-4)
AL df. test statistic for Hy: as0 bel ts
Cer Py )1?/0P, (1-75) (3.8)
These score tests turn out to be identical to those proposed by Cox (1958b).
fox also presented a test for whether predicted probabilities are averty
dispersed even though they are correct on the average (Hj: bel a*D).
The index of discrimination is derived by computing the difference in
quality of the best constant predictor (one that on the average correctly
predicts the overall prevalence of the evant} ard the best calibrated
predictor:
D = (t{as0) - Lfa,b) - 11/n. (3.9)
D has expected value O if there 15 no discrimination (b-0), The likelihood
patio statistic for testing whether the predictions have any diserininatory
ability (H,:be0) fs La.) - Lta.bj, having asymptotically 2 chi-square
distribution with 1 df. under 4.
An overall summary index for the quality of predictions 1s derived by
computing the difference in quality between the best constant predictor and
the quality of the predictions as they stand {with no cal fbration):
(L{a,0) - L(G,1) + Lin. (3.10)
It can readily be seen that Q = discrimination = total unreliability = D = u.
The sumary index Q 1s 4 simple translation of the logarithmic scoring ruleSimple score tests are also avaflable for testing the first two
hypotheses above, avoiding the need for iterative calculations (Rac, 1973). A.
2 dt. asymptotic x? score statistic for Hy asQ, Bel fs given by
ECO; -Py} ELAAY;-PybT EPYCL-Py} TLyPy (Py) ae AY y-Py} (3.7)
4,
DLyPgll-Py) | TLGP,(1-Py) BL, (¥;-P,)
Al df. test statistic for Hy: asd bel fs
Cee, -Py)HP/EPy (1-Py) (3.8)
These score tests turn out to be identical to those proposed by Cox {1958b).
Cox also presented a test for whether predicted probabilities are overly
dispersed even though they are correct on the average (H,; bel 2-0),
The index of discrimination is derived by computing the difference in
quality of the best constant predictor fone that on the average correctly
Predicts the overall prevaience of the evant) ahd the best calibrated
predictor:
B= [Lta,0) ~ b(a,b} = 1]/n. (3.9)
D has expected value 0 if there is no discrimination (b=-0). The likelihood
ratio statistic for testing whether the predictions have any discriminatory
ability (H,zbeO} fs Lfa.0) - L{a.b}, having asymptotically a chi-square
distribution with 1 d.f. under H,.
An overall summary indee for the quality of predictions is derived by
computing the dffference in quality between the best constant predictor and
‘the quality of the predictions as they stand (with no calibration}:
[héa.0) - £(0,1) + 1]/n. (3.10)
It can readily be seen that Q = discrimination = total unreliability = 0 = U.
‘The summary index Q fs & Simple translation of the logarithmic scoring rule(see Good (1952), Cox (1970, Eq. 4.34), and Shapiro (1977)). The reference
point for Q is the best constant predictor, whereas other authors used as
reference a predictor having constant probability 0.5.
The value of Q is invariant with respect to the form of the calibrating
model. This result follows from 1) when a-0 and b=l, P, = P;', making the
log-likelihood function (here L(0,1)) dependent only on the observed data, and
2) when b=0, the calibrated probabilities do not make use of the predicted
Probabilities so that P,' = P, the overall proportion of events that occurred
(zY;/n). Hence (3.10) reduces to
Q= (2/n) BLY, Tog (Py/P) + (1-¥;) Tog (1-P;)/(1-P)] + 1/n. (3.11)
An index of quality can also be constructed which does not penalize
Predictions for being wrong by a constant prevalence correction. This index is
derived from the difference in quality of the best intercept-corrected predic-
tions from the quality of the best constant predictor:
(3.12)
Q, = [L(a,0) - L(a,1)J/n = D - Uy.
Since score statistics can be used to approximate likelihood ratio
statistics, simpler unreliability indexes can be constructed by substituting
(3.8) for L(0,1)-L(a,1) in (3.2) and (3.7) for L(0,1)-L(a,b) in (3.1). This
approach has two disadvantages, though. First, score statistics are not
additive as are partitions of log-likelihood. Second, score statistics may
Not adequately quantify information content for situations far from the null
hypothesis.
POWER OF TEST FOR UNRELIABILITY
The likelihood ratio test for total unreliability given in (3.4) is
difficult to study because of the iterative calculations required. It has
been shown in a similar situation that score tests have equivalent power
functions as likelihood ratio tests (Lee et al, 1983). Therefore we study the
Power properties of the score test given by (3.7).Im general, E(¥4)°P, and the score vector [E{¥=Py), ELy(ty=™))7 tx
asymptotically normal with mean vector and covariance matrix given
respectively by :
3(P5-Py)
u* ,
a {P,-F))
‘ 1 {(a.1)
EPStL-Py) TL Py t-Py)
ve
uo de ot
BigP (UP) EL PyC1-P,)
It follows that the score statistic for testing H,:a=0, b=1 has mean m and
variance ¥ given by
ms tray + yay aay
v= 2teLEA)2] + 4u"AVAL
whare A {s the matrix inverse in (3.7), The distribution of (3,7) can be
approximated by a scaler multiple of @ non-centrai y° random vartable ayS(h)
with 2 df. and moncentrality 4 by equating the first two moments of such a
distribution to (4.2) (dehnson and Kotz, 1970), yfelding
tn? ~ v)}/2 yp
(4.4)
Amma - 2,
Tf Gera Tepresents the lea quantite of a central x? dtstrtbutton having 2
a@f., the power of an o-level test for unreltability can be approximated by
Problayg(a} > veetal = Prob (ga) > Gaal: (44)
To test the adequacy of this approximation as well as the adequacy of the
central y? mutt distribution for (3.7), 2000 samples of varying sizes wera
simlated for each of a variety of setups in which only twa distinct
19predictions were made. Here k observations were assigned a predicted
probability of py and k ware assigned 2 probabtltty of py. Corresponding
actual population probabilities were Py and Poe Power was estimated by
computing the frection of score statistics excemding xf, og=5.89. Resvits for
ar.05 ate found in Table 1. It can be seen that type T error 1s well
controled and that (4.4} appears to be a satisfactory power approximation.
=-- Table 1 About Here ---
The power approximation in (4.4) can ba used to estimate the sample size
necessary to achteve a power of f for an aslevel test of total unreliability.
Suppose that K predictions ore made at each af g probability levels, py. pps
tere Bg and that the true probabtlittes are py djs sors By» The total sample
size 18 thus kg. Define Ty=logtt py. Then the quantities in (4.2) are gtven by
me tray + uA
, (4.8)
wee erg + ak AR,
where 1
¢ Ep, (1-p,) B1jpyil-p3)
2
Enypy(tepy) 21ypy(1-Py)
tpy(l-p))—B1yby(1-8y>
(4.6)
ripytt-p) =1fpy(1-p5)
2p 4-Py)
TU jlpppy)
and all summations are over j=l, 2, -+2. 9s
ilThe following iterative algorithm converges quickly to a solution for k:
Initialize ky,g, #05 Bl +
Voop X: Compute A* non-centrality parawater of xf distribution such
that Prob OBC < x8, y.4/8) *
Set om* 8 (A42)
set ke Dm trey a
see Da (wavy 2 yz
kent
where v= 2teL(A'V")2] + dia” AA
It Rast <1 stor
Set kyace = ki go to X loop
6. EXAMPLES OF VALUES OF THE INDEXES
To help in interpreting the values of the indexes, consider a series of
simple exampies in which kK subjects receiva one prediction, Py> and another k
subjects recefyve a predicted probability of Pa- The first k subjects have an
observed prevalence of the event of o and the second k fave a prevalence of
04. The resulting calibration parameter estimates (a and 5}, accuracy indexes,
and chi-square statistics are in Table 2 for kel0Q. For comparison, the c
index 48 a)s0 given, atong with a version of Brier's scare defined by B= 1 =
average of Oar
Table Z About Here =
12Lines 1 and 2 in the table demonstrate the values of the indexes when there 1s
perfect reliability and low to moderate dYscrimination, respectively. Similar-
ly, Hines 3-5 correspond to backwards predictions (2.9. predict .25 and .75,
observe .75 and .25) with increasing discrimination. Total unreliability 1g
statistically significant for Hes 3-6, 8 and 9. Lines 6-9 are more typical
examples of unreliability. The measure of overall quality, Q, 1s negative
(Tines 3-6} when the discrimination {$ not geod enough to overcome serious
unreliatiTity. The index of discrimination, D, ranked the discrimination of
predictions in the same order as the absolute value of c-.8. The rankings of 9
and 8 are very similar but both differ from those of the absolute value of
They have similar rankings as c.
It appears thet predictions for which U does not exceed abovt 0.05 are
reliable for the most part, Statistical significance of U can also be used to
quantify unreliabiltty, although the power of this assessment depends on the
sample size (signtficant unreliability is present at the of,05 level if
Ue3.99/n; for U, and U, the criticel levels ave 2.84/n}. It can be shown that
for this sttustion {k predictions at each of tyo probabilities), the
unreliability index #8 given by
W = 6, ogle,/{2-p))1/00,762-0,3T+0,togb pay t-Py) ME Ogf(1-09)] 8,1)
+ Vag [(1-py El-p) VAL (1-84-05) -2/n.
The analyst can use (5,1) to estimate acceptable levels of U for fixed p,, py
by varying @ and 0, and judging U by whether 0, and 0, ere meaningfully
aitfevent from py and pp. A plot of U with respect to , and Q, is shown in
Figure Z when ke100 (79200) for the Four combinations p)=.25, py".283 py" .H5,
Bgt-B5; PyH-6, Pytods and ppt], PotsBe
To show examples of the values of the new indexes as well as the result-
ing estimates of the caTibration or reliabtlity curves when the predictions
are continuous, the predictive accuracy of two logistic regression models was
constdered. For both madels, the outcome variable was complete response to
treatment of non-Hodgkin's Iymphone, and the predictions were developed
(Harrell et al,, 1985) using a training sample of 110 patients (50 with
13complete response, 60 without}, These predictions were evaluated on a separate
test sample of 116 patients. The first model was developed using a standard
stepwise variable selection method with 25 candidate variables, far too many
for only $0 cases of complete response. The second mode! used the inconplete
principal component method, which effectively reduced the 25 variables to only
1, The accuracy indexes are faund im Table 3. Corresponding p-values for
significant unreliability or discrimination are fn parenthesis, Reliability
plots using the estimated a and b may be found {m Figure 1. The results
indicate significant unretiabflity {both types} and little discrimination
abil{ty for model 1, resulting in unacceptable predictions (Q==.19}. The
extreme predictions from model 1 cannot be trusted, which is frequently the
case when two many predictor variables are used with smal] sample sizes.
HWodel 2 has moderate need for 2 prevalence correction (U,».05) but nat for a
slope correction, and has better discrimination than mdel 1, resulting in far
better overall qualfty (G*.03 ys, -.19), This {mprovenent in predictive
accuracy is due te the date reduction resulting from fitting principal
components.
--- Table 3 About Here =--
COMPUTER SOFTWARE
A SAS (1985) wacro 1s available from the authors for calculating all of
the indexes mentioned in this paper as well as for drawing the reliability
plot. Another SAS program fs avallable far power and sample size calculations
based on (4.4) and (4.7).
£OWCLUS IONS
We sought a method of assessing predictive quality having the following
properties: (1) no grouping af predictions is required, (2) an overall measure
of the quality of predictions can be formally decomposed into a simple sum of
indexes of unreliability and discrimination, (3) the index of unrelfabi lity
can be further decomposed into an index of unrelfability due te the need for
an overall prevalence {constant} correction and unreliability due to a mre
complicated correction, (4} the method yfelded as a byproduct an index of
4overall predictive quality that was not penalized for a prevalence correction
and {5} the method automatically ylelds formal statistical tests (with reasons
able power) for significant unreliability (ang {ts two components) and for
significant discriminatory abtT{ty. The legistic regression model, when used
to calibrate predicted probabilities tao observed outcomes was useful) in
meeting these goals. The power approximation given in (4.4) is adequete for
estimating the sample size needed to conduct studies such as those designed ta
test diagnostic accuracy of physicians or probability models,
‘ACKHOWLEDGEWENTS.
This research wes supported by the National Center for Heatth Services
Research, and by the Matfonal Library of Medicine and National Heart, Lung,
and Blood Institute of the National Institutes of Health. Ke thank Ms. Cristy
Yollmar for the careful typing of the manuscrfpt, Barbara Pollock for
providing expert techiical assistance, and Robert Rosatt and David Pryor for
Motivating our work,
anyREFERENCES
1. Blattenberger G, Lad f (1985), Separating the Brier score into calibra-
tion and refinement components: a graphical example. An Statistician
39: 26-32.
2. Brier GH (1950). Yertfication af Forecasts expressed fn terms of praba-
bility. Monthly Weather Review 75:1-3,
3. Cox OR (19588): The regression analysis of binary sequences (with
discussion). J Roy Statist Soe B 20:215-242,
Cox DR (1958b): Te further refinements of @ model for binary
regression, Bfometrika 45:562-565.
5. Con OR (1966); Some procedures connected with the logistic qualitative
response curve. In Research Papers in Statistics: Essays in Honor of
2. Neyman’s Toth Birthday, pp. 58-71, Ed. F.N. David, London:Wiley.
6. Cox OR (1970); The Analysis af Binary Data. tondon:Methuen, pp, §2-54.
7. DeGroot MH, Feinberg SE (2982): Assessing probability assessors:
calibration and refinement. fn Statistiog] Decision Theory and Related
Topics Hl, Yor 1. Academic Press.
@. Good [J (1952): Rational decisions. J Ray Statist Soc B 14: 107-114.
9. Goodman LA, Kruskal WH (1979), Measures of Assoctation for Cross
> Classifications, Mew Tork: Springer-Verlag.
10. Hantey JA, McHeil BJ (1982): The meaning atd use of the area under a
recefver operating characteristic (ROC) curve, Radiology 50:23-36,
MM. Harrelt FE, Califf RM, Pryor D8, et aT (1962): Evaluating the yield of
Medical tests. J Am Med Assoc 2473 7543-6.
16
LLL12.
1a.
iu
15.
16.
In
1a.
19,
20,
21.
Harrell FE, Lee kL, Hatchar 08, Retchert TA (1985): Regression models.
for prognostic prediction: advantages, problems, and suggested
Solutions. Cancer Treatment Reports 69:1071-1077,
Hilden J, Habbema JOF, Bierregaard B (1978): The measurement of
performance in probabilistic diagnosis. II. - methods based on
continuous funetions of the diagnostic probabititves, Hethods of
Information in Medicine 17:238=246.
Johnson WL, Kotz 5 (1970): Distributtons in Statistics:
Continuous Univariate Distritutions-2, pp. 185-166. New York: Wiley,
Lee #L, Harrell FE, Tolley HD, Rosat{ RA (1983): A comparison of test
statistics for assessing the effects of concomitant variables in survival
analysis. Siemetries 39:341-350,
Lemeshow S, Hosmer DM (1962): A review of goodness of fit statistics for
use fn the development of logistic regression models. Am. Epidemiology
198: 92-108.
Rao CR (1973): Lineay Statistical Inference and Its Applications.
Second Edition, pp, 418-419. New York: Wiley.
SAS Institute (1985): SAS User's Guide: Basics, Version 5 Edition.
Cary, NC:SAS Institute, Inc., pp.643-727..
Shapiro AR (1977): The evaluation af clinical predictions: a method and
initial application. New England J Med 296: 1509-1514.
Speigethalter WW (1986): Probabilistic prediction in patient management,
and clinical trials. Statistics in Medicine 6:421-433.
Walker SH, Duncan DB (1967): Estfnation of the probability of an event as
a function of several fndependent variables. @iometrika $4:167-179.
ir22, fates JF {1982}: External correspandence: decompositon cf the mean
Probability score. © Organisational Behavior and Huwan Performance
90: 152-156.
18Table 1
Simulated and Approximated Powar af
Score Test for Unreliabi lity
am.05
Predicted True Simulated = Power by
kL Probabilities Probabilities Power 4a
La Pa a Ye
16 25 7S 75.080 050
10 175.083
20 25 75 25 al 050
alg 5197
30 25 475 26 TB 034
WW 75 288
40 2575 175 USB
+10 7 SIA
100 25 75 «28 75 068
AS i S46 +535
os U 85 BSL 873
+02 5 02 1% 042 050
«to 95 HB -950
W098 -10 8.08 050
. 02 98786 -783
vw1
2
a
4
5
6
7
8
g
Tabla 2
Examples of Values of the Indexes
mh ym 8 by ue ed oP gg 8
A060 40 660 01-005 0-005 0 01 0 0k 8.08 60 76
125,75 28 JB 0 1-005 O05 0-01 0 26 52 27.75.81
M060 60 40 0) -1-.005 O .160 32 18 32 08 H-.12 40 72
2875 78 260 © 1.100 220 1.10 220,26 §2 -,83 ,25 56
10.99 .90 .19 0 © 3.600 703 3.50 703.73 147-2.00 10.27
40.70 60 90 99 1.49 18 37 005 2.19 89 12 25.07.70 .80
120,70 25 075 2798 013-005 0 004 3 28 $2.25 75 81
+25 70.25 690 7G 1.69 004 10 06 «1S 23-4? 9S 36 83 BF
2585.25 901.69 2.54.13 28 18 91 28 59 a7 95.13.03 .80Table 3
Comparing Predictive Accuracy of Two Logistic Regression Modals
Quantity
re
eos
Figure 1 Legend:
Figure 2 Legend:
Modal_d 12Z
“7 5
a 14
14 (.0001) -05 [.007}
»O7 {.003) +009 (.4)
+21 (0001) -06 {.018}
* +03 (.045) -08 {.002)
=18 8
-.04 08
Four logistic calfaration {relfabtlity) curves,
fncluding one for a rediable predictor (a0, bel}.
Contour graphs of U {given by 5.1} as a function of observed
proportions 0; and 0). Tha center of each sat fs (Py. Pole
the true probabilities. The contours correspond to U = 0
Unner contour), .01, .02, .... .10 (outer contour).——- a=-.7 be.5
—— a=-.5 b=1.4