0% found this document useful (0 votes)
25 views24 pages

Logist Cal

measure of goodness of fit

Uploaded by

at
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
25 views24 pages

Logist Cal

measure of goodness of fit

Uploaded by

at
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
FILE: USING LOGISTIC MODEL DATE: 30JUNG7 USING LOGISTIC MODEL CALIGRATION TO ASSESS THE QUALITY DF PROBABILITY PREDICTIONS Frank E. Harrell, dr, Kerry L. Lee Division of Biomatry, Buke University Medical Center Box 3363, Durham, North Carolina 27710, USA SUMMARY We used a Togistic calibration model (Cor, 1958a} to partitfon a logarithmic scoring rule (used to assess the quality of probability predictions) inta indexes of discrimination and three indexes of unreltability. An index of overall quality that 1s not penalized for a prevalence correction is a1S0 proposed. Yorious tests for discriminatian and unvelfability arise fmmediately fron these indexes. Power properties of & test for unreltapility are studied, 1._INTRODUCTION The assessment of predictive accuracy 1s of central importance in validating and comparing either subjective or model-based predictions of event outcomes. ‘When one is predicting a continuoys outcome measurement, using ordinary Multiple Uinear vegression for axanple, assessnent, of the quality of predictians can be carried out in a straightforward way using scatter diagrams, correlations (predicted with observed), and error estimates (predicted-obsarved outcomes). When, however, the predictfon fs the probability that a particular event will occur, assesstent of predictive quality {5 mich more difficult due to the binary nature of the outcome. Two commonly used concepts of quality af predictions are the discrimination of a predictor {often called itz refinement), 1.¢., the ability of the predicter to separate or rank-order observations with different qutcomes, and its reliability (sometimes called walidity or degree of being calibrated) = the “correctness” of predictions on an absolute scale. If, for example, 4 predictor assigned 4 20% probability of diseese for each of a homogenous group of 100 patients and 20 patients were later diagnosed to have the disease, the predictions would be reliable. Even though reliabtlity 4s a Simpler concept, discrimination fs easter to uniquely quantify. For example, we may caTculate the concordance probability from the Wilcoxon-Mann-Whitney statistic - the proportion of pairs of subjects, one with and one without the outcone being predicted, such that the subject wfth the outcome had the higher predicted probobility (see Harrell, et al, 1982), We refer to this tencordance measure as the c-index. This measure is equivalent to the area under a ‘receiver operating charactertstic” curva (Hanley and McHeil, 1982) and 1s 2 linear translation of Somers’ rank correlation between predicted probabilities and the binary event indicator, which in the absence of ttes on predicted values fs also equivalent to the Goodman-Kruskal rank correlation coefficient (Goodman and Kruskal, 2975), Reliability is traditionally assessed by estimating the prevalence of the ‘vent in question for each level of predicted probability. This method works wel] when efther only a few unique predictions are made or the sample is extremely large. When the predictions vary continuously from 0 to 1, sone grouping of the probabilittes fs usually necessary, This may be accomplished by rounding predicted probebilities inte intervals or by constructing quantile groups. nce grouping is done, a variety of goodness of fit tests are available for detecting unreliability (Lemestow and Hosmer, 1982). The method of grouping predicted probabilities to assess reliability has Several drawbacks when applied to predictions that range continuously from 0 tool. ‘The most serious of these fs the fact that one's assessment of reliability can change significantly depending cn how the groups are formed. in addition, when one wants to test whether the predictions are “significantly unreliable", 1.2., whether the observed prevalence differs significantly from the predicted values, on ordinary chi-square goodness-of-fit test lacks power. If separate samples were used for deriving ant assessing predictions, and the probabilities were divided into ten groups, the x? statistic has 9 degrees of freedom (d.f.J, which has a critical value of 16.9 at the 52 level. If unrelfab?lty could be described with only 2 4.f., the critical value is reduced to 6, Yarious indexes have been proposed for assessing the accuracy of probability forecasts [see DeGroot and Feinberg 1982, Hilden, Habbema, and Bjerregaard, 1978, and Spiegethalter, 1986 for detailed discussions and bibliographies]. One commoniy used accuracy fridex is Brier’s (1950) quadratic scoring rule. This index is a “proper scoring rule", meaning that a predictor opkimizes the Brier index by predicting the true probability of the event in duestion, — Brfer's index has been decomposed inte reliability and diseriminatton components (Hilden et al., 1978, Sptegelhalter, 1986, Yates, 1982, Blattenberger and Lad, 1985, DeGroot and Feinberg, 1992) and a test for reliabllity based on one decomposition has been proposed {Hilden et al,, 1978). Logarithmic scoring rules (Goad, 1952} are also popular, and Cox (1958b} has presented one related test for rellability based on linear log odds altematives to perfect reliability. In contrast to Brier's index, less work has been dene to decompose logarithmic scoring rules into indexes of reliability and discrimination, The method presented tm Section 2 allows decomposition of an overall quality measure {nto discrimination and various unrelfability companents, each chance-corrected, and admits straightforward likelihood ratio and score tests for significant discrimination and anveliability, Throughout the discussion we assume that predictions and outcomes are stochastically independent. This is true if, for example, a regression model (eg, @ logistic model} was derived from a "training" sample and predictive accuracy was assessed on an independent "test" sample, which fs often the only way to obtain an unbiased validation of the entire madeling process. 2. DESCRIBING ACCURACY USING CALIBRATION Suppose that ome could estimate the "calibration curve" - the relation- ship between the predicted probability and the true probability of the event. Given an estimate of this relationship, one way to quantity the unreliability ‘of predictions is to measure what has to be done to make the calibration curve ‘supertapased on the fdeal curve fa 45 degree Hne}. Discrimination, on the other hand, fs related te whether or not the predictions are im any way velated to the outcomes, {-e., whether or not the calibration curve is harizontal. Thus the problem of quantifying unreliability and discrimination can be solved by estimating the relationship between predicted and observed values. Since observed values are binary, the retationship 1s stated in terms of the erobability that the outcome occurs. The method of maximum THkelthood can be used to estimate this relationship even when mo two predictions are the same. The method anly assumes that predictions are related to outcomes through a smooth curve that. interpolates between different predictions, For a simple predictor variable x, the logistic regression model (Cox, 1968a, 1966, Walker and Duncan, 1967) relates X to the probability of an event. Let the event or outcome variable be denated by Y, where Y=1 when the event occurs and te) otherwise, the model is as follows: Probivel x)» ' (2.1) 4+ exp[-(abx}] where exp(x} is e%, the natural anti logarithm, Cox {1958b, 1970) proposed using the Iinear logistic model to relate "subjective probebilities” to “objective probabilities”. Let the predicted probabilities of Y1.¥oseeu¥, = 1 he denoted by Pi, P,, -.., P, for n subjects, or cases. Let the true (calibrated) probabilities be denoted by Py‘. Py's...+ Py’. We can estimate Py given Py by estimating the relationship between P, and ¥,. Py is first transformed fron 2 6-1 scale te am unlimited scale to better fit the model. The logistic calibration model is: Prob(Y.=1 P;) = 1 (2.2) 1 + expf-(atbL,}] where Ly = Togit(P,) - log(P,/{1-P)}). The model can be restated as 1 (2,3) 1+ oxpi-a)[ey/(1-P 7 Figure 1 shows the shape of calibration curves for various values of 4 and b, The fdeal relationship (ne calfbration required) 1s found on the curve marked aaQ, b=1. s-- Figure 1 About Here --= Note thet when a0 and bel, Prt = P,, and no calibration ts required. When no Slope calibration is required (bel), Pie (2.4) Py + (tePydexpt-ad In this case, where only 2 prevalence adjustment is made, exp{a} is the odds ratio of the corrected to the uncorrected overall prevalence, and the calibra- tlon mode? is identical to the simplest form of Bayes' rule, It should be recognized that a simpler calibration mode? such as Py! = atbP, cannot be used because this would allow Py* to be Tess than O or greater than 1. In many cases where a model has not been developed carefully on 4 data- set, predictions from the model will be found too extreme when they are validated 1m an independent sample because of averfitting the original dato- set. For example, a probability of death of .1 may need to be calibrated to -25, and a prediction of .9 catibrated to .78. The corresponding logistic calibration for this example fs obtained using a<0, be.5 in equation (2.3). If predictions need to be shrunk symmetricatly toward a probability of .§ and a predicted probeMitty af P is calibrated to & value af P*, the calibrating equation is derived from (2,3) using a0, b=Togit(P')/Togit(P). The parameters a and b can be estimated by maximizing the posterior Tikeliheod of the observed data (Pi, Ty, fels2ie.in), or equivalently by minimizing -2 times the 1og-1fkel fhood function, n La -2 £ EY, log{Py'} + (1-¥,} log(l-Py')]. fay (2.8) a = 2 E EY, (avbly) -logta+enp(athl 9] fal L cam be thought of as measuring Informatfon or quality of the predictions in relation to the outcomes, given a and b. 3a_DERLVATION GF_ACCURACY INDEXES The following notation will be used: L(a,b) = minimum L for all ayb L(a,1) = nfnimum L for all a subject to bel Lia,0) = mintnun £ for all a subject to b=0 = -20[1,logPe{1-¥, og(1-F)] L{O,1) = value of L at a0, bet = -2E[4;1ogPgH(1-¥, }Tog(1-P, 1s where PaEY,/n. We compute the unreliability U of the predictions from the difference in quality of the uncalibrated predictions and the quality of sTope- and intereept-calibrated predictions: U= [L(0,1) - Lfa.b) - 2]/n - Gy Since L(Q,1) - L(a,) #3 4 Mkelthood ratio statistic for testing Ht a-Q,bel with an asymptotic y2 distribution having expectad value 2 if Hy is tre, U hos expected velue 0 if the predictions are reliable. Division by the sample Size makes the range of U independent of n. can be decomposed into I= U, + U,. where U, is the unreliability due to the need for an overall prevalence correction (correction of intercept on logit scale) and U, is unreliability due to the need for a slope correction given any needed prevalance correction: Uy = Eel.) - bad) - Isr Q.2) U, = [L(a.1) - Lfa,b) ~ 2]/n . (3.3) U, is the difference im quality of the best uncalibrated predictor and an intercept-calibrated predictor. U, is the difference 1m quality of the best intercept-caltbrated predictor and the best slope- and intercept-calibrated predictor. The -1 tetm chuses each index to hava expected value 0 if the corresponding type of unreliability ts truly absent. Large values of the unrelfability indexes mean that the predictions are unrelfable. Kegative values indicate better reliability than one would expect. py chance. Likelihood ratio statistics are immediately available for testing each type of unreliability: Asymptotic Kul] Hypothes i Significant total unreliability — L{0,1)}+L(a,b} 2 (3.4) H, 58-0 61 Significant unreliability due to L{0,1)-L(a,2) 1 (3.5) overall prevalence error Hy:aeO bel Significant unreliability due to L{a,1)-L(a,b) 1 (3.6) Slope error given prevalance correction Hosbel Simple scere tests are atso avaflable for testing the first two hypotheses above, avoiding the need for Iterative calculations (Rao, 1973}. A 2 d.t, asymptotic y® score statistic for Hyt or OL 1s given by CEVG-Pyd DLy(Yy-Pp97 BP CL-Py) ELM T-P) ~Preryery) (3.7) TLyPy(1-P,) mL2e, 0-74} EL (14-4) AL df. test statistic for Hy: as0 bel ts Cer Py )1?/0P, (1-75) (3.8) These score tests turn out to be identical to those proposed by Cox (1958b). fox also presented a test for whether predicted probabilities are averty dispersed even though they are correct on the average (Hj: bel a*D). The index of discrimination is derived by computing the difference in quality of the best constant predictor (one that on the average correctly predicts the overall prevalence of the evant} ard the best calibrated predictor: D = (t{as0) - Lfa,b) - 11/n. (3.9) D has expected value O if there 15 no discrimination (b-0), The likelihood patio statistic for testing whether the predictions have any diserininatory ability (H,:be0) fs La.) - Lta.bj, having asymptotically 2 chi-square distribution with 1 df. under 4. An overall summary index for the quality of predictions 1s derived by computing the difference in quality between the best constant predictor and the quality of the predictions as they stand {with no cal fbration): (L{a,0) - L(G,1) + Lin. (3.10) It can readily be seen that Q = discrimination = total unreliability = D = u. The sumary index Q 1s 4 simple translation of the logarithmic scoring rule Simple score tests are also avaflable for testing the first two hypotheses above, avoiding the need for iterative calculations (Rac, 1973). A. 2 dt. asymptotic x? score statistic for Hy asQ, Bel fs given by ECO; -Py} ELAAY;-PybT EPYCL-Py} TLyPy (Py) ae AY y-Py} (3.7) 4, DLyPgll-Py) | TLGP,(1-Py) BL, (¥;-P,) Al df. test statistic for Hy: asd bel fs Cee, -Py)HP/EPy (1-Py) (3.8) These score tests turn out to be identical to those proposed by Cox {1958b). Cox also presented a test for whether predicted probabilities are overly dispersed even though they are correct on the average (H,; bel 2-0), The index of discrimination is derived by computing the difference in quality of the best constant predictor fone that on the average correctly Predicts the overall prevaience of the evant) ahd the best calibrated predictor: B= [Lta,0) ~ b(a,b} = 1]/n. (3.9) D has expected value 0 if there is no discrimination (b=-0). The likelihood ratio statistic for testing whether the predictions have any discriminatory ability (H,zbeO} fs Lfa.0) - L{a.b}, having asymptotically a chi-square distribution with 1 d.f. under H,. An overall summary indee for the quality of predictions is derived by computing the dffference in quality between the best constant predictor and ‘the quality of the predictions as they stand (with no calibration}: [héa.0) - £(0,1) + 1]/n. (3.10) It can readily be seen that Q = discrimination = total unreliability = 0 = U. ‘The summary index Q fs & Simple translation of the logarithmic scoring rule (see Good (1952), Cox (1970, Eq. 4.34), and Shapiro (1977)). The reference point for Q is the best constant predictor, whereas other authors used as reference a predictor having constant probability 0.5. The value of Q is invariant with respect to the form of the calibrating model. This result follows from 1) when a-0 and b=l, P, = P;', making the log-likelihood function (here L(0,1)) dependent only on the observed data, and 2) when b=0, the calibrated probabilities do not make use of the predicted Probabilities so that P,' = P, the overall proportion of events that occurred (zY;/n). Hence (3.10) reduces to Q= (2/n) BLY, Tog (Py/P) + (1-¥;) Tog (1-P;)/(1-P)] + 1/n. (3.11) An index of quality can also be constructed which does not penalize Predictions for being wrong by a constant prevalence correction. This index is derived from the difference in quality of the best intercept-corrected predic- tions from the quality of the best constant predictor: (3.12) Q, = [L(a,0) - L(a,1)J/n = D - Uy. Since score statistics can be used to approximate likelihood ratio statistics, simpler unreliability indexes can be constructed by substituting (3.8) for L(0,1)-L(a,1) in (3.2) and (3.7) for L(0,1)-L(a,b) in (3.1). This approach has two disadvantages, though. First, score statistics are not additive as are partitions of log-likelihood. Second, score statistics may Not adequately quantify information content for situations far from the null hypothesis. POWER OF TEST FOR UNRELIABILITY The likelihood ratio test for total unreliability given in (3.4) is difficult to study because of the iterative calculations required. It has been shown in a similar situation that score tests have equivalent power functions as likelihood ratio tests (Lee et al, 1983). Therefore we study the Power properties of the score test given by (3.7). Im general, E(¥4)°P, and the score vector [E{¥=Py), ELy(ty=™))7 tx asymptotically normal with mean vector and covariance matrix given respectively by : 3(P5-Py) u* , a {P,-F)) ‘ 1 {(a.1) EPStL-Py) TL Py t-Py) ve uo de ot BigP (UP) EL PyC1-P,) It follows that the score statistic for testing H,:a=0, b=1 has mean m and variance ¥ given by ms tray + yay aay v= 2teLEA)2] + 4u"AVAL whare A {s the matrix inverse in (3.7), The distribution of (3,7) can be approximated by a scaler multiple of @ non-centrai y° random vartable ayS(h) with 2 df. and moncentrality 4 by equating the first two moments of such a distribution to (4.2) (dehnson and Kotz, 1970), yfelding tn? ~ v)}/2 yp (4.4) Amma - 2, Tf Gera Tepresents the lea quantite of a central x? dtstrtbutton having 2 a@f., the power of an o-level test for unreltability can be approximated by Problayg(a} > veetal = Prob (ga) > Gaal: (44) To test the adequacy of this approximation as well as the adequacy of the central y? mutt distribution for (3.7), 2000 samples of varying sizes wera simlated for each of a variety of setups in which only twa distinct 19 predictions were made. Here k observations were assigned a predicted probability of py and k ware assigned 2 probabtltty of py. Corresponding actual population probabilities were Py and Poe Power was estimated by computing the frection of score statistics excemding xf, og=5.89. Resvits for ar.05 ate found in Table 1. It can be seen that type T error 1s well controled and that (4.4} appears to be a satisfactory power approximation. =-- Table 1 About Here --- The power approximation in (4.4) can ba used to estimate the sample size necessary to achteve a power of f for an aslevel test of total unreliability. Suppose that K predictions ore made at each af g probability levels, py. pps tere Bg and that the true probabtlittes are py djs sors By» The total sample size 18 thus kg. Define Ty=logtt py. Then the quantities in (4.2) are gtven by me tray + uA , (4.8) wee erg + ak AR, where 1 ¢ Ep, (1-p,) B1jpyil-p3) 2 Enypy(tepy) 21ypy(1-Py) tpy(l-p))—B1yby(1-8y> (4.6) ripytt-p) =1fpy(1-p5) 2p 4-Py) TU jlpppy) and all summations are over j=l, 2, -+2. 9s il The following iterative algorithm converges quickly to a solution for k: Initialize ky,g, #05 Bl + Voop X: Compute A* non-centrality parawater of xf distribution such that Prob OBC < x8, y.4/8) * Set om* 8 (A42) set ke Dm trey a see Da (wavy 2 yz kent where v= 2teL(A'V")2] + dia” AA It Rast <1 stor Set kyace = ki go to X loop 6. EXAMPLES OF VALUES OF THE INDEXES To help in interpreting the values of the indexes, consider a series of simple exampies in which kK subjects receiva one prediction, Py> and another k subjects recefyve a predicted probability of Pa- The first k subjects have an observed prevalence of the event of o and the second k fave a prevalence of 04. The resulting calibration parameter estimates (a and 5}, accuracy indexes, and chi-square statistics are in Table 2 for kel0Q. For comparison, the c index 48 a)s0 given, atong with a version of Brier's scare defined by B= 1 = average of Oar Table Z About Here = 12 Lines 1 and 2 in the table demonstrate the values of the indexes when there 1s perfect reliability and low to moderate dYscrimination, respectively. Similar- ly, Hines 3-5 correspond to backwards predictions (2.9. predict .25 and .75, observe .75 and .25) with increasing discrimination. Total unreliability 1g statistically significant for Hes 3-6, 8 and 9. Lines 6-9 are more typical examples of unreliability. The measure of overall quality, Q, 1s negative (Tines 3-6} when the discrimination {$ not geod enough to overcome serious unreliatiTity. The index of discrimination, D, ranked the discrimination of predictions in the same order as the absolute value of c-.8. The rankings of 9 and 8 are very similar but both differ from those of the absolute value of They have similar rankings as c. It appears thet predictions for which U does not exceed abovt 0.05 are reliable for the most part, Statistical significance of U can also be used to quantify unreliabiltty, although the power of this assessment depends on the sample size (signtficant unreliability is present at the of,05 level if Ue3.99/n; for U, and U, the criticel levels ave 2.84/n}. It can be shown that for this sttustion {k predictions at each of tyo probabilities), the unreliability index #8 given by W = 6, ogle,/{2-p))1/00,762-0,3T+0,togb pay t-Py) ME Ogf(1-09)] 8,1) + Vag [(1-py El-p) VAL (1-84-05) -2/n. The analyst can use (5,1) to estimate acceptable levels of U for fixed p,, py by varying @ and 0, and judging U by whether 0, and 0, ere meaningfully aitfevent from py and pp. A plot of U with respect to , and Q, is shown in Figure Z when ke100 (79200) for the Four combinations p)=.25, py".283 py" .H5, Bgt-B5; PyH-6, Pytods and ppt], PotsBe To show examples of the values of the new indexes as well as the result- ing estimates of the caTibration or reliabtlity curves when the predictions are continuous, the predictive accuracy of two logistic regression models was constdered. For both madels, the outcome variable was complete response to treatment of non-Hodgkin's Iymphone, and the predictions were developed (Harrell et al,, 1985) using a training sample of 110 patients (50 with 13 complete response, 60 without}, These predictions were evaluated on a separate test sample of 116 patients. The first model was developed using a standard stepwise variable selection method with 25 candidate variables, far too many for only $0 cases of complete response. The second mode! used the inconplete principal component method, which effectively reduced the 25 variables to only 1, The accuracy indexes are faund im Table 3. Corresponding p-values for significant unreliability or discrimination are fn parenthesis, Reliability plots using the estimated a and b may be found {m Figure 1. The results indicate significant unretiabflity {both types} and little discrimination abil{ty for model 1, resulting in unacceptable predictions (Q==.19}. The extreme predictions from model 1 cannot be trusted, which is frequently the case when two many predictor variables are used with smal] sample sizes. HWodel 2 has moderate need for 2 prevalence correction (U,».05) but nat for a slope correction, and has better discrimination than mdel 1, resulting in far better overall qualfty (G*.03 ys, -.19), This {mprovenent in predictive accuracy is due te the date reduction resulting from fitting principal components. --- Table 3 About Here =-- COMPUTER SOFTWARE A SAS (1985) wacro 1s available from the authors for calculating all of the indexes mentioned in this paper as well as for drawing the reliability plot. Another SAS program fs avallable far power and sample size calculations based on (4.4) and (4.7). £OWCLUS IONS We sought a method of assessing predictive quality having the following properties: (1) no grouping af predictions is required, (2) an overall measure of the quality of predictions can be formally decomposed into a simple sum of indexes of unreliability and discrimination, (3) the index of unrelfabi lity can be further decomposed into an index of unrelfability due te the need for an overall prevalence {constant} correction and unreliability due to a mre complicated correction, (4} the method yfelded as a byproduct an index of 4 overall predictive quality that was not penalized for a prevalence correction and {5} the method automatically ylelds formal statistical tests (with reasons able power) for significant unreliability (ang {ts two components) and for significant discriminatory abtT{ty. The legistic regression model, when used to calibrate predicted probabilities tao observed outcomes was useful) in meeting these goals. The power approximation given in (4.4) is adequete for estimating the sample size needed to conduct studies such as those designed ta test diagnostic accuracy of physicians or probability models, ‘ACKHOWLEDGEWENTS. This research wes supported by the National Center for Heatth Services Research, and by the Matfonal Library of Medicine and National Heart, Lung, and Blood Institute of the National Institutes of Health. Ke thank Ms. Cristy Yollmar for the careful typing of the manuscrfpt, Barbara Pollock for providing expert techiical assistance, and Robert Rosatt and David Pryor for Motivating our work, any REFERENCES 1. Blattenberger G, Lad f (1985), Separating the Brier score into calibra- tion and refinement components: a graphical example. An Statistician 39: 26-32. 2. Brier GH (1950). Yertfication af Forecasts expressed fn terms of praba- bility. Monthly Weather Review 75:1-3, 3. Cox OR (19588): The regression analysis of binary sequences (with discussion). J Roy Statist Soe B 20:215-242, Cox DR (1958b): Te further refinements of @ model for binary regression, Bfometrika 45:562-565. 5. Con OR (1966); Some procedures connected with the logistic qualitative response curve. In Research Papers in Statistics: Essays in Honor of 2. Neyman’s Toth Birthday, pp. 58-71, Ed. F.N. David, London:Wiley. 6. Cox OR (1970); The Analysis af Binary Data. tondon:Methuen, pp, §2-54. 7. DeGroot MH, Feinberg SE (2982): Assessing probability assessors: calibration and refinement. fn Statistiog] Decision Theory and Related Topics Hl, Yor 1. Academic Press. @. Good [J (1952): Rational decisions. J Ray Statist Soc B 14: 107-114. 9. Goodman LA, Kruskal WH (1979), Measures of Assoctation for Cross > Classifications, Mew Tork: Springer-Verlag. 10. Hantey JA, McHeil BJ (1982): The meaning atd use of the area under a recefver operating characteristic (ROC) curve, Radiology 50:23-36, MM. Harrelt FE, Califf RM, Pryor D8, et aT (1962): Evaluating the yield of Medical tests. J Am Med Assoc 2473 7543-6. 16 LLL 12. 1a. iu 15. 16. In 1a. 19, 20, 21. Harrell FE, Lee kL, Hatchar 08, Retchert TA (1985): Regression models. for prognostic prediction: advantages, problems, and suggested Solutions. Cancer Treatment Reports 69:1071-1077, Hilden J, Habbema JOF, Bierregaard B (1978): The measurement of performance in probabilistic diagnosis. II. - methods based on continuous funetions of the diagnostic probabititves, Hethods of Information in Medicine 17:238=246. Johnson WL, Kotz 5 (1970): Distributtons in Statistics: Continuous Univariate Distritutions-2, pp. 185-166. New York: Wiley, Lee #L, Harrell FE, Tolley HD, Rosat{ RA (1983): A comparison of test statistics for assessing the effects of concomitant variables in survival analysis. Siemetries 39:341-350, Lemeshow S, Hosmer DM (1962): A review of goodness of fit statistics for use fn the development of logistic regression models. Am. Epidemiology 198: 92-108. Rao CR (1973): Lineay Statistical Inference and Its Applications. Second Edition, pp, 418-419. New York: Wiley. SAS Institute (1985): SAS User's Guide: Basics, Version 5 Edition. Cary, NC:SAS Institute, Inc., pp.643-727.. Shapiro AR (1977): The evaluation af clinical predictions: a method and initial application. New England J Med 296: 1509-1514. Speigethalter WW (1986): Probabilistic prediction in patient management, and clinical trials. Statistics in Medicine 6:421-433. Walker SH, Duncan DB (1967): Estfnation of the probability of an event as a function of several fndependent variables. @iometrika $4:167-179. ir 22, fates JF {1982}: External correspandence: decompositon cf the mean Probability score. © Organisational Behavior and Huwan Performance 90: 152-156. 18 Table 1 Simulated and Approximated Powar af Score Test for Unreliabi lity am.05 Predicted True Simulated = Power by kL Probabilities Probabilities Power 4a La Pa a Ye 16 25 7S 75.080 050 10 175.083 20 25 75 25 al 050 alg 5197 30 25 475 26 TB 034 WW 75 288 40 2575 175 USB +10 7 SIA 100 25 75 «28 75 068 AS i S46 +535 os U 85 BSL 873 +02 5 02 1% 042 050 «to 95 HB -950 W098 -10 8.08 050 . 02 98786 -783 vw 1 2 a 4 5 6 7 8 g Tabla 2 Examples of Values of the Indexes mh ym 8 by ue ed oP gg 8 A060 40 660 01-005 0-005 0 01 0 0k 8.08 60 76 125,75 28 JB 0 1-005 O05 0-01 0 26 52 27.75.81 M060 60 40 0) -1-.005 O .160 32 18 32 08 H-.12 40 72 2875 78 260 © 1.100 220 1.10 220,26 §2 -,83 ,25 56 10.99 .90 .19 0 © 3.600 703 3.50 703.73 147-2.00 10.27 40.70 60 90 99 1.49 18 37 005 2.19 89 12 25.07.70 .80 120,70 25 075 2798 013-005 0 004 3 28 $2.25 75 81 +25 70.25 690 7G 1.69 004 10 06 «1S 23-4? 9S 36 83 BF 2585.25 901.69 2.54.13 28 18 91 28 59 a7 95.13.03 .80 Table 3 Comparing Predictive Accuracy of Two Logistic Regression Modals Quantity re eos Figure 1 Legend: Figure 2 Legend: Modal_d 12Z “7 5 a 14 14 (.0001) -05 [.007} »O7 {.003) +009 (.4) +21 (0001) -06 {.018} * +03 (.045) -08 {.002) =18 8 -.04 08 Four logistic calfaration {relfabtlity) curves, fncluding one for a rediable predictor (a0, bel}. Contour graphs of U {given by 5.1} as a function of observed proportions 0; and 0). Tha center of each sat fs (Py. Pole the true probabilities. The contours correspond to U = 0 Unner contour), .01, .02, .... .10 (outer contour). ——- a=-.7 be.5 —— a=-.5 b=1.4

You might also like