Table of Contents
Statistics for LHC Physics.................................................................................................................................1
Goals of this page....................................................................................................................................1
What is statistics?....................................................................................................................................1
Hypothesis Testing and Limit Calculation.............................................................................................2
Hypothesis Testing...........................................................................................................................2
Upper Limit Calculation...................................................................................................................5
Reproducing Higgs Combination Tool results.................................................................................6
i
Statistics for LHC Physics
Some basics of statistics
http://www2.warwick.ac.uk/fac/sci/physics/research/epp/events/seminars/cowan_warwick_2011.pdf
https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchoolLPC2016Statistics This
page is a very good exercise for understanding more of technical things , this will be very useful after going
through the following stuff .
Goals of this page
Understanding the exact statistical concepts behind the LHC Data Analysis has been very confusing and
difficult from times. The purpose of this page is to put the concepts giving much equivalent but simpler
examples as is done in the real analysis so that the reader will be able to appreciate and hopefully understand
the concepts clearly. This is a sort of Wikipedia page where people can make the article more and more
refined by adding their knowledge.
What is statistics?
"Statistics is the science of information gathering,especially when the information arrives in little pieces
instead of big ones".Bradley Efron
Before going further lets discuss briefly about the probability theory .Its important to learn first the basic
schools of thought in probability/statistics as the interpretation of results depend sometime on the kind of
statistics you are using . Things will be more clear as we will proceed. There are two main concepts of
probability that people deal with , those are
1.) Frequentist
2.) Bayesian
Frequentist: Frequentist probability or frequentism is a standard interpretation of probability; it defines an
event's probability as the limit of its relative frequency in a large number of trials.
A typical example will be : If a frequestist will be asked to estimate the number of cancer population in
Delhi(suppose delhi's population is 10000000). He/She will say "I have 95% confidence that the mean of
cancer population in Delhi is between 10% to 20%". Roughly what he/she means is if you take many samples
of reasonably larger size and measure the percentage of cancer patients ,you will find that 95% of the time you
get the cancer population between 10% to 20%.
Bayesian: Bayesian statistics is a subset of the field of statistics in which the evidence about the true state of
the world is expressed in terms of degrees of belief or, more specifically, Bayesian probabilities.
But a bayesian would present his/her result in terms of probability after doing his/her study .He/She will say
there is a 95% probability that the cancer population is 10% to 20%. And most importantly , if the bayesian
gets some prior information before calculating his/her result his results will change according to prior belief.
For example if the baysian got some information from somebody that in the previous year the cancer
population was 40%. Then he will now think that as there was 40% in the previous year ,its very likely that it
has changed to 10%-20% this year .So he will try to recalculate his numbers based on this prior belief.
Bay's theorem provides a simple rule how to combine those probabilities.
and the rule is
Statistics for LHC Physics 1
StatisticsForLHCPhysics < CMS < TWiki
P(A|B) = P(A)*P(B|A)/P(B)
. . more will be written on this topic later
Hypothesis Testing and Limit Calculation
In physics ,we always need to conclude something based on the data we observe. Each theory before its
established is called a hypothesis and in the "statistical test" we try to establish one theory and reject the other
or put some conditions on some parameters of theory saying what are the regions of exclusion . We will now
discuss all about it. Some of the important terms in statistics are population and sample . Population : The
entire pool from which a statistical sample is drawn. The information obtained from the sample allows
statisticians to develop hypotheses about the larger population. Researchers gather information from a sample
because of the difficulty of studying the entire population. For example ,when in CMS we do experiment and
collect data .. all the 8TeV data i.e. 19.6fb^{-1} will be regarded as a sample from an infinitely large number
of data which we could have generated by running LHC for many many years!!!
There are two types of statistical hypotheses.
Null hypothesis: The null hypothesis, denoted by , is usually the hypothesis that sample observations result
purely from chance.
Alternative hypothesis: The alternative hypothesis, denoted by or Ha, is the hypothesis that sample
observations are influenced by some non-random cause.
Hypothesis Testing
Lets understand the concept of hypothesis testing from a simple example from a topic not related to physics .
and the Problem here is:
A scientist has developed a new car . He claims that the engine will run continuously for 5 hours (300
minutes) on a single litre of petrol . From his stock of 2000 cars , the scientist selects a simple random sample
of 50 cars for testing. The cars run for an average of 295 minutes, with a standard deviation of 20 minutes.
Test the null hypothesis that the mean run time is 300 minutes against the alternative hypothesis that the mean
run time is not 300 minutes. Use a 0.05 level of significance. (Assume that run times for the population of
engines are normally distributed.)
Solution: Now lets try to understand the question.
here the null hypothesis( ) will be : =300
and the alternative hypothesis ( ) : not =300
The scientist claims that the mean will be 300. But result from one sample of experiment show mean =295
with some terms and conditions(number of trials is 50 with standard deviation of 20). So here our present
belief is that mean = 300 , we want to see if experiment will show something different than our current belief.
Now lets look at what experiment we have done , we have taken 50 cars and have run them and we found 50
different observations(295, 297,300,293.... ) and if we take the average of all these we get 295 with standard
deviation of 20. Now lets realize that as we have not done experiment on all the cars (=2000) , we cannot say
,the number 295 is the actual population mean(mean when you will try with all 2000 cars ). So definitely the
number 295 has an uncertainty on it. In other word we can say the number 295 is the estimated mean not the
real one.
What is statistics? 2
StatisticsForLHCPhysics < CMS < TWiki
if so how will you get that uncertainty ??
Well , in principle we can take many samples of 50 cars like we have taken now and then calculate mean of
all those . If you will make a distribution of mean ,it will be gaussian and the standard deviation of that
gaussian will be the error on the mean. But there is another way we can calculate that error from our
observation of one sample. We can use theory of estimation to estimate the mean from a sample and the error
on the estimated mean.
So in this case we have Sample={295,297,300,293 ....} lets denote this as Sample = { }, now
somebody asks you from this sample try to estimate what will be the mean much closer to the real mean
(mean from doing experiment from 2000 cars.).
As we know the distribution of the population is normal(given from the question ). Now we know that the
points ..all will form frequency distribution which is gaussian/normal . So the probability of getting
will be
P( ) = Gauss( , , )
Now the total probability will be L= P( ) , as you want that all these points should follow a gaussian
frequency distribution , to get the best values of and ,you will maximize this function L (L is called
likelihood function ). You will have to form two equations like
dL/d = 0 and dL/ = 0 and this will give you estimated values of and . If you will do the calculation you
will get
Estimated = /50 = which is just the value given ~ 295. Now we can ask the question what will be the
error on the estimated value of mean? ,that also we can calculate
= where E stands for average or expectation and is the variance on ,its not . If you will
just expand it and calculate you will find
= /50
so the error on the mean i.e the error on the number 295 is = = 20/sqrt(50) = 2.83. Now come
back to our original problem . So now what we have understood is the current belief is =300 but we have
done an experiment taking a 50 number sample and found the estimated = 295 +/- 2.83 (1 ) uncertainty .
Now the question is what we can say about the null hypothesis based on our observation ? How we can judge
how close is the observed mean(295) with the believed mean(300).
Well ,lets proceed carefully now !. we said estimated (= ) = 295 +/- 2.83 . We do not know the actual
.(i.e. calculated from taking all 2000 cars,it could be 296 or 294 etc. ) . But if we will take more samples and
estimate , (in actual practice we could not )and plot the following
t=[ - ]/ is a distribution peaked at zero . This is a known distribution called student's t-distribution
and in fact depends only on the no of degrees of freedom. i.e. = sample size - 1 = n -1 . here n=50.
Hypothesis Testing 3
StatisticsForLHCPhysics < CMS < TWiki
t is called
test statistic for conducting the hypothesis test here . Any particular point on this t- distribution gives the
probability of observing that value of test statistic. This is always used to know how close is the sample mean
wrt the population mean. When the sample mean will be exactly close to the population mean we will get a
value close to zero which is the most probable value.
Now remember our believed actual is 300 and our observed/estimated = = 295 so now the observed
value of test statistic will be like the following
t = (295 - 300)/2.83 = -1.77 . By calculating this what we are saying is : We have got a quantity called t which
is a indirect measure of how close is the estimated to the believed . If the believed is the actual then the
estimated should lie close to it. Or in other words if the observed value of test statistic should lie close to
zero .
Now what is the probability of observing estimated mean < 295 or > 305 ? actually more correct question will
be what is the probability of observing t < -1.77 or t >1.77 ? This will give us information how close/far we
are to the believed mean 300.
This probability is called
p-value : P-value is the probability that the t-score having 49 degrees of freedom is less than -1.77 or greater
than 1.77 .
More generally speaking P-Value: Probability under the assumption of ( ), of finding Data with equal or
greater incompatibility with the predictions of H. where H is a given hypothesis. Here we are trying to get a
p-value for null hypothesis.
If p value is less , we are less close to .= 300 hypothesis and we can reject this hypothesis. And if p-value is
very high ,we are more close to .= 300 hypothesis . but how small p-value to be to reject the null hypothesis
depends on us . In standard practice for physics(say for Higgs discovery ) we say if the p-value is less than
(corredponds to 5 ,we will discuss about it later) , we will reject the null hypothesis(back ground
hypothesis). This number is called
siginificance level . But here the questions say use a significance level of 0.05. This means if the p-value is
less than 0.05 ,we will reject the null hypothesis.
So in this case We use the t Distribution Calculator to find P(t < -1.77) = 0.04, and P(t > 1.77) = 0.04. Thus,
the P-value = 0.04 + 0.04 = 0.08.
Interpret results. Since the P-value (0.08) is greater than the significance level (0.05), we cannot reject the null
hypothesis.
In general different problems have different kinds of test statistic, for physics physics we will discuss about
varieties of test statistic. We will come to real physics problem later. But the basic concept is the same as we
studied.
Hypothesis Testing 4
StatisticsForLHCPhysics < CMS < TWiki
Upper Limit Calculation
Example 1:
So now in the above example we could not reject the null hypothesis(let imagine for the moment if the null
hypothesis was actually false ), this is because the data were insufficient ...wait how did we know that data
were insufficient ?? Well now imagine ,we would be able to reject the null hypothesis if the P(t < -1.77) was
0.025 because 0.025+0.025 = 0.05 .In this case we would have been able to reject the null hypothesis. So to
get a smaller P(t < ) , the the standard error should have to be small. That will possible when the number of
observation will be more ,In other words when we will have more data , eventually we will be able the reject
the null hypothesis if it is true.
but what else we can say on the basis of data??
Okay let now calculate for what value of (maximum or upper and minimum or lower value ) we will get an
observed value of test statistic for which the p-value will be equal to significance level.
i.e. As we know
P(t < -1.999) = 0.025
[ - ]/ 2.83 $% = - 1.999
= 300.65
[ - ]/ 2.83 $% = 1.999
= 289.35
So based on our observation we have calculated the upper and lower limit at a confidence level of 95%
.Because P(t < -1.999) + P(t > 1.999) = 0.05
So now we can say with 95% confidence level the regions of exclusion are regions other than the interval:
[289.35, 300.65]
Example 2: Okay , now lets try to understand the concept of upper limit from another simple example.
Suppose you are doing a counting experiment . So in this case the probability of observing a particular
number of events will follow a poisson distribution.
Now lets say = 4.6
=5
P(n < , , )= ]/m!
so now setting P(n < ) = 0.05 ,if you solve for , you will get upper limit on the poisson mean with
respect to the specified
for the given problem we have
0.05 = ]/m!
Upper Limit Calculation 5
StatisticsForLHCPhysics < CMS < TWiki
lets denote the solution of above equation for . So now is the expected upper limit on the poisson mean
.
Reproducing Higgs Combination Tool results
Setting the Higgs combination tool like below
setenv SCRAM_ARCH slc6_amd64_gcc481
cmsrel CMSSW_7_1_5 ### must be a 7_1_X release >= 7_1_5; (7.0.X and 7.2.X are NOT supported eit
cd CMSSW_7_1_5/src
cmsenv
git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedL
scram b -j 9
cd HiggsAnalysis/CombinedLimit/data/tutorials
Now I have a datacard of a simplest kind like below
imax 1 number of bins
jmax 1 number of processes minus 1
kmax 0 number of nuisance parameters
----------------------------------------------------
bin of0j
observation 45
----------------------------------------------------
bin of0j of0j
process ggH qqWW
process 0 1
rate 20 30
----------------------------------------------------
And after running the combination tool as follows I get this
[bmahakud@lxplus0036 tutorials]$ combine -M Asymptotic DataCard.txt
>>> including systematics
>>> method used to compute upper limit is Asymptotic
>>> random number generator seed is 123456
TClass::TClass:0: RuntimeWarning: no dictionary for class stack<RooAbsArg*,deque<RooAbsArg*> > is
The signal model has no nuisance parameters. Please run the limit tool with no systematics (optio
To make things easier, I will assume you have done it.
Computing limit starting from observation
Will compute both limit(s) using minimizer Minuit2 with strategy 0 and tolerance 0.01
Median for expected limits: 0.603516
Sigma for expected limits: 0.307922
Restricting r to positive values.
Make global fit of real data
NLL at global minimum of data: 2.82412 (r = 0.750001)
Make global fit of asimov data
NLL at global minimum of asimov: 2.62233 (r = 0.00143079)
At r = 1.950001: q_mu = 9.53005 q_A = 28.02545 CLsb = 0.00101 CLb = 0.98634 CLs = 0.
At r = 1.350001: q_mu = 2.72502 q_A = 15.48876 CLsb = 0.04939 CLb = 0.98884 CLs = 0.
At r = 1.050001: q_mu = 0.73532 q_A = 10.16230 CLsb = 0.19558 CLb = 0.99011 CLs = 0.
At r = 1.200001: q_mu = 1.59107 q_A = 12.73279 CLsb = 0.10359 CLb = 0.98947 CLs = 0.
At r = 1.319842: q_mu = 2.47612 q_A = 14.92071 CLsb = 0.05779 CLb = 0.98897 CLs = 0.
At r = 1.343819: q_mu = 2.67317 q_A = 15.37176 CLsb = 0.05103 CLb = 0.98886 CLs = 0.
At r = 1.348615: q_mu = 2.71336 q_A = 15.46251 CLsb = 0.04976 CLb = 0.98884 CLs = 0.
-- Asymptotic --
Observed Limit: r < 1.3486
Expected 2.5%: r < 0.3100
Reproducing Higgs Combination Tool results 6
StatisticsForLHCPhysics < CMS < TWiki
Expected 16.0%: r < 0.4212
Expected 50.0%: r < 0.6035
Expected 84.0%: r < 0.8754
Expected 97.5%: r < 1.2165
Done in 0.00 min (cpu), 0.01 min (real)
Calculation by hand : .
I am taking reference from this paper
http://cds.cern.ch/record/1375842/files/ATL-PHYS-PUB-2011-011.pdf for reproducing the results by hand .
The test statistic here is
= -2 ln[ / ]
Now assuming no nuisance parameters
= -2 ln[ / ]
where is the maximum likelihood estimator
so now lets find the for this counting experiment
For a counting experiment with one channel only we have the
= , we can drop the subscript i s as we will have one bin in this calculation.
.......................(1)
and for getting the we have to do = 0 , where nobs = number of observed events.
Solving the above equation we will get = and putting back this into the
= .......................................................(2)
and now the test statistic which is the profilelikelihood ratio looks like this TS = -2*ln[ ]=
-2*ln[ ]
= -2*ln[ ] ..................................(3)
Now the distribution of this test statistic for our problem looks like the following plot
Reproducing Higgs Combination Tool results 7
StatisticsForLHCPhysics < CMS < TWiki
The code to generate the distributions and
the observed values , CLsb,CLs,CLb is given below. This would not yield exact results .. But the results will
give an idea
The code for producing the results are here
https://github.com/bmahakud/LimitAndSignificance/blob/master/GenericTS.C
commands to run the code
and the results are as follows
Make global fit of asimov data
At r = 1.95 : q_mu = 9.53005 q_A = 25.9542 CLsb = 0.00108076 CLb = 0.973336
At r = 1.35 : q_mu = 2.72502 q_A = 16.223 CLsb = 0.0540263 CLb = 0.989736
At r = 1.05 : q_mu = 0.735322 q_A = 11.3574 CLsb = 0.191845 CLb = 0.969242
At r = 1.2 : q_mu = 1.59107 q_A = 13.7902 CLsb = 0.105511 CLb = 0.98232
At r = 1.31984 : q_mu = 2.47612 q_A = 15.4121 CLsb = 0.0720343 CLb = 0.987089
At r = 1.34382 : q_mu = 2.67317 q_A = 15.4121 CLsb = 0.0714571 CLb = 0.986043
At r = 1.34861 : q_mu = 2.71336 q_A = 15.4121 CLsb = 0.0697164 CLb = 0.98589
This topic: CMS > StatisticsForLHCPhysics
Topic revision: r21 - 2016-10-24 - BibhuprasadMahakud1
Copyright &© 2008-2018 by the contributing authors. All material on this
collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Reproducing Higgs Combination Tool results 8