Advanced Statistics and Probability
Advanced Statistics and Probability
- mean
What is the numerical measure of average variability of a data set around the
mean – standard Deviation
Which type of statistics helps us understand how the data looks like without
giving much information on analysis of the data – Descriptive Statistics
How many different ways can four team members from the group of fifteen be
lined up for a photograph – 1365
Your team has 10 members and you need 3 of them for an app development.
How many possible combinations are there – 120
In a hospital, 10% of patients have liver disease and 5% are alcoholics. Among
those diagnosed with liver disease, 7% are alcoholics. If the patient is an
alcoholic, what is their chances of having liver disease? – 0.14
A watch manufacturer has two factories (FA, FB) and 60% of their watches are
made at FA. It is known that 10% of them are made at FA and 15% made at FB
are defective. What is the probability that a selected defective watch was
manufactured at FB? – 0.05
Central limit theorem and central tendency are same things – false
An essential component of the Central Limit Theorem is that – All
Identify the variables that are continuous or discrete – time , weight
continuous Color Country Discrete
What are the characteristics of normal distribution – mean lies at the center of
the distru
When the Standard Deviation in a Normal Distribution is higher, which of the
following is true? – peak is lower
There may be times when data is supposed to fit a normal distribution, but
does not. Which of the following could be reasons for this? – Outliers and
small sample size
If you reject H0 but H0 is true, what type of error has occurred? – Type I
Assuming innocence until proven guilty, a Type I error occurs when an
innocent person is found guilty – true
A passing student is failed by an examiner, it is an example of – Type II error
Which variable represents the actual Type I error – P - Value
A Type I error is also known as a – False Positive
The use of the laws of probability to make inferences and draw statistical
conclusions about populations based on sample data is referred to as –
Inferential Statistics
The p-value in statistical significance testing should be used to assess how
strong a relationship is. For example, if relationship A has a p=.04 and
relationship B has a p=.03 then you can conclude that relationship B is
stronger than relationship A – false
Confidence interval become narrow by increasing the – Degree of Freedom
A door alarm works in 72 out of 100 cases and surveillance camera works in
68 out of 100 cases. What is the probability of effective screening techniques
keeping in mind that these two methods can be used together – 0.49
Any hypothesis which is tested for the purpose of rejection under the
assumption that it is true is called – Null
An advertising agency wants to test the hypothesis that the proportion of
adults in a country who read a Sunday Magazine is 25 percent. The null
hypothesis is that the proportion reading the Sunday Magazine is – Equals
25%
In which examples could binomial distribution be used – Modelling the number
A good way to get a small standard error is to use a –large sample
A statistician calculates a 95% confidence interval for Mean when Standard
Deviation is known. The confidence interval is Rs.18000 to Rs.22000, the
amount of the sample mean is – 20000
The dividing point between the region where the null hypothesis is rejected
and the region where it is not rejected is said to be – Critical Value
Name Age
Ram __21
Sandy __34
Jack __26
Here we considered the variable "age" only.
Let us try to solve this scenario with bivariate data analysis in which we measure two
elements based on data observation.
Since we will tag variables as x and y, let one observation made be represented
as (x,y).
Let,
From the observation, we are trying to analyse the thermal expansion of the Iron rod
with the change in temperature.
"Multivariate data analysis" provides powerful features with the help of which we can
analyse the above discussed factors.
Multivariate Statistics
Loading image..
Many times we deal with statistical problems which have many variables envolved in the
study. This could even get complex to thousand of variables!
Multivariate statistics deals with simultaneous observations and analysis on several
variables and relations between them. The main purpose is to find out the way in which they
interact with each other.
Categorization of variables
Dimensionality reduction
Cause-effect relationship
Categorization of variables
Dimensionality reduction
Cause-effect relationship
This is done to facilitate running computations in the data in a very easy manner.
Dimensionality reduction:
Reducing the complexity and number of variables without loosing the properties of the
variables is known as dimensionality reduction.
Cause-effect relationship:
The effects put upon the value of the variable due to another variable present during the
analysis is termed as Cause-effect relationship.
For example, let there be a cube of known dimensions like length, width and
height from which we will find the unknown dimension volume. Then this multivariate
random variable can be described as:
V = (L*W*H)
Applications are seamless, you will explore them as you know more about the subject.
Simpson’s Paradox
Sometimes, having a multivariate analysis produces a misleading result because of some
lurking variables that is left unanalysed but seriously affects the relationship in between
analysed variables.
Hence, lurking variables are those unanalysed variables not taken into consideration during
data analysis but seriously affects the result of analysis!
Lurking Variable!
"Lurking variable" takes its name from the word lurk which means remaining hidden
waiting for something to happen!
Hence, it is a kind of variable which remains hidden during the analysis of data but its
presence seriously affects the relationship between other variables. This makes lurking
variables to be identified and analysed in detail.
A situation might arise when we find the difference in the expansion trend if we include the
hidden variable in the analysis. Here, the hidden variable is the factor that the Iron rod was
preheated sometime earlier also or not. If the iron rod was preheated earlier, then there will
be less expansion for the same temperature as compared to the normal expansion at that
temperature.
"The variable c denoted the temperature of the engine. He noted down the corresponding
temperature for each degree of throttle. Finally, he found that increase in the value of t and
decrease in the value of g resulted in the decrease of c."
endent variable.
Independent variables refers to those variables which acts as an input in the experiment. It is
changed to different value to test its impact on the dependent variable! Here, t was the
independent variable.
A boy is planning to make a paper aeroplane that he wants to send to his neighbour! He did
an extensive research and noted down several factors that could decide the trajectory of the
paper plane i.e. wing span, wind speed, weight, angle of inclination, humidity etc. Still he is
unable to decide which factor will have the greatest impact and which factor the least! What
do you think could be the solution?
Problems like estimating the manufacturing cost, given total numbers of units to be
manufactured can be handled with least square regression analysis.
We find the least square regression lines by finding residuals and the slope of the
line. Residuals are the error in the line which we use to model the relationship that arises
due to non linearity of the data on the chart.
The goal of Partial least square regression is to determine dependent variables from
independent variables (also known as predictors). In statistical terms, it can be said that
we predict Y from X.
Principal component analysis is a type of Multivariate data analysis for reducing large
number of corelated variables into set of values of un-corelated variables. It is also known as
Principal mode of variation.
An Example
Let us discuss Principal component analysis with an example:
A research firm decides to perform Principal component analysis on a Mars rover being
designed by them! Following Charecteristics are studied :
You already know about dimensions, right? Yes..we are talking about the same 1-D, 2-D
and 3-D! Now let us take a look how it leads to principal components with the help of an
example!
Volume can be represented on a 3-D plot as product of length, width and height.
Key to Treasure!
"Future is hidden in the past". Hence, we will explore some basics that is necessary to
understand the CDF:
X = {(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5),
(2,6), (3,1), (3,2), (3,3), (3,4), (3,5), (3,6), (4,1), (4,2), (4,3), (4,4), (4,5),
(4,6), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6), (6,1), (6,2), (6,3), (6,4), (6,5),
(6,6)}
Hence,
For example, if the height growth of a plant is studied, then the height can take infinite
number of values within an interval.
What is the probability that a randomly ordered pizza weighs between 0.20 and
0.30 kg? In the terms of probability, if X denote the weight of a randomly
selected pizza in kg then what is P(0.20 < X < 0.30)?
F(y) = P(X≤y)
or,
Let us take an example of tossing two coins together. Then sample space will be:
F(x)=p[X≤x]=α
The Basics
Don't think much about the title "Kernel Density Estimation" now!
What is parameter?
What is estimation?
What is probability density function?
What is density estimation?
What is non-parametric density estimation?
What is Parameter?
A parameter is a measurable quantity telling something useful about the population! Ex:
mean, median, mode, standard deviation etc.
What is Parameter?
A parameter is a measurable quantity telling something useful about the population! Ex:
mean, median, mode, standard deviation etc.
What is Estimation?
In simple words, estimation is related with the estimation of parameters. It is some
conclusion that we can draw from the population.
Point estimation
Interval estimation
Point estimation:
The single value is knows as "statistic" which represents the best conclusion for the
unknown parameter.
Interval estimation:
Interval estimation is the estimation can be represented with two numbers which acts as the
interval of the maybe values of the unknown parameter.
Density Estimation - Explained!
We can estimate probability densities in simplest way on the histogram. The histogram is
constructed by making sub intervals known as "bins". Whenever there is a new data in
the sub interval, one bin is added on the top. Size of the bins maybe set as 1 which is
constant during the plotting process.
The point where this estimation process lacks is that the plot is not smooth and also, it
depends on the width of bins.
The Better Way!
When the block in the histogram is centered over the data points to refine the
histogram, then it is known as box kernel density estimate where we are using the
discontinuous kernel for developing the histogram. It may be represented as in the
above plot.
When the probability distribution of the uncertain quantity is done in the absense of
data i.e. lack of the evidence, then it is called prior probability distribution or
simply prior.
For example, Finding the probability distribution of persons attending the wedding party
after the invitation is sent can be called as prior probability.
Let,
P(M) = 0.3
P(F) = 0.7
Then, Likelihood function:
L(θ | X) = P(X | θ)
Normalizing constant:
If the parameter θ and the evidence X of the experiment is given, then the posterior
probability can be defined as p(θ|X).
Let,
P(M) = 0.3
P(F) = 0.7
Puzzle Solved!
Now let's consider the definition -
According to the definition, we know:
Puzzle Unpuzzled!
Putting the value in the equation:
Random walk is that kind of event in which we can not predict the outcome in advance as it
is the result of series of random movements! It is the path created by random events in
succession.
For example, the path travelled by the water molecules is a random walk. A possible
random walk is shown above in the picture.
Here we can see transition between three states A, B and C. Markov chain finds the
probability of this transition from one state to the other state. It is to be noted that the
states can transit / hop to itself. For example, the state A can hop to state B or C.
State Acan also hop to state A.
Here, in the three state representation, the probability of transitioning will always be 0.33.
Stochastic Process and Markov Chain
Let there be a sequence of some events. If we want to deduce the outcome at any stage of the
sequence of the event, then it will depend on some probability. This probability distribution
over the path/ sequence is called Stochastic process.
Now let this stochastic process be finite i.e. number of events in the sequence are finite and
outcome at any stage should depend only on the outcome of the previous stage, then such a
stochastic process is termed as Markov chain!
4 of 6
30% of population already riding in the given year discontinues their riding next
year.
20% of population not riding the train starts riding the train next year.
5000 people ride the train
10,000 people do not ride the train
We have to find the distribution of riders and non riders in the next year.
Scenario Analysed!
We should determine the number of people riding the bus next year. According to the
data:
Also,
Hope you had a nice time completing the journey on Advanced Statistics and probability!