Unit 3
Unit 3
Introduction
The conversion of raw data into a form that will make it easy to understand & interpret, ie.,
rearranging, ordering, and manipulating data to provide insightful information about the provided
data.
Descriptive Analysis is the type of analysis of data that helps describe, show or summarize data
points in a constructive way such that patterns might emerge that fulfill every condition of the
data.
It is one of the most important steps for conducting statistical data analysis. It gives you a
conclusion of the distribution of your data, helps you detect typos and outliers, and enables you to
identify similarities among variables, thus making you ready for conducting further statistical
analyses.
1. Descriptive techniques often include constructing tables of quantities and means, methods
of dispersion such as variance or standard deviation, and cross-tabulations or "crosstabs"
that can be used to carry out many disparate hypotheses. These hypotheses often highlight
differences among subgroups.
2. Measures like segregation, discrimination, and inequality are studied using specialised
descriptive techniques. Discrimination is measured with the help of audit studies or
decomposition methods. More segregation on the basis of type or inequality of outcomes
need not be wholly good or bad in itself, but it is often considered a marker of unjust
social processes; accurate measurement of the different steps across space and time is a
prerequisite to understanding these processes.
But this also enters the province of measuring impacts which requires the use of different
techniques. Often, random variation causes difference in means, and statistical inference is
required to determine whether observed differences could happen merely due to chance.
1. Measures of Frequency
In descriptive analysis, it’s essential to know how frequently a certain event or response is
likely to occur. This is the prime purpose of measures of frequency to make like a count or
percent.
For example, consider a survey where 500 participants are asked about their favourite IPL
team. A list of 500 responses would be difficult to consume and accommodate, but the data
can be made much more accessible by measuring how many times a certain IPL team was
selected.
In descriptive analysis, it’s also important to find out the Central (or average) Tendency
or response. Central tendency is measured with the use of three averages — mean, median,
and mode. As an example, consider a survey in which the weight of 1,000 people is
measured. In this case, the mean average would be an excellent descriptive metric to
measure mid-values.
3. Measures of Dispersion
4. Measures of Position
Descriptive analysis also involves identifying the position of a single value or its
response in relation to others. Measures like percentiles and quartiles become very useful in
this area of expertise.
Apart from it, if you’ve collected data on multiple variables, you can use the
Bivariate or Multivariate descriptive statistics to study whether there are relationships between
them.
In bivariate analysis, you simultaneously study the frequency and variability of
two different variables to see if they seem to have a pattern and vary together. You can also test
and compare the central tendency of the two variables before carrying out further types of
statistical analysis.
Multivariate analysis is the same as bivariate analysis but it is carried out for more
than two variables. Following 2 methods are for bivariate analysis.
1. Contingency table
In a contingency table, each cell represents the combination of the two variables.
Naturally, an independent variable (e.g., gender) is listed along the vertical axis and a
dependent one is tallied along the horizontal axis (e.g., activities). You need to read “across”
the table to witness how the two variables i.e. independent and dependent variables relate to
each other.
9–
Group 0–4 5–8 13–16 17+
12
Men 33 68 37 23 22
Women 36 48 44 83 25
A table showing a tally of different gender with number of activities
2. Scatter plots
A scatter plot is a chart that enables you to see the relationship between two or three
different variables. It’s a visual rendition of the strength of a relationship.
In a scatter plot, you are supposed to plot one variable along the x-axis and another one
along the y-axis. Each data point is denoted by a point in the chart.
The scatter plot shows the hours of sleep needed per day by age,Source
● High degree of objectivity and neutrality of the researchers are one of the main advantages of
Descriptive Analysis. The reason why researchers need to be extra vigilant is because descriptive
analysis shows different characteristics of the data extracted and if the data doesn’t match with
the trends then it will lead to major dumping of data.
● Descriptive analysis is considered to be more vast than other quantitative methods and provide a
broader picture of an event or phenomenon. It can use any number of variables or even a single
number of variables to conduct a descriptive research.
● This type of analysis is considered as a better method for collecting information that describes
relationships as natural and exhibits the world as it exists. This reason makes this analysis very
real and close to humanity as all the trends are made after research about the real-life behaviour
of the data.
● It is considered useful for identifying variables and new hypotheses which can be further
analyzed through experimental and inferential studies. It is considered useful because the margin
for error is very less as we are taking the trends straight from the data properties.
This type of study gives the researcher the flexibility to use both quantitative and qualitative data
in order to discover the properties of the population.
For example, researchers can use both case study which is a qualitative analysis and
correlation analysis to describe a phenomena in its own way. Using the case studies for
describing people, events, institutions enables the researcher to understand the behavior and
pattern of the concerned set to its maximum potential.
● In the case of surveys which consist of one of the main types of Descriptive Analysis, the
researcher tends to gather data points from a relatively large number of samples unlike
experimental studies that generally need smaller samples.
This is an out and out advantage of the survey method over other descriptive methods that
it enables researchers to study larger groups of individuals with ease. If the surveys are properly
administered, it gives a broader and neater description of the unit under research.
Data Visualization
What is Data Visualization?
Data visualization is the graphical representation of datasets and information. Data
visualization is an umbrella term for visualizing all types of data through charts, graphs, and
maps.
The ultimate goal is to visually represent your data in an accessible and easy-to-understand
manner. Visualizing data is a fundamental step in understanding trends, uncovering patterns,
and tracking outliers.
There is no question that datasets are growing tenfold. Combine this with the growing
advancement of IT systems that are increasingly more advanced, and you have a colossal
pain point for businesses.
The fact that data is becoming so overwhelming for organizations has spurred the rise
of AIOps. AIOps helps businesses with various use cases such as:
● Predictive alerting
● Prioritizing events
● Predictive outages
2. Greater Accessibility
We mentioned how large datasets are now accessible for a greater number of users. This
demonstrates that data visualization is a key factor in data democratization.
Data visualization tools help simplify complex robust data points and present them in highly
digestible ways. This growing accessibility can help upskill employees and make businesses
more efficient.
Traditional means of sifting through data were meticulous and time-consuming. Data
visualization helps businesses discover insights at a much faster rate than prior to the advent
of visualization tools.
Speed is central here. This growing scalability means that business leaders have more room
to be granular in their analysis. If data is mapped out more quickly, IT teams and data
scientists have more time to draw more complex insights from their well-organized
databases.
Prior to data democratization, gaps in communication were all too common for enterprises
and businesses alike. Boiling down and explaining advanced insights can be difficult without
a common understanding of what the datasets behind these insights mean. With modern data
visualization software, such as Tableau and Microsoft Power BI, data analysis is broadened
to virtually any department within your organization.
Here are examples of various forms of data visualization and uses cases:
Bar graphs: These types of graphs are best utilized to compare aspects of different groups or
to track those aspects over time. Bar graphs are best used when changes are rather large.
Line graphs: One of the most popular and fundamental forms of data visualization are line
graphs, which are used to track changes over short and long periods of time. Line graphs are
particularly useful to highlight smaller changes.
Graphs are, for the most part, are rather modular. Above is an example of a line graph
tracking bounce rates; this is in contrast to a bar graph that represents page load times.
Pie Charts: Another fundamental form of data visualization, pie charts are effective for
comparing parts of a whole. Because they are not placed on an X-Y plot, tracking data over
time is not possible with a pie chart.
These charts are very basic examples of data visualization. Many modern tools are designed
to open up complex methods to the everyday user. For instance, here’s one way a
management trainee in the finance industry used data visualization software for their needs:
“Power BI is a widely used software in our organization where we deal with a huge amount
of raw data and process it to gather actionable insights. It helps us to visualize scattered and
unfiltered information efficiently and easy to understand manner…Overall, I would say that
this is a must-have software for any enterprise that directly witnesses a lot of data being
gathered to formulate strategies and plan of actions” – Management Trainee in finance
industry, review of Microsoft Power BI at Gartner Peer Insights.
Scatter Plots: A slightly more advanced data visualization method is scatter plotting. Scatter
plots are an effective way to explore the relationship between two variables and multiple sets
of data. Below is an example of a scatter plot mapping out profitability in various American
cities. Note how cities with larger profitability have larger circles.
Exploratory Data Analysis using Data
Visualization Techniques!
Desriptive Statistics Defined
Descriptive statistics describe, show, and summarize the basic features of a dataset found in a
given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Descriptive statistics represent the available data sample and do not include theories,
inferences, probabilities, or conclusions. That’s a job for inferential statistics.
Here’s an even simpler example. Let’s assume a data set of 2, 3, 4, 5, and 6 equals a sum of
20. The data set’s mean is 4, arrived at by dividing the sum by the number of values (20
divided by 5 equals 4).
Analysts often use charts and graphs to present descriptive statistics. If you stood outside of a
movie theater, asked 50 members of the audience if they liked the film they saw, then put
your findings on a pie chart, that would be descriptive statistics. In this example, descriptive
statistics measure the number of yes and no answers and show how many people in this
specific theater liked or disliked the movie. If you tried to come up with any other
conclusions, you would be wandering into inferential statistics territory, but we'll later cover
that issue.
Finally, political polling is considered a descriptive statistic, provided it’s just presenting
concrete facts (the respondents’ answers), without drawing any conclusions. Polls are
relatively straightforward: “Who did you vote for President in the recent election?”
Descriptive statistics break down into several types, characteristics, or measures. Some
authors say that there are two types. Others say three or even four.
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to
summarize the frequency of every possible value of a variable, rendered in percentages or
numbers. For instance, if you held a poll to determine people’s favorite Beatle, you’d set up
one column with all possible variables (John, Paul, George, and Ringo), and another with the
number of votes.
Measures of central tendency estimate a dataset's average or center, finding the result using
three methods: mean, mode, and median.
Mean: The mean is also known as “M” and is the most common method for finding averages.
You get the mean by adding all the response values together, and dividing the sum by the
number of responses, or “N.” For instance, say someone is trying to figure out how many
hours a day they sleep in a week. So, the data set would be the hour entries (e.g.,
6,8,7,10,8,4,9), and the sum of those values is 52. There are seven responses, so N=7. You
divide the value sum of 52 by N, or 7, to find M, which in this instance is 7.3.
Mode: The mode is just the most frequent response value. Datasets may have any number of
modes, including “zero.” You can find the mode by arranging your dataset's order from the
lowest to highest value and then looking for the most common response. So, in using our
sleep study from the last part: 4,6,7,8,8,9,10. As you can see, the mode is eight.
Median: Finally, we have the median, defined as the value in the precise center of the dataset.
Arrange the values in ascending order (like we did for the mode) and look for the number in
the set’s middle. In this case, the median is eight.
The measure of variability gives the statistician an idea of how spread out the responses are.
The spread has three aspects — range, standard deviation, and variance.
Range: Use range to determine how far apart the most extreme values are. Start by
subtracting the dataset’s lowest value from its highest value. Once again, we turn to our sleep
study: 4,6,7,8,8,9,10. We subtract four (the lowest) from ten (the highest) and get six. There’s
your range.
Standard Deviation: This aspect takes a little more work. The standard deviation (s) is your
dataset’s average amount of variability, showing you how far each score lies from the mean.
The larger your standard deviation, the greater your dataset’s variable. Follow these six steps:
9 9-7.3=1.7 2.89
When you divide the sum of the squared deviations by 6 (N-1): 23.83/6, you get 3.971, and
the square root of that result is 1.992. As a result, we now know that each score deviates from
the mean by an average of 1.992 points.
Variance: Variance reflects the dataset’s degree spread. The greater the degree of data spread,
the larger the variance relative to the mean. You can get the variance by just squaring the
standard deviation. Using the above example, we square 1.992 and arrive at 3.971.
Univariate descriptive statistics are helpful when it comes to summarizing huge amounts of
numerical data as well as revealing patterns in the raw data. Patterns discovered in univariate
data may be described using central tendency (mean, mode, and median), as well as
dispersion: variance, range, quartiles, standard deviations, maximum, and minimum.
When dealing with univariate data, you have numerous alternatives for defining it.
● Histograms
● Bar Charts
● Pie Charts
● Frequency Polygon
Bivariate statistics are inferential statistics that examine the connection between two
variables. In other words, bivariate statistics investigates how one variable compares to
another or how one variable impacts another.
Bivariate descriptive statistics include studying (comparing) two variables at the same time in
order to see whether there is a link between them. By convention, the columns represent the
independent variable and the rows represent the dependent variable.
Sampling Methods
One of the most common methods used for analyzing and measuring data on a large scale is
Sampling. There are various types of Sampling and Sampling methods used in statistical
analysis.
What is Sampling?
With the help of Sampling, an arbitrary section of a population is taken as a sample for
analysis. It helps analysts to make inferences about an entire population quicker than the
manual observation strategy.
So, for statistical analysis of a large population, it is a common practice to take a sample.
Thus, Sampling makes the study much more efficient and cost-effective, thereby showcasing
its importance in statistics.
There are different types of Sampling techniques, each applying a unique strategy to gain
knowledge about a broad set of near homogeneous elements.
Different Types of Sampling Methods
Sampling methods can be broadly categorized into two types – random or probability
Sampling methods and non-random or non-probability Sampling methods.
1)Random or probability Sampling methods can be further subdivided into 2 types, i.e.
restricted or simple random Sampling and unrestricted random Sampling.
2)Restricted random Sampling can be further classified as systematic Sampling, stratified
Sampling, and cluster Sampling.
3)Meanwhile, non-random or non-probability Sampling consists of 3 types : judgment
Sampling, quota Sampling, and convenience Sampling. You can get a clear understanding of
The various methods of Sampling and its types from the illustration below –
Restricted Random Sampling
● Systematic Sampling
● Stratified Sampling
● Cluster Sampling
Non-Random Sampling
● Judgment Sampling
● Quota Sampling
● Convenience Sampling
Random or Probability Sampling
Among the different types of Sampling in statistics, random or probability Sampling method
deserves mention. In the case of random or probability Sampling methods, every individual
element or observation has an equal chance to be selected as samples.
In this method, there should be no scope of bias or any pattern when drawing a
selected group of elements for observation.
As per the law of statistical regularity, a random or probable sample of an adequate
size which has been taken from a large population tends to have the same features and
characteristics as those of the entire population as a whole.
In a population of 1000 people, each person has a one-in-a-thousand probability of
being selected for a sample. Random Probability Sampling restricts population bias and
ensures that all individuals of the population have an equal opportunity of being included in
the sample.
Random or Probability Sampling can be broken down into 4 types, they are –
Estimation techniques are used to forecast the cost and effort involved in following a
particular course of action such as the implementation of a solution.
Estimation techniques are used to help organizations make strategic business decision by
analyzing the following.
Once all these factors are analyzed, the result of the estimation can be shown as a single
number, but if the results are described as a range, with minimum and maximum values along
with probability, it may be easier for the stakeholders to understand.
This minimum and maximum range is called a confidence interval and it calculates the level
of unpredictability of the estimation results.
1. Methods: there are different estimation methods that can be used for specific situations.
But it is very important that the stakeholders involved in the estimation have a shared
understanding of the elements that are to be estimated.
This shared understanding is usually achieved with the help of some decomposition tool such
as a work breakdown structure which would help breakdown a complex problem into simpler
pieces.
When creating and communicating an estimate, the constraints and assumptions also need to
be clearly communicated to avoid any confusion or misunderstanding.
• Top-down: in this technique, the solution is analyzed from the and then it is broken down
into the lower levels and summed up.
for example, when analyzing the organizational budgets, the total cost of each department is
first identified and then it is split into the individual units in each of those departments.
• Bottom-up: this technique examines the organization from the lower-level and builds up
the estimated individual cost or effort, and then adds them across all elements to provide an
overall estimate.
For example, using the same budgetary example from above. The budgets for the individual
units in the department would first be identified then they would be summed up to their
departments and then summed up to the whole organization.
• Parametric estimation: this technique uses a calibrated parametric model of the element
being estimated.
The estimators would identify previous projects or solutions to analyze the costs. This would
provide an estimate of what this current solution or project might cost.
It is vital that the organization uses its own historical records to calibrate any parametric
model, because the values demonstrate the abilities of its employee and the processes used to
perform the work.
• Rough order of magnitude (ROM): this method uses a high-level estimate to estimate the
cost of the project or solution.
It usually used when there is little information to work with and is dependent on the
estimation skills of the estimators.
Estimation techniques are used to forecast the cost and effort involved in following a
particular course of action such as the implementation of a solution.
Estimation techniques are used to help organizations make strategic business decision by
analyzing the following.
Once all these factors are analyzed, the result of the estimation can be shown as a single
number, but if the results are described as a range, with minimum and maximum values along
with probability, it may be easier for the stakeholders to understand.
This minimum and maximum range is called a confidence interval and it calculates the level
of unpredictability of the estimation results.
The greater the uncertainty, the wider the confidence interval will be.
1. Methods: there are different estimation methods that can be used for specific situations.
But it is very important that the stakeholders involved in the estimation have a shared
understanding of the elements that are to be estimated.
This shared understanding is usually achieved with the help of some decomposition tool such
as a work breakdown structure which would help breakdown a complex problem into simpler
pieces.
When creating and communicating an estimate, the constraints and assumptions also need to
be clearly communicated to avoid any confusion or misunderstanding.
• Top-down: in this technique, the solution is analyzed from the and then it is broken down
into the lower levels and summed up.
for example, when analyzing the organizational budgets, the total cost of each department is
first identified and then it is split into the individual units in each of those departments.
• Bottom-up: this technique examines the organization from the lower-level and builds up
the estimated individual cost or effort, and then adds them across all elements to provide an
overall estimate.
For example, using the same budgetary example from above. The budgets for the individual
units in the department would first be identified then they would be summed up to their
departments and then summed up to the whole organization.
• Parametric estimation: this technique uses a calibrated parametric model of the element
being estimated.
The estimators would identify previous projects or solutions to analyze the costs. This would
provide an estimate of what this current solution or project might cost.
It is vital that the organization uses its own historical records to calibrate any parametric
model, because the values demonstrate the abilities of its employee and the processes used to
perform the work.
• Rough order of magnitude (ROM): this method uses a high-level estimate to estimate the
cost of the project or solution.
It usually used when there is little information to work with and is dependent on the
estimation skills of the estimators.
• Rolling wave: this technique involves a continuous estimation of the project through out its
lifecycle.
It is based on the belief that as the estimators knowledge grows they would be able to give
better defined estimates for the next phase of the project.
• Delphi: this technique uses a mix of expert judgment and historical information.
It depends on historical records from the organization which is used to adjust the estimates.
The process involves creating initial estimates, sharing those estimates with the stakeholders,
and continuously refining those estimates until they are accepted by all the stakeholders.
• PERT: in this technique, each element of the estimate is given three values, which are:
Estimation techniques are used to forecast the cost and effort involved in following a
particular course of action such as the implementation of a solution.
Estimation techniques are used to help organizations make strategic business decision by
analyzing the following.
Once all these factors are analyzed, the result of the estimation can be shown as a single
number, but if the results are described as a range, with minimum and maximum values along
with probability, it may be easier for the stakeholders to understand.
This minimum and maximum range is called a confidence interval and it calculates the level
of unpredictability of the estimation results.
The greater the uncertainty, the wider the confidence interval will be.
1. Methods: there are different estimation methods that can be used for specific situations.
But it is very important that the stakeholders involved in the estimation have a shared
understanding of the elements that are to be estimated.
This shared understanding is usually achieved with the help of some decomposition tool such
as a work breakdown structure which would help breakdown a complex problem into simpler
pieces.
When creating and communicating an estimate, the constraints and assumptions also need to
be clearly communicated to avoid any confusion or misunderstanding.
• Top-down: in this technique, the solution is analyzed from the and then it is broken down
into the lower levels and summed up.
for example, when analyzing the organizational budgets, the total cost of each department is
first identified and then it is split into the individual units in each of those departments.
• Bottom-up: this technique examines the organization from the lower-level and builds up
the estimated individual cost or effort, and then adds them across all elements to provide an
overall estimate.
For example, using the same budgetary example from above. The budgets for the individual
units in the department would first be identified then they would be summed up to their
departments and then summed up to the whole organization.
• Parametric estimation: this technique uses a calibrated parametric model of the element
being estimated.
The estimators would identify previous projects or solutions to analyze the costs. This would
provide an estimate of what this current solution or project might cost.
It is vital that the organization uses its own historical records to calibrate any parametric
model, because the values demonstrate the abilities of its employee and the processes used to
perform the work.
• Rough order of magnitude (ROM): this method uses a high-level estimate to estimate the
cost of the project or solution.
It usually used when there is little information to work with and is dependent on the
estimation skills of the estimators.
• Rolling wave: this technique involves a continuous estimation of the project through out its
lifecycle.
It is based on the belief that as the estimators knowledge grows they would be able to give
better defined estimates for the next phase of the project.
• Delphi: this technique uses a mix of expert judgment and historical information.
It depends on historical records from the organization which is used to adjust the estimates.
The process involves creating initial estimates, sharing those estimates with the stakeholders,
and continuously refining those estimates until they are accepted by all the stakeholders.
• PERT: in this technique, each element of the estimate is given three values, which are:
(3) most likely value., which as the name states is the most likely value.
Then a PERT value for each estimated element is computed as a weighted average, using this
formula:
It can be calculated as a ratio of the width of the confidence interval to its mean value and
then expressed as a percentage.
When there is little information, such as early in the development of a solution approach,
a Rough Order of Magnitude (ROM) estimate can be used, which is expected to have a
wide range of likely values and a high level of uncertainty.
ROM estimates are usually not more than +50% to -50% accurate but a definitive estimate
which is more accurate can be made as soon as more information is available.
Definitive estimates that are used for forecasting timelines, final budgets, and resource needs
should ideally be accurate within 10% or less.
3. Sources of information: the estimators can use historical information from previous
experience along with the element being estimated to calculate the estimation.
But there are also some other sources of information which might be helpful and they include
the following :
• Analogous situations: this uses estimates from a similar initiative in the organizational’s
industry for example a competitor, to calculate the element being estimated.
• Organization history: this involves using historical records from similar projects in the
organization, especially if the same people and resources would be used to perform the work.
• Expert judgment: this involves using the expertise of individuals who are knowledgeable
in the element being estimated.
It relies on the knowledge of those who have done similar work in the past and this include
both internal or external people.
When using external experts, estimators should consider the relevant skills and abilities of
those doing the estimation.
4. Precision and reliability of estimates: when numerous estimates are made for a specific
attribute, the accuracy of the resulting estimate would be an average of the estimates.
By analyzing the measures of inaccuracy such as variances, estimators can agree on a final
estimate.
To show the degree of accuracy and precision, an estimate is often shown as a range of values
with a confidence interval, which is its probability level.
For example, if a team estimated that some task would take 50 hours, a 90% confidence
interval might be 44 to 54 hours, depending on what they gave as individual estimates.
Team estimates are usually more accurate than those of a single individual especially if the
team members are those people who would do the actual work.
Field experts can also provide estimates especially for sensitive projects such as those use to
fulfil industry regulations.
Estimation techniques have both their strengths and limitations which include the following:
Strengths
● Estimates provide an explanation for an allocated budget, their time frame, or magnitude of a
set of elements.
● If projects are planned without the use of estimates, It could lead to inadequate budgets and
unrealistic time frames.
● If there is limited information or knowledge in a project, a rough estimate can initially be
created. As more information becomes available this estimate can be refined over time to
improve its accuracy and help ensure the projects success.
Limitations
• Estimates are only as accurate as the knowledge level of the estimators. If the estimators are
novices, their estimates can be way off the mark.
• Using just one estimation method may lead to unrealistic expectations of the projects
feasibility.
A probability distribution is a statistical function that describes all the possible values and
likelihoods that a random variable can take within a given range. This range will be bounded
between the minimum and maximum possible values, but precisely where the possible value
is likely to be plotted on the probability distribution depends on a number of factors. These
factors include the distribution's mean (average), standard deviation, skewness, and kurtosis.
Perhaps the most common probability distribution is the normal distribution, or "bell curve,"
although several distributions exist that are commonly used. Typically, the data generating
process of some phenomenon will dictate its probability distribution. This process is called
the probability density function.
Probability distributions can also be used to create cumulative distribution functions (CDFs),
which adds up the probability of occurrences cumulatively and will always start at zero and
end at 100%.
Academics, financial analysts and fund managers alike may determine a particular stock's
probability distribution to evaluate the possible expected returns that the stock may yield in
the future. The stock's history of returns, which can be measured from any time interval, will
likely be composed of only a fraction of the stock's returns, which will subject the analysis
to sampling error. By increasing the sample size, this error can be dramatically reduced.
KEY TAKEAWAYS
The most commonly used distribution is the normal distribution, which is used frequently in
finance, investing, science, and engineering. The normal distribution is fully characterized by
its mean and standard deviation, meaning the distribution is not skewed and does exhibit
kurtosis. This makes the distribution symmetric and it is depicted as a bell-shaped curve
when plotted. A normal distribution is defined by a mean (average) of zero and a standard
deviation of 1.0, with a skew of zero and kurtosis = 3. In a normal distribution, approximately
68% of the data collected will fall within +/- one standard deviation of the mean;
approximately 95% within +/- two standard deviations; and 99.7% within three standard
deviations. Unlike the binomial distribution, the normal distribution is continuous, meaning
that all possible values are represented (as opposed to just 0 and 1 with nothing in between).
Stock returns are often assumed to be normally distributed but in reality, they exhibit kurtosis
with large negative and positive returns seeming to occur more than would be predicted by a
normal distribution. In fact, because stock prices are bounded by zero but offer a potentially
unlimited upside, the distribution of stock returns has been described as log-normal. This
shows up on a plot of stock returns with the tails of the distribution having a greater
thickness.
Probability distributions are often used in risk management as well to evaluate the probability
and amount of losses that an investment portfolio would incur based on a distribution of
historical returns. One popular risk management metric used in investing is value-at-risk
(VaR). VaR yields the minimum loss that can occur given a probability and time frame for a
portfolio. Alternatively, an investor can get a probability of loss for an amount of loss and
time frame using VaR. Misuse and overreliance on VaR has been implicated as one of the
major causes of the 2008 financial crisis.1
As a simple example of a probability distribution, let us look at the number observed when
rolling two standard six-sided dice. Each die has a 1/6 probability of rolling any single
number, one through six, but the sum of two dice will form the probability distribution
depicted in the image below. Seven is the most common outcome (1+6, 6+1, 5+2, 2+5, 3+4,
4+3). Two and twelve, on the other hand, are far less likely (1+1 and 6+6).