0% found this document useful (0 votes)
70 views50 pages

Module 2 Dav

Uploaded by

smritispam2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views50 pages

Module 2 Dav

Uploaded by

smritispam2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT–II: Statistical and Distribution Theory - Complete Guide

Topic 1: Discrete Random Variables — Basic Concepts


Introduction: What is a Random Variable?
Imagine you are conducting a survey on how many cups of coffee a person drinks in
a day. You ask 100 people. The answers you get—0, 1, 2, 3 cups—are not predictable
beforehand; they are random. A Random Variable is simply a mathematical way to
represent these uncertain numerical outcomes.

 It's a variable: Its value is not fixed.


 It's random: Its value is determined by chance.
 It's an outcome: It translates the results of a random process (like surveying) into
numbers.

In data analytics, we use random variables to model things we measure but cannot
perfectly predict:

 Number of website visitors in an hour.


 Number of defective items in a production batch.
 Customer churn (0 for stayed, 1 for left).

There are two main types:

1. Discrete Random Variable (DRV): Outcomes you can count. They are distinct and
separate values. (e.g., 0, 1, 2, 3 cups of coffee).
2. Continuous Random Variable (CRV): Outcomes you can measure. They can take
any value within an interval. (e.g., the exact time spent on a website: 2.35 minutes,
5.17 minutes, etc.).

This topic focuses on understanding the discrete type.

Formal Definition and Notation


A Discrete Random Variable (DRV) is a function that assigns a unique numerical value
to each outcome in a sample space of a random experiment. The key is that the set
of possible values is finite or countably infinite.

Notation:

 We denote the random variable itself by a capital letter, like X.


 We denote the specific possible values it can take by lowercase letters, like x.
 Example: Let X be the "number of heads when flipping a coin twice".
o The possible outcomes are: {TT, TH, HT, HH}.
o The random variable X assigns a number to each:

 TT -> X = 0
 TH -> X = 1
 HT -> X = 1
 HH -> X = 2
o So, the possible values x for X are {0, 1, 2}.

Properties and Rules


For a number to be a valid value of a discrete random variable, it must be associated
with a probability. These probabilities must follow two fundamental rules:

1. Non-Negativity: The probability for any specific value must be between 0 and 1.
o 0 ≤ P(X = x) ≤ 1 for all possible values x.
2. Sum to One: The sum of the probabilities for all possible values must equal 1. This
makes sense because something must happen.
o Σ P(X = x) = 1 for all x.

Real-Life Data Analytics Example


Scenario: An e-commerce analyst wants to model the number of items customers
add to their cart in a single session.

 Random Variable (X): Number of items added to cart.


 Possible Values (x): {0, 1, 2, 3, 4+} (Often, we group larger values into a "4 or more"
category for simplicity).
 Dataset: After observing 10,000 shopping sessions, the analyst counts the frequency
of each value.
Number of Items (x) Number of Sessions Probability P(X = x)

0 1500 1500/10000 = 0.15

1 4000 4000/10000 = 0.40

2 3000 3000/10000 = 0.30

3 1200 1200/10000 = 0.12

4+ 300 300/10000 = 0.03

Total 10,000 1.00

This table is the core of the discrete random variable. It lists every possible outcome
and its probability.
Step-by-Step Calculation of Probabilities
From the table above, we can now answer probabilistic questions:

1. What is the probability a random customer adds exactly 2 items?


o Look directly at the table: P(X = 2) = 0.30 or 30%.
2. What is the probability a customer adds more than 1 item?
o This means X = 2 OR X = 3 OR X = 4+.
o In probability, "OR" means we add the probabilities.
o P(X > 1) = P(X=2) + P(X=3) + P(X=4+) = 0.30 + 0.12 + 0.03 = 0.45 or 45%.
3. What is the probability a customer adds at least 1 item?
o This is the opposite of adding 0 items.
o P(X ≥ 1) = 1 - P(X=0) = 1 - 0.15 = 0.85 or 85%.

Visualization: Bar Plot


The best way to visualize a Discrete Random Variable is a bar plot (or
a histogram with discrete bins). The height of each bar represents the probability of
that outcome.

import matplotlib.pyplot as plt

# Data from our table


number_of_items = [0, 1, 2, 3, 4]
probabilities = [0.15, 0.40, 0.30, 0.12, 0.03]
labels = ['0', '1', '2', '3', '4+'] # Label for the x-axis

plt.figure(figsize=(8, 5))
plt.bar(number_of_items, probabilities, color='skyblue', edgecolor='black', t
ick_label=labels)
plt.title('Probability Distribution: Items Added to Cart')
plt.xlabel('Number of Items')
plt.ylabel('Probability')
plt.ylim(0, 0.5) # Set y-axis limit from 0 to 0.5 for better readability
plt.grid(axis='y', alpha=0.4)
plt.show()
Applications in Data Analytics
Understanding DRVs is crucial for:

 Customer Behavior Modeling: Predicting purchases, clicks, logins.


 Quality Control: Counting defects, errors, or failures in a process.
 Risk Assessment: Modeling the number of insurance claims, loan defaults, or system
failures.
 A/B Testing: Comparing the number of conversions between two website variants.

Case Study: E-Commerce Cart Abandonment


Business Problem: An e-commerce site has a high cart abandonment rate. They
want to understand if the number of items in the cart influences the likelihood of
abandonment.

Analysis:

1. The analyst defines X as the number of items in the cart at the time of abandonment
or purchase.
2. They calculate two probability distributions:
o One distribution for abandoned carts.
o One distribution for completed purchases.
3. By comparing these two distributions (e.g., using side-by-side bar charts), they might
discover that carts with only 1 item are abandoned 70% of the time, while carts with
3+ items are only abandoned 20% of the time.

Insight & Action: This suggests the business should create promotions or incentives
(like "add 2 more items for free shipping") to encourage customers with small carts
to add more items, potentially reducing abandonment.
Key Takeaways

 A Discrete Random Variable (DRV) counts outcomes and has distinct, separate
values.
 Its behavior is defined by its probability distribution—a list of all possible values
and their corresponding probabilities.
 The probabilities must be non-negative and sum to 1.
 Bar charts are the ideal way to visualize them.
 They are the foundation for modeling count-based events in business and analytics.

Common Pitfalls and Practice Questions Pitfalls:

 Assuming Independence: Just because you can model something with a DRV
doesn't mean the events are independent. (e.g., adding one item might make you
more likely to add another).
 Ignoring the "Long Tail": In analytics, many events (like purchases) have a few very
high values. Grouping these into a "4+" category is practical, but be aware it hides
detail.

Practice Questions:

1. Define a DRV for the "sum of two dice rolls." List all its possible values.
2. In the cart example, what is P(X ≤ 2)?
3. You survey 10 people on how many smartphones they own. Is this a DRV or CRV?
Why?
4. If P(X=0) = 0.1, P(X=1)=0.3, P(X=2)=0.4, what must P(X=3) be?

(Answers: 1. Values 2-12. 2. P(X<=2) = P(0)+P(1)+P(2)=0.15+0.40+0.30=0.85.


3. DRV, because you count whole phones. 4. 1 - (0.1+0.3+0.4) = 0.2)
Topic 2: Probability Mass Functions (PMF)
Introduction: Bridging from DRVs to PMFs
In the previous topic, we described a Discrete Random Variable (DRV) using a table
of values and their probabilities. A Probability Mass Function (PMF) is the formal
mathematical function that defines this table. It is the rule that assigns a probability
to each possible outcome of a discrete random variable.

Think of it this way:

 The concept of a DRV is the idea of counting uncertain outcomes.


 The PMF is the specific recipe or formula that tells you the probability of each count.

For data analysts, the PMF is a crucial tool. It provides a complete description of the
random variable's behavior, allowing us to calculate any probability we need and to
understand the likelihood of different scenarios before they happen.

Formal Definition and Properties


The Probability Mass Function (PMF) of a discrete random variable XX is the
function p(x)p(x) that gives the probability that XX takes exactly the value xx.

p(x)=P(X=x)p(x)=P(X=x)
For a function to be a valid PMF, it must satisfy the two fundamental rules of
probability we saw earlier:

1. Non-Negativity: The probability for every possible value xx must be zero or


positive.

p(x)≥0for all xp(x)≥0for all x

2. Normalization: The sum of probabilities over all possible values xx must equal 1.
This ensures that the probability of something happening is 100%.
∑all xp(x)=1all x∑p(x)=1

How to Read and Use a PMF: A Simple Example


Let's take the classic example: flipping a fair coin twice. We defined our random
variable XX as the number of heads.

 Sample Space: {TT, TH, HT, HH}


 Possible Values of XX: {0, 1, 2}
The PMF is the function that gives us the probability for each value:

 p(0)=P(X=0)=P(TT)=14p(0)=P(X=0)=P(TT)=41
 p(1)=P(X=1)=P(TH, HT)=24=12p(1)=P(X=1)=P(TH, HT)=42=21
 p(2)=P(X=2)=P(HH)=14p(2)=P(X=2)=P(HH)=41
We can express this PMF in a table:

xx p(x)p(x)

0 0.25

1 0.50

2 0.25

Sum 1.0

Real-Life Data Analytics Example: User Logins


Scenario: A data analyst at a social media company wants to model the number of
times a user logs into the app on a given day. After analyzing historical data, they
determine the following PMF:

Logins (xx) Probability p(x)p(x)

0 0.20

1 0.35

2 0.25

3 0.15

4 0.05

Sum 1.00

This PMF is a data-driven model. It tells the company that:

 20% of users don't log in on a typical day.


 35% log in exactly once.
 Only 5% of users are highly engaged, logging in 4 times.
Step-by-Step Calculation of Probabilities using PMF
Using the PMF table above, we can answer complex business questions through
simple arithmetic.

1. What is the probability a randomly selected user logs in at least once?


o "At least once" means X≥1X≥1.
o We find this by adding the probabilities for all values
x≥1x≥1.
o P(X≥1)=p(1)+p(2)+p(3)+p(4)=0.35+0.25+0.15+0.05=0.80P(X≥1)=p(1)+p(2)
+p(3)+p(4)=0.35+0.25+0.15+0.05=0.80.
o Alternatively, P(X≥1)=1−P(X<1)=1−p(0)=1−0.20=0.80P(X≥1)=1−P(X<1)=1−
p(0)=1−0.20=0.80.
2. What is the probability a user logs in an odd number of times?
o The odd number outcomes are
X=1X=1 and X=3X=3.
o P(X is odd)=p(1)+p(3)=0.35+0.15=0.50P(X is odd)=p(1)+p(3)=0.35+0.15=0.
50.
Visualization: Stem Plots and Bar Charts
The classic way to visualize a PMF is a stem plot (also called a spike plot). It places a
dot (or circle) at the probability for each value and draws a line from the x-axis to
that dot, emphasizing that these are discrete points.

A bar chart is also perfectly appropriate and is often used in business contexts for its
clarity.

import matplotlib.pyplot as plt

# Data from the User Logins PMF


logins = [0, 1, 2, 3, 4]
probability = [0.20, 0.35, 0.25, 0.15, 0.05]

# Create a figure with two subplots to compare visualizations


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Subplot 1: Stem Plot (Statistically traditional)


ax1.stem(logins, probability, basefmt=' ')
ax1.set_title('Stem Plot of User Login PMF')
ax1.set_xlabel('Number of Daily Logins (x)')
ax1.set_ylabel('Probability p(x)')
ax1.set_ylim(0, 0.4)
# Subplot 2: Bar Chart (Business intuitive)
ax2.bar(logins, probability, color='skyblue', edgecolor='black',
alpha=0.7)
ax2.set_title('Bar Chart of User Login PMF')
ax2.set_xlabel('Number of Daily Logins (x)')
ax2.set_ylabel('Probability p(x)')
ax2.set_ylim(0, 0.4)

plt.tight_layout()
plt.show()

(This code would create a side-by-side comparison of a stem plot and a bar chart for
the same PMF.)

Applications in Data Analytics: Beyond Description


PMFs are not just for description; they are the foundation for prediction and
decision-making.

 Resource Planning: A cloud company uses the PMF of server requests per second to
determine how many servers to have running to handle load without over-
provisioning.
 Customer Segmentation: Users can be segmented based on their "number of
purchases" PMF. Marketing campaigns are then tailored to each segment (e.g., re-
engagement campaigns for the x=0 group).
 Risk Modeling: In finance, the PMF of "number of defaulting loans" in a portfolio is
used to calculate potential losses.

Case Study: Forecasting Support Ticket Volume


Business Problem: An IT support team needs to staff appropriately for the next day.
They need to predict how many tickets will come in.

Analysis:
1. The analyst reviews historical data and builds a PMF for the number of daily support
tickets (XX).
2. The PMF might show:
o p(10)=0.05p(10)=0.05 (a very slow day)
o p(25)=0.20p(25)=0.20 (a typical day)
o p(40)=0.02p(40)=0.02 (a very busy day)
3. The team calculates the expected value (a concept we'll cover later) from this PMF,
which is a weighted average. Let's say E[X]=26E[X]=26 tickets.
4. They also look at the probability of high-volume
events: P(X>35)=p(36)+p(37)+...=0.08P(X>35)=p(36)+p(37)+...=0.08.

Insight & Action: The team decides to schedule enough staff to handle 26 tickets
comfortably. However, because there's an 8% chance of getting more than 35 tickets,
they also create an on-call list for such high-volume scenarios. The PMF allows for
both optimal planning and contingency planning.

Key Takeaways

 A Probability Mass Function (PMF) p(x)p(x) is the function that gives the
probability that a discrete random variable XX is exactly equal to xx.
 It must satisfy: p(x)≥0p(x)≥0 and ∑all xp(x)=1∑all xp(x)=1.
 It is most commonly presented as a table or visualized with a stem plot or bar
chart.
 The power of the PMF lies in its ability to answer any probabilistic question about
the random variable through simple addition.
 It is a foundational tool for descriptive analytics, forecasting, and risk assessment.

Common Pitfalls and Practice Questions


Pitfalls:

 Confusing PMF with PDF: The PMF is for discrete variables (points). The Probability
Density Function (PDF) is for continuous variables (areas under a curve). This is a
critical distinction.
 Ignoring the Sample Space: The PMF is only defined for the possible values
of XX. p(x)=0p(x)=0 for any value xx not in that list.

Practice Questions:

1. Is the function p(x)=x/10p(x)=x/10 for x=1,2,3,4x=1,2,3,4 a valid PMF? Why or


why not?
2. Using the support ticket PMF values below, what is the probability of having a typical
or slow day (X≤25X≤25)?
o p(10)=0.05,p(25)=0.20,p(40)=0.02p(10)=0.05,p(25)=0.20,p(40)=0.02
3. Create a PMF table for the random variable "the number of tails" when a fair coin is
tossed three times. (Hint: List all 8 outcomes in the sample space first).

(Answers: 1. Calculate sum: 1/10 + 2/10 + 3/10 + 4/10 = 10/10 = 1. And all p(x)
> 0. So, yes, it is valid. 2. P(X<=25) is not just p(10)+p(25). We are missing the
probabilities for all other values between 0 and 25. This highlights the pitfall of
needing a complete PMF. 3. Possible outcomes: {HHH, HHT, HTH, THH, HTT,
THT, TTH, TTT}. X = number of tails: {0, 1, 1, 1, 2, 2, 2, 3}. Therefore: p(0)=1/8,
p(1)=3/8, p(2)=3/8, p(3)=1/8.)
Topic 3: Continuous Random Variables — Basic Concepts
Introduction: The World is Continuous
In the previous topics, we dealt with outcomes you can count (number of items,
logins, heads). But many things we measure in data analytics are not counts; they
are measurements. These measurements can take on any value within an interval.

Consider:

 The exact time a customer spends on your website (e.g., 2.357 minutes).
 The height of a user (e.g., 175.4 cm).
 The annual revenue of a customer (e.g., $243,561.78).
 The temperature of a server CPU (e.g., 67.3°C).

These are not whole numbers. They can be infinitely precise. A Continuous Random
Variable (CRV) is used to model these types of outcomes. The key difference from a
Discrete Random Variable (DRV) is that a CRV can take on any value in a continuous
interval.

The Fundamental Difference: Probability at a Point


This is the most important conceptual leap. For a Discrete Random Variable (DRV),
we can calculate the probability that it takes on a specific value, like P(X = 2).

For a Continuous Random Variable (CRV), the probability of it taking on any single,
exact value is zero.

P(X=x)=0for any specific value xP(X=x)=0for any specific value x


Why? Because there are infinitely many possible values (e.g., between 2.3 and 2.4
minutes, you have 2.31, 2.311, 2.3111, etc.). The probability of landing on one specific
value out of infinity is effectively zero.

This doesn't mean the value is impossible; it just means we must think about
probability differently. For CRVs, we only calculate probability for intervals.

From Probability Mass to Probability Density


Since we can't talk about probability at a point, we introduce a new concept: density.
Instead of a Probability Mass Function (PMF), we use a Probability Density Function
(PDF), denoted by f(x).

Think of it like this:

 PMF (Discrete): The probability p(x) is the mass (the actual probability) assigned to
the point x.
 PDF (Continuous): The function f(x) gives the density of the probability at the
point x. It tells us how "packed" or " dense" the probability is around that point. To
find the actual probability, we must find the area under the PDF curve over an
interval.

The Probability Density Function (PDF) and its Rules


The PDF, f(x), for a continuous random variable X is a function that must satisfy two
conditions, analogous to the rules for a PMF:

1. Non-Negativity: The density is never negative. A negative density wouldn't make


sense.

f(x)≥0for all xf(x)≥0for all x

2. Normalization: The total area under the entire curve of the PDF must be exactly 1.
This represents the fact that the probability of X taking some value is 100%.
∫−∞∞f(x) dx=1∫−∞∞f(x)dx=1

The probability that X falls between two points a and b is the area under the PDF
curve between those points.

P(a≤X≤b)=∫abf(x) dxP(a≤X≤b)=∫abf(x)dx
Real-Life Data Analytics Example: Website Session Duration
Scenario: A web analyst is modeling the time users spend on a website. This is a
continuous random variable T (time in minutes).

They determine that the time follows a distribution with the following PDF (this is a
simplified example):

f(t)=110e−t/10for t≥0f(t)=101e−t/10for t≥0


This is a known distribution (the exponential distribution) used for modeling waiting
times.

Question: What is the probability that a randomly selected user spends between 5
and 10 minutes on the site?

Answer: We need to find the area under the f(t) curve from t=5 to t=10.

P(5≤T≤10)=∫510110e−t/10 dtP(5≤T≤10)=∫510101e−t/10dt
We would calculate this integral to find the exact probability. (Spoiler: The result is
approximately e^{-0.5} - e^{-1} ≈ 0.6065 - 0.3679 = 0.2386 or 23.86%).

Visualization: The Area Under the Curve


This is best understood visually. The probability is not a height but an area.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Create a range of time values from 0 to 30 minutes


t = np.linspace(0, 30, 500)
# Calculate the PDF (f(t)) for each value
pdf = expon.pdf(t, scale=10) # scale = mean waiting time

# Create the plot


plt.figure(figsize=(10, 6))
plt.plot(t, pdf, 'b-', linewidth=2, label='PDF: f(t)')

# Shade the area for P(5 <= T <= 10)


t_shade = np.linspace(5, 10, 100)
pdf_shade = expon.pdf(t_shade, scale=10)
plt.fill_between(t_shade, pdf_shade, color='skyblue', alpha=0.7, label='P(5 ≤
T ≤ 10)')

# Add labels and title


plt.title('Probability Density Function (PDF) of Website Session Time')
plt.xlabel('Time (minutes)')
plt.ylabel('Density f(t)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

This code would produce a graph with a decaying curve. The shaded area between 5
and 10 minutes represents the 23.86% probability we calculated. The height of the
curve at any point is the density, not the probability.

Applications in Data Analytics


Understanding CRVs is essential for analyzing metric data:
 Operations & Logistics: Modeling delivery times, wait times in a queue, machine
failure intervals.
 Finance: Modeling stock returns, asset prices, and exchange rates (often using log-
normal distributions).
 Quality Control: Measuring the diameter of a manufactured bolt, the volume of
liquid in a bottle, the strength of a material.
 User Behavior Analytics: Analyzing session duration, time between app opens,
revenue per user.

Case Study: Optimizing Server Response Time


Business Problem: A SaaS company needs to guarantee that their server response
time is under 200 milliseconds (ms) for 99% of requests to meet their Service Level
Agreement (SLA).

Analysis:

1. The engineer defines X as the server response time, a continuous random variable.
2. They collect a massive sample of response times and plot a histogram. They discover
the data is right-skewed—most responses are fast, but there's a long tail of slower
responses.
3. They fit a theoretical PDF (e.g., a log-normal or gamma distribution) to this data. This
PDF model, f(x), describes the pattern of their response times.
4. The question becomes: What is the value c such that the area under the PDF
from 0 to c is 0.99?
P(X≤c)=∫0cf(x) dx=0.99P(X≤c)=∫0cf(x)dx=0.99

The value c that satisfies this is the 99th percentile.

Insight & Action: The calculated 99th percentile is 190 ms. This means 99% of
requests are handled in under 190 ms, which safely meets the 200 ms SLA. The
model also shows that improving performance for the slowest 1% of requests would
be very challenging, helping the team set realistic goals.

Key Takeaways

 Continuous Random Variables (CRVs) model measurable, uncountably infinite


outcomes (time, height, weight, revenue).
 The probability of a CRV taking any exact value is zero (P(X=x)=0). We can only
calculate probabilities for intervals.
 The Probability Density Function (PDF), f(x), describes the distribution.
Its height gives density, and areas under it give probabilities.
 The total area under the entire PDF curve is always equal to 1.
 Understanding CRVs and PDFs is crucial for working with the vast majority of
business metrics.
Page 10 – Common Pitfalls and Practice Questions
Pitfalls:

 Interpreting PDF height as probability: This is the most common error. The
value f(x) is a density, not a probability. Only an area gives a probability.
 Forgetting P(X=x)=0: Asking for the probability of an exact value in a continuous
setting is meaningless.

Practice Questions:

1. Why is P(X = 5) zero for a continuous random variable?


2. If the PDF f(x) is very high at a point x=10, what does that tell you?
3. For a valid PDF, the area under the curve must equal ____.
4. If a customer's spending is modeled by a CRV, can we calculate the probability they
spend exactly $50? Can we calculate the probability they spend between $50 and
$100?

(Answers: 1. Because there are infinitely many possible values, making the
probability of any single one effectively zero. 2. It tells you that values around
x=10 are very dense or likely; a small interval around 10 will have a high
probability. 3. 1. 4. No, P(Spend = $50) = 0. Yes, we can calculate P(50 ≤ Spend
≤ 100) by finding the area under the PDF between 50 and 100.)
Topic 4: Probability Density Functions (PDF)
Introduction: The "Density" in Probability
In the previous topic, we established that for continuous random variables,
probability is measured as the area under a curve. The Probability Density Function
(PDF) is the mathematical function that defines this curve. It is the continuous analog
of the Probability Mass Function (PMF).

Think of the PDF not as a graph of probabilities, but as a graph of relative


likelihood. The height of the PDF at any point x, given by f(x), tells us how "dense"
the probability is in the immediate vicinity of x. A higher density means that values
near x are more likely to occur, and a small interval around x will have a larger
probability than an interval of the same width where the density is lower.

For data analysts, the PDF is a powerful model. Once we have a PDF that fits our
data, we can answer any question about the probability of future events within that
continuum.

Formal Definition and the Integral Calculus Connection


Formally, for a continuous random variable XX, the PDF f(x)f(x) is the function that
satisfies the following equation for any interval [a,b][a,b]:

P(a≤X≤b)=∫abf(x) dxP(a≤X≤b)=∫abf(x)dx

This equation is the heart of the PDF. It states that the probability that XX falls
between aa and bb is the definite integral (the area under the curve) of the PDF
from aa to bb.

The two defining properties of a PDF are:

1. f(x)≥0f(x)≥0 for all xx (non-negativity).


2. ∫−∞∞f(x) dx=1∫−∞∞f(x)dx=1 (the total area is 1).
The Critical Difference: PMF vs. PDF
This is a crucial distinction for beginners to grasp.

Probability Density Function


Feature Probability Mass Function (PMF)
(PDF)

Continuous Random Variables


Applies to Discrete Random Variables (DRVs)
(CRVs)

Probability directly. p(x) is the Density. f(x) is not a


Gives
probability that X = x. probability.
Probability Density Function
Feature Probability Mass Function (PMF)
(PDF)

Probability of an event is found Probability of an interval is found


Calculation by summing PMF values: P(X ∈ A) by integrating the PDF: P(X ∈
= Σ p(x) A) = ∫ f(x)dx

Value at a P(X = x) = 0. f(x) can be


p(x) can be between 0 and 1.
Point greater than 1.

Key Insight: A PDF value f(x) can be greater than 1, as long as the total area under
the curve is 1. For example, a very tall and very narrow PDF "spike" can have a height
much greater than 1, but its width is so small that its area is a tiny probability.

Real-Life Data Analytics Example: Customer Service Call Times

Scenario: A call center's data analyst is modeling the duration of customer service
calls. The data is continuous and is well-modeled by a probability distribution with
the following PDF:

f(x)=0.1e−0.1xfor x≥0f(x)=0.1e−0.1xfor x≥0

where xx is the call length in minutes. This is the PDF of the exponential
distribution, which is commonly used to model time until an event (like the end of a
call).

Question: What is the probability that a randomly selected call lasts between 5 and
10 minutes?

Answer: We find the area under the PDF curve between 5 and 10.

P(5≤X≤10)=∫5100.1e−0.1x dxP(5≤X≤10)=∫5100.1e−0.1xdx
Step-by-Step Calculation (Using Calculus and Python)
Let's solve the integral from the previous page.

1. Calculus Solution:

∫0.1e−0.1x dx=−e−0.1x+C∫0.1e−0.1xdx=−e−0.1x+C
Therefore,

∫5100.1e−0.1x dx=[−e−0.1x]510=(−e−1)−(−e−0.5)=−e−1+e−0.5∫510
0.1e−0.1xdx=[−e−0.1x]510
=(−e−1)−(−e−0.5)=−e−1+e−0.5≈−0.3679+0.6065=0.2386≈−0.3679+0.6065=0.23
86
So, there is approximately a 23.86% chance a call lasts between 5 and 10 minutes.

2. Python Solution (No Calculus Required!):


In practice, data analysts use statistical software to calculate these integrals.

from scipy.stats import expon

# Define the exponential distribution with rate parameter lambda = 0.1


lambda_param = 0.1
dist = expon(scale=1/lambda_param) # scale = 1/lambda

# Calculate P(5 <= X <= 10)


prob = dist.cdf(10) - dist.cdf(5)
print(f"P(5 <= X <= 10) = {prob:.4f} or {prob*100:.2f}%")

This code uses the Cumulative Distribution Function (CDF), which we will cover next.
The output will be:
P(5 <= X <= 10) = 0.2387 or 23.87%

Visualization: Interpreting the PDF Curve


The graph of the PDF makes the calculation intuitive. The probability is the shaded
area.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Create the plot


fig, ax = plt.subplots(figsize=(10, 6))
lambda_param = 0.1
x = np.linspace(0, 40, 1000)
pdf = expon.pdf(x, scale=1/lambda_param)
ax.plot(x, pdf, 'b-', label='PDF: f(x) = 0.1e^{-0.1x}')

# Shade the area for P(5 <= X <= 10)


x_shade = np.linspace(5, 10, 100)
pdf_shade = expon.pdf(x_shade, scale=1/lambda_param)
ax.fill_between(x_shade, pdf_shade, color='skyblue', alpha=0.7, label='P(5 ≤
X ≤ 10) ≈ 0.24')

# Annotate the graph


ax.set_ylim(0, 0.11)
ax.set_title('PDF of Customer Service Call Duration')
ax.set_xlabel('Call Length (minutes)')
ax.set_ylabel('Probability Density f(x)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
This visualization clearly shows the area representing our calculated probability. The
peak of the PDF is at x=0, showing that very short calls are the most "dense" or
common, which is a realistic scenario for a call center (e.g., calls that are wrong
numbers or quick questions).

Applications in Data Analytics: The Power of Modeling


Fitting a PDF to data is a core task in analytics. It allows us to:

 Predict Probabilities: What is the probability a user session will last more than 30
minutes?
 Identify Outliers: Values where the PDF is extremely low (e.g., in the tails of a normal
distribution) are rare and might be outliers worth investigating (e.g., fraudulent
transactions).
 Simulate Real-World Processes: PDFs are used in Monte Carlo simulations to
model complex systems like stock markets or queue waiting times by generating
random variables that follow the PDF's pattern.
 Calculate Expected Values: The mean of a distribution (a crucial concept for
reporting "average" values) is calculated using the
PDF: E[X]=∫xf(x)dxE[X]=∫xf(x)dx.

Case Study: Setting SLAs for Cloud Service Response Time


Business Problem: A cloud infrastructure company needs to define a Service Level
Agreement (SLA) that promises response times for an API. They want to promise a
maximum response time that they can meet 99.9% of the time.

Analysis:

1. They collect a huge sample of API response times (X).


2. They plot a histogram and observe the data is not symmetric but is right-skewed.
3. They determine that a Lognormal Distribution PDF is the best fit for their data. This
PDF is characterized by parameters μ (shape) and σ (scale).
4. The question becomes: find the value t such that the probability P(X <= t) = 0.999.
This value t is the 99.9th percentile.
5. Using the fitted PDF, they calculate this integral:
∫0tf(x) dx=0.999∫0tf(x)dx=0.999

The solution t is found to be 620 milliseconds.

Insight & Action: The company can confidently set its SLA to 620 ms. This means
they are contractually promising that 99.9% of all API responses will be faster than
620 ms. The PDF model provides a rigorous, data-driven foundation for this critical
business decision.

Key Takeaways

 The Probability Density Function (PDF) defines the shape of the distribution for a
continuous random variable.
 f(x) is a density, not a probability. Probability is given by the area under the PDF
curve over an interval.
 The total area under any PDF is always exactly 1.
 PDFs are used to calculate probabilities, identify outliers, simulate processes,
and calculate summary statistics like the mean and variance.
 Choosing the right PDF model (e.g., Normal, Exponential, Lognormal) for your data is
a key skill in data analytics.

Common Pitfalls and Practice Questions Pitfalls:

 "f(5) = 0.4, so the probability is 40%." INCORRECT. f(5) is the density. The
probability at a point is zero.
 Ignoring the support of the PDF. The PDF is only defined over a specific range
(e.g., x >= 0 for the exponential distribution). Outside this range, f(x) = 0.

Practice Questions:

1. If f(x) is the PDF of rainfall in a day, what does ∫₀¹ f(x)dx represent?
2. Can the value of a PDF be greater than 1? Why or why not?
3. True or False: The probability that a continuous random variable equals 5 is
always f(5).
4. A PDF is defined as f(x) = c * x for 0 <= x <= 4, and 0 otherwise. What must the
value of c be to make this a valid PDF? (Hint: Total area must be 1).

(Answers: 1. The probability that rainfall is between 0 and 1 unit. 2. Yes, as long
as the area under the curve is 1. A very tall, very narrow spike can have a height
>1. 3. False. P(X=5)=0. f(5) is the density. 4. Solve ∫₀⁴ (c*x) dx = 1. ∫₀⁴ c*x dx =
c * [x²/2]₀⁴ = c * (8 - 0) = 8c. Set 8c = 1, so c = 1/8.)
Topic 5: Cumulative Distribution Functions (CDF)
Introduction: The "Running Total" of Probability
The Cumulative Distribution Function (CDF) is a fundamental concept in probability
theory that provides a different perspective on the distribution of a random variable.
While the PMF (for discrete) and PDF (for continuous) give you the probability or
density at a point, the CDF gives you the probability that a random variable takes a
value less than or equal to a specific number. It is, in essence, a "running total" of
probabilities.

For data analysts, the CDF is incredibly useful because:

 It is defined for both discrete and continuous random variables.


 It allows for easy calculation of probabilities for intervals.
 It provides a way to compute percentiles and medians.
 Its graph can be used to quickly understand the distribution of data.

Formal Definition and Properties


For any random variable XX (discrete or continuous), the Cumulative Distribution
Function (CDF) is defined as:

F(x)=P(X≤x)F(x)=P(X≤x)
This function has the following properties:

1. F(x)F(x) is non-decreasing: if x1<x2x1<x2, then F(x1)≤F(x2)F(x1)≤F(x2).


2. lim⁡x→−∞F(x)=0limx→−∞F(x)=0 and lim⁡x→∞F(x)=1limx→∞F(x)=1.
3. F(x)F(x) is right-continuous (for continuous random variables, it is continuous).
For a discrete random variable, the CDF is a step function. For a continuous random
variable, the CDF is a continuous function.

CDF for Discrete Random Variables: The Step Function


For a discrete random variable, the CDF is found by summing the PMF up to the
point xx:

F(x)=P(X≤x)=∑k≤xp(k)F(x)=P(X≤x)=k≤x∑p(k)
where p(k)p(k) is the PMF at kk.

Example: Consider a discrete random variable XX with PMF:

p(0)=0.2,p(1)=0.5,p(2)=0.3p(0)=0.2,p(1)=0.5,p(2)=0.3
Then the CDF is:
 F(−0.5)=P(X≤−0.5)=0F(−0.5)=P(X≤−0.5)=0
 F(0)=P(X≤0)=p(0)=0.2F(0)=P(X≤0)=p(0)=0.2
 F(0.5)=P(X≤0.5)=p(0)=0.2F(0.5)=P(X≤0.5)=p(0)=0.2
 F(1)=P(X≤1)=p(0)+p(1)=0.7F(1)=P(X≤1)=p(0)+p(1)=0.7
 F(1.5)=P(X≤1.5)=p(0)+p(1)=0.7F(1.5)=P(X≤1.5)=p(0)+p(1)=0.7
 F(2)=P(X≤2)=p(0)+p(1)+p(2)=1.0F(2)=P(X≤2)=p(0)+p(1)+p(2)=1.0
 F(3)=1.0F(3)=1.0
Notice how the CDF remains constant between integers and jumps at the integer
values.

CDF for Continuous Random Variables: The Smooth Curve


For a continuous random variable, the CDF is the integral of the PDF
from −∞−∞ to xx:

F(x)=P(X≤x)=∫−∞xf(t) dtF(x)=P(X≤x)=∫−∞xf(t)dt
where f(t)f(t) is the PDF.

The CDF for a continuous random variable is a continuous, non-decreasing function


that goes from 0 to 1.

Example: Consider an exponential distribution with


PDF f(x)=λe−λxf(x)=λe−λx for x≥0x≥0. Then the CDF is:

F(x)=∫0xλe−λt dt=1−e−λxF(x)=∫0xλe−λtdt=1−e−λx
Real-Life Data Analytics Example: Customer Waiting Times
Scenario: A company models the waiting time (in minutes) for customer service calls
using an exponential distribution with rate parameter λ=0.2λ=0.2 (so mean waiting
time is 5 minutes). The CDF is:

F(x)=1−e−0.2xF(x)=1−e−0.2x
Questions:

1. What is the probability that a customer waits less than 3 minutes?

P(X≤3)=F(3)=1−e−0.2×3=1−e−0.6≈1−0.5488=0.4512P(X≤3)=F(3)=1−e−0.2×
3=1−e−0.6≈1−0.5488=0.4512

2. What is the probability that a customer waits more than 10 minutes?


P(X>10)=1−F(10)=1−(1−e−0.2×10)=e−2≈0.1353P(X>10)=1−F(10)=1−(1−e−
0.2×10)=e−2≈0.1353
3. What is the probability that a customer waits between 5 and 8 minutes?
P(5≤X≤8)=F(8)−F(5)=(1−e−1.6)−(1−e−1.0)=e−1−e−1.6≈0.3679−0.2019=0.1
660P(5≤X≤8)=F(8)−F(5)=(1−e−1.6)−(1−e−1.0)=e−1−e−1.6≈0.3679−0.2019=0.16
60

Visualization: Reading the CDF


The graph of the CDF allows us to visually estimate probabilities and percentiles.

For a discrete variable, the CDF is a step function. For a continuous variable, it is an S-
shaped curve (for the exponential, it is a rising curve that approaches 1).

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Example: Exponential distribution CDF


lambda_param = 0.2
x = np.linspace(0, 20, 100)
cdf = expon.cdf(x, scale=1/lambda_param) # scale = 1/lambda

plt.figure(figsize=(10, 6))
plt.plot(x, cdf, 'b-', label='CDF: F(x) = 1 - e^{-0.2x}')
plt.title('CDF of Customer Waiting Time')
plt.xlabel('Waiting Time (minutes)')
plt.ylabel('Cumulative Probability F(x)')
plt.grid(True, alpha=0.3)

# Highlight F(3) and F(10)


plt.axvline(3, color='r', linestyle='--')
plt.axhline(y=expon.cdf(3, scale=5), color='r', linestyle='--')
plt.text(3, 0, '3', color='r', ha='center', va='top')
plt.text(0, expon.cdf(3, scale=5), f'{expon.cdf(3, scale=5):.2f}', color='r',
ha='right', va='center')

plt.axvline(10, color='g', linestyle='--')


plt.axhline(y=expon.cdf(10, scale=5), color='g', linestyle='--')
plt.text(10, 0, '10', color='g', ha='center', va='top')
plt.text(0, expon.cdf(10, scale=5), f'{expon.cdf(10, scale=5):.2f}', color='g
', ha='right', va='center')

plt.legend()
plt.show()

This code will plot the CDF of the exponential distribution. The dashed lines show
how to find the cumulative probability for x=3 and x=10.
The Inverse CDF (Percentile Function)
The inverse of the CDF, also called the quantile function or percentile function, is
incredibly useful. For a probability pp, the inverse CDF gives the value xx such
that F(x)=pF(x)=p. This is how we find percentiles.

Example: In the waiting time example, what is the 90th percentile of waiting times?
That is, find xx such that F(x)=0.9F(x)=0.9.

0.9=1−e−0.2x ⟹ e−0.2x=0.1 ⟹ −0.2x=ln⁡(0.1) ⟹


x=−ln⁡(0.1)0.2≈2.30260.2=11.5130.9=1−e−0.2x⟹e−0.2x=0.1⟹−0.2x=ln(0.1)
⟹x=−0.2ln(0.1)≈0.22.3026=11.513
So, 90% of customers wait less than 11.5 minutes.

In Python, we can use the ppf (percent point function) method:

p = 0.9
percentile_90 = expon.ppf(p, scale=5) # scale = 1/lambda = 5
print(f"The 90th percentile is {percentile_90:.2f} minutes.")

Applications in Data Analytics


The CDF is used in many analytical contexts:

 Percentile Calculations: Understanding the distribution of values, such as the 95th


percentile of response times.
 Comparing Distributions: Plotting empirical CDFs (ECDFs) of two datasets allows for
easy comparison without the binning bias of histograms.
 Hypothesis Testing: The Kolmogorov-Smirnov test compares the CDF of a sample
to a theoretical CDF to check if the sample comes from that distribution.
 Risk Analysis: In finance, the CDF of returns is used to calculate Value at Risk (VaR).

Case Study: Analyzing Income Distribution


Business Problem: A policy analyst wants to understand income inequality in a
region. They have income data for a large sample of households.

Analysis:

1. They compute the empirical CDF of the income data.


2. They can then read off percentiles directly from the CDF:
o The median income (50th percentile).
o The 10th percentile and the 90th percentile to see the spread.
3. They might compare the CDF of income from different years to see how the
distribution has changed.
4. The Gini coefficient (a measure of inequality) can be calculated from the CDF.

The CDF provides a complete picture of the income distribution, allowing the analyst
to make statements like: "The bottom 20% of households have incomes below
$30,000" and "The top 10% have incomes above $150,000".

Summary and Key Takeaways

 The Cumulative Distribution Function (CDF) F(x)=P(X≤x)F(x)=P(X≤x) gives the


probability that a random variable is less than or equal to xx.
 It is defined for both discrete and continuous random variables.
 For discrete variables, the CDF is a step function; for continuous variables, it is
continuous.
 The CDF can be used to compute probabilities for intervals and to find percentiles via
the inverse CDF.
 The CDF is a powerful tool for data analysis, enabling percentile analysis, distribution
comparison, and risk assessment.

Common Pitfalls:

 Confusing the CDF with the PDF/PMF. Remember: the CDF accumulates probability.
 For discrete variables, the CDF is right-continuous, but note that it has jumps.

Practice Questions:

1. If F(10)=0.75F(10)=0.75 for a continuous random variable, what


is P(X>10)P(X>10)?
2. For a discrete random variable, if the PMF
is p(1)=0.3,p(2)=0.5,p(3)=0.2p(1)=0.3,p(2)=0.5,p(3)=0.2, what
is F(2.5)F(2.5)?
3. True or False: The CDF of a continuous random variable is always continuous.
4. If the CDF of a distribution is F(x)=x2F(x)=x2 for 0≤x≤10≤x≤1, what is the PDF?

(Answers: 1. P(X>10)=1−F(10)=0.25P(X>10)=1−F(10)=0.25.
2. F(2.5)=P(X≤2.5)=p(1)+p(2)=0.8F(2.5)=P(X≤2.5)=p(1)+p(2)=0.8. 3. True.
4. The PDF is the derivative: f(x)=2xf(x)=2x for 0≤x≤10≤x≤1.)
Topic 6: Binomial Distribution
Introduction: Modeling Success in Fixed Trials
The Binomial Distribution is one of the most fundamental discrete probability
distributions in statistics and data analytics. It models scenarios where we perform a
fixed number of independent experiments, each with the same probability of success,
and count the number of successes.

Key characteristics of a Binomial experiment:

1. Fixed number of trials (n): The number of experiments is predetermined.


2. Independent trials: The outcome of one trial does not affect others.
3. Two possible outcomes: Each trial results in success or failure.
4. Constant probability (p): The probability of success remains the same for each trial.

Real-world examples:

 Counting the number of defective items in a production batch


 Tracking how many users click on an ad out of those who view it
 Measuring the number of successful sales calls in a day

Mathematical Foundation and PMF Formula


The Probability Mass Function (PMF) of a Binomial random variable X with
parameters n (number of trials) and p (probability of success) is:

P(X=k)=(nk)pk(1−p)n−kP(X=k)=(kn)pk(1−p)n−k
Where:

 (nk)=n!k!(n−k)!(kn)=k!(n−k)!n! is the binomial coefficient


 kk is the number of successes (0 ≤ k ≤ n)
 pp is the probability of success on a single trial
 1−p1−p is the probability of failure
The binomial coefficient counts the number of ways to achieve k successes in n trials.

Properties and Characteristics


Key properties of the Binomial distribution:

 Mean (Expected value): E[X]=npE[X]=np


 Variance: Var(X)=np(1−p)Var(X)=np(1−p)
 Standard Deviation: σ=np(1−p)σ=np(1−p)
 Shape: The distribution is symmetric when p = 0.5, right-skewed when p < 0.5, and
left-skewed when p > 0.5
 Mode: The value of k with highest probability is either ⌊(n+1)p⌋ or ⌊(n+1)p⌋ - 1
These properties provide quick insights into the distribution's behavior without
detailed calculations.

Real-World Analytics Example: A/B Testing Conversion Rates


Scenario: An e-commerce company runs an A/B test on their website. They show a
new page design (Variant B) to 1000 visitors and track conversions. The historical
conversion rate is 5%.

Questions:

1. What's the probability of getting exactly 50 conversions?


2. What's the probability of getting at least 60 conversions?
3. How likely is Variant B to outperform the historical rate?

Analysis:
This is a binomial scenario with:

 n = 1000 trials (visitors)


 p = 0.05 (probability of conversion)
 X = number of conversions

We can use the binomial PMF to answer these questions.

Step-by-Step Probability Calculations


Let's calculate the probability of exactly 50 conversions:

P(X=50)=(100050)(0.05)50(0.95)950P(X=50)=(501000)(0.05)50(0.95)950
While this calculation is complex manually, we can use Python:
python
from scipy.stats import binom

n = 1000
p = 0.05
k = 50

prob_exact = binom.pmf(k, n, p)
print(f"P(X = 50) = {prob_exact:.6f}")

For "at least 60 conversions":

P(X≥60)=1−P(X≤59)P(X≥60)=1−P(X≤59)

In Python:

prob_at_least_60 = 1 - binom.cdf(59, n, p)
print(f"P(X ≥ 60) = {prob_at_least_60:.6f}")
Visualization: Understanding Distribution Shape
The shape of the binomial distribution changes based on n and p. Let's visualize
different scenarios:

import matplotlib.pyplot as plt


import numpy as np
from scipy.stats import binom

# Create subplots for different parameter values


fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Different parameter combinations


params = [(10, 0.3), (10, 0.5), (20, 0.3), (50, 0.1)]

for i, (n, p) in enumerate(params):


ax = axes[i//2, i%2]
k_values = np.arange(0, n+1)
probs = binom.pmf(k_values, n, p)

ax.bar(k_values, probs, alpha=0.7)


ax.set_title(f'n = {n}, p = {p}')
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Probability')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

These visualizations show how the distribution becomes more symmetric as n


increases, and how the skewness changes with p.

Applications in Data Analytics and Business


The Binomial distribution has wide-ranging applications:
1. Quality Control: Monitoring defect rates in manufacturing
2. Marketing: Predicting campaign response rates
3. Finance: Modeling loan default probabilities
4. Healthcare: Estimating treatment success rates
5. Technology: Predicting system reliability and failure rates

In each case, the binomial model helps make data-driven decisions by quantifying
uncertainty.

Case Study: Email Marketing Campaign Performance


Business Problem: A company sends a marketing email to 10,000 subscribers.
Historically, the open rate is 15%. They want to know:

1. The expected number of opens


2. The probability of exceeding 1,600 opens
3. The range of likely outcomes

Analysis:
Using the binomial model with n = 10,000, p = 0.15:

1. Expected opens: E[X] = np = 10,000 × 0.15 = 1,500


2. Probability of more than 1,600 opens: P(X > 1,600) = 1 - P(X ≤ 1,600)
3. 95% probability range: Using normal approximation (valid for large n)
n = 10000
p = 0.15

# Exact binomial calculation


prob_more_than_1600 = 1 - binom.cdf(1600, n, p)
print(f"Probability of more than 1600 opens: {prob_more_than_1600:.4f}")

# Normal approximation
mu = n * p
sigma = np.sqrt(n * p * (1-p))
# Using continuity correction: P(X > 1600) ≈ P(Z > (1600.5 - μ)/σ)
z = (1600.5 - mu) / sigma
prob_approx = 1 - norm.cdf(z)
print(f"Normal approximation: {prob_approx:.4f}")
Probability of more than 1600 opens: 0.0026
Normal approximation: 0.0024

Normal Approximation to Binomial Distribution


For large n, the binomial distribution can be approximated by a normal distribution
with mean μ = np and variance σ² = np(1-p). This is known as the De Moivre-Laplace
theorem.
The approximation is reasonable when:

 np ≥ 10
 n(1-p) ≥ 10

With continuity correction:

P(X≤k)≈P(Z≤k+0.5−npnp(1−p))P(X≤k)≈P(Z≤np(1−p)k+0.5−np)
This approximation simplifies calculations for large n where exact binomial
computation becomes difficult.

Summary and Key Takeaways

 The Binomial distribution models the number of successes in n independent trials


 It's characterized by two parameters: n (number of trials) and p (success probability)
 Mean: np, Variance: np(1-p)
 Applicable in various business contexts including A/B testing, quality control, and risk
assessment
 For large n, can be approximated by a normal distribution
 Understanding this distribution is crucial for interpreting count data and making
probabilistic predictions

Common Pitfalls:

 Assuming independence when trials are correlated


 Applying to non-binary outcomes
 Using when p changes between trials
 Applying normal approximation without checking np ≥ 10 and n(1-p) ≥ 10

Practice Questions:

1. In 20 coin flips, what's the probability of exactly 10 heads?


2. If a basketball player has 80% free throw accuracy, what's the probability they make
at least 8 of 10 shots?
3. When is normal approximation appropriate for a binomial distribution?
4. How does the variance change as p approaches 0 or 1?

(Answers: 1. binom.pmf(10, 20, 0.5) ≈ 0.1762; 2. 1 - binom.cdf(7, 10, 0.8) ≈


0.6778; 3. When np ≥ 10 and n(1-p) ≥ 10; 4. Variance decreases as p approaches
0 or 1, reaching 0 at the extremes)
Topic 7: Poisson Distribution
Introduction: Modeling Rare Events Over Time and Space

The Poisson Distribution is a fundamental discrete probability distribution that


models the number of events occurring in a fixed interval of time or space, when
these events happen with a known constant mean rate and independently of the
time since the last event. Unlike the Binomial distribution which counts successes in a
fixed number of trials, the Poisson distribution counts occurrences in a continuous
interval.

Key characteristics of a Poisson process:

1. Events are independent: The occurrence of one event does not affect the
probability of another event
2. Constant average rate: The average rate (events per unit time/space) is constant
3. No simultaneous events: Two events cannot occur at exactly the same instant

Real-world examples:

 Number of customers arriving at a store per hour


 Number of emails received per day
 Number of website visitors per minute
 Number of defects in a manufactured product

Mathematical Foundation and PMF Formula

The Probability Mass Function (PMF) of a Poisson random variable X with parameter
λ (lambda), representing the average rate of events, is:

P(X=k)=e−λλkk!P(X=k)=k!e−λλk
Where:

 kk is the number of occurrences (k = 0, 1, 2, ...)


 λλ is the average number of events per interval
 ee is Euler's number (approximately 2.71828)
 k!k! is the factorial of k
The Poisson distribution has the unique property that its mean and variance are both
equal to λ:

 Mean: E[X]=λE[X]=λ
 Variance: Var(X)=λVar(X)=λ
 Standard Deviation: σ=λσ=λ

Relationship to Binomial Distribution

The Poisson distribution can be derived as a limiting case of the Binomial distribution
when:

 The number of trials n approaches infinity


 The probability of success p approaches zero
 The product np remains constant (np = λ)

This relationship is particularly useful because:

1. It provides intuition about when Poisson approximation is appropriate


2. It allows us to use Poisson distribution for rare events (small p) with large n
3. The approximation is good when n ≥ 20 and p ≤ 0.05, and excellent when n ≥ 100
and np ≤ 10

Real-World Analytics Example: Call Center Operations

Scenario: A call center receives an average of 15 calls per hour. The management
wants to understand:

1. The probability of receiving exactly 20 calls in an hour


2. The probability of receiving more than 25 calls in an hour
3. The staffing requirements to handle peak loads

Analysis:
This is a classic Poisson scenario with:

 λ = 15 (average calls per hour)


 X = number of calls in a given hour

We can use the Poisson PMF and CDF to answer these operational questions.

Step-by-Step Probability Calculations

Let's calculate the probability of receiving exactly 20 calls:

P(X=20)=e−15⋅152020!P(X=20)=20!e−15⋅1520
Using Python for calculation:

from scipy.stats import poisson

lambda_val = 15 # Average calls per hour


k = 20 # Number of calls we're interested in
# Probability of exactly 20 calls
prob_exact = poisson.pmf(k, lambda_val)
print(f"P(X = 20) = {prob_exact:.4f}")

# Probability of more than 25 calls


prob_more_than_25 = 1 - poisson.cdf(25, lambda_val)
print(f"P(X > 25) = {prob_more_than_25:.4f}")
P(X = 20) = 0.0418
P(X > 25) = 0.0062

Visualization: Understanding the Poisson Distribution Shape

The shape of the Poisson distribution changes based on the value of λ. Let's visualize
different scenarios:

import matplotlib.pyplot as plt


import numpy as np
from scipy.stats import poisson

# Create subplots for different lambda values


fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Different lambda values to demonstrate shape changes


lambda_values = [1, 4, 10, 25]

for i, lam in enumerate(lambda_values):


ax = axes[i//2, i%2]
k_values = np.arange(0, 3*lam) # Show up to 3 times lambda
probs = poisson.pmf(k_values, lam)

ax.bar(k_values, probs, alpha=0.7, color='skyblue')


ax.set_title(f'Poisson Distribution (λ = {lam})')
ax.set_xlabel('Number of Events')
ax.set_ylabel('Probability')
ax.grid(True, alpha=0.3)

# Mark the mean (λ)


ax.axvline(lam, color='red', linestyle='--', alpha=0.7, label=f'Mean (λ =
{lam})')
ax.legend()

plt.tight_layout()
plt.show()

This visualization shows how the distribution:

 Becomes more symmetric as λ increases


 Has its peak near λ
 Spreads out as λ increases (variance = λ)
Applications in Data Analytics and Business

The Poisson distribution has extensive applications across various domains:

1. Operations Management:
o Modeling customer arrivals in queues
o Inventory management for perishable goods
o Service capacity planning
2. Quality Control:
o Counting defects in manufacturing processes
o Monitoring rare events in production lines
3. Telecommunications:
o Modeling call arrivals in networks
o Predicting message traffic
4. Healthcare:
o Modeling patient arrivals in emergency rooms
o Counting rare disease occurrences
5. Finance:
o Modeling rare market events
o Counting transactions in high-frequency trading

Case Study: Website Traffic Analysis

Business Problem: An e-commerce website averages 500 visitors per hour during
peak times. The infrastructure team needs to:

1. Determine the probability of traffic spikes


2. Plan server capacity to handle load
3. Set up automatic scaling triggers

Analysis:
Using the Poisson model with λ = 500:

1. Expected visitors: E[X] = 500 per hour


2. Variance: Var(X) = 500
3. Standard deviation: σ = √500 ≈ 22.36

For capacity planning, we might want to know the 99th percentile:

lambda_val = 500

# Find the 99th percentile


percentile_99 = poisson.ppf(0.99, lambda_val)
print(f"99th percentile: {percentile_99} visitors per hour")

# Probability of exceeding 550 visitors


prob_exceed_550 = 1 - poisson.cdf(550, lambda_val)
print(f"P(X > 550) = {prob_exceed_550:.6f}")
99th percentile: 553.0 visitors per hour
P(X > 550) = 0.012898

This analysis helps the team set appropriate capacity limits and scaling policies.

Poisson Process and Time Between Events

An important related concept is the exponential distribution, which models the time
between events in a Poisson process. If events follow a Poisson process with rate λ,
then the time between events follows an exponential distribution with parameter λ.

Key relationships:

 If X ~ Poisson(λ) [count of events in unit time]


 Then Y ~ Exponential(λ) [time between events]
This relationship is crucial for:

 Modeling waiting times between events


 Reliability engineering (time between failures)
 Queueing theory (time between arrivals)

Summary and Key Takeaways

 The Poisson distribution models the number of events in fixed intervals of time or
space
 It's characterized by a single parameter λ (the average rate)
 Mean and variance are both equal to λ
 Arises as a limit of the Binomial distribution for rare events
 Widely applicable in operations, quality control, and service industries
 Related to the exponential distribution for modeling time between events

Common Pitfalls:

 Assuming events are independent when they may be correlated


 Applying when the rate λ is not constant
 Using for non-count data
 Confusing with the exponential distribution (which models time between events)

Practice Questions:

1. If a store averages 8 customers per hour, what's the probability of exactly 10


customers in an hour?
2. How does the variance change as λ increases?
3. When is Poisson approximation to Binomial appropriate?
4. If events occur at a rate of 2 per minute, what distribution models the time between
events?

(Answers: 1. poisson.pmf(10, 8) ≈ 0.0993; 2. Variance increases with λ; 3. When


n ≥ 20, p ≤ 0.05, and np ≤ 10; 4. Exponential distribution with λ = 2)
Topic 8: Conditional Distributions
Introduction: The Concept of Conditional Probability in Distributions

Conditional distributions represent one of the most powerful concepts in probability


and statistics, allowing us to understand how the probability distribution of one
variable changes when we have information about another variable. While standard
distributions describe overall behavior, conditional distributions help us make more
precise, context-aware predictions.

In data analytics, we rarely examine variables in complete isolation. We want to know


things like:

 How does the distribution of customer spending change when we know their age
group?
 What is the probability distribution of website conversion rates for users from
different geographic regions?
 How does the failure rate of equipment change based on operating conditions?

Conditional distributions provide the mathematical framework to answer these types


of questions by showing how knowing one piece of information (the conditioning
variable) changes our expectations about another variable.

Formal Definition and Mathematical Foundation

For two random variables X and Y, the conditional distribution of Y given X = x is


denoted as P(Y = y | X = x) for discrete variables or f(y | x) for continuous variables.

The formal definition builds on the concept of conditional probability:

For discrete variables:

P(Y=y∣X=x)=P(X=x,Y=y)P(X=x)P(Y=y∣X=x)=P(X=x)P(X=x,Y=y)
For continuous variables:

f(y∣x)=f(x,y)fX(x)f(y∣x)=fX(x)f(x,y)
Where:

 P(X = x, Y = y) is the joint probability mass function


 f(x, y) is the joint probability density function
 P(X = x) and f_X(x) are the marginal distributions
The key insight is that conditional distributions are proportional to joint distributions
but normalized by the probability of the conditioning event.

Relationship to Joint and Marginal Distributions

Understanding conditional distributions requires grasping their relationship with joint


and marginal distributions:

1. Joint Distribution: The complete probability distribution of both variables together


2. Marginal Distribution: The distribution of one variable ignoring the other
3. Conditional Distribution: The distribution of one variable given specific knowledge
about the other

These three concepts form a fundamental triangle in multivariate statistics:

 From joint to marginal: Summing/integrating over the other variable


 From joint to conditional: Dividing by the appropriate marginal
 From conditional and marginal to joint: Multiplying them together

This relationship is captured in the fundamental formula:

f(x,y)=f(y∣x)⋅fX(x)=f(x∣y)⋅fY(y)f(x,y)=f(y∣x)⋅fX(x)=f(x∣y)⋅fY(y)
Real-World Analytics Example: Customer Segmentation

Scenario: An e-commerce company wants to understand how spending behavior (Y)


differs across age groups (X). The company has collected data that allows them to
model:

 Marginal distribution of age groups: P(X = x)


 Joint distribution of age and spending: P(X = x, Y = y)
 Therefore, they can compute conditional spending distributions: P(Y = y | X = x)

This enables precise questions like:


"What is the probability distribution of spending for customers in the 25-34 age
group?"

Analysis:
Let's say the data shows:

 P(Age = 25-34) = 0.30 (marginal)


 For this age group, the conditional spending distribution might be:
o P(Spending = Low | Age = 25-34) = 0.40
o P(Spending = Medium | Age = 25-34) = 0.45
o P(Spending = High | Age = 25-34) = 0.15
This conditional distribution provides much more specific information than the
overall spending distribution across all age groups.

Step-by-Step Calculation Example

Let's work through a concrete example with discrete data:

Suppose we have data on 1000 customers showing their device type (Mobile or
Desktop) and conversion status (Converted or Not):

Mobile Desktop Total

Converted 120 180 300

Not Converted 380 320 700

Total 500 500 1000

The conditional distribution of conversion given device type is:

P(Converted | Mobile) = 120/500 = 0.24


P(Not Converted | Mobile) = 380/500 = 0.76

P(Converted | Desktop) = 180/500 = 0.36


P(Not Converted | Desktop) = 320/500 = 0.64

This shows desktop users have a 50% higher conversion rate (36% vs 24%),
information that would be masked if we only looked at the overall conversion rate of
30%.

Visualization of Conditional Distributions

Effective visualization of conditional distributions often involves comparative plots:

import matplotlib.pyplot as plt


import numpy as np

# Data from our conversion example


devices = ['Mobile', 'Desktop']
conversion_rates = [0.24, 0.36]
non_conversion_rates = [0.76, 0.64]

# Create grouped bar chart


x = np.arange(len(devices))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))


bars1 = ax.bar(x - width/2, conversion_rates, width, label='Converted', color
='green')
bars2 = ax.bar(x + width/2, non_conversion_rates, width, label='Not Converted
', color='red')

ax.set_xlabel('Device Type')
ax.set_ylabel('Probability')
ax.set_title('Conditional Distribution of Conversion by Device Type')
ax.set_xticks(x)
ax.set_xticklabels(devices)
ax.legend()

# Add value labels on bars


for bar in bars1 + bars2:
height = bar.get_height()
ax.annotate(f'{height:.2f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')

plt.tight_layout()
plt.show()

This visualization clearly shows how the conversion probability distribution changes
conditional on the device type.

Conditional Expectation and Variance

Beyond the full conditional distribution, we often want summary statistics:

Conditional Expectation: E[Y | X = x]


 The expected value of Y given that X = x
 For each possible x, this gives a different expected value
 In regression analysis, we model E[Y | X] as a function of X

Conditional Variance: Var(Y | X = x)

 Measures how much Y varies around its conditional expectation


 Important for understanding prediction uncertainty

For example, in our conversion scenario:

 E[Conversion | Device = Mobile] = 0.24


 E[Conversion | Device = Desktop] = 0.36

This conditional expectation helps allocate marketing resources more effectively.

Applications in Machine Learning and Predictive Modeling

Conditional distributions form the theoretical foundation for many machine learning
algorithms:

1. Classification Algorithms: Essentially model P(Y = class | X = features)


2. Regression Models: Estimate E[Y | X = x], the conditional expectation
3. Bayesian Methods: Update prior distributions to posterior distributions using
conditional probability
4. Hidden Markov Models: Use conditional distributions for state transitions
5. Recommendation Systems: Model P(preference | user features, item features)

In all these applications, we're not just modeling variables in isolation, but how they
relate to each other conditionally.

Case Study: Risk Assessment in Lending

Business Problem: A bank wants to assess default risk more accurately by


conditioning on multiple borrower characteristics.

Analysis:
Instead of using a single overall default probability, the bank develops conditional
default probabilities:

 P(Default | Credit Score = 650, Income = $50,000, Loan Amount = $200,000)


 P(Default | Credit Score = 750, Income = $100,000, Loan Amount = $100,000)

By developing sophisticated models of the conditional distribution of default given


borrower characteristics, the bank can:
 Price loans more accurately
 Make better lending decisions
 Manage portfolio risk more effectively
 Comply with regulatory requirements for risk-based pricing

This conditional approach is far superior to using overall average default rates, which
would treat all borrowers as identical.

Summary and Key Takeaways

 Conditional distributions show how the probability distribution of one variable


changes when we know the value of another variable
 They are calculated from joint distributions divided by marginal distributions
 Conditional expectations (E[Y|X=x]) are fundamental to prediction and regression
 Visualization of conditional distributions reveals relationships that might be hidden in
marginal distributions
 Applications span customer analytics, risk management, machine learning, and
decision making
 Understanding conditional distributions is crucial for moving from simple descriptive
statistics to predictive modeling

Common Pitfalls:

 Confusing conditional and marginal probabilities (base rate fallacy)


 Assuming independence when conditional dependence exists
 Extrapolating beyond the range of the conditioning variable
 Ignoring the variability in conditional distributions

Practice Questions:

1. If P(A = a, B = b) = 0.15 and P(B = b) = 0.30, what is P(A = a | B = b)?


2. In our conversion example, what is E[Conversion | Device = Mobile]?
3. Why might conditional variances differ from marginal variances?
4. How are conditional distributions used in classification algorithms?

(Answers: 1. 0.15/0.30 = 0.50; 2. 0.24; 3. Because the conditioning variable


might explain some of the variation; 4. They model P(class | features) directly)
Topic 9: Normal Distribution and Related Distributions
Introduction: The Ubiquitous Bell Curve

The Normal Distribution, also known as the Gaussian distribution, is arguably the
most important probability distribution in statistics and data analytics. Its
characteristic bell-shaped curve appears throughout nature, science, and human
phenomena, making it a fundamental tool for understanding and modeling
continuous data.

Why the normal distribution is so prevalent:

 Central Limit Theorem: The sum of many independent random variables tends
toward a normal distribution, regardless of their original distributions
 Natural phenomena: Many biological, physical, and social measurements follow
normal distributions (heights, test scores, measurement errors)
 Analytical convenience: Mathematical properties make it tractable for statistical
inference and hypothesis testing

In data analytics, the normal distribution serves as:

 A model for many naturally occurring phenomena


 A foundation for statistical inference and hypothesis testing
 A benchmark for comparing other distributions
 A building block for more complex statistical models

Mathematical Foundation and PDF Formula

The probability density function (PDF) of the normal distribution with mean μ and
standard deviation σ is:

f(x)=1σ2πe−12(x−μσ)2f(x)=σ2π1e−21(σx−μ)2
Where:

 μ is the mean (determines the center of the distribution)


 σ is the standard deviation (determines the spread of the distribution)
 π is the mathematical constant pi (≈3.14159)
 e is the base of the natural logarithm (≈2.71828)

Key properties:

 Symmetry: The distribution is perfectly symmetric about the mean


 Mean, median, mode coincidence: All three measures of central tendency are equal
 Asymptotic: The curve approaches but never touches the x-axis
 Total area: The total area under the curve equals 1
The Standard Normal Distribution and Z-Scores

The standard normal distribution is a special case with μ = 0 and σ = 1. Its PDF
simplifies to:

ϕ(z)=12πe−12z2ϕ(z)=2π1e−21z2
Any normal distribution can be transformed to the standard normal distribution
using the z-score transformation:

z=x−μσz=σx−μ
This transformation is crucial because:

 It allows us to compare values from different normal distributions


 It enables the use of standard normal tables (z-tables)
 It simplifies probability calculations

The z-score represents how many standard deviations a value is from the mean,
providing a standardized measure of relative position.

Real-World Analytics Example: Height Distribution

Scenario: A clothing retailer wants to understand the distribution of adult male


heights in their target market to optimize inventory planning.

Data characteristics:

 Heights are normally distributed


 Mean height: μ = 175 cm
 Standard deviation: σ = 7 cm

Business questions:

1. What percentage of the population is between 168 cm and 182 cm?


2. What height represents the 95th percentile?
3. How do these insights inform clothing size distribution?

Analysis:
This is a classic application of the normal distribution where we can use z-scores and
standard normal properties to answer these questions.
Step-by-Step Probability Calculations

Let's calculate the percentage of the population between 168 cm and 182 cm:

1. Convert to z-scores:

z1=168−1757=−1.0z1=7168−175=−1.0z2=182−1757=1.0z2=7182−175=1.0

2. Find probabilities using standard normal distribution:


P(−1.0≤Z≤1.0)=P(Z≤1.0)−P(Z≤−1.0)P(−1.0≤Z≤1.0)=P(Z≤1.0)−P(Z≤−1.0)

Using Python for calculation:


python
from scipy.stats import norm

# Calculate probability between z = -1 and z = 1


prob = norm.cdf(1) - norm.cdf(-1)
print(f"Percentage between 168cm and 182cm: {prob*100:.1f}%")

# Find 95th percentile height


percentile_95 = norm.ppf(0.95) * 7 + 175
print(f"95th percentile height: {percentile_95:.1f} cm")

This reveals that approximately 68% of the population falls within this range, and the
95th percentile height is about 186.5 cm.

Visualization: The Bell Curve and Empirical Rule

The normal distribution's properties are beautifully captured in the empirical rule (68-
95-99.7 rule):

 Approximately 68% of values fall within ±1σ of the mean


 Approximately 95% of values fall within ±2σ of the mean
 Approximately 99.7% of values fall within ±3σ of the mean

import matplotlib.pyplot as plt


import numpy as np
from scipy.stats import norm

# Create visualization
mu, sigma = 175, 7
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
pdf = norm.pdf(x, mu, sigma)

plt.figure(figsize=(12, 6))
plt.plot(x, pdf, 'b-', linewidth=2)
plt.title('Normal Distribution of Adult Male Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')

# Shade different regions


plt.fill_between(x, pdf, where=(x >= mu-sigma) & (x <= mu+sigma),
color='lightblue', alpha=0.5, label='μ ± σ (68%)')
plt.fill_between(x, pdf, where=(x >= mu-2*sigma) & (x <= mu+2*sigma),
color='blue', alpha=0.3, label='μ ± 2σ (95%)')
plt.fill_between(x, pdf, where=(x >= mu-3*sigma) & (x <= mu+3*sigma),
color='darkblue', alpha=0.2, label='μ ± 3σ (99.7%)')

plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

This visualization clearly demonstrates the empirical rule and helps understand the
concentration of values around the mean.

Applications in Data Analytics

The normal distribution underpins many analytical techniques:

1. Hypothesis Testing: Many tests (t-tests, z-tests) assume normality or rely on the
central limit theorem
2. Quality Control: Process capability analysis uses normal distributions to assess
manufacturing processes
3. Risk Management: Value at Risk (VaR) calculations in finance often assume normal
returns
4. Forecasting: Prediction intervals often rely on normal distribution assumptions
5. Machine Learning: Many algorithms assume features are normally distributed or
perform better after normalization

The central limit theorem is particularly important as it justifies the use of normal-
based inference even when the underlying data isn't perfectly normal, provided
sample sizes are sufficiently large.
Related Distributions

Several important distributions are related to the normal distribution:

1. t-Distribution:
o Similar shape to normal but with heavier tails
o Used when sample sizes are small and population variance is unknown
o Approaches normal distribution as degrees of freedom increase
2. Chi-Square Distribution:
o Distribution of sum of squared standard normal variables
o Used in goodness-of-fit tests and tests of independence
o Right-skewed with shape depending on degrees of freedom
3. F-Distribution:
o Ratio of two chi-square distributed variables
o Used in ANOVA and regression analysis
o Right-skewed with two parameters for degrees of freedom

These distributions form the foundation of many statistical tests and are essential for
advanced analytics.

Case Study: Quality Control in Manufacturing

Business Problem: A manufacturer produces bolts with a target diameter of 10mm.


Due to natural variation, diameters are normally distributed with σ = 0.02mm. The
quality team needs to determine acceptable tolerance limits that will include 99% of
production.

Analysis:

1. The diameter follows N(10, 0.02²)


2. We need to find values a and b such that P(a ≤ X ≤ b) = 0.99
3. Due to symmetry, we can find the value such that P(X ≤ a) = 0.005 and P(X ≥ b) =
0.005

Using Python:
python
# Find critical values for 99% interval
lower_limit = norm.ppf(0.005, 10, 0.02)
upper_limit = norm.ppf(0.995, 10, 0.02)

print(f"99% of bolts will have diameters between {lower_limit:.3f}mm and {upp


er_limit:.3f}mm")
This analysis helps set manufacturing tolerances and informs quality control
procedures. The team might set acceptance limits at these values and implement
statistical process control to monitor production.

Summary and Key Takeaways

 The normal distribution is characterized by its bell shape, symmetry, and defined by μ
and σ
 The standard normal distribution (μ=0, σ=1) serves as a reference for all normal
distributions
 The empirical rule provides quick probability estimates for intervals around the mean
 Z-scores allow standardization and comparison across different normal distributions
 Related distributions (t, χ², F) extend the utility of the normal distribution to various
statistical applications
 Understanding the normal distribution is essential for statistical inference, hypothesis
testing, and many analytical techniques

Common Pitfalls:

 Assuming normality without verification


 Applying normal-based methods to highly skewed or non-normal data
 Confusing the standard normal distribution with other normal distributions
 Misinterpreting z-scores as probabilities rather than standardized values

Practice Questions:

1. If test scores are N(75, 10²), what percentage of students scored above 90?
2. What z-score corresponds to the 25th percentile?
3. When would you use a t-distribution instead of a normal distribution?
4. How does the chi-square distribution relate to the normal distribution?

(Answers: 1. P(X > 90) = 1 - Φ((90-75)/10) = 1 - Φ(1.5) ≈ 6.68%; 2. z = Φ⁻¹(0.25)


≈ -0.6745; 3. When sample size is small and population variance is unknown; 4.
The chi-square distribution is the distribution of the sum of squared standard
normal variables.)

You might also like