0% found this document useful (0 votes)
46 views43 pages

Distributions

The document provides an overview of distributions in Python, focusing on random variables, probability mass/density functions, and cumulative distribution functions. It details functionalities for generating random numbers, calculating probability density, cumulative distribution, and percentile point functions for various distributions, particularly the normal distribution. The document also includes examples of generating random numbers and visualizing distributions using libraries like NumPy, Matplotlib, and SciPy.

Uploaded by

sugaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views43 pages

Distributions

The document provides an overview of distributions in Python, focusing on random variables, probability mass/density functions, and cumulative distribution functions. It details functionalities for generating random numbers, calculating probability density, cumulative distribution, and percentile point functions for various distributions, particularly the normal distribution. The document also includes examples of generating random numbers and visualizing distributions using libraries like NumPy, Matplotlib, and SciPy.

Uploaded by

sugaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DISTRIBUTIONS IN PYTHON

1) Random variable (RV):-

Maps outcome of a sample space to a real line such that there is a unique real
number corresponding to every outcome of sample space
Eg: Coin toss experiment where outcomes [H,T] are mapped to [0,1].If outcomes are
already real valued then no need to map them on real line (throw of a dice). These
are examples of discrete random variables
Continuous RV maps outcomes of a continuous phenomena to intervals in a real
line (Eg: sensor readings)
2) Probability mass/density function:-

For a discrete RV, the probability mass function assigns a probability measure to
every discrete outcome of sample space
Coin toss sample space [H T] mapped to [0 1]. X is discrete RV whose outcome can
be 0 or 1
P(X = 0) = 0.5, P(X = 1) = 0.5
For a continuous RV, the probability density function assigns a probability measure
to every interval on real line
b
P(a < x < b) = ∫ f(x)dx (Area under the curve of f(x))
a
3) Cumulative distribution/density function:-

This is the probability that the RV 'x' lies in the interval −∞ < x < b
b
F(b) = P(−∞ < x < b) = ∫ f(x)dx
−∞
Every distribution in Python has four functionalities

1. ‘rvs’: generating random numbers from a distribution


2. ‘pdf’:probability density function
3. ‘cdf’:cumulative distribution function
4. ‘ppf’: percentile point function (inverse cumulative distribution function)

Distribution root name


norm- normal distribution
binom- binomial distribution

In this lecture we are going to work with standard normal distributions i.e.
mean=0
standard deviation=1
NORMAL DISTRIBUTION
In [1]: # Importing numpy for numerical operations
import numpy as np

# Importing matplotlib and seaborn for visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Importing scipy.stats for statistical computations


import scipy.stats
Setting gure size

In [2]: plt.rcParams["figure.figsize"] = (7,7)


GENERATING RANDOM NUMBERS
In [3]: v1= scipy.stats.norm.rvs(loc=0,scale=1,size=10)
print('Mean',v1.mean())
print('Std Dev',v1.std())

Mean 0.3253569359912848
Std Dev 1.2820175997433112
In [4]: sns.distplot(v1)

Out[4]: <matplotlib.axes._subplots.AxesSubplot at 0x1cfe3a42048>


In [5]: v2= scipy.stats.norm.rvs(loc=0,scale=1,size=100)
print('Mean',v2.mean())
print('Std Dev',v2.std())

Mean 0.020713894103135778
Std Dev 0.9415726759437645

In [6]: sns.distplot(v2)

Out[6]: <matplotlib.axes._subplots.AxesSubplot at 0x1cfe6b01648>


In [7]: v3= scipy.stats.norm.rvs(loc=0,scale=1,size=1000)
print('Mean',v3.mean())
print('Std Dev',v3.std())

Mean 0.02103248207417688
Std Dev 0.9865003013211904

In [8]: sns.distplot(v3)

Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x1cfe6b87e48>


PROBABILITY DENSITY FUNCTION
In [9]: scipy.stats.norm.pdf(-1)

Out[9]: 0.24197072451914337
In [10]: scipy.stats.norm.pdf(np.arange(-3,-1,0.01),loc = 0,scale= 1)

Out[10]: array([0.00443185, 0.00456659, 0.00470496, 0.00484703, 0.0049929 ,


0.00514264, 0.00529634, 0.0054541 , 0.00561598, 0.0057821 ,
0.00595253, 0.00612738, 0.00630673, 0.00649068, 0.00667932,
0.00687277, 0.0070711 , 0.00727444, 0.00748287, 0.00769651,
0.00791545, 0.00813981, 0.00836969, 0.0086052 , 0.00884645,
0.00909356, 0.00934664, 0.0096058 , 0.00987115, 0.01014283,
0.01042093, 0.0107056 , 0.01099694, 0.01129507, 0.01160014,
0.01191224, 0.01223153, 0.01255811, 0.01289213, 0.0132337 ,
0.01358297, 0.01394006, 0.01430511, 0.01467825, 0.01505962,
0.01544935, 0.01584758, 0.01625445, 0.0166701 , 0.01709467,
0.0175283 , 0.01797113, 0.01842331, 0.01888498, 0.01935628,
0.01983735, 0.02032836, 0.02082943, 0.02134071, 0.02186237,
0.02239453, 0.02293735, 0.02349099, 0.02405557, 0.02463127,
0.02521822, 0.02581658, 0.02642649, 0.0270481 , 0.02768157,
0.02832704, 0.02898466, 0.02965458, 0.03033696, 0.03103193,
0.03173965, 0.03246027, 0.03319392, 0.03394076, 0.03470094,
0.03547459, 0.03626187, 0.03706291, 0.03787786, 0.03870686,
0.03955004, 0.04040755, 0.04127953, 0.04216611, 0.04306742,
0.0439836 , 0.04491477, 0.04586108, 0.04682264, 0.04779957,
0.04879202, 0.04980009, 0.0508239 , 0.05186358, 0.05291923,
0.05399097, 0.0550789 , 0.05618314, 0.05730379, 0.05844094,
0.05959471, 0.06076517, 0.06195242, 0.06315656, 0.06437766,
0.06561581, 0.06687109, 0.06814357, 0.06943331, 0.07074039,
0.07206487, 0.07340681, 0.07476626, 0.07614327, 0.07753789,
0.07895016, 0.08038011, 0.08182778, 0.08329319, 0.08477636,
0.08627732, 0.08779607, 0.08933262, 0.09088698, 0.09245913,
0.09404908, 0.0956568 , 0.09728227, 0.09892547, 0.10058637,
0.10226492, 0.1039611 , 0.10567483, 0.10740608, 0.10915477,
0.11092083, 0.11270421, 0.1145048 , 0.11632253, 0.1181573 ,
0.120009 , 0.12187754, 0.12376279, 0.12566464, 0.12758295,
0.1295176 , 0.13146843, 0.1334353 , 0.13541806, 0.13741654,
0.13943057, 0.14145997, 0.14350455, 0.14556413, 0.1476385 ,
0.14972747, 0.1518308 , 0.15394829, 0.1560797 , 0.15822479,
0.16038333, 0.16255506, 0.16473972, 0.16693704, 0.16914676,
0.17136859, 0.17360225, 0.17584743, 0.17810384, 0.18037116,
CUMULATIVE DISTRIBUTION FUNCTION
In [11]: scipy.stats.norm.cdf(x=-1,loc = 0,scale= 1)

Out[11]: 0.15865525393145707

In [12]: 1-scipy.stats.norm.cdf(x=-1,loc = 0,scale= 1)

Out[12]: 0.8413447460685429
VISUALIZING DISTRIBUTIONS
Left Tail Right Tail

Bounded Tails
LEFT TAIL
In [13]: plt.fill_between(x=np.arange(-3,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,3,0.01)),
facecolor='blue',
alpha=0.5)
plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')

Out[13]: <matplotlib.collections.LineCollection at 0x1cfe6c49388>


In [14]: scipy.stats.norm.cdf(x=-1,loc = 0,scale= 1)

Out[14]: 0.15865525393145707

cdf to left of -1 i.e. Z<=-1


In [15]: prob_under_neg1 =scipy.stats.norm.cdf(x=-1,loc = 0,scale= 1)
print(prob_under_neg1)

0.15865525393145707
Plotting the cdf value on the normal distribution
In [16]: plt.fill_between(x=np.arange(-3,-1,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,-1,0.01)) ,
facecolor='blue',edgecolor='black',
alpha=0.5)
plt.fill_between(x=np.arange(-1,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(-1,3,0.01)) ,
facecolor='red',edgecolor='black',
alpha=0.5)
plt.text(x=-2, y=0.03, s= round(prob_under_neg1,3))
plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')

Out[16]: <matplotlib.collections.LineCollection at 0x1cfe6d08e88>


RIGHT TAIL
In [17]: plt.fill_between(x=np.arange(-3,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,3,0.01)) ,
facecolor='blue',
alpha=0.5)
plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')

Out[17]: <matplotlib.collections.LineCollection at 0x1cfe6cce408>


cdf to right of -1 i.e. Z>=-1
In [18]: prob_over_neg1 =1-prob_under_neg1
print(prob_over_neg1)

0.8413447460685429
In [19]: plt.fill_between(x=np.arange(-3,-1,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,-1,0.01)) ,
facecolor='red',edgecolor='black',
alpha=0.5)
plt.fill_between(x=np.arange(-1,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(-1,3,0.01)) ,
facecolor='blue',edgecolor='black',
alpha=0.5)
plt.text(x=0.5, y=0.03, s= round(prob_over_neg1,3))
plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')

Out[19]: <matplotlib.collections.LineCollection at 0x1cfe6dffdc8>


BOUNDED
In [20]: plt.rcParams["figure.figsize"] = (5,5)
plt.fill_between(x=np.arange(-3,-1,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,-1,0.01)) ,
facecolor='red',edgecolor='black',
alpha=0.5)
plt.fill_between(x=np.arange(-1,1,0.01),
y1= scipy.stats.norm.pdf(np.arange(-1,1,0.01)) ,
facecolor='blue',edgecolor='black',
alpha=0.5)
plt.fill_between(x=np.arange(1,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(1,3,0.01)) ,
facecolor='red',edgecolor='black',
alpha=0.5)
plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')
plt.vlines(x=1,ymin=0,ymax=0.24,linestyles='dashed')

Out[20]: <matplotlib.collections.LineCollection at 0x1cfe6e836c8>


In [21]: prob_over_pos1 =1-scipy.stats.norm.cdf(x=1,loc = 0,scale= 1)
print(prob_over_pos1)
print(prob_under_neg1)
between_prob=1-(prob_under_neg1+prob_over_pos1)
print(between_prob)

0.15865525393145707
0.15865525393145707
0.6826894921370859
TAILS
In [22]: tails_prob=prob_under_neg1+prob_over_pos1
print(tails_prob)

0.31731050786291415
INVERSE CUMULATIVE DISTRIBUTION FUNCTION
In [23]: plt.fill_between(x=np.arange(-3,3,0.01),
y1= scipy.stats.norm.pdf(np.arange(-3,3,0.01)) ,
facecolor='blue',
alpha=0.5)

plt.vlines(x=-1,ymin=0,ymax=0.24,linestyles='dashed')

Out[23]: <matplotlib.collections.LineCollection at 0x1cfe6e3fc88>


In [24]: q_val=scipy.stats.norm.cdf(x=-1,loc = 0,scale= 1)
print(q_val)

0.15865525393145707

In [25]: scipy.stats.norm.ppf(q=q_val,loc = 0,scale= 1)

Out[25]: -1.0
BINOMIAL DISTRIBUTION
No. of successes in 10 tosses of a fair coin
n- no.of tosses
p- probability of success

In [26]: scipy.stats.binom.rvs(n=10,p=0.5)

Out[26]: 5
A coin is tossed 10 times and the experiment is repeated for 5 times.In each experiment
the number of successes are recorded

In [27]: scipy.stats.binom.rvs(size=5,n=10,p=0.5,random_state=0)

Out[27]: array([5, 6, 5, 5, 5])


Now let us repeat the experiment for 15 times

In [28]: scipy.stats.binom.rvs(size=15,n=10,p=0.5,random_state=0)

Out[28]: array([5, 6, 5, 5, 5, 6, 5, 7, 8, 5, 6, 5, 5, 7, 3])


Let us visualize this distribution
In [29]: ax=sns.distplot(scipy.stats.binom.rvs(size=15,n=10,p=0.5,random_state=0),kde=False,hist_
kws={"color": 'b','alpha':0.8})
# default color is blue

# Labels for x and y axis


ax.set(xlabel='No. of successes',ylabel='Frequency of success in a given trial')

Out[29]: [Text(0, 0.5, 'Frequency of success in a given trial'),


Text(0.5, 0, 'No. of successes')]
In [30]: ax=sns.distplot(scipy.stats.binom.rvs(size=15,n=10,p=0.5,random_state=0),kde=True,hist_k
ws={"color": 'b','alpha':0.8})
# default color is blue

# Labels for x and y axis


ax.set(xlabel='No. of successes',ylabel='Frequency of success in a given trial')

Out[30]: [Text(0, 0.5, 'Frequency of success in a given trial'),


Text(0.5, 0, 'No. of successes')]
Now increase the no. of experiments to 100 and then to 500
In [31]: ax=sns.distplot(scipy.stats.binom.rvs(size=100,n=10,p=0.5,random_state=0),kde=False,hist
_kws={"color": 'b','alpha':0.8})
# default color is blue

# Labels for x and y axis


ax.set(xlabel='No. of successes',ylabel='Frequency of success in a given trial')

Out[31]: [Text(0, 0.5, 'Frequency of success in a given trial'),


Text(0.5, 0, 'No. of successes')]
In [32]: ax=sns.distplot(scipy.stats.binom.rvs(size=100,n=10,p=0.5,random_state=0),kde=True,hist_
kws={"color": 'b','alpha':0.8})
# default color is blue

# Labels for x and y axis


ax.set(xlabel='No. of successes',ylabel='Frequency of success in a given trial')

Out[32]: [Text(0, 0.5, 'Frequency of success in a given trial'),


Text(0.5, 0, 'No. of successes')]
In [33]: ax=sns.distplot(scipy.stats.binom.rvs(size=500,n=10,p=0.5,random_state=0),kde=True,hist_
kws={"color": 'b','alpha':0.8})
# default color is blue

# Labels for x and y axis


ax.set(xlabel='No. of successes',ylabel='Frequency of success in a given trial')

Out[33]: [Text(0, 0.5, 'Frequency of success in a given trial'),


Text(0.5, 0, 'No. of successes')]
PROBABILITY MASS FUNCTION
Probability of seeing a given number of success

Probability of seeing exactly nine heads i.e.X=9

In [34]: scipy.stats.binom.pmf(n=20,p=0.5,k=9)

Out[34]: 0.16017913818359344
Probability of seeing nine or lesser heads i.e. X<=9

In [35]: scipy.stats.binom.pmf(n=20,p=0.5,k=0)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=1)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=2)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=3)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=4)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=5)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=6)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=7)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=8)+\
scipy.stats.binom.pmf(n=20,p=0.5,k=9)

Out[35]: 0.4119014739990231

In [36]: k_range=np.arange(0,10)

scipy.stats.binom.pmf(n=20,p=0.5,k=k_range)

sum(scipy.stats.binom.pmf(n=20,p=0.5,k=k_range))

Out[36]: 0.4119014739990231
This is the cumulative distribution function
In [37]: scipy.stats.binom.cdf(n=20,p=0.5,k=9)

Out[37]: 0.41190147399902316

You might also like