{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d28d89a7",
   "metadata": {},
   "source": [
    "### <img src='./fig/vertical_COMILLAS_COLOR.jpg' style= 'width:70mm'>\n",
    "\n",
    "<h1 style='font-family: Optima;color:#ecac00'>\n",
    "Máster en Big Data. Tecnología y Analítica Avanzada (MBD).\n",
    "<a class=\"tocSkip\">\n",
    "</h1>\n",
    "\n",
    "<h1 style='font-family: Optima;color:#ecac00'>\n",
    "Introducción al Análisis Estadístico con Lenguajes de Programación para Machine Learning (IAELPML). 2023-2024.\n",
    "</h1>\n",
    "    \n",
    "<h1 style='font-family: Optima;color:#ecac00'>\n",
    "03 Basic Probability\n",
    "<a class=\"tocSkip\">    \n",
    "</h1>  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "905c03ea",
   "metadata": {},
   "source": [
    "## <span style='background:yellow; color:red'> Remember:</span>\n",
    "\n",
    "+ Navigate to your `IAELPML` folder in the console  \n",
    "+ Execute `git pull origin main` to update the code\n",
    "+ **Do not modify the files in that folder**, copy them elsewhere"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "349ec85d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standard Data Science Libraries Import\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import scipy as scp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c85b7a9",
   "metadata": {},
   "source": [
    "# Populations and Samples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "265492f8",
   "metadata": {},
   "source": [
    "+ The main goal of Statistics is to obtain reliable and useful information about a **population** of interest using **samples** from that population. The term population is used here in a broad sense to refer to any collection of individual entities, not just living creatures. A population can be the set of vehicles with license plates from year 2015, the buying orders received by a company a given month or the hummingbirds that visit a bird feeder during a certain week somewhere in Costa Rica.  \n",
    "\n",
    "+ Trying to get all the information of interest from all the individuals in a population is often impossible or pointless: too difficult, too expensive, too time consuming or in many cases, too harmful for the individuals in the population if the sampling process causes some degree of damage. That is where Statistics comes into play. Can we use some samples from the population to *infer* or *predict* what we want to know? In that sense a related question is: is the sample a good representation of the population?\n",
    "\n",
    "+ **Inference** is the part of Statistics dealing with those questions, making them formal and providing answers with a mathematically sound basis.\n",
    "\n",
    "![](fig/011-inferencePopulationSample.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c81a5039",
   "metadata": {},
   "source": [
    "## Simple Random Samples with Python"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5a5ab5f",
   "metadata": {},
   "source": [
    "+ In the study of a population we are usually interested in certain properties or characteristics of the individuals, that may be different from one individual to another. These properties are the *variables of interest to us*. When sampling a population we get the values of those variables in a sample of some individuals from that population. \n",
    "\n",
    "+ In order for the sampling process to provide us with a representative sample, we often take what is called a **simple random sample**. That is, we choose individuals from the population at random so that:\n",
    "    + All individuals in the population are *equally likely* to be chosen for the sample.\n",
    "    + We sample with *replacement*. That is, an individual can be chosen twice for the same sample. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "563516a9",
   "metadata": {},
   "source": [
    "+ Let us see a very simple example, using synthetic data. In the following code we create a `Population` data set of `N = 158000` individuals. The example is loosely inspired by the number of passengers at the Madrid airport some days and the variable `Ages` represents the (entirely fictitious) ages of those passengers. This example is unrealistic in that we have the whole set of ages for each and every passenger. The blue dashed line represents the *population* age. In this case we can ask Python to tell us the value, but where would be the fun in doing that? Instead, let us try to come up with a good guess of that mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ab03c822",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Ages\n",
      "0    62\n",
      "1    52\n",
      "2    29\n",
      "3     7\n",
      "4    31\n",
      "5    13\n",
      "6    31\n",
      "7    15\n",
      "8    30\n",
      "9    17\n"
     ]
    },
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 1200x850 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.set(rc={'figure.figsize':(12, 8.5)})\n",
    "\n",
    "rng = np.random.default_rng(2022) # seed for reproducibility\n",
    "\n",
    "N = 158000\n",
    "Ages = np.rint(2 * rng.chisquare(df = 13, size = N)).astype(int)\n",
    "Population = pd.DataFrame({'Ages':Ages})\n",
    "\n",
    "print(Population.head(10))\n",
    "\n",
    "sns.histplot(data = Population, x = 'Ages', bins=15, color = \"orange\")\n",
    "getPlot = plt.axvline(x = Population.Ages.mean(), linewidth = 4, linestyle='--')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56410afc",
   "metadata": {},
   "source": [
    "+ Guesing the population mean is just the type of question that we expect Statistics to answer. So we will take random samples from the population in order to make our guess. Let us begin by taking a sample of 20 individuals ages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "be617a7b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        Ages\n",
      "112031    30\n",
      "87747     53\n",
      "84458     33\n",
      "43231     42\n",
      "122826    13\n",
      "93717     24\n",
      "51518     18\n",
      "36356     36\n",
      "71485     26\n",
      "45449     22\n",
      "126797    27\n",
      "30217     22\n",
      "75464     26\n",
      "4398      30\n",
      "35343     17\n",
      "8810      57\n",
      "109881    43\n",
      "124932    29\n",
      "66740     24\n",
      "153307    20\n",
      "\n",
      "\n",
      " Sample mean =  Ages    29.6\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "n = 20 # sample size\n",
    "\n",
    "sample = Population.sample(n)\n",
    "\n",
    "print(sample) # The index reflects the row lines in the original data set\n",
    "\n",
    "print(\"\\n\" * 2, \"Sample mean = \", sample.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73c550e5",
   "metadata": {},
   "source": [
    "+ Execute the above cell a few times. Every time you will get a different sample and a new sample mean:  \n",
    "\n",
    "$$\\bar x = \\dfrac{x_1 + x_2 +\\cdots + x_{20}}{20}$$\n",
    "\n",
    "+ How different can these sample means be from one another and from the population mean? How *bad* can a sample be? Recall that we are sampling with replacement. Therefore, since there is a passenger with `age = 2` the following sample is a perfeclty legit sample:\n",
    "\n",
    "$$x_1 = 2,\\, x_2 = 2,\\,  \\ldots \\, , x_{20} = 2$$\n",
    "\n",
    "+ This sample would make us guess that the population sample is 2, which is way off from the true value (look at the histogram). In that sense this is a very, very bad sample (as bad as they get).   \n",
    "\n",
    "+ But remember, *we are taking random samples (with replacement)* How probable is it that we get this particularly bad sample **at random**?\n",
    "\n",
    "+ Which leads us to one of the more crucial steps in understanding the inner workings of Statistics. To answer the question at the end of the last paragraph we need to answer this two questions:\n",
    "    + how many different samples exist?\n",
    "    + how is the *sample mean* distributed over those samples? In other words, how many *good* and *bad* samples exist when it comes to guessing the mean of the population?\n",
    "    \n",
    "  Let us begin with the first question. The number of different samples is this unconceivably large number:\n",
    "  \n",
    "  $$158000^{20} \\approx 9.4003005\\times 10^{103}$$ \n",
    "  \n",
    "  To put this in pespective the number of stars in the universe is estimated to be less than $10^{40}$. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8e86b3c",
   "metadata": {},
   "source": [
    "+ Now let us consider the question about the **distribution of the sample mean** in that huge sample space. There are too many samples in this example to check one by one the mean of every sample and decide if they are good or bad samples (meaning how close they are to the true population sample). But we can look at a large number of samples (a big *sample of samples*) and look at the distribution of the sample mean.  \n",
    "\n",
    "+ We already did this in an exercise Session 02 (we told you then it was a very important exercise), so you want to take a look at that exercise to recall what happened there. The difference between both situations is that in Session 02 we were sampling from a population where all the values had the same probability. Here our starting point is the population described by the histogram above, where obviously not all values are equally likely.\n",
    "\n",
    "+ The code to analyze that *sample of samples* is this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5d75699f",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 20 # the common size to all samples\n",
    "\n",
    "n_samples = 10000 # we take a large number of samples with replacement, each of size 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f0bfea67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%timeit\n",
    "sample_means = pd.DataFrame([Population.sample(n, replace=True).mean() for item in range(n_samples)], \n",
    "                            columns=['Ages'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea18a8ef",
   "metadata": {},
   "source": [
    "+ Now let us look at the first sample means:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "64328473",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Ages</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>27.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>24.30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>24.35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>22.80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>25.90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26.10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>29.15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>28.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>31.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>26.60</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Ages\n",
       "0  27.20\n",
       "1  24.30\n",
       "2  24.35\n",
       "3  22.80\n",
       "4  25.90\n",
       "5  26.10\n",
       "6  29.15\n",
       "7  28.20\n",
       "8  31.00\n",
       "9  26.60"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_means.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d6b1007",
   "metadata": {},
   "source": [
    "+ To understand their distribution and answer the question about the proportion of good and bad samples we use a density plot. And to get a deeper insight we also plot the density curve for the original population (blue for the samples, orange for the population): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9cc29bfb",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "",
      "text/plain": [
       "<Figure size 800x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.set(rc={'figure.figsize':(8, 8)})\n",
    "sample_means.plot.density()\n",
    "Population.Ages.plot.density()\n",
    "plt.axvline(x = Population.Ages.mean(), linewidth = 2, linestyle='--')\n",
    "getPlot = plt.xlim((0, 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4f89c81",
   "metadata": {},
   "source": [
    "+ **Exercise:** rerun the above cells changing the sample size $n$ and see how it affects the results. Try e.g. with larger sample sizes, such as $n = 150$ and with small samples for $n = 10$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d18990d7",
   "metadata": {},
   "source": [
    "+ The above graphic is possibly **the most important graph in the whole course**. Note these three things:\n",
    "\n",
    "    + The *mean of the sample means* is equal to the population mean. Make sure you understand this!\n",
    "    + There are very few *really bad samples*. The height of the blue curve is almost equal to zero if you move away from its maximum in either direction. In particular, the probability of randomly choosing and being misled by one of those very bad samples is *extremely* low.\n",
    "    + The distribution of the sample means is a bell-shaped curve and it is only non zero in a narrow interval centered at the mean. The spread of the sample means is much smaller than the spread of the original population.\n",
    "    \n",
    "+ The blue bell-shaped curve contains a very precise and useful answer to our second question about the distribution of the sample means. And the important thing here is that the same kind of bell shaped answer appears for any choice of the initial population. This is the content of the **Central Limit Theorem**, one of the most important results in Statistics. But to make  these ideas precise we need the vocabulary of Probability. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8f618b0",
   "metadata": {},
   "source": [
    "+ **Exercise:** \n",
    "    + Use the following commands to generate a new population:\n",
    "    ```python\n",
    "    rng = np.random.default_rng(2022) # seed for reproducibility\n",
    "    N = 100000\n",
    "    x = np.concatenate((1.5 * rng.normal(loc = -2, size = N), rng.normal(loc = 0.5, size = N)))\n",
    "    Population2 = pd.DataFrame({'x':x})\n",
    "    ```\n",
    "    + Plot the density curve of this population.\n",
    "    + Repeat the above steps: take a large number (tens of thousands) of size-20 samples (with replacement) of this population. Compute the sample means for all those samples and use graphs to study the distribution of the sample mean. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1251a65a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-001.py\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a98bc083",
   "metadata": {},
   "source": [
    "# Basic Probability"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c86dc388",
   "metadata": {},
   "source": [
    "+ To understand or even state the Central Limit Theorem we need to learn a minimal amount of the language  of Probability Theory. \n",
    "\n",
    "+ And the first step as we enter this world of Probability is the realization that our intuition of probabilities is usually quite poor. We will begin using examples from simple gambling games such as cards, dice, etc. because historically this is also the context where Probability was born."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26ad1c48",
   "metadata": {},
   "source": [
    "## The problems of the *Chevalier de Méré*\n",
    "\n",
    "+ [A. Gombaud, Chevalier de Méré](https://en.wikipedia.org/wiki/Antoine_Gombaud) was an amateur mathematician from the 17th century who became interested in understanding gambling and applying Mathematics to it. He himself made no major contributions to solving the problems he raised, but he managed to get the attention of well known mathematicians such as Pascal, Fermat and later Laplace. In particular he got them thinking about the solution of these two famous seminal problems. When playing dice, what is more likely?\n",
    "\n",
    "  + To get at least one six in four rolls of a single dice or\n",
    "  + to get at least a double six in 24 rolls of two dice.\n",
    "  \n",
    "+ The gamblers of that time reasoned that:\n",
    "\n",
    "  (a) The probability of rolling a six in a single dice roll is 1/6. Therefore, in four rolls the probability is\n",
    "  $$\\dfrac{1}{6} + \\dfrac{1}{6} + \\dfrac{1}{6} + \\dfrac{1}{6} = \\dfrac{4}{6} = \\dfrac{2}{3}$$\n",
    "  (b) The probability of rolling a double six in a two dice roll is 1/36 (because there are 36 diferent results). Therefore, in 24 rolls the probability is\n",
    "  $$\\dfrac{1}{36} + \\dfrac{1}{36} + \\cdots + \\dfrac{1}{36} = \\dfrac{24}{36} = \\dfrac{2}{3}$$\n",
    "  \n",
    "  Thus it would seem that both bets are similarly likely. These kind of reasoning was being used in gambling salons to establish the amount that the gamblers should win or lose in these games. Legend has it that Gombaud kept track of the results of these games for a long time and became aware that the results did not match the above reasoning. The probabilities of winning and losing seemed wrong to him.\n",
    "\n",
    "+ Instead of having you discover Probability by losing your money at gambling, let us use Python.\n",
    "![](https://media.giphy.com/media/3ohjUS2N88LGAjLypO/giphy.gif)  \n",
    "\n",
    "+ **Exercise:** Let $N = 100000$. \n",
    "\n",
    "  (a) Simulate $4\\cdot N$ results of the first de Méré games and obtain a table of the relative frequencies for win or lose in this game.  \n",
    "  (b) Do the same for the second de Méré game, but this time simulate $24\\cdot N$ results of this game.  \n",
    "  (c) Compare the relative frequencies of winning in (a) and (b) with our naive estimate of $2/3$.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f362093e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-002a.py\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c5591a90",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-002b.py\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87eddf0a",
   "metadata": {},
   "source": [
    "## The Birthday Paradox\n",
    "\n",
    "+ This is another experiment that illustrates how bad our probabilistic intuition usually is. If we have a room (a large hall) with 367 or more people in it, then we are certain that there are two people in that room whose birthdays coincide. However if the number of people present is less then the probability of birthdays coincidence decreases. What is the smallest number of people in a room that leads to a coincidence probability bigger than 50%?\n",
    "\n",
    "+ Let us use Python to answer the question. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "00bcfd8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-003.py\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb2e0e9e",
   "metadata": {},
   "source": [
    "## Laplace's Method \n",
    "\n",
    "+ This was historically the first result that made it possible to address the systematic computation of probabilities. But as we will see, its practical application is not always simple and it is also lacking from the theoretical perspective.\n",
    "\n",
    "+ We need some language to state the rule:\n",
    "\n",
    "  (a) We will consider a random experiment with $n$ **elementary possible results** (two of them can not simultaneously occur)     $$a_1, a_2, \\ldots, a_n$$ \n",
    "  which *must occur with the same probability*. That means, in an informal approach, that their relative frequencies are the same when the experiment is conducted a very large number of times. For example, when throwing a honest dice, the set of elementary possible results with the same probability is the set of numbers from 1 to 6.  \n",
    "  (b) A **random event** is any subset $A$ of the set of elementary results. In the dice example, $A$ could be \"obtaining an even number\", which means $A = \\{2, 4, 6\\}$.\n",
    "\n",
    "+ **Laplace's Method:** To compute the probability of $A$ using this method we form a fraction. The numerator is the number of events that belong to $A$ (or where $A$ occurs) and the denominator is the total number $n$ of elementary events.\n",
    "$$\n",
    "\\quad\\\\\n",
    "P(A) = \\dfrac{\\text{number of events that belong to }A}{n} = \\dfrac{\\#A}{n}\n",
    "\\quad\\\\\n",
    "$$\n",
    "In the example of the dice this implies $P(A) = P(\\text{even result}) = \\dfrac{3}{6} = \\dfrac{1}{2}$.\n",
    "\n",
    "## Applying the Laplace's Method\n",
    "\n",
    "+ Using Laplace's Method and Combinatorics (don't panic) to count the number of events in $A$ \n",
    "  ![](https://media.giphy.com/media/l0XtbC8EniiuwAEOQn/giphy-downsized-large.gif)  \n",
    "  it is possible to answer questions such as:\n",
    "    + What is the probability that the sum of the results when throwing two dice equals 7?\n",
    "    + What is the probability of getting exactly one six when throwing three dice?\n",
    "    + If a box contains 20 cards with the numbers 1 to 20 printed on them and we choose two cards at random: what is the probability that those cards are precisely the cards with numbers 1 and 20? Does the probability depend on whether you pull the two cards at once or one after the other (without replacement in the second case)? And what happens if there is replacement?\n",
    "\n",
    "+ Problems like these illustrate the fact that in order to make an effective use of Laplace's method you need to get *good at counting*. And believe me, counting is one of the hardest things to do in Mathematics.\n",
    "\n",
    "+ It is also important to understand that the Laplace's Method can not be considered a *definition* of Probability. For starters, it would be a circular definition. And more importantly it fails to provide the answer to situations where the intuition leads to a clear solution, such as the following:\n",
    "+ Pick a real number $x$ in the interval $[0, 1]$. What is the probability that $1/3 \\leq x \\leq 2/3$? What is your intuition telling you (shouting, actually)? But now try to apply our method. How many possible values of $x$ do we have? An infinite number. And how many of those belong to $[0, 1]$? Again, an infinite number. Thus we are left with $\\infty / \\infty$, which in this context is useless. However our intuition is clear, and we can use Python to check that it is good.\n",
    "\n",
    "+ **Exercise:** use NumPy to select tens or hundreds of thousands of points $x$ at random in the $[0, 1]$ interval. Now find the relative frequency of the event $1/3 \\leq x \\leq 2/3$.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "c3daa038",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-004.py\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b15ca359",
   "metadata": {},
   "source": [
    "+ Laplace's Method was not designed to deal with continuous variables and phenomena. Thus, we need a more general definition of Probability."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9097e1d7",
   "metadata": {},
   "source": [
    "## Axiomatic Theory of Probability \n",
    "\n",
    "+ The theoretical details quickly become involved. But keeping things informal but simple there are three basic ingredients in our vision of Probability:  \n",
    "\n",
    "  (a) First we need a *sample space* $\\Omega$ which is the set of all possible results of an experiment. You may also think that $\\Omega$ is the *population of interest* in the same sense that we have been using that term until now.   \n",
    "    \n",
    "  (b) A *random event* is  (almost) any subset of $\\Omega$ (we exclude those subsets which are *too weird* in a mathematically precise sense).   \n",
    "    \n",
    "  (c) A *probability function* that we denote with the letter $P$. this function must assign a number $P(A)$ to every random event $A$ in $\\Omega$. And to be a probability the function $P$ should have these three properties (the *axioms of probability*)\n",
    "    \n",
    "    1. $P(\\Omega) = 1$\n",
    "    2. For every random event $A$, we have $0\\leq P(A)\\leq 1$.\n",
    "    3. If $A_1$ and $A_2$ are two random events then \n",
    "    $$P(A_1\\cup A_2) = P(A_1) + P(A_2) - P(A_1 \\cap A_2)$$\n",
    "\n",
    "+ Here $A_1\\cup A_2$ is the *union* of the random events while $A_1 \\cap A_2$ is their *intersection*, as illustrated by the Venn diagram below (exchange probabilities with areas and imagine that the area of the rectangle is 1).\n",
    "![](fig/03-fig02-DiagramaVennInterseccionSucesos.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9e9728a",
   "metadata": {},
   "source": [
    "## Further Properties of Probability \n",
    "\n",
    "+ In the next session we will see concrete and useful examples of how to construct those probability functions, both for discrete and continuous problems.\n",
    "\n",
    "+ The probability of the *empty or null event* $\\emptyset$ is $0$; that is $P(\\emptyset)=0.$  \n",
    "+ Two events $A_1$ y $A_2$ are said to be *disjoint or mutually exclusive* if they have an empty intersection; thus they can not occur simultaneously. If that is the case:\n",
    "$$P(A_1\\cup A_2) = P(A_1) + P(A_2)$$\n",
    "+  If $A$ is a random event, the *complementary event* $A^c$ id defined as $A$ ``does not occur''. And we always have:\n",
    "$$P(A^c)=1-P(A)$$\n",
    "\n",
    "+ If $A\\subset B$ (that is, $A$ is a subset of $B$) then\n",
    "$$P(A)\\leq P(B)$$ \n",
    "\n",
    "+ **Exercise:** Find the probability that a four digit number (like a pin) has some repeated digit (two or more digits are equal). Use Python to check your result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "0137c2cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load \"./exclude/S03-005.py\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b11ca757",
   "metadata": {},
   "source": [
    "# Conditional Probability and Independence\n",
    "\n",
    "## Conditional Probability\n",
    "\n",
    "+ The idea of *conditional probability* tries to capture the fact that *additional information* may alter our estimation of the probability $P(A)$ of an event.  \n",
    "\n",
    "+ *Example.* What is the probability of getting an even number as the result of a single roll of a dice? The answer is clearly 0.5. But what if I told you, without giving out the result,  that the outcome is a number greater than 3? Would you still think that the probability is 0.5?\n",
    "\n",
    "+ In situations like this we want to update the probability assigned to $A$ *using the knowledge that  another event $B$ has certainly occurred*. This updated probability is called the **probability of $A$ conditioned on $B$ (or given $B$)** and it is denoted by $P(A\\,|\\,B)$. The definition, that can be justified through Laplace's method is this:\n",
    "$$P(A\\,|\\,B) = \\dfrac{P(A \\cap B)}{P(B)}$$\n",
    "\n",
    "+ **Example (continued):** In the example of a single dice roll:\n",
    "$$P(\\text{even result}|\\text{result > 3}) = \n",
    "\\dfrac{P(\\text{even result and also result > 3})}{P(\\text{result > 3})} = \n",
    "\\dfrac{2/6}{3/6} =\\dfrac{2}{3} \n",
    "$$\n",
    "\n",
    "+ **Example:** The [Monty Hall Problem (link in Spanish)](http://docentes.educacion.navarra.es/msadaall/geogebra/figuras/azar_monty.htm) is a well known example of how additional information alters our estimation of probabilities, and it is also used to illustrate that probabilistic intuition tends to be weak in most of us ([see also](https://www.geogebra.org/m/wa5qtjpp))."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "071c2195",
   "metadata": {},
   "source": [
    "## Independence\n",
    "\n",
    "\n",
    "+ Random events $A$ and $B$ are **independent** if knowing for certain that $B$ has occurred does not affect our estimation of the probability of $A$ happening. That is, independence means that $P(A\\,|\\,B) = P(A)$. There is another expression for this that makes the symmetry of the definition apparent:\n",
    "$$\n",
    "A\\text{ and }B\\text{ are independent means }P(A\\cap B) = P(A)\\,P(B)\n",
    "$$\n",
    "\n",
    "+ **Make sure that you see the difference between \"independent\" and \"mutually exclusive\"** Mutually exclusive events can never be independent (for Spanish speakers, remember that *incompatible* is mutually exclusive). \n",
    "\n",
    "+ This notion of independence is a mathematical abstraction. Real world phenomena almost always fail to be independent in this strict mathematical sense. Later in the course we will discuss how to deal with the problem of independence in the practice. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a96ea664",
   "metadata": {},
   "source": [
    "# Bayes Rule"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3df4b9a",
   "metadata": {},
   "source": [
    "## Total Probability.\n",
    "\n",
    "+ This result can be used to obtain the probability of an event that can occur due to one of several mutually exclusive mechanisms or pathways, as illustrated in the following example.\n",
    "\n",
    "+ **Example:** a factory has three different machines producing the same component of a product, under these conditions:\n",
    "  \n",
    "  (a) Each component comes from precisely one of the machines.\n",
    "  (b) Each one of the machines outputs a known and fixed fraction of the total component production.\n",
    "  (c) Each one of the machines has a known rate of defective components.\n",
    " \n",
    " If all of this information is known, what is the total rate of defective components for the factory as a whole?\n",
    " ![](./fig/03-fig03-ProbabilidadTotal.png)\n",
    "\n",
    "\n",
    "+ Let $A$ be the event \"the component is defective\". We are trying to obtain $P(A)$. And let $M_1, M_2, M_3$ be the events \"the component comes from the machine $M_i$ for $i= 1, 2, 3$ respectively\".\n",
    "\n",
    "+ We assume (from (b) above) that the probabilities $P(M_1), P(M_2), P(M_3)$ are known. Similarly (from (c)) we assume that the *conditional probabilities* $P(A|M_1), P(A|M_2), P(A|M_3)$ are known as well.\n",
    "\n",
    "+ In a situation such as this one the **Total Probability Theorem** says that:\n",
    "$$\n",
    "\\quad\\\\\n",
    "P(A) = P(A\\,|\\,M_1) P(M_1) + P(A\\,|\\,M_2) P(M_2) + P(A\\,|\\,M_3) P(M_3)\n",
    "\\quad\\\\\n",
    "$$\n",
    "Note that the sum contains one term for each machine / pathway."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "110821bf",
   "metadata": {},
   "source": [
    "## Bayes Theorem\n",
    "\n",
    "+ This theorem applies to situations just like the one we have seen for the Total Probability, but its mission is to answer the *reversed question*:  \n",
    "\n",
    "  Knowing that the component is defective, what is the probability that is comes e.g. from machine $M_1$?\n",
    "  \n",
    "+ The goal now is therefore to obtain $P(M_1 | A)$, and Bayes Theorem provides the answer:  \n",
    "$$\n",
    "\\quad\\\\\n",
    "P(M_1\\,|\\, A) = \n",
    "\\dfrac{P(A\\,|\\, M_1) P(M_1)}{P(A\\,|\\, M_1) P(M_1) + P(A\\,|\\, M_2) P(M_2)+ P(A\\,|\\, M_3) P(M_3)} =\n",
    "\\dfrac{P(A\\,|\\, M_1) P(M_1)}{P(A)}\n",
    "\\quad\\\\\n",
    "$$\n",
    "Please note the last expression: thanks to *Total Probability*, the denominator is $P(A)$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb8822a4",
   "metadata": {},
   "source": [
    "+ **Example / Exercise:** the hardest part when using Bayes theorem tends to be retrieving the required information from the data we are provided with.  \n",
    "Suppose that a hospital has to operating theaters or rooms. In the first one $OR_1$ there have been incidents in 20% of the procedures, while in the second one $OR_2$ the incident rate is just 4%. Assume that the number of operations is the same in both operation rooms. When evaluating the hospital performance we choose at random the report describing one procedure and discover that there was an incident during that procedure. What is the probability that the report corresponds to a procedure conducted in $OR_1$?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72028796",
   "metadata": {},
   "source": [
    "+ **Example / Exercise:** we will be using a [data set called *spam*](https://raw.githubusercontent.com/mbdfmad/fmad2223/main/data/spam.csv), a classical example containing information about a sample of email messages. The table contains information about thousands of messages. Each row shows (corresponding to a message) the relative frequency of appearance in the body of that  message of the words from a certain set. That set of words appears as header of the first 57 columns in this table (that is, there is a column for each word).  The last column classifies the message as spam / no spam. \n",
    "\n",
    "  Use Python to read the data and answer this questions:\n",
    "\n",
    "  + What is the probability that a message picked at random is spam?\n",
    "  + What is the probability that a message picked at random contains the word *order*?\n",
    "  + Knowing that a message is spam, what is the probability that it contains the word *order*?\n",
    "  + Finally, using Bayes Theorem, knowing that a message contains the word *order*, what is the probability that is indeed spam?\n",
    "  \n",
    "  This is a very simple approach to the problem of detecting spam. But combined with the *Naive Bayes* method of Machine Learning, methods like this one were the basis of the first generation of antispam filters. \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fbd6f1c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}