1. Concepts & Definitions
1.1. Experiment, observation, and sample space
1.2. Sample space: Venn and Tree diagram
1.3. Simple and composite events
1.4. Three definitions of probability
1.5. Law of large numbers and its consequences
1.6. Frequency and empirical probability
2. Problem & Solution
2.4. Frequency of categories from tables
2.5. Simple and marginal probabilities
2.6. Conditional probabilities
Two events A and B are independent if the occurrence of one of them does not affect the probability of occurrence of the other. Then the following equations must hold.
P(A|B) = P(A)
P(B|A) = P(B)
To better understand this concept of independent events let’s revisit the experiment described in previous sections.
To better illustrate both properties of marginal and conditional probability, an experiment is proposed. The next figure contains an illustration of an experiment that collects data about product flow at a certain process on a port and classifies it according to two values on two categories: importation or exportation flow, and Process Time (takes) ≤ 5, or > 5 hours.
The next table summarizes the result after the collection of one hundred values for the process.
Before starting to compute probabilities, it is necessary to define the following events about a product randomly picked in the process which follows the data from Table:
• I - event of finding an importation product,
• E - event of finding an exportation product,
• L - event of finding a product that took less than or equal to five hours in the process,
• G - event of finding a product that took more than five hours in the process.
Remember the computations for simple or marginal probability from Track 04 - Section 2.5:
P(I) = (15+4)/100 = 19/100 = 0.19
P(E) = (45+36)/100 = 0.81
P(L) = (15+45)/100 = 0.60
P(G) = (4+36)/100 = 0.40
Remember the computation for conditional probability from Track 04 - Section 2.6:
Event L occurred, which means consider only products with a waiting time equal to or lower than 5 hours:
P(I|L) = (15)/60 = 0.25
P(E|L) = (45)/60 = 0.75
Event G occurred, which means consider only products with a waiting time higher than 5 hours:
P(I|G) = (4)/40 = 0.10
P(E|G) = (36)/40 = 0.90
Now, to verify if holds that events I and G or E and G are independent the following equations should be checked:
P(I|G) = P(I) or P(G|I) = P(G)?
P(E|G) = P(E) or P(G|E) = P(G)?
It is possible to obtain the following:
P(I|G) = 0.10 and P(I) = 0.19, which means G and I are not independent events in terms of probability of occurrence.
P(E|G) = 0.90 and P(E) = 0.81, which means G and E are not independent events in terms of probability of occurrence.
Now, to verify if holds that events I and L or E and L are independent the following equations should be checked:
P(I|L) = P(I) or P(L|I) = P(L)?
P(E|L) = P(E) or P(L|E) = P(L)?
It is possible to obtain the following:
P(I|L) = 0.25 and P(I) = 0.19, which means G and I are not independent events in terms of probability of occurrence.
P(E|L) = 0.75 and P(E) = 0.81, which means G and E are not independent events in terms of probability of occurrence.
The following steps will serve as a guideline to use data from simple and conditional probabilities to verify if probability independence between two events holds using a code in Python programming language:
Load the notebook with commands developed in step 2.6. (Click on the link):
https://colab.research.google.com/drive/1T7SaI0gEDf11FEKMB7cXIPNRyHOyaTkC?usp=sharing
A second simple situation in a port where importation containers could be divided in terms of finding some issue according to their ProductCode (another option could be according to its origin employing historical records of the port). To be didactic, suppose the container type could be Vegetable and FoodProduct. Let's work with products according to the information of ProductCode and have issues or not:
Before starting to compute probabilities, it is necessary to define the following events about a product randomly picked in the process which follows the data from Table:
• V - event of finding a vegetable product,
• F - event of finding a food product,
• N - event of finding a product without an issue,
• I - event of finding a product with an issue.
Remember from previous sections that for the simple probability of events:
P(V) = (1900+100)/5000 = 2000/5000 = 0.40
P(F) = (2900+100)/5000 = 0.60
P(N) = (4800)/5000 = 0.96
P(I) = (200)/5000 = 0.04
Remember from previous sections that for the conditional probability of events:
P(N|V) = (1900)/2000 = 2000/5000 = 0.95
P(I|V) = (100)/2000 = 0.05
P(N|F) = (2900)/3000 = 0.97
P(I|F) = (100)/3000 = 0.03
Now several questions about probability independence could be answered to verify if holds that events N and V or I and V are independent the following equations should be checked:
P(N|V) = P(N) or P(V|N) = P(V)?
P(I|V) = P(I) or P(V|I) = P(V)?
It is possible to obtain the following:
P(N|V) = 0.95 and P(N) = 0.96, which means N and V are not independent events in terms of probability of occurrence.
P(I|V) = 0.05 and P(I) = 0.04, which means I and V are not independent events in terms of probability of occurrence.
The previous manual calculations could be done using Python language as done in the next code:
import pandas as pd
col = ['Vegetable', 'FoodProduct']
row = ['NoIssues', 'Issues']
dat = [[1900, 2900], [100, 100]]
df_impp = pd.DataFrame(data=dat, index=row, columns=col)
df_impp.loc[:,'SumI'] = df_impp.sum(axis=1)
df_impp.loc['SumP',:] = df_impp.sum(axis=0)
PNI = df_impp.loc['NoIssues','SumI']/df_impp.loc['SumP','SumI']
PI = df_impp.loc['Issues','SumI']/df_impp.loc['SumP','SumI']
PV = df_impp.loc['SumP','Vegetable']/df_impp.loc['SumP','SumI']
PF = df_impp.loc['SumP','FoodProduct']/df_impp.loc['SumP','SumI']
print("PNI = ",PNI)
print("PI = ",PI)
print("PV = ",PV)
print("PF = ",PF)
The following probability values will appear:
PNI = 0.96
PI = 0.04
PV = 0.4
PF = 0.6
The following codes help in the computation of conditional probabilities employing the definition of conditional probability:
PVN = df_impp.loc['NoIssues','Vegetable']/df_impp.loc['SumP','Vegetable']
PFN = df_impp.loc['NoIssues','FoodProduct']/df_impp.loc['SumP','FoodProduct']
PVI = df_impp.loc['Issues','Vegetable']/df_impp.loc['SumP','Vegetable']
PFI = df_impp.loc['Issues','FoodProduct']/df_impp.loc['SumP','FoodProduct']
print("P(NoIssues|veg) = ",PVN)
print("P(Issues|veg) = ",PVI)
print("P(NoIssues|food) = ",PFN)
print("P(Issues|food) = ",PFI)
The following probability values will appear:
P(NoIssues|veg) = 0.95
P(Issues|veg) = 0.05
P(NoIssues|food) = 0.97
P(Issues|food) = 0.03
The following codes help in the computation of comparison between conditional and marginal probability to verify if the events 'No Issues' and 'Vegetables', and 'Issues' and 'Vegetables' are independent or not:
if (PVN == PNI):
print("The events 'No issues' and 'vegetables' are independents!")
else:
print("The events 'No issues' and 'vegetables' are not independents!")
print('P(NoIssues|veg) = ',PVN)
print('P(NoIssues) = ',PNI)
The events 'No issues' and 'vegetables' are not independents!
P(NoIssues|veg) = 0.95
P(NoIssues) = 0.96
if (PVI == PI):
print("The events 'Issues' and 'vegetables' are independents!")
else:
print("The events 'Issues' and 'vegetables' are not independents!")
print('P(Issues|veg) = ',PVI)
print('P(Issues) = ',PI)
The events 'Issues' and 'vegetables' are not independents!
P(Issues|veg) = 0.05
P(Issues) = 0.04
The previous manual calculations could be done using Python language as done in the next code:
https://colab.research.google.com/drive/1q91Pat5iO7d6-2QMfm73lSjTk1ngK40N?usp=sharing