Book of Why

The Book of Why: The New Science of Cause and Effect

By Judea Pearl and Dana Mackenzie

Introduction: Mind Over Data

Humans are unique among species in that we ask "why?"
Causal inference is the serious attempt to answer "why?" such as the effect of a medicine, tax law, or policy
Until recently, science has not given a language to answer this question
Mathematical equations are not sufficient. For example, the formula that relates pressure and temperature is bidirectional, so it could equally mean that pressure causes the barometer to change as it does the barometer causes the pressure to change
Every statistics student learns early "correlation is not causation," but statistics itself does not answer what causation is
P(L | do(D)) represents the effect of drug D on lifespan L if the patient is made to take the drug
This differs from P(L|D), which is the traditional statistical expression for the probability of lifespan L given seeing patient taking drug D. This is not equivalent.
Causal inference can predict the outcome of an intervention even without doing it.
A counterfactual is reasoning based on possibilities contrary to what happened.
People can make good judgments about this, but statistics does not have methods for it
In the 1980s Pearl believed that artificial intelligence needed a way to represent and understand causal relationships to achieve human-like intelligence
The inference engine takes in assumptions, queries, and data
The inference engine outputs: an answer of whether the query is valid, an estimand (a recipe for generating the answer, assuming there is data), and estimate (from data)
"Data are dumb" because they do not have a way to understand or represent cause and effect, except in the case of an RCT
"You are smarter than your data"

Chapter 1: The Ladder of Causation

Data (also called facts) are glued together with cause-and-effect relationships
Most our knowledge is these relationships, not the facts themselves
When considering a strategy, such as hunting a mammoth, a person draws upon the causal relationships and compares different options, such as taking a different number of hunters
In this way, a person can imagine unseen worlds
The three levels of causation are: association (seeing one thing increases the likeliness of another thing), intervention (doing one thing causes another thing to happen), and counterfactual (imagining a different action and its outcomes)
On the second level of causation, great predictions can be made without understanding the causes
Traditional statistics has no language to articulate counterfactual questions such as "What would happen if we..."
This is a symbol representation of the query "What is the probability we will sell floss at a given price if we increase the price of toothpaste" P (floss | do(toothpaste))
The "do operator" is unique to causal inference
Alan Turing proposed a practical test called the imitation game to test whether a computer could think like a person.
If people interacting with it believed it spoke like a person, then the computer passed the test
Today people have chatbots that compete for this test
The mini-Turing test, which is a different test, is to give the computer a causal model of reality and then ask it to answer queries
Pearl presents the causal scenario of a court order given to execute a prison. The captain tells two shooters to shoot at the same time. If either fires, the prisoner dies. This scenario is represented in a causal diagram.
With a causal model of this scenario, we could answer a question about an event that never happened: what if shooter A dies on his own to fire. There is no data for this (i.e., counterfactual), but we can still answer what happens using the diagram by erasing arrows into shooter A (from the captain) and setting A to true. The arrow from shooter A to the prisoner's death remains.
Even a diagram without numbers (i.e., data) can be used to answer many kinds of queries.
Philosophers and statistics have struggled with causal inference, including the definition of what is the causal relationship between two variables.
Confounding is a common problem: there may be a variable C that causes both A and B. However, confounding is a causal concept too, so there is not a way to define it it probabilistic terms. We need another language.

Chapter 2: From Buccaneers to Guinea Pigs: The Genesis of Causal Inference

In 1877 Francis Galton presented his quincunx to the Royal Institution of Great Britain. Today the it is known as Plinko. Balls are dropped in one place, and as they fall, they hit pegs that divide their paths. They settle in the bottom spots into a bell-shaped distribution with great regularity. Galton did this to make a point about heredity.
However, while human heredity explains human stature, stature does not spread out like the quincunx. There are not many seven-foot and three-foot tall people, so Galton studied made anthropomorphic measurements of fathers and their sons. (He also studied eminence, thinking it was heritable.) He discovered that taller-than-average fathers had tall but less exceptionally-tall sons, and he called it regression towards mediocrity. Later it was called regression towards the mean. This was the development of linear regression.
Remarkably, the reverse relationship was also true: taller-than-average sons had tall but less exceptionally tall fathers, but the temporal order defied a causal explanation. This is not the paradox it seems because the tallest fathers and tallest fathers are different populations. Regression itself does have not information about causation.
Later Karl Pearson developed the correlation coefficient.
Galton started looking for a causal explanation, but ended up developing correlation. Pearson, who believed in a philosophy called positivism, worked to remove causal inference from statistics.
In reality, success comes from talent and luck, but luck is not heritable.
While studying pigmentation of guinea pigs, Sewall Wright developed path diagrams to answer causal questions. His diagram included developing factors before birth, environmental factors after birth, and genetic factors.
He also developed a causal diagram for the birth weight of pups that accounted for the confounding effect of litter size, and he produced a biologically meaningful result (3.34 grams per day). The previous number, 5.66 grams per day, was not biologically meaningful because it was confounded.
Developing a causal diagram requires the scientist to draw upon his experience, while statistics requires the statistician only to follow canned procedures. Some prefer "mere" statistics because it seems more objective.
Similarly, Bayesian statistics requires a prior belief, and it has also struggled to achieve mainstream status. In Bayesian reasoning, new information is combined with prior beliefs to come up with revised beliefs.

Chapter 3: From Evidence to Causes: Reverend Bayes Meets Mr. Holmes

Induction is the process of developing hypotheses from data. Sherlock Holmes was famous for it, and induction is a central issue in artificial intelligence (AI).
Bonaparte is a computer software that used induction to match human remains after the 2014 crash of Malaysia Airlines Flight 17. When the victim's DNA was not available, it could use the DNA of multiple family members.
In the 18th century, Reverend Thomas Bayes developed a formula for inverse probability.
A forward probability starts with a known cause, while the Bayesian method starts with the effect: this is more difficult.
An example of forward probability is: a billiard ball is shot on a table, and what is the probability it will land within X feet of the left-hand edge.
The corresponding inverse probability is we observe the resting place of the ball and want to know the length of the table.
Pearl developed Bayesian Networks, modeled after neural networks and the human brain. It calculates conditional probabilities in one direction and likelihood ratios in the other direction.
Bayes developed the most simple kind of Bayesian Network
There are three basic junctions in a Bayesian network
The first junction A->B->C is called a chain or mediation. An example is fire -> smoke -> alarm because it is only through smoke that the alarm signals, and the mediator (smoke) hides information about A (fire) from C (alarm). In other words, the alarm would signal just the same even if there were smoke from a non-fire source.
The junction A <- B -> C is called a fork, and B is a common cause or confounder of A and C. An example is a child's age causes influences both shoe size and reading ability, so shoe size and reading ability are correlated without a causal connection.
The third junction A -> B <- C is called a collider. A and C are independent, but when observing something high on B (such as a celebrity), we may observe a negative correlation between A (beauty) and talent (C).

Chapter 4: Confounding and Deconfounding: Or, Slaying the Lurking variable

Chapter 5: The Smoke-Filled Debate: Clearing the Air

Chapter 6: Paradoxes Galore!

Chapter 7: Beyond Adjustment: The Conquest of Mount Intervention

Chapter 8: Counterfactuals: Mining Worlds that Could Have been

Chapter 9: Mediation: The Searching or a Mechanism

Chapter 10: Big Data, Artificial Intelligence, and the Big Questions

Collection of causal relationships and correlations

This section is a collection of examples throughout the book

Chapter 3
- Fire -> Smoke -> Alarm
- Shoe size <- Child's age -> Reading Ability