The Book of Why: Introduction

This post continues my notes on Judea Pearl and Dana Mackenzie’s The Book of Why. This post covers “Introduction: Mind Over Data.”

If I could sum up the message of this book in one pithy phrase, it would be that you are smarter than your data. Data do not understand causes and effects; humans do.

Pearl positions his argument by stating that “data are profoundly dumb”, decades of statistical work assuming relationships could be understood from the data was misguided, and we need to discipline data with causal thinking. “Data can tell you that the people who took a medicine recovered faster than those who did not take it, but they can’t tell you why.”

Causal inference is a “new science” that has “changed the way we distinguish facts from fiction.” Pearl claims causal inference unifies all past approaches to separating fact from fiction into a single framework that is only twenty years old. The framework is “a simple mathematical language to articulate causal relationships.” Pearl claims this mathematical language is completely new to human history. Never before have humans developed a language in which they can mathematically communicate information about causal relationships. (They were close around the development of statistics, with the work of Sewall Wright in the 1920s, and others, but didn’t quite get there). Equations of the past specify relationships, but they do not encode causality.

The human mind is the center of causal inference, but we can program a computer to become an “artificial scientist” using the same causal inference as human minds have used for thousands of years. “Once we really understand the logic behind causal thinking, we could emulate it on modern computers.” (Pearl is a professor of computer science. It makes sense he takes the discussion of causal inference toward computing and data science.)

Pearl claims what has happened in the past two decades is that scientists discovered problems requiring a language that encodes causal relationships. Without this need, no language had been developed. Facing this need, scientists got to work creating the language as a tool to solve the new problem. “Scientific tools are developed to meet scientific needs.

He claims the book introduces a new “calculus of causation” consisting of two languages:

  1. Causal diagrams to express what we know
  2. Symbolic “language of queries” to express what we want to know

Causal diagrams describe the data generation process, “the cause-effect forces that operate in the environment and shape the data generated.”

The symbolic language of queries expresses the question we want to answer. For example, if we want to know whether multinational corporations’ foreign direct investments cause conflict in countries where investments are made, we can encode this in the symbolic language as P(C | do(FDI)), which can be read as, What is the probability (P) of conflict (C) if multinational firms are made to do foreign direct investment. The “made to do” component is crucial here, because use of do indicates control over which firms do and don’t engage in FDI. If there is no control over FDI and firms instead choose whether they engage in FDI, the symbolic language can communicate this as P(C | FDI). Absence of the do operator indicates absence of experimental control and possible presence of various selection effects that could confound causal analysis. P(C | do(FDI)) could be completely different than P(C | FDI), and the difference can be thought of as the difference between doing and seeing, respectively.

One of the greatest achievements of the calculus of causation is allowing researchers to approximate P(C | do(FDI)), which is extremely rare, fromP(C | FDI), which is extremely common.

We can approximate doing from seeing using counterfactual reasoning.

Causal inference engines are introduced. They take three inputs and produce three outputs. The inputs are:

  • Assumptions
  • Queries
  • Data

The outputs are

  • Decision: Can the query be answered given the causal model and assuming perfect data
  • Estimand: Mathematical formula to generate the answer from any hypothetical data
  • Estimate: The answer and some measure of the certainty of the answer

Pearl notes the importance of the causal model coming before data. He is a critic of current arguments in artificial intelligence that causality can come from data, rather than be specified in advance by theory.


Pearl’s book on counterfactual reasoning is a bit late, despite its claims to be the first to describe this “causal calculus.” Predecessors include:

  • Morgan, S. L., & Winship, C. (2007). Counterfactual and causal inference: Methods and principles for social research (Analytical Methods for Social Research) (1st ed.). Cambridge, England: Cambridge University Press.
  • Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.

There’s a bit of squeakiness in the counterfactual reasoning approach to causation that makes it seem a bit less of an advance over prior ways of thinking than Pearl portrays. The squeakiness arises from counterfactuals being imaginary. They are supposed to represent what would have happened in the absence of an action. For example, take a multinational corporation that engages in foreign direct investment. Observe whether the investment is associated with conflict in the country into which the investment is made. The counterfactual for this is whether conflict would have occurred if all else had happened exactly the same except the firm had not engaged in foreign direct investment.

This assumes the human mind can somehow know what the counterfactual would have been. But it can’t. We can never observe or confirm counterfactuals. They are fundamentally assumptions about what the world would have looked like in the absence of some event that actually did happen in the world.

We are right back to where we were before the causal calculus, to a world in which arguments about causation are not based on evidence but on untestable claims about the nature of alternative worlds. We’re back to the authority of the speaker being critical to whether the causal argument is credible, and that is what research methods requiring data were supposed to get us away from. Claiming the human mind is ultimately what determines causation, rather than data, risks reversing the scientific revolution’s rejection of faith and authority as bases for claims.