DAG & Causal Inference Intuition
Source: The Effect 2E
1. Designing Research
The Goal of Research At its core, research is about answering questions regarding how the world works. A strong research question must be well-defined, understandable, and answerable. However, coming up with the question is only the first step; you must also design a way to find a reliable answer.
Empirical vs. Quantitative Research Instead of just philosophically reasoning through a problem, the text focuses on empirical research, which attempts to answer questions using structured observations from the real world. Specifically, the focus is on quantitative empirical research, which uses numbers and data sets rather than qualitative measures like interviews.
The Challenge of Observation A major headache for researchers is that the raw numbers we observe rarely tell us exactly what we want to know. For example, if your research question is “does adding an additional highway lane reduce traffic congestion?”, you cannot simply compare existing two-lane highways to three-lane highways. This is because you are missing a “what if” scenario: you don’t have a way to see how much congestion there would have been if you had made that exact same two-lane highway one lane wider.
Why Research Design Matters Without a proper research design, the data can actively mislead you. If you blindly compared highway lanes to traffic, you might find that three-lane highways have more congestion. Without good design, you might assume lanes cause traffic, missing the fact that busy routes are simply the ones most likely to get expanded.
The text points to nutrition studies as a prime example of poor research design. Because it is so incredibly difficult to isolate the effect of one specific food from everything else a person eats or does, the field often suffers from shaky research designs. This leads to the contradictory and constantly changing news headlines we see about what is “healthy” to eat.
The Solution Well-designed research is specifically built to answer the question it sets out to answer. By carefully designing your analysis, you can figure out what data to collect and exactly what to do with those numbers to find the truth. This chapter sets the stage for the rest of the book, where Part I teaches the principles of building these designs, and Part II provides a mathematical “toolbox” for conducting causal research without running direct experiments.
2. Research Questions
What Makes a Good Research Question? In quantitative empirical research, a good research question must meet two primary criteria: it must be answerable, and finding that answer must improve our understanding of how the world works.
- It can be answered: This means it is possible to find real-world evidence that would provide a believable answer. For example, “what is the best James Bond movie?” is too subjective to be answerable. However, “which era of Bond movies had the highest ticket sales?” can be definitively answered with data.
- It informs theory: Once answered, the question should tell us something broader than the specific data points by informing a theory—an explanation of why things happen. A strong research question takes you from a theory to a hypothesis, which is a specific, testable statement about what you expect to observe in your data. If the data proves or disproves your hypothesis, it helps validate or challenge your underlying theory.
The Trap of Data Mining You might wonder why we need to start with a specific question when we have massive amounts of data at our fingertips. Why not just look for interesting patterns? The text refers to this approach as “data mining”.
While data mining is highly valuable for finding patterns and making predictions, it is a poor tool for explaining why the world works the way it does.
- Mistaking Prediction for Causality: Data mining only looks at what is in the data. For instance, it might reveal that the proportion of people wearing shorts is a fantastic predictor of ice cream sales. However, shorts don’t cause people to eat ice cream; hot weather causes both. Data mining is easily fooled by these correlations.
- Lacking Abstraction: Data mining fails to grasp underlying concepts. If an AI looks at chairs, it sees flat surfaces on top of vertical posts, but it lacks “chair theory”—the understanding that chairs are designed for sitting. This is why generative models like ChatGPT sometimes draw hands with the wrong number of fingers; they are matching patterns without actually understanding “hand theory”.
- False Positives: If you check 100 random variables without a guiding question, something will eventually look connected purely by random chance. Starting with a targeted question prevents you from falling for these false positives.
Evaluating Your Question Research questions generally stem from curiosity, working backward from a theory, or even an opportunistic dataset. To know if you have a workable question, the author recommends checking a few key areas:
- Consider Potential Results: Imagine the possible answers your data could give you. If you can’t draw an interesting conclusion about your theory regardless of whether the result is positive or negative, your question and theory aren’t linked well enough.
- Consider Feasibility and Scale: Do you actually have access to the data required, and can you reasonably complete the research within your time and resource constraints?
- Keep It Simple: Don’t bundle massive concepts into a single question. Asking “what are the determinants of social mobility?” is too broad to answer well. Asking “is birth location a determinant of social mobility?” is much more achievable.
3. Describing Variables
The Building Blocks: What is a Variable? Before you can find causality, you need to know how to describe the data you actually have. In empirical research, a variable is simply a set of observations of the same measurement, such as the monthly incomes of 433 people or the color of 532 flowers.
The text outlines several main types of variables:
- Continuous: Can take any numerical value, including fractions (e.g., income of $34,123.32).
- Count: Whole numbers counting how many times something happened (e.g., number of business mergers).
- Ordinal: Categories that have a clear order, but the distance between them isn’t measured (e.g., “elementary,” “middle,” and “high school” education).
- Categorical: Groupings with no inherent order (e.g., flower color). A highly useful sub-type is the binary variable, which only takes two values, like “yes” or “no”.
Distributions: How Often Do Things Happen? To understand a variable, you look at its distribution, which describes how often different values occur.
- For categorical variables, this is easily shown in a frequency table or a bar graph (e.g., 47% of colleges award 2-year degrees, 31% award 4-year degrees).
- For continuous variables, you use a histogram (which groups ranges of numbers into “bins”) or a density plot (a smooth curve where the area under the curve represents the probability of the variable falling in that range).
Summarizing the Data Because a full distribution graph can be overwhelming, researchers summarize it using specific numbers:
- Central Tendency (The Middle): The mean (average) calculates a representative value by weighing all observations. The median (the 50th percentile) finds a representative observation. The text notes that the median is often better when you have massive outliers. If Jeff Bezos walks into a room, the mean wealth skyrockets, but the median wealth stays roughly the same, making the median a better reflection of the “typical” person.
- Variation (The Spread): This describes how wide the distribution is. Researchers use the range (max minus min) or the Interquartile Range (IQR) (the 75th percentile minus the 25th percentile) to see how spread out the middle 50% of the data is. Another common measure is the standard deviation, which tells you how far from the mean your data is, on average.
- Skew (The Lean): Many real-world variables, like income, have a “right skew,” meaning most people are clustered near the bottom, but a long “tail” stretches to the right for billionaires. Researchers often apply a log transformation to skewed data to pull those extreme values in and make the distribution symmetrical and easier to work with.
Theoretical Distributions and the “Truth” The most philosophical concept in the chapter is the difference between your sample data and the actual “truth” of the world.
- Notation: In statistics, English/Latin letters (like \(x\)) represent your messy, real-world data, while Greek letters (like \(\mu\) or \(\sigma\)) represent the lofty, perfect truth.
- The Goal: We don’t actually care about the specific 1,000 people we surveyed; we care about using that sample to estimate the theoretical distribution—the true distribution of everyone in the world. The larger your sample size, the closer your data gets to matching this perfect theoretical distribution.
- Hypothesis Testing: Because we can never truly see the theoretical distribution, we use our observed data to ask: “If the truth was actually \(X\), how incredibly unlikely would it be for me to get the data I just collected?” If the chances are too low (the p-value), we reject that theoretical distribution as being the truth.
4. Describing Relationships
What is a Relationship? While Chapter 3 focused on describing a single variable in isolation, Chapter 4 moves to the core of most research questions: the relationship between variables. Fundamentally, a relationship tells you what learning about one variable tells you about another. Relationships can be positive (they move up together), negative (one goes up while the other goes down), or null (they have nothing to do with each other).
A basic way to visualize the relationship between two continuous variables is a scatterplot, which plots every single data point. However, the author warns that scatterplots carry a psychological trap: they strongly tempt you to assume the variable on the x-axis causes the variable on the y-axis, even when it doesn’t.
Conditional Distributions and Means To mathematically describe a relationship, researchers look at conditional distributions—the distribution of one variable given the specific value of another. For example, the general probability that someone is a woman is about 50%, but the probability conditional on their name being Sarah is much higher.
In quantitative research, this is most often done using a conditional mean, which calculates the expected average of one variable for a specific value of another. For continuous variables (where a specific number might only have one observation), researchers might cut the data into bins, or use a LOESS curve. A LOESS curve estimates “local averages” to draw a flexible, curvy, and smooth shape through the data points.
Line-Fitting and Regression Instead of just relying on local averages, a very common approach is regression, which involves fitting a specific shape (usually a line) to the entire dataset.
- Ordinary Least Squares (OLS): This is the most famous line-fitting method. It picks a line that minimizes the sum of squared residuals (a residual is the difference between an actual data point and the prediction made by the line).
- Slope and Correlation: The resulting line has a slope that cleanly describes the relationship (e.g., a one-unit increase in X is associated with a four-unit increase in Y). This concept is also the basis for correlation, which rescales the OLS slope to a number between -1 and 1 to tell you exactly how strong the relationship is, regardless of the units being measured.
- Nonlinear Regression: OLS doesn’t have to be a perfectly straight line; it can fit curves if you use squared terms or logarithms. For binary variables (like “yes or no”), researchers use specific nonlinear functions (like logit or probit) so the prediction line doesn’t impossibly guess that someone has a 110% or -10% chance of doing something.
Controlling for Variables (The Magic of Residuals) The most powerful concept in the chapter involves using residuals to isolate relationships. Because a residual is the part of a variable that is not explained by your prediction line, it represents the part of the data that has nothing to do with the variable you just measured.
If you want to know if wearing shorts is related to eating ice cream, you have to account for the fact that hot weather causes both. By calculating the conditional mean of ice cream based on temperature, and shorts based on temperature, you can extract the residuals for both. Finding the relationship between those two sets of residuals is called controlling for a variable, which mathematically washes out all the variation associated with that third variable. In practice, you don’t have to calculate this by hand; multivariable regression does it automatically simply by adding the new variable to the equation.
5. Identification
The Data Generating Process (DGP) Before we can figure out what causes what, we have to acknowledge that the data we observe doesn’t just appear out of nowhere; it is created by underlying laws and mechanisms. This is the data generating process (DGP). For example, the movement of the planets (the data) is generated by gravity and momentum (the DGP).
The DGP contains parts we already know and parts we want to learn. In empirical research, our goal is to use the parts of the DGP we do know to block out noise and isolate the parts we don’t know.
Isolating Your Variation Raw data usually just shows us correlations, which means variables are moving together, but it doesn’t tell us why. The author uses the example of avocado sales: a graph might show that when prices are high, fewer avocados are sold. But is this because high prices made buyers want fewer avocados, or because a flooded market made sellers drop the price?
Variables move around for all sorts of reasons, and this movement is called variation. To answer a specific research question, you have to dig through the messy raw data to find the specific variation that matters to you. For instance, if you assume avocado sellers only change their prices at the beginning of the month, you can look purely at within-month variation to isolate the behavior of the buyers, tossing out the variation caused by the sellers.
What is Identification? Identification is the process of figuring out what part of the variation in your data actually answers your research question. You do this by systematically closing off every other possible explanation for the relationship you see.
The text uses the analogy of a dog repeatedly escaping a house. The owners lock the front door, then the back door, then the doggie door, and finally the windows. Only when every other possible exit is sealed, and the dog still gets out, can they definitively identify the basement window as the escape route. In research, you use your knowledge of the DGP to identify your answer by closing off alternate statistical explanations.
To successfully identify an effect, you must:
- Use theory and context to figure out what the DGP looks like.
- Identify alternate reasons your data might look the way it does.
- Find statistical ways to block out those alternate reasons.
The Danger of Alternate Explanations If you fail to properly map out the DGP, your research will be ruined by alternate explanations. The text highlights a massive study linking alcohol consumption to mortality. While the study controlled for things like smoking and age, it couldn’t perfectly account for unmeasured parts of the DGP, like underlying risk-taking behavior. Because risk-taking causes people to drink and causes people to die earlier, it provides an alternate explanation for the data. If your methods aren’t identifying the true causal effect, they can lead to absurd conclusions—like a joke analysis that used similar methods to “prove” that drinking alcohol turns you into a man.
Context and Omniscience The major takeaway of the chapter is that you cannot do good causal research without deep contextual knowledge of the environment you are studying. You must understand the rules, the institutions, and the people to map out the DGP.
However, no researcher is omniscient. We will never know every alternate explanation. Good research isn’t about being perfect; it’s about learning the context deeply, acknowledging your assumptions, actively looking for gaps in your knowledge, and making your errors as small as possible.
6. Causal Diagrams
What is Causality? Before drawing diagrams, the chapter explicitly defines what it means for something to be “causal.” In empirical research, \(X\) causes \(Y\) if, by intervening to change \(X\), the distribution of \(Y\) changes as a result.
- Correlation vs. Causation: Changing someone’s pants to shorts won’t cause them to eat ice cream, so that relationship is non-causal. But intervening to flip a light switch does change the probability that the light comes on, making it causal.
- Weasel Words: The text warns researchers to watch out for phrases like “linked to” or “associated with”. These are “weasel words” that don’t technically claim causality but are often written to trick the reader into hearing a causal relationship.
Building Causal Diagrams A causal diagram (pioneered by computer scientist Judea Pearl) is a visual representation of the data generating process (DGP). It is built using only two components:
- Nodes: These represent the variables in your DGP.
- Arrows: These represent the causal relationships, pointing from the cause to the effect.
Crucially, you must include all relevant variables in your diagram, even if they are unobserved or unmeasurable. For instance, if you don’t know why shorts and ice cream are correlated, you must draw a “latent variable” (like “U1”, representing an unknown factor like temperature) with arrows pointing to both.
The Power of Omission Perhaps the most counterintuitive lesson of the chapter is that what is missing from the diagram is just as important as what is on it.
- If you leave a variable off the diagram, you are making an assumption that it is not an important part of the DGP.
- If you leave an arrow out between two variables, you are assuming they have no direct causal relationship.
These omissions are your identifying assumptions. If your assumptions are wrong, your research design won’t work. Therefore, creating a diagram is a difficult balancing act: you must omit enough variables to make the diagram simple and useful, but you cannot omit variables that act as major alternative explanations for your data.
Answering Your Research Question Once built, you can use the causal diagram to map out exactly what variation you need to answer your research question.
- It helps you identify the direct effect (an arrow straight from \(X\) to \(Y\)) and any indirect effects (where \(X\) causes \(Z\), which then causes \(Y\)).
- It visually highlights your “alternate explanations.” For example, if a diagram shows that local politics causes both an increase in police and an increase in crime, you instantly know that you must statistically block out the effect of local politics to isolate the true effect of police on crime.
Moderators The chapter concludes by addressing moderating variables—variables that don’t necessarily cause an outcome directly, but change the strength of the effect one variable has on another (e.g., having a uterus “moderates” the effect of taking a fertility drug). While strict causal diagrams don’t allow arrows to point at other arrows, the author notes that researchers often “cheat” for the sake of clarity by drawing an interaction node (like \(X \times Z\)) to explicitly show a moderating effect on the graph.
7. Drawing Causal Diagrams
Translating the World into a Diagram While Chapter 6 introduced what causal diagrams are, Chapter 7 focuses on how to actually draw one yourself. Because a causal diagram represents the Data Generating Process (DGP) for your specific research question, you have to start by thoroughly researching your topic.
The text outlines a systematic process for building your diagram, using an example study about the effect of taking online classes (the treatment variable) on dropping out of college (the outcome variable).
1. Brainstorming Variables and Connections You start by putting your treatment and outcome variables on the board. Then, you list every factor that might cause either the treatment, the outcome, or both.
- In the online class example, this includes demographics (Race, Gender, Age), Socioeconomic Status (SES), Available Time, Work Hours, Internet Access, and past Academics.
- You must also connect these variables to each other (e.g., Age causes SES, and SES causes Internet Access).
- If two variables are correlated but neither causes the other, you must add an unobserved “latent variable” (like U1) to serve as a common cause.
2. The Art of Simplification If you include every possible variable, your diagram will become an unreadable mess. The author compares a messy causal diagram to asking for directions to a gas station and being handed a giant, hyper-detailed atlas; sometimes, having too much information makes it useless.
To hit the “golden mean” of being simple but accurate, you should apply four tests to prune your diagram:
- Unimportance: If a variable only has a tiny, negligible effect (like the presence of a quiet cafe nearby encouraging someone to take an online class), you can leave it off.
- Redundancy: If multiple variables occupy the exact same space—meaning they have the exact same arrows coming in and going out—you can combine them. For example, if Race and Gender behave identically on your graph, you can combine them into a single “Demographics” node.
- Mediators: If a variable only exists as a middle-man (A \(\rightarrow\) B \(\rightarrow\) C), you can often remove it and just draw an arrow straight from A to C. Warning: Do not remove a mediator if your research question is specifically trying to figure out why A causes C.
- Irrelevance: You can drop variables that aren’t on any path between your treatment and outcome.
3. Avoiding Cycles One strict rule of causal diagrams is that they cannot contain cycles or feedback loops (e.g., A \(\rightarrow\) B \(\rightarrow\) A). If you have a cycle, you are implying a variable causes itself, which makes it mathematically impossible to isolate an effect.
While the real world has feedback loops (like two people punching each other back and forth), you can fix cycles on a diagram by incorporating time. If you punch me at time \(t\), it doesn’t cause me to throw the first punch; it causes me to punch you back at time \(t+1\). Time only moves in one direction, which breaks the cycle.
4. Defending Your Assumptions Finally, the chapter reminds us that every variable or arrow you leave off the graph is a massive assumption. Writing down a diagram requires sticking your neck out and making claims about how the world works.
Because we aren’t omniscient, assumptions are rarely perfectly right or wrong; they are just more or less probable. Your job is to put yourself in the shoes of a “critical reader” and provide prior research, logic, or data to convince them that your assumptions are reasonable.
8. Causal Path & Closing Back Doors
What is a Path? In Chapter 7, we learned how to draw causal diagrams. Chapter 8 focuses on how to actually use those diagrams by tracing paths. A path is simply the set of nodes and arrows you travel along when “walking” from one variable to another on your diagram.
To figure out how to isolate your causal effect, you must first map out every single path that starts at your treatment variable and ends at your outcome variable. Why? Because every path represents a different story or alternate explanation for why your treatment and outcome are statistically related in the real world.
Good Paths, Bad Paths, Front Doors, and Back Doors Once you have listed all the paths between your treatment and outcome, you must categorize them:
- Good Paths vs. Bad Paths: A Good Path is a path that describes a reason why your treatment and outcome are related that actually answers your research question. A Bad Path is an alternate explanation that is unrelated to your research question.
- Front Doors vs. Back Doors: A front door path is a path where every single arrow points away from the treatment. These are usually your Good Paths, as they represent the treatment directly or indirectly causing the outcome. A back door path has at least one arrow pointing towards the treatment. These are usually your Bad Paths.
Open Paths, Closed Paths, and Identification The core rule of causal identification is this: If you want to identify the answer to your research question, you must Close all the Bad Paths while leaving all the Good Paths Open.
- Open Path: A path is Open if all the variables along the path have variation in the data. If a Bad Path is open, it will contaminate your data with an alternate explanation.
- Closed Path: A path is Closed if at least one variable on the path has no variation. You can intentionally close a Bad Path by controlling for a variable on that path (for example, by statistically adjusting for it or only looking at data where that variable is constant).
By finding a way to control for at least one variable on every single Bad Path, without accidentally controlling for a variable on a Good Path, you effectively “close the back doors” and isolate your true causal effect.
The Trap of Colliders There is a massive trap in this process: Colliders. A variable is a collider on a path if the arrows on both sides of it point directly at it (e.g., Treatment \(\rightarrow\) Collider \(\leftarrow\) Outcome).
- Colliders Close Paths by Default: Because the collider is purely an effect of the variables around it, it doesn’t cause anything else on that path. Therefore, a path with a collider is naturally blocked, or Closed by default. You don’t need to control for anything to close this Bad Path!
- Controlling for a Collider Opens the Path: If you accidentally control for a collider (or select a sample based on it), the path opens back up. Controlling for a collider forces the two variables pointing at it to become statistically related, introducing a brand new alternate explanation into your data.
- The “College Student” Collider Trap: In the general population, having good grades and being a star athlete are completely unrelated reasons for being admitted to college (Grades \(\rightarrow\) Admitted to College \(\leftarrow\) Athlete). However, if you accidentally “control” for the collider by restricting your data to only include college students, you force a mathematical relationship to appear between the two previously unrelated causes. Because you are only looking at people who made it in, knowing that a student lacks one cause (good grades) instantly tells you they must have the other (star athlete), artificially opening up a brand new causal path.
Testing Your Diagram (Placebo Tests) Finally, you can use paths to test if your diagram is actually a good representation of the real world. You can look at two variables on your diagram (other than treatment and outcome) and figure out what you would need to control for to close all paths between them. Once you do that, the relationship between those two variables in your data should be zero. Checking the data to see if that relationship is actually zero is called a placebo test. If they are still related, it means your causal diagram is missing an important path.
9. Finding Front Doors
The Problem with Back Doors In Chapter 8, we learned that to identify a causal effect, you have to close every single “back door” path (alternate explanation) by controlling for variables. Chapter 9 admits a harsh truth: closing all back doors is incredibly difficult. It requires you to accurately map out every possible confounder and perfectly measure them. If an alternate explanation relies on an unmeasurable variable—like a person’s “business skill” or “intellectual ability”—you can’t control for it, leaving the back door wide open.
Instead of exhaustively hunting down and closing every back door, Chapter 9 introduces an alternative approach: isolating front doors. If we can find a way to isolate just the variation in our treatment variable that is free of back doors, we don’t have to worry about controlling for all those messy unobservable variables.
Randomized Controlled Experiments The cleanest way to isolate a front door is through a randomized controlled experiment. If a researcher randomly assigns a treatment (like a medicine, or a lottery to get into a charter school), they are artificially creating variation in the treatment that has absolutely nothing to do with a person’s background, income, or latent ability.
Because the treatment was determined purely by a random coin flip or lottery, all back doors are closed by default. If you focus exclusively on the variation driven by the lottery, any difference in the outcome must be caused by the treatment.
Natural Experiments and Exogenous Variation Explicit randomized experiments aren’t always possible, but sometimes the real world does the randomizing for us in what is called a natural experiment. To find a natural experiment, you look for a source of exogenous variation—variation that comes from “outside” the DGP and isn’t caused by any other variable in your model.
- Example: A researcher wants to know if heavy air pollution causes people to drive more. This is tangled with back doors because driving causes pollution. However, the researcher can use wind direction as a source of exogenous variation. Wind direction isn’t caused by traffic, but a west-blowing wind randomly pushes pollution into the city. By looking only at the changes in pollution driven by the wind, the researcher isolates a clean front door without the back doors.
Natural experiments have the benefit of reflecting realistic human behavior (unlike clinical lab settings) and usually offer much larger sample sizes. However, unlike pure random lotteries, natural experiments might still have a few back doors you need to close (e.g., controlling for the season when looking at the wind), and you have to work hard to convince skeptical readers that your variation is truly exogenous.
The “Front Door Method” The chapter concludes with a very specific, fascinating (though rarely used in practice) statistical trick called the Front Door Method. It is used when a bad path simply cannot be closed, but there is a fully observable “middle-man” (mediator) directly between the treatment and the outcome.
- Example: Imagine trying to find the causal effect of Smoking on Cancer. There are unmeasurable lifestyle back doors we cannot close. However, we can measure the exact amount of “Tar in Lungs” (the mediator).
- Because there are no back doors between Smoking and Tar, we can calculate Smoking \(\rightarrow\) Tar.
- Because we can control for Smoking, we can close the back doors and calculate Tar \(\rightarrow\) Cancer.
- By chaining those two isolated effects together, we successfully identify the total causal effect of Smoking on Cancer, entirely bypassing the unmeasurable back doors.
The “Front Door” Collider:
The path (Treatment \(\leftarrow\) Confounder \(\rightarrow\) Outcome \(\leftarrow\) Mediator), is safely blocked! The Outcome variable acts as a natural collider on this path because both the confounder and the mediator point directly at it (Confounder \(\rightarrow\) Outcome \(\leftarrow\) Mediator).
10. Treatment Effects
Chapter 10 shatters a convenient fiction used in earlier chapters: the idea that a treatment has one single, universal effect on everybody. In the real world, particularly in the social sciences, treatments affect different people differently, which is known as a heterogeneous treatment effect.
Because everyone has their own individual response to a treatment, researchers are actually looking at a treatment effect distribution. How you summarize that distribution depends entirely on your research design, and you don’t always get the straightforward average you might want.
Here is a breakdown of the different types of treatment effects you might estimate:
The Standard Averages
- Average Treatment Effect (ATE): This is the mean of the entire treatment effect distribution. It tells you what would happen, on average, if you forced the treatment on absolutely everyone in the population.
- Conditional Average Treatment Effect (CATE): This is the average effect calculated only for a specific subset of people, such as the effect of a medicine exclusively among women.
- Average Treatment on the Treated (ATT) & Untreated (ATUT): These measure the average effect specifically for the people who actually received the treatment, or those who did not. This distinction is crucial because when people get to choose whether or not to take a treatment, the ones who choose it are usually the ones who benefit from it the most, hence why ATT and ATUT are different.
Weighted Averages Sometimes your mathematical model calculates an average that includes everyone, but counts certain individuals more heavily than others.
- Variance-Weighted Average Treatment Effect: When you control for variables to close back doors, you shut out certain types of variation. The math will automatically place more weight on the people who still have a lot of variation left in their treatment status.
- Intent-to-Treat (ITT): If you randomly assign a treatment but people don’t perfectly comply, the ITT measures the effect of assigning the treatment, rather than the effect of the treatment itself.
- Local Average Treatment Effect (LATE): Common in natural experiments, this is a weighted average that heavily favors the “compliers”—the people whose treatment status was most strongly changed by the exogenous variation (like winning a lottery).
- The Marginal Treatment Effect: is the treatment effect for the next person who would receive the treatment if the program or treatment rates were expanded. It focuses specifically on the people who are “just on the margin” of deciding whether to get treated or not. The text notes that while it can be tricky to calculate in the real world, it is the exact effect you want to look at if you are trying to answer the specific policy question: “Should we treat more people?”.
Which Effect Do You Get? You don’t usually get to pick which effect you calculate; it is a consequence of your research design and where your variation is coming from. The chapter offers a few rules of thumb:
- True randomization in a representative sample gives you the ATE.
- Randomization within a specific group gives you a CATE.
- Assuming an untreated group perfectly mimics what the treated group would look like gives you the ATT.
- Isolating variation driven by an exogenous variable (like a natural experiment) gives you a LATE.
Why Does This Matter? Paying attention to whose effect you are measuring is crucial because you want your data to match the real-world policy or intervention you are considering. If you are considering rolling out a policy to the entire population, you need the ATE. But if you are just considering expanding an opt-in program to a few more people, the ATE might mislead you, and you would be much better served by looking at the ATT or the ATUT.
Example: Imagine we have four individuals with their own secret, individual treatment effects (the difference between their treated and untreated outcomes):
- Alfred (Male): Effect = 1
- Brianna (Female): Effect = 4
- Chizue (Female): Effect = 3
- Diego (Male): Effect = 2
1. Average Treatment Effect (ATE) The ATE is the straightforward mean if you forced the treatment on absolutely everyone.
- Calculation: (1 + 4 + 3 + 2) / 4 = 2.5.
2. Conditional Average Treatment Effect (CATE) The CATE isolates the average effect for a specific subset of people, such as only looking at the men.
- Calculation: Alfred (1) + Diego (2) / 2 = 1.5.
3. Average Treatment on the Treated (ATT) This measures the effect only for the people who actually ended up receiving the treatment.
- Scenario: Imagine only Alfred and Chizue choose to get treated.
- Calculation: Alfred (1) + Chizue (3) / 2 = 2.
4. Average Treatment on the Untreated (ATUT) This predicts what the effect would have been for the people who did not receive the treatment.
- Scenario: Imagine a population of 1000 Alfreds (Effect=1) and 1000 Briannas (Effect=4). If 600 Alfreds and 400 Briannas are left untreated, we average their hidden effects.
- Calculation: (600 × 1 + 400 × 4) / 1000 = 2.2.
5. Variance-Weighted Average Treatment Effect
- Scenario: We have a massive sample where 50% of all Briannas get treated (high variance: \(0.5 \times 0.5 = 0.25\)) but 90% of Diegoes get treated (low variance: \(0.9 \times 0.1 = 0.09\)).
- Calculation: The math weights Brianna’s effect (4) much more heavily than Diego’s (2) because we see her both ways more often: (0.25 × 4 + 0.09 × 2) / (0.25 + 0.09) = 3.47.
6. Intent-to-Treat (ITT) This measures the effect of assigning treatment, rather than actually taking it, which averages in the people who ignore their assignment (a weight of 0).
- Scenario: We have 2 Chizues (who always follow instructions) and 2 Diegos (who refuse to take the treatment even if assigned). We randomly assign one of each to the treatment group.
- Calculation: The two Chizues get their full effect weighted by 1. The two Diegos get their effect weighted by 0. We divide by the total number of people (4): (3 × 1 + 3 × 1 + 2 × 0 + 2 × 0) / 4 = 1.5.
7. Local Average Treatment Effect (LATE) Used in natural experiments, this scales the ITT back up by dividing it by the proportion of people who actually changed their behavior (the compliers).
- Scenario: Following the ITT example above, being assigned to treatment only increased actual treatment rates by 50% (since Chizue complied but Diego did not).
- Calculation: We take the ITT (1.5) and divide it by the response rate (0.5): 1.5 / 0.5 = 3. Notice how this perfectly recovers Chizue’s individual effect (3), because she is the only “complier” driving the variation!
- Marginal Treatment Effect
- Scenario: imagine that currently, only Alfred (Effect = 1) and Chizue (Effect = 3) are taking the treatment. Policymakers are considering expanding the treatment program to reach more people. Diego is “on the fence” about the treatment and will take it if the program expands even slightly, whereas Brianna is completely uninterested and won’t take it regardless of the expansion.
- Calculation: Because Diego is the very next person who would get treated if the treatment expands, he represents the “margin.” Therefore, the Marginal Treatment Effect is simply Diego’s individual effect: 2.
11. Causality with Less Modeling
The Generic Model and “Stuff” When researchers cannot possibly map out or measure every back door, they often rely on a completely generic causal diagram: Treatment causes Outcome, and there is a back door through unmeasured “Stuff”. To deal with this “Stuff” without actually measuring it, researchers use a few clever strategies (aside from natural experiments (instrumental variables)):
- Catch-All Controls (Fixed Effect): Sometimes, controlling for a single variable effectively controls for an infinite number of unmeasured variables. For example, if you control for the specific house someone grew up in (comparing them only to their siblings), you automatically control for their neighborhood, city, state, and upbringing in one fell swoop,.
- Comparable Control Groups (Regression Discontinuity): Instead of explicitly controlling for variables, researchers try to find a control group that is theoretically identical to the treated group on average. For example, rather than comparing communities that received government funds to those that didn’t, a researcher might compare communities that barely won the funding to communities that barely lost, assuming their baseline needs and traits are the same,.
- Comparing Changes Over Time (Difference in Difference): Instead of assuming two groups are identical in their absolute levels of “Stuff,” researchers might look at changes over time. If the unmeasured “Stuff” doesn’t change over time, comparing the change in the treatment group to the change in the control group bypasses those back doors completely,.
Testing Your Assumptions Because “less modeling” means relying on stronger, less-verifiable assumptions (like assuming your control group is perfectly comparable), Chapter 11 provides ways to check if your assumptions are fatally wrong:
- Robustness Tests: These are tests used to see if you can disprove your own assumptions. For example, if your research design assumes that the control group and treatment group have the exact same trends over time, you can test that assumption by looking at their trends before the treatment happened to see if they actually match.
- Placebo Tests: This involves analyzing a “fake” treatment. You pretend a treatment happened somewhere it didn’t, or to a group that didn’t receive it. If your statistical model finds that this “placebo” had a significant causal effect, it means your model is picking up on a bad path, not a true effect,.
- Partial Identification: Instead of making massive, strict assumptions to get one single, exact number for your treatment effect, you can make weaker, more plausible assumptions to get a range of possibilities. For instance, if you know an unmeasured variable (like risk-taking) biases your result positively, you can confidently state that your true effect is “no higher than 5%,” effectively bounding the effect rather than pinpointing it,.
- The Gut Check: The last line of defense is your own common sense. If you do all the math and find that drinking one extra glass of water a week extends your lifespan by 20 years, your result is completely unbelievable. If the result is practically impossible, it means you have a wrong assumption somewhere, even if you can’t find it,.
12. Opening the Toolbox
From Concept to Execution The first half of the book was entirely conceptual, focusing on how to think about the Data Generating Process (DGP) and how to identify causal paths. Chapter 12 marks a turning point: the second half of the book is about execution. It introduces a “toolbox” of statistical methods that researchers commonly use to perform their analyses.
The Power of Template Diagrams In the real world, causal diagrams are incredibly complex, which makes it practically impossible to measure and close every single back door. To get around this, researchers rely on template causal diagrams attributed to the researcher Jason Abaluck
Instead of trying to solve a messy real-world diagram from scratch, you ask yourself: does my DGP happen to match one of these pre-established templates?. If your context fits the template, you can use the corresponding “toolbox” method as a shortcut to identification. For example, if your diagram has a variable that causes the treatment, and every path from that variable to the outcome goes through the treatment, you match the “Instrumental Variables” template. Once you prove your DGP fits a template, your research design problem is solved, and you can move on to the statistical calculations.
The Structure of the Toolbox Chapters 16 through 21 will cover these specific template designs. Chapter 12 explains that each of these upcoming chapters will be split into three sub-chapters:
- How Does It Work?: A conceptual look at the method, its required causal diagram, and how it manipulates data.
- How Is It Performed?: The actual execution of the method, which is usually done using regression.
- How the Pros Do It: Important caveats and extensions. Because actual research moves too fast for textbooks to keep up, this section highlights the kinds of things working professionals think about when using these methods.
Code Examples Finally, the chapter establishes that all methods in the toolbox will include practical code examples in R, Stata, and Python. It also introduces a custom data package called causaldata (created by the author) that contains all the datasets you need to practice running these models yourself.
Chapters 13–15: The Foundational Tools
- Chapter 13 (Regression), Chapter 14 (Matching), and Chapter 15 (Simulation).
- The text notes that these chapters precede the specific “template” research design chapters (which span Chapters 16 through 21).
- Instead of introducing new causal diagram structures, these chapters establish the foundational statistical techniques needed to actually execute a research design. For example, Chapter 12 explicitly notes that the later “template” methods are almost always executed using the regression techniques introduced in Chapter 13. Similarly, matching is introduced as a way to close back door paths by picking a set of untreated observations with similar variable values.
Chapters 22–23: The Leftovers and the Messy Details
- Chapter 22 (A Gallery of Rogues: Other Methods). As the title implies, this chapter covers a collection of alternative statistical and causal inference methods that don’t fit neatly into the main “toolbox” templates highlighted earlier in the book.
- Chapter 23 (Under the Rug). This chapter addresses the complex, messy statistical issues that researchers often try to avoid or “sweep under the rug.” For instance, the text mentions that Chapter 23 tackles how to deal with “fat tails”—a frustrating data problem where heavily skewed distributions have extreme observations far away from the mean that happen with surprising regularity, making standard models difficult to apply.
13. Regression
- Fundamentals of OLS
Regression is the most common way to fit a line or shape to explain variation in a dependent variable (Y) using predictor variables (X). In causal research, we use it to estimate the relationship between two variables while controlling for others.
Fitting a specific shape to your data involves a fundamental trade-off:
- Pros: It uses variation efficiently and provides a shape that is easy to explain and interpret.
- Cons: You lose “interesting” variation that doesn’t fit the shape. If you choose a shape that doesn’t match the true population relationship (functional form misspecification), your results will be biased.
Residuals vs. Errors Students often confuse these two terms. The residual is what we can see; the error is the theoretical truth.
| Feature | Residual | Error (\(\epsilon\)) |
|---|---|---|
| Level | Sample-level (Observable) | Population-level (Theoretical) |
| Definition | The difference between the actual observed Y and the OLS prediction (\(\hat{Y}\)). | The difference between the actual Y and the prediction from the “true” population model. |
| Role | Used to estimate the variance of the unobservable error. | Contains everything that causes Y but is not included in the model. |
The OLS Objective OLS aims to find the best linear approximation of the relationship. It does this by minimizing the Sum of Squared Residuals (SSR)—the sum of the prediction errors after they have been squared.
Regression Interpretation To be a precise researcher, use both the “one-unit change” mantra and the more precise “comparison” phrasing:
- The Template: “Adjusting for Z, a one-unit change in X is linearly associated with a \(\beta_1\)unit change in Y.”
- The Comparison: “Comparing two observations with the same value of Z, the one with an X value one unit higher will, on average, have a Y value \(\beta_1\) units higher.”
- Linking Regression to Causal Diagram
Translating DAGs to Equations
- Identify the Outcome: Set as the dependent variable (Y).
- Identify the Treatment: Set as the primary independent variable (X).
- Identify Backdoor Paths: Find variables that cause both X and Y (confounders).
- Assign Controls: Add variables (Z) to the right-hand side to close those backdoor paths.
- Equation Construction: \(Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon\).
The Exogeneity Assumption For \(\beta_1\) to be a causal effect, X must be uncorrelated with \(\epsilon\). We use biological terminology here:
- Endogenous: “From within the system.” X is correlated with \(\epsilon\) (often due to an omitted variable), meaning the system determines X and Y simultaneously.
- Exogenous: “From outside the system.” X is uncorrelated with \(\epsilon\), allowing us to treat its variation like an external shock.
Model Selection Rule
- Must Include: Variables that close backdoor paths (to avoid Omitted Variable Bias).
- Can Include: Variables that cause Y but are unrelated to X. While not necessary for identification, they are beneficial because they **shrink * (the standard deviation of the error term), thereby reducing your standard errors and increasing precision.
- Must Exclude: Colliders and mediators (to avoid closing the front door).
- Hypothesis Testing
The Sampling Distribution OLS coefficients follow a normal distribution centered on the true population parameter. We estimate the width of this distribution (the Standard Error) using the residuals found in our sample.
Factors that Shrink Standard Errors:
- Lower Error Variance (): A model that predicts Y more accurately.
- Higher Variation in X: A wider range of X makes the slope easier to “see.”
- Larger Sample Size (n): More data points reduce the influence of random noise.
Significance Metrics
- Standard Error (SE): The standard deviation of the sampling distribution.
- t-statistic: The coefficient divided by its SE (\(\beta / SE\)). If |t| > 1.96, the result is significant at the 95% level.
- p-value: The probability of seeing a result as extreme as ours if the null (= 0) were true.
Reading Regression Tables
- Significance Stars: Typically
p < 0.1,* p < 0.05,** p < 0.01. - R^2 vs. Adjusted R2:** R2 shows the share of variance explained. Adjusted R^2** penalizes the addition of unnecessary variables, counting only the variance explained “above and beyond” what you’d get by adding random noise.
- Pedagogical Warning: Do not fixate on R^2 for causal research. A low R^2 just means there is a lot of Y left to explain; it doesn’t mean your estimate of the effect of X is wrong.
- Discrete Variables
Binary Variables OLS interprets the coefficient on a 0/1 variable as the difference in means between the “1” group and the “0” (reference) group.
Categorical Data & The Dummy Variable Trap When a variable has multiple categories (e.g., North, South, East, West), you create binary “dummies” for each. However, you must drop one category to serve as the Reference Category.
- The “Lurking” Variable: If you include all categories, you fall into the Dummy Variable Trap. This happens because there is an “all-1s” variable lurking in the constant term. Since the sum of all dummies also equals an all-1s variable, OLS faces perfect multicollinearity and cannot pick a unique estimate.
Joint F-Tests To see if a categorical variable matters as a whole, don’t look at the individual stars. Use a Joint F-test to test if all the category coefficients are simultaneously zero.
- Nonlinearities
Polynomial Models When relationships are curvy, we use terms like \(X^2\) or \(X^3\).
- Interpretation: Individual coefficients are meaningless. You must calculate the marginal effect using the derivative: \(\frac{\partial Y}{\partial X} = \beta_1 + 2\beta_2X + 3\beta_3X^2\). This shows that the effect of X depends on its current value.
Logarithmic Transformations Logs linearize multiplicative relationships and manage skew.
| Model | Transformation | Interpretation of _1 |
|---|---|---|
| Log-Level | ln(X) | A 1% increase in X is associated with a (\(\beta_1 / 100\)) unit change in Y. |
| Level-Log | ln(Y) | A 1-unit increase in X is associated with a (\(\beta_1 \times 100\))% change in Y. |
| Log-Log | ln(X) and ln(Y) | A 1% increase in X is associated with a \(\beta_1\)% change in Y. |
Note: The interpretations above are approximations based on \(1+c \approx e^c\). They break down for changes larger than 10%.
Alternative Transformations
- Inverse Hyperbolic Sine (asinh): Recommmend this over log(x+1) for skewed data with zeros. It behaves like a log for large values but is defined at zero.
- Winsorizing: “Squishing” the ends of a distribution to limit the influence of outliers.
- Standardizing (z-scores): Centering variables at mean 0 with SD 1 to interpret effects in “standard deviation units.”
- Log(x+1): Simply adding 1 to the variable before taking the log to artificially handle zeros. The author refers to this as a “kludge” that ruins the nice properties of the logarithm, and strongly recommends using the Inverse Hyperbolic Sine (asinh) instead.
- Square Root: Like the logarithm, it squishes down big positive outliers, but it naturally accepts zeros. However, it doesn’t compress extreme values nearly as much as a log does (e.g., \(\sqrt{10000} = 100\), whereas \(\ln(10000) \approx 9.2\)), and you lose the nice percentage-change interpretation.
- Discontinuous Transformations (Categorical Indicators): Used when the effect of a continuous variable shifts sharply at a specific, meaningful threshold rather than changing smoothly. For example, you might transform “Age” into a binary “Above 21 / Below 21” indicator when studying drunk driving.
- Bunching Indicators: Used when a continuous variable has a massive cluster of observations exactly at a single “corner value” (e.g., a sample where many people work exactly 0 hours). You add a binary indicator specifically for being at that exact bunching point to account for the fact that this group might be fundamentally different from the rest of the sample.
- Interaction Term (Moderation)
Definition: Used when the effect of X on Y depends on another variable Z. Math: \(Y = \beta_0 + \beta_1 X + \beta_2 Z + \beta_3 (X \times Z) + \epsilon\). Effect of X: \(\text{Effect} = \beta_1 + \beta_3 Z\).
The “Fishing” Warning Interactions are “noisy” and require significantly more power—a rule of thumb is that you need 16 times the sample size to estimate an interaction with the same precision as a main effect. Researchers often “fish” for interactions (p-hacking) only after failing to find a main effect. Be skeptical of sub-group effects that weren’t hypothesized in advance.
- NonLinear Regression
The Problem with OLS for Binary Outcomes Using OLS for binary outcomes is called the Linear Probability Model (LPM). It is flawed because it can predict probabilities >1 or <0, and assumes a constant effect (no matter how close you are to the “ceiling” of 100%).
Probit and Logit These models use Link Functions to ensure predictions stay between 0 and 1. While LPM is “wrong,” Probit and Logit also have a weakness: if heteroskedasticity is present, their coefficients are mathematically incorrect unless robust standard errors are used.
Average Marginal Effects (AME) Logit/Probit coefficients are in “index units” (logits/probits), which are unreadable. We use Average Marginal Effects, which calculate the change in probability for every individual in the sample and then average those changes. This is superior to “Marginal Effects at the Mean,” which describes a non-existent “average” person.
- Standard Errors
Standard OLS assumes errors are Independent and Identically Distributed (IID). If they aren’t, your SEs are wrong.
- Heteroskedasticity: Variance of \(\epsilon\) changes with X. Use Robust Standard Errors (specifically HC3).
- Autocorrelation: Errors are correlated over time. Use Newey-West (HAC) errors.
- Spatial Correlation: Errors are correlated by geography. Use Conley Spatial Standard Errors.
- Clustered SEs: Errors are correlated within groups (e.g., students in a school). The rule of thumb: Cluster at the level of treatment.
- The Bootstrap: Resampling your data with replacement thousands of times to estimate the sampling distribution of an estimator when the math is too complex for a formula.
9. Penalized Regression (LASSO & Ridge)
When you map out a Data Generating Process, you might find yourself with an overwhelming number of potential control variables. If you include dozens or hundreds of controls in a standard OLS regression, it creates a collinearity nightmare and causes “overfitting”—where the model bends to fit the noise in your specific sample rather than the true underlying relationship. Penalized regression solves this by acting as an automatic model selection procedure.
How it Works Instead of just minimizing the sum of squared residuals like standard OLS, penalized regression tries to achieve two goals at once: it minimizes the residuals and it applies a mathematical penalty based on the size of the regression coefficients (\(\beta\)). This forces the model to work on a “coefficient budget,” meaning it will only give a variable a large coefficient if it significantly improves the model’s predictions. The severity of this penalty is controlled by a tuning parameter (\(\lambda\)), which is typically chosen using cross-validation.
The Two Main Types
- LASSO (L1-norm): This method penalizes the sum of the absolute values of the coefficients. LASSO is highly popular because it acts as a “selection operator”—it doesn’t just shrink coefficients; it forces the least useful ones to exactly zero, effectively dropping those variables from the model entirely.
- Ridge Regression (L2-norm): This method penalizes the sum of the squared coefficients. It shrinks coefficients down to reduce noise, but unlike LASSO, it rarely drops them to exactly zero.
Standardizing Before Penalizing Before running a penalized regression, you must standardize all of your variables (subtracting the mean and dividing by the standard deviation). Because the penalty function targets the absolute size of the coefficients, it is highly sensitive to the scale of your variables. A variable measured in tiny units naturally requires a massive coefficient, which means LASSO would unfairly penalize and drop it just because of its scale. Standardizing ensures all variables are on the exact same playing field before the penalty is applied.
Using LASSO for Causal Inference Because LASSO actively shrinks coefficients to prioritize out-of-sample prediction, the coefficient estimates it spits out are inherently biased. If your goal is to find a true causal effect rather than just predict an outcome, you should use LASSO strictly as a variable selector. First, run LASSO to see which control variables it decides to keep. Then, drop the penalty and run a standard, unpenalized OLS regression using only those surviving variables to get your final, unbiased causal estimate.
10. Advanced Regression Concerns
- Sample Weights (WLS): You must use weights when your sample is not perfectly drawn from the population or when using aggregated data.
- Inverse sample probability weights: Used to correct for non-random sampling by weighting individuals by the inverse of the probability that they were included in the sample, heavily weighting those who were less likely to be surveyed.
- Frequency weights: Used when you have aggregated data (like classroom averages). It counts each aggregate observation by the number of individual observations it represents.
- Inverse variance weights: Used to weight observations (like aggregated means or effects from meta-analyses) by their precision, giving higher weight to estimates that have a lower variance.
- Collinearity: Occurs when predictor variables in your model are highly correlated with each other, which makes estimates noisy and severely inflates your standard errors.
- This is tested using a Variance Inflation Factor (VIF). As a rough rule of thumb, a VIF above 10 typically indicates problematic collinearity.
- If you have many variables measuring the exact same underlying concept, a common solution to avoid collinearity is using dimension reduction techniques (like latent factor analysis or principal components analysis) to distill them down to their shared parts.
- Measurement Error: This occurs when you use an imprecise measurement or a proxy for the true “latent” variable you actually want in your model.
- Classical measurement error occurs when the error is completely random. If this happens to your independent variable, it will systematically shrink your coefficient estimate toward zero (often called attenuation bias).
- Non-classical measurement error occurs when the error is related to the true value (e.g., low-income businesses underreporting income more than high-income ones to avoid taxes). This is highly unpredictable and can bias your results in any direction.
11. Important Details and Nuances
- Regression Equation Subscripts: Regression equations frequently use subscripts (like \(i\) for individuals, \(t\) for time periods, or \(g\) for groups) to explicitly show the dimensions across which the data varies.
- Root Mean Squared Error (RMSE): Often included in regression tables alongside \(R^2\), the RMSE (or residual standard error) estimates the standard deviation of the error term, showing the average prediction error of the model.
- Polynomial Best Practices: As a rule of thumb, you should almost never go beyond a cubic polynomial (\(X^3\)), because higher orders bend to fit random noise and cause overfitting. You can graphically check if you have included enough polynomial terms by plotting the model’s residuals against the \(X\) variable to ensure there is no obvious shape left over.
- Warnings on Statistical Significance:
- An insignificant result does not mean the effect is zero, it just means you don’t have the evidence to prove it isn’t zero.
- Never “p-hack” by repeatedly changing your analysis just to find a statistically significant result; this renders your tests incorrect and meaningless.
- Statistical significance is not the same as practical significance; a result can be statistically significant but too tiny to matter in the real world.
14. Matching
1. The Core Concept: Weighted Averages
The ultimate goal of matching is to generate a set of weights for your observations. By assigning different weights to different observations, you force the treated and control groups to be comparable. For example, if your treated group is 80% men and 20% women, but your raw control group is 50/50, matching will assign higher weights to the men in the control group so that the weighted control group is also an 80/20 split. Once the groups are balanced, you simply compare the weighted mean of their outcomes to find the treatment effect.
2. Key Decisions in Matching
When designing a matching procedure, a researcher must make several choices about what makes two observations “similar”:
- The Matching Criterion: You can use distance matching (finding observations with similar values of the matching variables) or propensity score matching (finding observations that had an equal probability of receiving the treatment).
- Selecting vs. Weighting: You can either strictly select matches (an observation is either “in” and gets a weight of 1, or “out” and gets a weight of 0) or construct a weighted sample (every control observation gets a weight based on how close it is to a treated observation).
- Number of Matches: If selecting matches, you must decide how many controls to match to each treated observation. You can use one-to-one matching, k-nearest-neighbor matching (picking the top k matches), or radius matching (picking all matches within a certain distance). More matches decrease variance (noise) but increase bias (because you are letting in worse matches).
- Replacement: You must decide whether to match with replacement (a single control observation can be matched to multiple treated observations) or without replacement. Matching with replacement reduces bias by ensuring the best matches are always used, but increases variance.
- Calculating Weights (If Constructing a Weighted Sample): If you choose the weighted sample approach instead of strictly selecting matches, you must determine how the weights will be generated and how they will decay with distance. There are two main methods for doing this:
- Kernel-Based Matching: Used primarily with distance matching, this method applies a mathematical kernel function (such as the Epanechnikov kernel) to produce the weights. It assigns the highest possible weight to matches with zero difference, and the weight smoothly declines as the distance between the matching variables grows. Bad matches eventually receive a weight of zero.
- Inverse Probability Weighting (IPW): Used specifically with propensity scores, this approach does not calculate weights based on distance. Instead, it weights each observation by the inverse of the probability that it received the exact treatment it actually got. For example, a treated observation with an estimated 60% probability of treatment (.6 propensity score) receives a weight of 1/.6, while an untreated observation with that same .6 propensity score receives a weight of 1/(1-.6).
- The Worst Acceptable Match: You must define a caliper or bandwidth, which is the maximum distance a match can be before it is rejected. If a treated observation has no controls within the caliper, it is dropped from the data.
3. Handling Multiple Variables
Matching on one variable is easy, but matching on dozens of variables introduces the curse of dimensionality—the more variables you add, the harder it becomes to find any two observations that are similar on all of them. Chapter 14 details several ways to aggregate multiple variables into a single matching measure:
- Mahalanobis Distance: This method standardizes all matching variables and divides out their covariance, ensuring that highly correlated variables (like “beard length” and “score on a masculinity index”) don’t accidentally double-count the same underlying trait (like “being male”).
- Coarsened Exact Matching (CEM): This method places continuous variables into “bins” (e.g., grouping ages 20-29 together) and only accepts matches that are exactly the same across all bins. It guarantees perfect balance but often forces researchers to drop massive amounts of data if exact matches cannot be found.
- Entropy Balancing: Instead of aggregating variables, this method mathematically forces the control group to match the treatment group’s descriptive statistics (like means and variances) and relies on a computer algorithm to search for the exact weights that satisfy those restrictions.
- Propensity Scores & Inverse Probability Weighting (IPW): A propensity score summarizes all matching variables into a single number: the estimated probability that an observation would get treated. With IPW, you weight each observation by the inverse of the probability that it received the treatment it actually received. This gives massive weights to control observations that look like they should have been treated, making the control group mimic the treatment group.
4. Assumptions and Checking Your Work
Matching does not magically identify causal effects without assumptions. It relies heavily on:
- Conditional Independence: The assumption that the variables you matched on are sufficient to close all back doors.
- Common Support: The assumption that there is substantial overlap between the treated and control groups. If there are treated observations with traits that never appear in the control group, you lack common support and must trim or drop those observations.
- Balance: The assumption that your matching procedure actually worked. Researchers test this using a balance table (checking if the difference-in-means between treated and control groups is zero after matching) and by plotting the overlaid distributions of the variables to ensure their shapes match perfectly.
5. Estimation and “Doubly Robust” Methods
Once you have your matched weights, you can simply compare the weighted means of the outcome variable. However, calculating standard errors is complex because the standard errors must account for the uncertainty introduced during the matching process itself. Bootstrap standard errors are highly recommended if you used a weighting approach, though they fail if you used a strict “selecting matches” approach.
Matching can also be combined with regression. In a Doubly Robust Estimation, you use matching to generate inverse probability weights, and then apply those weights while running a regression. It is “doubly robust” because it gives you two chances to be right: the causal effect is properly identified as long as either your matching model or your regression model is correctly specified.
6. Choosing Your Treatment Effect
One of the greatest advantages of matching over regression is how easily it lets you choose your treatment effect average.
- Average Treatment on the Treated (ATT): Make the control group look like the treated group.
- Average Treatment on the Untreated (ATUT): Reverse the process and make the treated group look like the control group.
- Average Treatment Effect (ATE): Weight both groups to look like the overall population.
15. Simulation
Chapter 15, titled “Simulation,” introduces a highly practical tool for researchers. While theoretical statisticians rely on complex mathematical proofs to demonstrate why an estimator works, applied researchers can use computer simulations to build a fake world, run their estimator, and visually verify if it actually gets the right answer.
Because you get to play the role of the creator in a simulation, you know exactly what the “true” causal effect is, allowing you to test whether your statistical tools can successfully find it.
1. The Anatomy of a Simulation
Running a simulation requires writing code (in R, Stata, or Python) to execute a repetitive loop. The basic anatomy includes:
- Creating Data: You use random number generators to create data that perfectly mimics your assumed Data Generating Process (DGP).
- Estimation: You run your statistical model (like a regression) on that fake data and store the result (like a coefficient or a standard error).
- Iteration: You use a “for loop” to repeat this exact process hundreds or thousands of times, generating a new fake dataset and a new estimate each time.
- Evaluation: You look at the distribution of all your saved estimates. The mean of this distribution tells you if your estimator is biased (does it match the “true” effect you baked into the data?), and the standard deviation of this distribution perfectly mimics your estimator’s standard error.
2. Why We Simulate
Beyond just proving that standard methods work, simulation is the primary tool researchers reach for to solve complex problems:
- Testing and Comparing Estimators: You can invent a brand-new estimator and test if it works, or you can run a “horse race” between two different methods to see which one gives you a more precise or less biased result on the same data.
- Breaking Things (Sensitivity Analysis): What happens if your assumptions are wrong? You can intentionally build a fake dataset that violates a core assumption (like leaving a back door open) and turn the “strength” of that violation up and down. This tells you exactly how bad an assumption violation has to be before it completely ruins your results.
- Power Analysis: Statistical power is your ability to successfully detect an effect that is actually there (the true-positive rate). By simulating data with a specific effect size, you can figure out the minimum sample size you need to reliably detect it, or conversely, the minimum detectable effect you can hope to find given the sample size you already have.
3. The Bootstrap (Simulating with Existing Data)
The chapter finishes by detailing a very special kind of simulation: the bootstrap. While standard simulation requires you to invent the DGP to create fake data, the bootstrap uses your actual, existing dataset.
By randomly resampling your own data with replacement thousands of times, you can physically build a sampling distribution. Because you didn’t create the data, the bootstrap can’t tell you the “true” causal effect, but it is effectively magic for calculating standard errors when standard mathematical formulas fail.
The chapter also details a few advanced variants of the bootstrap for complex data structures:
- Cluster/Block Bootstrap: Used when observations are grouped together (like students in a classroom) or when tracking the same person over time (panel/time-series data). Instead of resampling individual rows of data, it resamples entire clusters or blocks of time to preserve their internal correlations.
- Wild Bootstrap: A method that resamples and randomly multiplies the residuals of a model rather than the observations. It is highly popular because it works exceptionally well for fixing standard errors when you have heteroskedasticity or a very small number of clusters.
16. Fixed Effects
Chapter 16, titled “Fixed Effects,” introduces a powerful research design that allows you to control for variables you cannot even measure or see, as long as those variables remain constant over time.
Up until now, the book has focused on identifying causal effects by explicitly measuring and controlling for variables on back doors. But in the real world, you will inevitably run into unobserved confounders (like a person’s upbringing, genetic traits, or innate ability) that you cannot measure. Fixed effects provides a mathematical shortcut to control for all of them at once.
1. The Core Concept: Within vs. Between Variation Data containing multiple observations of the same individual (or country, company, etc.) over time contains two types of variation:
- Between Variation: How individuals differ from each other on average. For example, the fact that you are generally taller than your friend is between variation.
- Within Variation: How an individual changes relative to their own average over time. For example, the fact that you are taller today than you were five years ago is within variation.
Fixed effects operates by mathematically throwing away all the between variation. By strictly isolating within variation, you effectively control for everything about an individual that does not change over time, whether you measured it or not. If a person’s geographic birthplace never changes, then looking only at how their life changes relative to their own baseline means their birthplace is held completely constant (controlled for).
2. How It Works Mechanically Statistically, a fixed effects model is just an Ordinary Least Squares (OLS) regression where the intercept (\(\beta_0\)) is allowed to vary for each individual. Instead of fitting one single line for the whole dataset, it fits parallel lines for each individual, moving the line up or down to match each person’s unique baseline.
Researchers usually estimate this in one of two ways:
- Binary Controls: Adding a separate binary control variable (a dummy variable) for every single individual in the dataset.
- De-meaning (Absorbing): Calculating the average of \(X\) and \(Y\) for each individual, subtracting that mean from their actual \(X\) and \(Y\) values, and running a regression on those residuals.
3. Interpreting the Results Because fixed effects removes between-group variation, you must change how you interpret the regression coefficient. You cannot say “an individual with one more unit of \(X\) has \(\beta_1\) more units of \(Y\).” Instead, the interpretation must be explicitly relative to the group’s own average: “For a given individual, when \(X\) is one unit higher than it typically is for that individual, we expect \(Y\) to be \(\beta_1\) higher than it typically is for that individual”.
4. Two-Way Fixed Effects If you have data spanning multiple time periods, you can include multiple sets of fixed effects at the same time. The most common approach is Two-Way Fixed Effects, which includes fixed effects for individual and fixed effects for year.
- The individual fixed effects control for everything constant about a person across all time.
- The year fixed effects control for everything constant about a time period across all people (e.g., a nationwide recession that affected everyone equally in 2009).
5. Fixed Effects vs. Random Effects The chapter contrasts Fixed Effects with a related method called Random Effects.
- Unlike fixed effects, random effects assumes the individual intercepts come from a specific mathematical distribution (like a normal distribution).
- While random effects provides tighter standard errors and lets you measure between-variation, it relies on a fatal assumption for causal inference: it assumes that the unobserved individual traits (\(\beta_i\)) are completely unrelated to your treatment variable (\(X\)). Because social science variables are almost always related, economists generally prefer fixed effects.
- However, modern statisticians often use Multi-level models (Hierarchical models) to get the best of both worlds, explicitly separating the “within” effects from the “between” effects while modeling the individual traits directly.
6. Advanced Pro-Tips The chapter concludes with a few warnings for using fixed effects in practice:
- Clustered Standard Errors: Because you have multiple observations per individual, their error terms are likely correlated. You should generally cluster your standard errors at the level of your fixed effect, assuming there is treatment effect heterogeneity.
- Nonlinear Models: Standard fixed effects methods fail catastrophically if you try to use them with nonlinear models like Logit or Probit (a phenomenon called the “incidental parameters problem”). If you have a binary outcome variable, you either have to use a Linear Probability Model (OLS) anyway, or seek out specialized conditional logit software packages.
17. Event Studies
Chapter 17, titled “Event Studies,” explores what is likely the oldest and most intuitive causal inference design in existence. The core idea is simple: at a specific moment in time, an event occurs that turns a treatment from “off” to “on.” By comparing what happens before the event to what happens after, you can estimate the causal effect of the treatment.
However, because the real world is messy, a simple before-and-after comparison is rarely enough.
The Core Challenge: Time as a Back Door
If you look at a causal diagram for an event study, the primary back door is Time. Things naturally change over time. If a company introduces a new product and sales go up a month later, was it the product, or was it just the start of the holiday shopping season?
To close this back door, you cannot simply look at the raw data after the event. You must calculate a counterfactual prediction—an estimate of what would have happened if the event had never occurred. The causal effect is the deviation between what actually happened and your prediction of what would have happened.
Because time is the main confounding variable, event studies are generally intended for a short horizon (a short post-event window). The further out you get from the event, the more likely it is that other unrelated time factors will creep in and ruin your prediction.
Approaches to Predicting the Counterfactual
There are three main conceptual ways to figure out what the world would have looked like without the event:
- Ignore It: You can simply assume that without the event, the outcome would have stayed exactly the same as it was before. This only works if your pre-event data is incredibly flat/stable, or if you are looking at an extremely tiny window of time (like second-by-second stock prices) where normal trends don’t have time to matter.
- Predict After Using Before: You use the pre-event data to identify a trend (e.g., sales were steadily growing at 2% a month), and you extrapolate that exact same trend into the post-event period to serve as your baseline.
- Predict After Using After: You look at how your outcome relates to other variables before the event. For example, if your company’s stock usually rises by 0.5% every time the overall market rises by 1%, you can look at the actual market performance after the event to predict what your stock should have done, and then subtract that from the stock’s actual performance.
How Event Studies Are Performed
The chapter details two main statistical approaches for running an event study:
1. The Finance Approach (Abnormal Returns) Event studies are massively popular in finance because stock prices react immediately to new information, making it easy to pinpoint an event and use a short horizon. The steps are:
- Use an “estimation period” before the event to build a prediction model. This could be a means-adjusted model (average past returns), a market-adjusted model (using the overall market return), or a risk-adjusted model (using a regression to see how the stock normally moves with the market).
- Subtract the predicted return from the actual return during the “observation period” to get the Abnormal Return (AR).
- If the AR spikes after the event, that represents the causal effect.
2. The Regression Approach (Segmented Regression) If your event causes a long-lasting change or a shift in a trend, you can use a segmented regression. You regress your outcome on the time period, a binary indicator for being “After” the event, and an interaction term between the two (\(Time \times After\)).
- The coefficient on the “After” variable tells you if there was an immediate jump in the outcome right when the event happened.
- The coefficient on the interaction term tells you if the slope of the trend changed after the event (e.g., sales were flat, but after the event they started growing steadily).
Advanced Considerations for the Pros
- Multiple Affected Groups: If the event affects many groups at once, you can average their data together, run separate event studies for each, or use a regression model with fixed effects for the groups and time periods. A popular method sets the time period just before the event as the baseline (reference category) so you can easily see the effect spike in the periods following it.
- Autocorrelation: Time-series data is notoriously “sticky”—what happens today is highly dependent on what happened yesterday. If you don’t adjust your standard errors for autocorrelation, you are highly likely to find a statistically significant effect even when there isn’t one. Advanced forecasters use models like ARMA or ARIMA to account for these moving averages and autoregressive traits.
- The Joint-Test Problem & Placebo Tests: Any event study is technically testing two things at once: the actual effect of the event, and whether your prediction model was correct. If your prediction model is wrong, your estimated effect will be wrong. To test your model, researchers use placebo tests: they run the exact same event study on unaffected groups, or on random fake dates. If the model detects a “significant effect” on a fake date where nothing happened, you know your underlying prediction model is flawed.
18. Difference in Difference
Chapter 18, titled “Difference-in-Differences,” covers one of the most popular and historically significant research designs in causal inference. While Event Studies (Chapter 17) look at how a single group changes from “before” to “after” a treatment, Difference-in-Differences (DID) adds a crucial component: an untreated comparison group.
By bringing in a group that never receives the treatment, DID solves the main problem of an event study—the fact that things naturally change over time anyway.
1. The Core Concept: Across Within Variation
DID identifies a causal effect by mathematically closing two back doors at once: group differences and time trends.
- First, it isolates the within variation for both the treated and untreated groups, meaning it compares each group to itself over time. This automatically controls for any fixed differences between the groups.
- Second, it compares the treated group’s within variation to the untreated group’s within variation. The change in the untreated group acts as a baseline representing how much change we would have expected in the treated group if no treatment had occurred.
- The actual DID calculation is simply:
(Treated After − Treated Before) − (Untreated After − Untreated Before). By doing this, it isolates the Average Treatment on the Treated (ATT)—the effect of the treatment specifically for the groups that actually received it.
2. The Crucial Assumption: Parallel Trends
For DID to work, you must be able to trust your untreated group. This relies on the parallel trends assumption: If no treatment had occurred, the gap between the treated and untreated groups would have remained constant in the post-treatment period.
Because this assumption relies on an unobservable counterfactual, you can never prove it is true. However, you can provide suggestive evidence that it is plausible using two main checks:
- Test of Prior Trends: You look at the data before the treatment occurred. If the treated and untreated groups were trending at the exact same rate leading up to the treatment, it is much more plausible that they would have continued to trend together.
- Placebo Tests: You use only the pre-treatment data and pretend the treatment happened at a random, fake date. If your DID model detects an “effect” on a date where nothing actually happened, it tells you that the groups’ trends are naturally diverging, meaning parallel trends is likely violated.
- A warning on functional form: Parallel trends is sensitive to how you measure your data. If parallel trends holds for your raw outcome variable \(Y\), it mathematically cannot hold if you transform that variable into \(\ln(Y)\) (and vice versa).
3. Estimation: Two-Way Fixed Effects and Dynamics
Traditionally, DID is estimated using a Two-Way Fixed Effects (TWFE) regression model: \(Y = \alpha_g + \alpha_t + \beta_1 Treated + \epsilon\). By including fixed effects for the group \((\alpha_g)\) and the time period (\(\alpha_t\)), the coefficient on the \(Treated\) indicator (\(\beta_1\)) gives you the DID estimate.
Researchers often expand this into a Dynamic Treatment Effect (or Event Study DID) model. Instead of just grouping time into one massive “After” period, they interact the treatment with each specific time period. This allows you to see if an effect fades out over time or takes a while to kick in, and it also explicitly plots the pre-treatment periods to visually prove there were no prior trends.
4. Advanced Concerns (How the Pros Do It)
The chapter finishes by covering the massive recent shifts in how econometricians view DID:
- The “Secret Shame” of Rollout Designs: For decades, researchers used the standard TWFE regression when different groups received treatment at different times (staggered rollout). Econometricians recently realized this is mathematically flawed. TWFE accidentally uses already-treated groups as control groups for newly-treated groups, which heavily biases the result if the treatment effect changes over time.
- Modern Rollout Estimators: To fix the staggered rollout problem, researchers now use specialized estimators. The Callaway and Sant’Anna method estimates a separate effect for each specific group-time cohort and carefully averages them together. Other fixes include Wooldridge’s extended TWFE (adding a massive amount of interaction terms for cohorts) and Borusyak et al.’s Imputation estimator (using untreated data to explicitly predict counterfactuals for the treated data).
- Control Variables: Adding controls to a DID regression is dangerous. You should only add them if you believe parallel trends only holds conditionally on those variables (e.g., age or income influences a group’s trajectory). You must never control for variables measured after the treatment, as the treatment may have caused them, opening up collider bias or closing your front doors.
- Triple-Differences (DIDID): If you are worried that parallel trends is violated because of some unmeasured shock, you can find a group that shouldn’t be affected by the policy (like an exempt industry) and run DID on them. By subtracting their fake DID effect from your real DID effect, you “net out” the confounding trends.
19. Instrumental Variables
Chapter 19 covers Instrumental Variables (IV), a research design that takes a fundamentally different approach to identifying causal effects than regression or matching. While those methods “chip away” at bad variation by explicitly controlling for back doors, IV acts more like pouring concrete into a mold by isolating only the specific variation you want.
IV attempts to mimic a randomized controlled experiment using observational data. It works by finding a real-world source of randomness—the instrument (\(Z\))—that strongly affects your treatment (\(X\)) but has absolutely no back doors to the outcome (\(Y\)).
The Mechanics of IV
If you have a treatment that is hopelessly entangled with unmeasurable back doors (like trying to measure the effect of wealth on lifespan), IV bypasses those back doors by looking only at a specific source of wealth. For example, if you use “winning the lottery” as your instrument, you isolate only the variation in wealth driven by the lottery, completely ignoring the variation driven by hard work, inheritance, or crime.
Statistically, IV does this by splitting your treatment into two parts:
- It uses the instrument to predict the treatment.
- It throws away the unexplained part of the treatment (which is contaminated by back doors).
- It looks only at the relationship between the clean, instrument-explained variation in the treatment and the outcome.
The Three Key Assumptions
For an instrumental variable to work, you must be able to justify three major assumptions:
- Relevance: The instrument must actually predict the treatment. If the relationship between the instrument and the treatment is too small, you have a weak instrument. Weak instruments cause your estimates to swing wildly and introduce severe bias. Relevance is tested using a first-stage F-statistic.
- Validity (The Exclusion Restriction): The instrument must have no open back doors of its own, and the only way it can affect the outcome is through the treatment. If there is any other path from the instrument to the outcome, the instrument is invalid. Validity is notoriously hard to prove and relies heavily on domain knowledge and theory.
- Monotonicity: The instrument must push everyone in the sample in the exact same direction (or have no effect on them). If an instrument makes most people more likely to get treated but makes a few people less likely (these people are called “defiers”), the math breaks down and ruins the estimate.
What Does IV Estimate? (LATE)
Because IV entirely throws away the data of people who do not respond to the instrument (the “always-takers” and “never-takers”), it does not give you the Average Treatment Effect (ATE) for the whole population.
Instead, it calculates a Local Average Treatment Effect (LATE). This is a weighted average where an individual’s treatment effect is weighted by how strongly they actually respond to the instrument (the “compliers”). This means that using a different instrument will give you a different treatment effect estimate, because it will pull from a different group of compliers.
Estimation and Advanced Concerns
- Two-Stage Least Squares (2SLS): This is the most common way to estimate IV. The first stage is a regression that uses the instrument to predict the treatment. The second stage uses those predicted values to estimate the effect on the outcome.
- Generalized Method of Moments (GMM): When your model is “overidentified” (you have more instruments than treatment variables), GMM is often preferred over 2SLS because it produces more precise estimates, especially when the data has heteroskedasticity.
- Nonlinear Models: If your treatment or outcome is binary, you cannot simply use a probit or logit model in the first stage of a 2SLS estimation (an error known as the “forbidden regression” because the math fails). Instead, econometricians use specialized approaches like the control function approach, treatment effect regressions, or bivariate probit models.
- Fixing Weak Instruments: If your instrument is weak but you believe it is valid, you can use methods like Limited-Information Maximum Likelihood (LIML) to reduce bias, or report Anderson-Rubin confidence intervals to honestly reflect the uncertainty in your estimate.
20. Regression Discontinuity
Chapter 20 explores Regression Discontinuity (RD), a research design that identifies causal effects in situations where a treatment is assigned discontinuously based on a specific threshold. By comparing observations just barely on either side of this threshold, researchers can reasonably assume that the two groups are practically identical except for the treatment, allowing them to isolate its causal effect.
1. Core Terminology
To understand RD, you must define three key components of the design:
- Running variable (or forcing variable): The variable that determines whether an observation receives treatment.
- Cutoff: The specific threshold value of the running variable that dictates treatment assignment (e.g., a test score of 95, an income of $75,000, or a geographic border).
- Bandwidth: The range of data around the cutoff that the researcher considers comparable enough to use in the estimation.
2. Sharp vs. Fuzzy Regression Discontinuity
- Sharp RD: This occurs when the cutoff perfectly determines treatment, meaning the probability of treatment jumps instantly from 0% to 100% at the line. This is estimated using a local regression (often a linear or low-order polynomial) where the running variable is centered at zero. The treatment effect is simply the jump in the intercepts between the line fit to the left of the cutoff and the line fit to the right.
- Fuzzy RD: This occurs when being on one side of the cutoff only changes the probability of receiving treatment (e.g., passing a retirement age threshold increases the probability of retiring by 30%, but not everyone retires immediately). Fuzzy RD is estimated using Instrumental Variables (IV). Being past the cutoff serves as an instrument for actual treatment, meaning the jump in the outcome is divided (scaled) by the jump in the treatment rate.
3. What Does RD Estimate?
Because RD throws away data that is far from the cutoff to avoid back doors, it strictly estimates a Local Average Treatment Effect (LATE). It provides the average treatment effect only for those individuals who are right at the margin of the cutoff. While this is highly useful for deciding whether to slightly expand or shrink a policy, it tells you nothing about the treatment effect for people with running variables far away from the cutoff.
4. Key Assumptions and How to Test Them
RD is celebrated because it closes unobservable back doors without needing control variables, but it relies heavily on two main assumptions:
- Smoothness at the Cutoff: You must assume that if the treatment had not occurred, the outcome would not have jumped. If another policy kicks in at the exact same cutoff, or if the populations are fundamentally different right at the cutoff for an unrelated reason, the design is ruined.
- Placebo Tests: You can test this by running your RD model on a control variable that the treatment should not affect. If your placebo variable jumps at the cutoff, your design is flawed.
- No Manipulation: People must not be able to precisely manipulate their running variable to force themselves over the line (e.g., a timekeeper shaving two seconds off a runner’s time so they make the team).
- McCrary Density Test: You can test for manipulation by plotting the density distribution of the running variable. If there is a massive unnatural spike or dip in the number of people just over the cutoff, it implies manipulation.
5. Advanced Concerns (How the Pros Do It)
The chapter concludes by covering the complexities researchers face when the simple RD setup breaks down:
- Regression Kink: Sometimes a policy doesn’t cause a jump in an outcome, but rather a change in its slope (e.g., unemployment benefits cap out at a certain income level, kinking the treatment amount). A regression kink design looks for a change in the slope of the relationship between the outcome and the running variable at the cutoff.
- Misbehaving Running Variables: The running variable must be continuous and smooth. If it is too granular (e.g., income measured in $10,000 bins) or exhibits nonrandom heaping (e.g., doctors rounding birth weights to the nearest 100 grams), it creates bias. Researchers handle heaping by using a “donut-hole” RD that actively drops the heaped observations right at the cutoff.
- Multiple Cutoffs: A design might feature different cutoffs across different states, cutoffs that shift over time, or a treatment that requires crossing two running variables (e.g., passing both a math and English threshold). Specialized software like the
rdmultipackage is used to handle these multi-dimensional borders. - Bandwidth Selection: Choosing the bandwidth involves an efficiency-robustness tradeoff. A wider bandwidth gives you more data (better precision/efficiency) but introduces more bias from non-comparable observations. Modern practice relies on automated “optimal bandwidth selection” algorithms and cross-validation rather than guessing.
21. Partial Identification
Chapter 21, titled “Partial Identification,” introduces a paradigm shift in how we think about causal inference. Up until this point, the book has focused on point identification—making a set of assumptions strong enough to pin down a single, exact number for a causal effect.
However, making those assumptions often requires us to confidently claim things that are highly implausible, like asserting that an unmeasured variable has an effect of exactly zero. Partial identification (also known as set identification or sensitivity analysis) allows us to relax those strong, brittle assumptions and replace them with weaker, more believable assumptions. The trade-off is that instead of getting a single number, we get an identified set—a range of plausible estimates.
1. The Core Concept: Bounding the Effect
The algebra analogy perfectly captures partial identification: if \(x + y = 10\), and we assume \(y = 6\), we can point identify \(x\) as exactly 4. But what if we don’t know exactly what \(y\) is? If we make a weaker assumption that \(y\) is somewhere between 4 and 7, we can partially identify \(x\) as being between 3 and 6.
The chapter demonstrates this using a sample selection problem: measuring the effect of a $1,000 cash transfer on school attendance. Because some people drop out of the study, we suffer from collider bias. However, even if we know absolutely nothing about the missing people, we know that the share of their kids in school must be between 0% and 100%. Plugging those logical extremes into our math gives us a worst-case and best-case scenario, bounding the true effect between a 4 and 34 percentage-point increase. It’s a wide range, but it proves the effect is positive without making any dangerous assumptions.
2. Sensitivity Analysis in Regression (Handling Omitted Variables)
It is incredibly difficult to confidently claim you have controlled for every back door in a regression. Sensitivity analysis allows us to ask: “How bad would an omitted variable have to be to completely erase our estimated effect?”.
- Cinelli and Hazlett’s Method: This approach calculates the bias introduced by an omitted variable based on its partial \(R^2\)—how much residual variation the omitted variable explains in the treatment, and how much it explains in the outcome.
- Benchmarking: Since we can’t measure the unobserved variables, we benchmark them against the variables we did measure. You can calculate the strength of a known, powerful control variable (like the severity of a crime) and ask: “Even if there is an unmeasured back door out there just as strong as this variable, would my effect still be positive?”. If the answer is yes, your results are robust.
3. Rosenbaum Bounds for Matching
In Chapter 14, we learned that matching assumes we have closed back doors by pairing treated and untreated individuals with the exact same probability of being treated (an odds ratio of 1). Rosenbaum bounds relax this by allowing for unobserved differences between the matched pairs.
- The Gamma (\(\Gamma\)) Parameter: This parameter represents the worst-case odds ratio of treatment assignment due to unobserved variables. We allow \(\Gamma\) to be greater than 1.
- Interpretation: Because odds ratios are hard to intuit, we convert \(\Gamma\) into a probability. A \(\Gamma\) of 1.5 translates to a 60% probability. This allows us to say: “Even if unobserved factors made it so that within our ‘perfect’ matched pairs, we could successfully guess who was treated 60% of the time, our treatment effect would still be statistically significant”.
4. Design-Specific Partial Identification (How the Pros Do It)
The pros frequently apply partial identification to the most fragile assumptions of advanced quasi-experimental designs.
- Honest Difference-in-Differences: Difference-in-Differences requires the massive assumption of exact parallel trends. “Honest DID” relaxes this by allowing the pre-treatment gap between groups to change slightly, calculating a range of adjusted effects based on a plausible range of parallel-trend violations.
- Regression Discontinuity with Manipulation: RD assumes no one manipulates their running variable to sneak over the cutoff. If they do, new bounding methods allow researchers to specify a maximum amount of manipulation and see how the effect holds up.
- Imperfect Instrumental Variables: IV requires a perfectly valid instrument with zero open back doors to the outcome. Partial identification can bound the IV estimate by assuming the validity violation isn’t perfectly zero, but is restricted to a certain size or direction.
22. Other Methods
Chapter 22, titled “A Gallery of Rogues: Other Methods,” serves as a tour of the causal inference frontier. Because research design is constantly evolving, this chapter briefly introduces methods that are relatively new, actively developing, or too mathematically complex to cover in detail.
Instead of deep dives and code, it provides a conceptual overview of several cutting-edge techniques to start you on a journey of exploration.
1. Advanced Template Designs
These are comprehensive research designs that can be applied to observational data in a variety of settings, much like Difference-in-Differences (DID) or Regression Discontinuity.
- Synthetic Control: This is effectively a highly disciplined variant of DID used when only a single group receives a treatment. Instead of relying on the unprovable “parallel trends” assumption, synthetic control actively forces pre-treatment trends to match. It does this by using a matching algorithm to assign weights to a “donor pool” of untreated groups, constructing a “synthetic” version of the treated group that perfectly mimics its pre-treatment trajectory. The effect is the difference between the actual treated group and the synthetic control group after the treatment starts.
- Matrix Completion: This method relies heavily on the “potential outcomes” framework and treats causal inference as a missing data problem. Imagine a spreadsheet where rows are individuals and columns are time periods. For a treated person, we know their outcome when treated, but their counterfactual (what would have happened if they were untreated) is a missing value (a “?”). Matrix completion uses machine learning (regularized regression) to analyze the patterns in the data from untreated periods and untreated individuals to accurately predict and fill in those “?”s, giving us our counterfactuals to compare against.
- Causal Discovery: The rest of the book insists that you must draw your causal diagram using theory and intuition before looking at the data. Causal discovery uses algorithms to let the data draw the diagram for you. Algorithms like SGS test for conditional relationships to figure out which variables are directly connected by lines. Then, they look for colliders (situations where A and B are unrelated until you control for C) to figure out which direction the arrows should point.
- Double Machine Learning (DML): When you control for a variable in regression, you are effectively removing the variation in X and Y explained by those controls, and then looking at the relationship between the residuals. DML asks: why do we have to use linear regression to find those residuals? Instead, DML uses machine learning (like random forests or neural networks) to predict X and Y from the controls. ML is vastly superior to standard regression when you have hundreds of control variables or highly complex, non-linear relationships.
2. Modeling Heterogeneous Effects
We know from Chapter 10 that treatment effects are rarely identical for everyone. These machine-learning-based methods help us map out the entire distribution of how effects vary across a sample.
- Causal Forests: This is a modification of a machine learning prediction tool called a “random forest.” A normal random forest splits data into smaller and smaller branches (like “Age > 65” vs. “Age < 65”) to find the best way to accurately predict an outcome. A causal forest splits the data into branches based on how different the estimated treatment effect is on either side of the split. By the end of the process, the algorithm provides a customized treatment effect estimate (with standard errors) for every single individual in the data.
- Sorted Effects: A much simpler approach. If you estimate a model that naturally allows for individual differences in effects (like a model with lots of interactions, or a logit/probit model), you simply calculate the specific treatment effect for every individual in your sample. Then, you sort them from lowest to highest to visualize the entire range. This allows you to easily compare the characteristics of the people who were most affected by a policy against those who were least affected.
3. Structural Estimation
In most of the book, we used theory to figure out which variables to control for in a standard regression model (a “reduced form” approach). Structural estimation takes theory much more literally.
If economic or scientific theory gives you a specific mathematical equation for how the world works (like the physics equation for gravity: \(F = G \frac{m_1 m_2}{r^2}\)), structural estimation skips the generic linear regression entirely. Instead, it uses advanced statistics (like maximum likelihood) to estimate the exact theoretical parameters (like the gravitational constant \(G\)) directly from the data. While mathematically grueling, this allows researchers to answer highly complex counterfactual questions, like predicting exactly what would happen if a completely new policy were introduced that doesn’t even exist in the data yet.
23. Under the Rug
Chapter 23, titled “Under the Rug,” explores the uncomfortable assumptions and concerns that are present in almost any causal inference study, but which researchers frequently ignore or brush aside. Rather than presenting a new research design, this chapter acts as a survey of the messy realities of working with data, highlighting the limitations of our methods and offering tools to grapple with them.
1. Model Uncertainty
Even if you are confident in your causal diagram, you will inevitably face model uncertainty—uncertainty about which specific statistical model is the “right” one to estimate. For instance, you might be unsure about which specific control variables to include or how to define them.
- The Fix: Instead of hand-picking a single “best” model, you can estimate lots of different models with varying combinations of controls and show the entire distribution of the estimates. Alternatively, you can use Bayesian model averaging, which weights the results of different candidate models based on how likely each model is to be true given the data.
2. Measurement and Validity
Data does not fall from the sky; it is generated, collected, and cleaned by humans, which introduces biases.
- Construct Validity: Does your data actually measure the theoretical concept you want it to?. For example, simply asking someone to rate their “trust” on a scale of 1 to 10 might actually measure how important they think trust is, rather than their actual trusting behavior. Psychometricians address this by using multiple different measures of a concept and statistically extracting the shared underlying factor.
- The Observer Effect: People modify their behavior when they know they are being studied. Subjects might try to give you the answer they think you want, act in ways that make them look good, or even intentionally mess up the data.
- Processing Data: Data cleaning requires highly subjective human decisions, such as deciding which observations to drop or how to group variables into bins. Different researchers cleaning the exact same data can arrive at completely different sample sizes and estimates.
3. Missing Data
It is incredibly common to have datasets where observations are missing values for certain variables. How you handle this depends on why the data is missing:
- Types of Missing Data: Data can be Not in Universe (the question doesn’t apply to them), Missing Completely at Random (a complete fluke), Missing at Random (missingness is unrelated to the value itself after accounting for other variables), or Missing Not at Random (the missingness is directly related to the missing value, like a heavy person refusing to report their weight).
- How to Handle It:
- Listwise deletion (dropping anyone with missing data) is the default in most software, but can introduce severe bias if the data is not completely random.
- Multiple Imputation predicts the missing values multiple times using the other variables in the dataset, adding random noise each time to reflect uncertainty, and then averages the estimates together.
- Other advanced methods include Full Information Maximum Likelihood (letting incomplete observations partially contribute to the model) and the Expectation-Maximization (EM) algorithm.
4. SUTVA (Stable Unit Treatment Value Assumption)
SUTVA is the foundational assumption that we actually know what “treatment” is. This assumption is violated in two main ways:
- Treatment means different things for different people. For example, “workplace diversity training” might mean watching a 10-minute video at one company, but attending a week-long intensive seminar at another.
- Spillovers. Your outcome is affected by whether other people get treated. If a neighboring city gets a tax incentive that boosts its economy, that economic growth might spill over into your city regardless of whether your city was treated.
5. Nonexistent Moments (Fat Tails)
All the statistical methods in this book assume that things like the “mean” and “variance” of a distribution actually exist. However, in power law (fat-tailed) distributions, extreme observations are so massive and common that the theoretical mean and variance are mathematically undefined. Variables like income, wealth, city sizes, and Spotify streams often follow power laws. If you try to calculate a sample mean for fat-tailed data, the estimate will jump around wildly depending on whether you happen to sample a “super-big” observation.
- The Fix: Researchers often use a logarithm transformation to rein in extreme values, though this sometimes isn’t enough. Advanced fixes include quantile regression (predicting the median instead of the mean) or estimating the power-law distribution directly using maximum likelihood.
6. The Treatment Mystery
The chapter concludes with a philosophical puzzle: If you have successfully closed all back doors and found observations that are truly comparable in every way, why did one get the treatment and the other didn’t?. For example, if you compare identical twins to isolate the effect of education, you assume they share the exact same background, genetics, and demographics. But if they are so identical, why did one go to college and the other didn’t? Unless the treatment was explicitly randomized, whatever unobserved difference caused them to get different levels of education might also act as an unclosed back door.