Formal framework for quantitative Root Cause Analysis

Dominik Janzing @ Amazon Research

Abstract:
Asking for the “root cause(s)” of a singular event is at the heart of human attempts to understand what happened. Nevertheless, we were not able to find a satisfactory formalization of “Root Cause Analysis (RCA)” for our business. We have therefore proposed a framework for RCA of anomalies [1] and distribution change [2] that looks sufficiently general for a wide range of use cases based on structural equation models or graphical causal models. It quantifies in percentage to what extent each node contributes to the respective event, based on well-defined statistical and causal principles (also available as open source code in DoWhy [3]). I will, however, also mention cases where our implicit notion of root causes RCA also seems to be rely on normative expectations rather than on statistics and causality alone.

References:

[1] Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, Dominik Janzing: Causal structure-based root cause analysis of outliers, ICML 2022

[2]Kailash Budhathoki, Dominik Janzing, Patrick Bloebaum, Hoiyi Ng : Why did the distribution change? AISTATS 2021

[3] DoWhy: An end-to-end library for causal inference, https://py-why.github.io/dowhy/v0.8/

Bio:

Research interests:

novel causal inference methods and their foundation
physics of causality and information flow
notions of complexity and their application in machine learning
statistical methods
statistical physics, in particular the link between causality and the second law of thermodynamics.
I founded the group "causal inference" together with Bernhard Schölkopf: https://webdav.tuebingen.mpg.de/causality/

We (Jonas Peters, Bernhard Schölkopf, and me) have written a book on causal inference: https://mitpress.mit.edu/9780262037310/

I have been working on quantum information theory for many years and I'm still interested in it; my current causality research is strongly influenced by the paradigm that information is physical. In 2003, I started a project on causal inference together with the student Xiaohai Sun at the Universitaet Karlsruhe (meanwhile KIT), which later resulted in a joint project with the MPI for Biological Cybernetics and thus became the beginning of the causality group.

Summary:

Root cause analysis (RCA): decomposition of dynamic system into individual components
- Graphical models: language of causal models
- Causal Bayesian Network: decomposition of distribution into directed graph of conditional distributions
- Functional Causal Model: independent structural equations/functions for each directed graph node, with independent noise variables
- Key idea: each graph node represents an independent mechanism (represented by conditional distribution or function)
- Different from analyzing treatment effects;
  - Root cause is the earliest cause in a chain of events that lead to a type of event
    - Finding root causes partitions the causal graph into semi-independent components
  - Treatment effects can be computed for any event in the chain
- Contribution of mechanisms
  - Take causal graph
  - Replace some node with a baseline mechanism
  - Compute the impact of the replacement action on some output quantity
  - Keep replacing to find the residual impact of other node, after accounting for impact of previously replaced nodes
  - The result depends on chosen order and the baseline we choose
  - Contributions needs to sum up to something meaningful
  - Baselines:
    - RCA of Outliers: replace anomalous noise with normal value
    - Distribution Change: replace existing conditional distribution with baseline conditional
    - Intrinsic Causal Contribution: Set random noise of node to a constant
  - Intuition:
    - If node’s anomalous value may be the root cause, which propagates through normal causal mechanism of causal children (probed by changing values)
    - If node has an anomalous mechanism, normal values of causal parents propagate anomalously (probed by changing noise or the function at node)
RCA of outliers
- Take the probability of unlikely event of interest
- Rewrite graph as just a function of of noise terms
- Compute the sensitivity of the event to the value of the noise term, in some sequence after the prior nodes’ impact has been considered
- Average over all possible orders
RCA of distribution changes
- We observe the joint distribution of data has changed
- Want to assign responsibility to nodes
- Observe: KL-distance between two distribution decomposes into per-node distances
  - Caveat: conditional KL distances contain distribution of parents
- Intuition: most distribution changes are local
  - Look for the few nodes that have changed
  - Use conditional independence tests to find them
- Idea: attribute variance in target variable to causal nodes
  - Compare variance of variable conditional on some nodes to variance conditional to 1 fewer causal nodes
  - Highlights impact of the removed node on the variance
  - Average over possible orderings of removals
- Intuition:
  - Attribution replaces mechanisms, not values
  - Want to know which part of change is due to a single node’s mechanism, not its parents
  - Don’t want to attribute contribution to nodes that just propagate anomalies
Limitations of RCA
- Typically can talk about unusual behavior
- Can’t say much about how things should work
- To capture those, need to specify baseline distributions about what should be considered normal behavior for a given node