Dominik Janzing @ Amazon Research
Abstract:
Asking for the “root cause(s)” of a singular event is at the heart of human attempts to understand what happened. Nevertheless, we were not able to find a satisfactory formalization of “Root Cause Analysis (RCA)” for our business. We have therefore proposed a framework for RCA of anomalies [1] and distribution change [2] that looks sufficiently general for a wide range of use cases based on structural equation models or graphical causal models. It quantifies in percentage to what extent each node contributes to the respective event, based on well-defined statistical and causal principles (also available as open source code in DoWhy [3]). I will, however, also mention cases where our implicit notion of root causes RCA also seems to be rely on normative expectations rather than on statistics and causality alone.
References:
[1] Kailash Budhathoki, Lenon Minorics, Patrick Bloebaum, Dominik Janzing: Causal structure-based root cause analysis of outliers, ICML 2022
[2]Kailash Budhathoki, Dominik Janzing, Patrick Bloebaum, Hoiyi Ng : Why did the distribution change? AISTATS 2021
[3] DoWhy: An end-to-end library for causal inference, https://py-why.github.io/dowhy/v0.8/
Bio:
Research interests:
novel causal inference methods and their foundation
physics of causality and information flow
notions of complexity and their application in machine learning
statistical methods
statistical physics, in particular the link between causality and the second law of thermodynamics.
I founded the group "causal inference" together with Bernhard Schölkopf: https://webdav.tuebingen.mpg.de/causality/
We (Jonas Peters, Bernhard Schölkopf, and me) have written a book on causal inference: https://mitpress.mit.edu/9780262037310/
I have been working on quantum information theory for many years and I'm still interested in it; my current causality research is strongly influenced by the paradigm that information is physical. In 2003, I started a project on causal inference together with the student Xiaohai Sun at the Universitaet Karlsruhe (meanwhile KIT), which later resulted in a joint project with the MPI for Biological Cybernetics and thus became the beginning of the causality group.
Summary:
Root cause analysis (RCA): decomposition of dynamic system into individual components
Graphical models: language of causal models
Causal Bayesian Network: decomposition of distribution into directed graph of conditional distributions
Functional Causal Model: independent structural equations/functions for each directed graph node, with independent noise variables
Key idea: each graph node represents an independent mechanism (represented by conditional distribution or function)
Different from analyzing treatment effects;
Root cause is the earliest cause in a chain of events that lead to a type of event
Finding root causes partitions the causal graph into semi-independent components
Treatment effects can be computed for any event in the chain
Contribution of mechanisms
Take causal graph
Replace some node with a baseline mechanism
Compute the impact of the replacement action on some output quantity
Keep replacing to find the residual impact of other node, after accounting for impact of previously replaced nodes
The result depends on chosen order and the baseline we choose
Contributions needs to sum up to something meaningful
Baselines:
RCA of Outliers: replace anomalous noise with normal value
Distribution Change: replace existing conditional distribution with baseline conditional
Intrinsic Causal Contribution: Set random noise of node to a constant
Intuition:
If node’s anomalous value may be the root cause, which propagates through normal causal mechanism of causal children (probed by changing values)
If node has an anomalous mechanism, normal values of causal parents propagate anomalously (probed by changing noise or the function at node)
RCA of outliers
Take the probability of unlikely event of interest
Rewrite graph as just a function of of noise terms
Compute the sensitivity of the event to the value of the noise term, in some sequence after the prior nodes’ impact has been considered
Average over all possible orders
RCA of distribution changes
We observe the joint distribution of data has changed
Want to assign responsibility to nodes
Observe: KL-distance between two distribution decomposes into per-node distances
Caveat: conditional KL distances contain distribution of parents
Intuition: most distribution changes are local
Look for the few nodes that have changed
Use conditional independence tests to find them
Idea: attribute variance in target variable to causal nodes
Compare variance of variable conditional on some nodes to variance conditional to 1 fewer causal nodes
Highlights impact of the removed node on the variance
Average over possible orderings of removals
Intuition:
Attribution replaces mechanisms, not values
Want to know which part of change is due to a single node’s mechanism, not its parents
Don’t want to attribute contribution to nodes that just propagate anomalies
Limitations of RCA
Typically can talk about unusual behavior
Can’t say much about how things should work
To capture those, need to specify baseline distributions about what should be considered normal behavior for a given node