Supplement to Submission 9679 Author Response
"WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted Conformal Martingales"
"WATCH: Weighted Adaptive Testing for Changepoint Hypotheses via Weighted Conformal Martingales"
(S0) Revised Figure 1
We have revised Figure 1 in our paper to more clearly highlight how WATCH responds to and diagnoses between (a) benign covariate shifts (in X), (b) harmful covariate shifts (in X), and (c) harmful concept shifts (in Y|X).
Figure 1: Each column represents a data shift scenario: the top row is a simulated shift example and the bottom row shows WATCH's response, averaged over 20 random seeds. WATCH raises an alarm to retrain the AI/ML once the WCTM (blue) exceeds its alarm threshold; meanwhile, an $X$-CTM (gray)---a standard CTM that only depends on inputs $X$, and thus only detects covariate shifts---dynamically initiates the WCTM's adaptation phase and aids in root-cause analysis. In (a), the $X$-CTM starts the WCTM's adaptation phase, which allows the WCTM to avoid raising an unnecessary alarm. In (b), the extreme covariate shift causes the WCTM to raise an alarm, indicating that the covariate shift is too severe to be adapted to. In (c), the illustrated concept shift causes WATCH to raise an alarm, but without the $X$-CTM detecting a shift in covariates $X$---this allows WATCH to diagnose the root-cause of the alarm as a concept shift in $Y\mid X$.
(S1) What Makes a Covariate Shift Mild or Extreme (or In-Between)?
(S1.1) Ablation Studies on Shift Magnitude for Synthetic Data Example
Figure S1.1: Ablation study illustrating synthetic data example of WATCH performance for different magnitudes of covariate shift (in the input $X$ distribution). Each row corresponds to a specific magnitude of covariate shift and illustrates WATCH's response regarding coverage (prediction safety), interval widths (prediction informativeness), and WCTMs (monitoring criteria for alarms). The post-changepoint test points are sampled from the full source distribution with probabilities proportional to $exp(|x-18|*\lambda)$; larger values of $\lambda$ thus correspond to more severe covariate shift toward extreme (and particularly toward large) values of the input $X$. Experiments are averaged over 20 seeds.
(S1.2) Pseudocode for How Proposed Methods Handle "Extreme" Covariate Shifts Differently
(S2) Further Ablation Experiments
(S2.1) Ablation Studies on WCTM with vs. without X-CTM
Figure S2.1 Ablation experiments comparing the performance of the main proposed WCTM (blue) whose adaptation is initiated dynamically by the X-CTM (gray) to a "WCTM (Deploy-Adapt)" whose adaptation process begins at deployment time t=0. Each row is a data shift setting, with corresponding coverage, widths, and martingale values. Standard CP coverage and widths are reported for baseline comparison (but standard CTMs are omitted for improved clarity on the martingales plot). In the mild/benign covariate shift setting, the pink WCTM (Deploy-Adapt) method initially increases for longer before it is able to adapt in response to the shift--this is because beginning its adaptation too early biased its performance prior to the true changepoint. For extreme/harmful covariate shifts, however, the pink WCTM (Deploy-Adapt) raises an alarm even faster than the blue WCTM (X-CTM-Adapt); the two WCTM methods perform nearly identically in detecting concept shift.
(S2.2) Ablation Studies on Density-Ratio Weight Estimator
Figure S2.2: Selected synthetic-data example where logistic regression is a misspecified probabilistic classifier for distinguishing between pre- and post-changepoint data, but where a neural network (MLP) is able to accurately discriminate between the same pre- and post-changepoint data. That is, in this example the pre- and post-changepoint data are not linearly separable in the input $X$ domain, so logistic regression is not able to reliably discriminate, and thereby it is unable to reliably estimate density-ratio weights via probabilistic classification. The result is that the changepoint causes a large increase in coverage, despite some adaptation (decreasing interval widths); the estimator's misspecification thus causes WCTMs to raise an alarm, indicating that the covariate shift cannot be adapted to by the estimator. In contrast, the MLP estimator is able to appropriately adapt by maintaing target coverage, improving interval sharpness, and avoiding unnecessary alarms.
(S2.3) Ablation Studies on Betting Function
Figure S2.3: Ablation experiment on the betting function used for X-CTMs and WCTMs, on three settings of the synthetic-data example. The "Composite" Jumper betting function is the betting function used in all other experiments, and it is an average of Simple Jumper betting functions over “jumping parameters” $J\in [0.0001, 0.001, 0.01, 0.1, 1]$; here, we set the Simple Jumper baseline to have $J=0.01$. See Vovk et al. (2021) for pseudocode and exposition of the Simple Jumper algorithm. $J=1$ means conservatively spreading bets across all options to avoid cumulative losses, while smaller $J$ encourages “doubling down” on bets that were previously successful. The CTMs with Composite betting are thus lower bounded at $M_t = 0.2$, whereas those with Simple betting continually decrease, resulting in slightly delayed detection speed relative to Composite.
(S3) In-depth Analysis of Image Classification Experiments
(a) 1st column: no corruption; (b) 2nd column: minor benign corruption; (c) 3rd column: benign corruption; (d) 4th column: extreme corruption
Figure S3: The results supplement Figure 3 in the main paper. They demonstrate the coverage rate and prediction set size under four different corruption scenarios. In the multi-class classification setting, we adopt metrics different from those used in the regression experiments and follow Romano et al. (2020) to measure the prediction set size and coverage rate (defined as the proportion of true classes in the specified range, NOT conformal coverage) as principled risk metrics for distinguishing benign from harmful shifts. We increased the size of the validation set and the number of samples visualized to yield more robust performance and provide a clearer view of the trajectory; the results are averaged over a window size of 200, while all other configurations remain unchanged from the original setting. As discussed in our paper, we mixed test samples (target corrupted) with validation samples (source clean) to improve the estimation of weights for CTMs. So under corrupted scenarios, the “starting points” before the change points for WCTM and CTM differ, as the mixture allows the validation set to contain corrupted data; however, this difference is not clearly reflected in the martingale paths. Overall, the models initially exhibit relatively high classification performance under the clean setting, while CTM rapidly declines to a lower performance level under all corruption conditions. Although WCTM adapts to changes in benign scenarios, it eventually demonstrates severe metric changes under extreme shifts as well, which corresponds to the results in Figure 3.
Experiment Details
References
Romano et al., NeurIPS 2020, Classification with Valid and Adaptive Coverage.
Vovk, V., Petej, I., Nouretdinov, I., Ahlberg, E., Carlsson, L., & Gammerman, A. (2021, September). Retrain or not retrain: Conformal test martingales for change-point detection. In Conformal and Probabilistic Prediction and Applications (pp. 191-210). PMLR.