Chromium Try Flakes detects test flakes on CQ, and reports flaky tests to Findit upon detection.
Flake Analyzer uses historical build artifacts from Chromium Waterfall to rerun the test many times (n > 100) and calculates the pass rate at different Waterfall build points as shown below.
Starting from the build that flakiness was first detecting and working backwards, Flake Analyzer uses a variant of exponential backward search with slightly-increasing step sizes to pick the next build point to rerun the test at. Once a build in which the test is stable (98%+ passing or failing), Flake Analyzer switches to searching forward linearly to narrow down the regression range to a single build point on waterfall.
Advantages of the above approach:
Note: For flakes on CQ, Flake Analyzer maps the test step on the CQ trybot to the test step on the corresponding Waterfall buildbot. However for release builds, CQ trybots compile with DCHECK on, while the corresponding Waterfall buildbots compile with it off. Thus Flake Analyzer might not support flaky tests on CQ that trigger a DCHECK.
When a regression range is identified for a reproducible flake, step detection is used to determine the confidence of how reliable the range is. For a range with sufficient confidence (> 0.6), Flake Analyzer triggers a series of try-jobs to compile and rerun the test at commits in the range and identify the exact culprit as shown below; otherwise, Flake Analyzer bails out with just the regression range.
In many cases, the regression is caused by changes that modify the file containing the test or related files. Findit also performs heuristic analysis on the regression range to suggest culprits, whose results are confirmed by try-jobs. When heuristic results are unavailable or incorrect, Flake Analyzer bisects the regression range to identify the culprit quickly.