Overview

Detect Flaky Tests

Chromium Waterfall Upon a test failure, Findit reruns the failure 30 iterations in a Swarming task with the identical test binary & data from Waterfall. The test is considered flaky if any iteration passes.
Commit Queue (CQ)

Chromium Try Flakes detects test flakes on CQ, and reports flaky tests to Findit upon detection.

Identify the Regression Range

Flake Analyzer uses historical build artifacts from Chromium Waterfall to rerun the test many times (n > 100) and calculates the pass rate at different Waterfall build points as shown below.

Starting from the build that flakiness was first detecting and working backwards, Flake Analyzer uses a variant of exponential backward search with slightly-increasing step sizes to pick the next build point to rerun the test at. Once a build in which the test is stable (98%+ passing or failing), Flake Analyzer switches to searching forward linearly to narrow down the regression range to a single build point on waterfall.

Advantages of the above approach:

Fast:
- No compile is needed, because test binary & data are pre-built by Waterfall.
- Test binaries & data are cached and hot on the Swarming bots.
Last known good revision is not needed and it is not available either:
When flakiness is detected, the culprit is more likely to have been introduced in a recent build cycle, and the exponential backward search can localize to recent builds.
With a hard-coded good revision (e.g. bad revision - 5000), bisecting in the range might find an earlier culprit other than the most recent one if the test experienced multiple flake regressions.

Note: For flakes on CQ, Flake Analyzer maps the test step on the CQ trybot to the test step on the corresponding Waterfall buildbot. However for release builds, CQ trybots compile with DCHECK on, while the corresponding Waterfall buildbots compile with it off. Thus Flake Analyzer might not support flaky tests on CQ that trigger a DCHECK.

Identify the Exact Culprit

When a regression range is identified for a reproducible flake, step detection is used to determine the confidence of how reliable the range is. For a range with sufficient confidence (> 0.6), Flake Analyzer triggers a series of try-jobs to compile and rerun the test at commits in the range and identify the exact culprit as shown below; otherwise, Flake Analyzer bails out with just the regression range.

In many cases, the regression is caused by changes that modify the file containing the test or related files. Findit also performs heuristic analysis on the regression range to suggest culprits, whose results are confirmed by try-jobs. When heuristic results are unavailable or incorrect, Flake Analyzer bisects the regression range to identify the culprit quickly.

Report abuse