The Chromium Projects

Except as otherwise noted, the content of this page is licensed under a Creative Commons Attribution 2.5 license, and examples are licensed under the BSD License.

The Chromium OS designs and code are preliminary. Expect them to evolve.
For Developers‎ > ‎Design Documents‎ > ‎

HandlingLayoutTestExpectations

This spec was never implemented. It remains here for historical reasons.

Handling Layout Test Expectations: Rebaselining and layout test failures

Executive Summary

We have a lot of layout test failures. For each test failure, we have no good way of tracking whether or not someone has looked at the test output lately, and whether or not the test output is still broken or should be rebaselined. We just went through a week of rebaselining, and stand a good chance of needing to do that again in a few months and losing all of the knowledge that was captured last week.

So, I propose a way to capture the current "broken" output from failing tests, and to version control them so that we can tell when a test's output changes from one expected failing result to another. Such a change may reflect that there has been a regression, or that the bug has been fixed and the test should be rebaselined.

Changes

  1. We modify the layout test scripts to check for 'foo-bad' as well as 'foo-expected'. If the output of test foo does not match 'foo-expected', then we check to see if it matches 'foo-bad'. If it does, then we treat it as we treat test failures today, except that there is no need to save the failed test result (since a version of the output is already checked in). Note that although "-bad" is similar to a different platform, we cannot actually use a different platform, since we actually need up to N different "-bad" versions, one for each supported platform that a test fails on.
  2. We check in a set of '*-bad' baselines based on current output from the regressions. In theory, they should all be legitimate. 
  3. We modify the test to also report regressions from the *-bad baselines. In the cases where we know the failing test is also flaky or nondeterministic, we can indicate that as "NDFAIL" in test expectations to distinguish from a regular deterministic "FAIL".
  4. We modify the rebaselining tools to handle "*-bad" output as well as "*-expected".
  5. Just like we require each test failure to be associated with a bug, we require each "*-bad" output to be associated with a bug - normally (always?) the same bug. The bug should contain comments about what the difference is between the broken output and the expected output, and why it's different, e.g., something like "Note that the text is in two lines in the -bad output, and it should be all on the same line without wrapping."
  6. The same approach can be used here to justify platform-specific variances in output, if we decide to become even more picky about this, but I suggest we learn to walk before we try to run.
  7. Eventually (?) we modify the layout test scripts themselves to fail if the *-bad baselines aren't matched.

Discussion

Advantages:

  • We'll be able to notice when test output changes, potentially increasing our fix rate and noticing new regressions

Disadvantages:

  • There may be a lot more work to maintain the baselines
  • Disk storage in the repository will increase as we save the results (especially for PNGs)
  • This perhaps sends the wrong message that it's okay for tests to fail

Note however that the overhead should trend to zero as we get closer to zero test failures. Realistically, we should expect there to be expected failures for quite some time, and so anything we can do to work with them more effectively is probably a good thing.

Also, I am working on a separate change to store the expected images on the server, so that they don't have to be pulled down locally into the tree. 

Background

First, let's think about how one writes tests. Typically, there are two approaches. 

The first, and most popular these days, is to write a self-contained test that checks its own output and simply announces "pass" or "fail". This is in fact the recommended way to write tests in WebKit, and is how the xUnit style of tests are usually written.

The second, is to separate the test from the output, and to use a driver that checks the output against an expected result (or baseline) to determine if the test passed or failed. This is how run_webkit_tests works.

Most people prefer the first approach because you have fewer files to maintain, and the purpose and correctness of the test is more obvious. However, in some cases (e.g., pixel tests in the renderer), this simply isn't possible (or, at least, practical).

Both approaches, however, have drawbacks, both in the normal case and in the "we expect this test to fail" case.

In the normal approach, a problem arises if there are actually multiple "correct" answers. One example is when writing Javascript tests, but the expected output is different in V8 and JSC. (E.g., if you were testing implementation-dependent features like stack traces). There are three workarounds for this, all of which are weak. The first is to rewrite the implementation to conform. Generally this is a good idea, but sometimes it might not be desirable or even possible. The second is to rewrite the test to accept either output. This is somewhat fragile, and requires the test to know about every possible implementation. The third is to not run the test, and to instead copy and paste the test into a new test, and then modify the output and correct that. Both of the latter approaches leads to confusion (which version is "correct", and why?) and maintenance issues (if something changes, do I have to modify both versions? What should the "correct" output be).

The second approach, obviously, is just the second workaround to the first approach, codified into different files. The second approach is perhaps preferable where multiple correct results are the norm, rather than the exception, which is why we use it to compare PNGs across multiple platforms.

Now, in the case where tests fail, one can actually view the failure as a different kind of "correct" - i.e., we know the output is wrong, but it's an "expected" diff, and in most cases, we want to know if the diff changes from what we expect. Perhaps we actually fixed the bug? Perhaps we introduced a new bug? In fact, one could argue that platform-specific baselines are "expected" wrong baselines.

Tracking "expected diffs" introduces its own woes - what if the diff output is not deterministic? Or, and more importantly, how do you distinguish "expected wrong diff" from "expected right diff"? 

Lastly, one could argue that we should spend more time fixing the bugs that cause the diffs, and less time tracking diffs :) Unfortunately, it's a lot faster to baseline expected diffs then it is to fix them :(