Sheriffs should have a strong bias towards actions that keep the tree green and then open. For main waterfall bots that are on the Commit Queue, if a simple revert can fix the problem the sheriff should revert first and ask questions later. For bots that are not covered by the Commit Queue and if the author is online, it's fine to ask them to fix asynchronously (since it shouldn't be blocking anyone, and it's not the author's fault as the change landed through the CQ).
sheriff-o-matic.appspot.com is a tool to automate the work of sheriffing. It shows you just the failures and it automatically intersects regression ranges across the bots for you. Currently it only works for the Chromium tree. Making it work for other trees is a small amount of work: crbug.com/409693.
More on sheriff-o-matic and what to do if it's down: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriff-o-matic.
Troopers know more about maintaining the buildbot masters and slaves themselves. They're the people to look for when the bots need an OS update, a machine goes offline, checkouts are failing repeatedly, and so on.
Gardeners are watchers of particular component interactions. They generally watch a component's release or development and move the version included forward when it is compatible. This is called "rolling DEPS", and consists of committing a CL that changes the (Subversion) revision number in some
Of particular interest to the Chromium projects are the gardeners who watch the interaction between Skia and Chromium, and those who watch the interaction of Chromium and ChromiumOS.
See Sheriff Details: Chromium to revert changes, disable tests, etc.
The tree status can be closed or open. These status levels control the activity of the commit queue. If the tree is open, the commit queue runs as normal.
Annotate the tree status with information about what is known about the status of build failures. For example, automatic closure messages such as...
... should be changed to:
... to indicate that committer 'johnd' has been notified of the problem and is looking into it. Once a fix has been checked in, sheriffs often use status:
... to indicate that a fix/revert has been checked in and the tree will likely be opened soon. Alternatively, if the sheriffs decided to revert first and ask questions later, then the tree status should be changed to:
If the tree has been closed for an extended time, particularly if the breakage covered more than one working timezone (US Pacific, US Eastern, Europe, Asia), it is considered best practice to communicate what was needed to fix the breakage. That way the next sheriff knows what's been happening, and people in other timezones know what to do next time it breaks the same way.
If the fix was simple, it can be listed in the tree-open status message, such as...
If a more detailed fix was needed, send email to the chromium-dev mailing list explaining what happened. It's a good idea to CC the current and upcoming sheriffs too.
Sometimes you just need to clobber (i.e. force a full, clean rebuild of) some class of bots (win, mac+ninja, linux asan using make, etc.). You can do this by landing a landmine change. Docs are here: Chromium Clobber Landmines.
Note: if a specific CL is causing bots to break unless they are clobbered, that CL should be reverted first and fixed to avoid this.
To retry the last build, you can force a build.
For Chromium only, if you check the "Clobber" checkbox, it will also delete the build output directory before redoing the compile.
Note: If this is not a builder (no compile step), then doing a clobber won't do anything. You need to clobber the "Builder" first.
There is an option to stop a build, but do not use it! If you stop the build during the update step, the bot is going to be hosed for sure. Again, don't use this option, and if you feel like using it, talk to the troopers first.
Install the buildbot error extension to more quickly isolate errors on stdio pages. See Useful extensions for chromium developers for more information.
Documentation is here in the Pre Flight Queue documentation
The authoritative list is on Google Calendar. Here's how to add the sheriff calendar to yours:
To see who the sheriff is, click an event and look at the guest list. (Yes, it would be nice if it showed the people in the event title, but then there's the issue of the event title and the guest list getting out of sync -- no easy answer.) To find when a specific person is going to be sheriff, use google calendar's advanced search box (click the down-triangle in the main search box), select the appropriate sheriff calendar, and type the person's username into the "Who" box.
APAC sheriffs: Note that the day shown in the calendar is the California / PST day. The actual day you will sheriff (in your local time) will be the day after. See "APAC and time zones" below.
The script/process that updates the calendars can be found in svn://svn.chromium.org/chrome-internal/trunk/tools/build/scripts/tools/sheriff.
There are two ways to see who the current sheriff is:
Please note that if it is Sunday PST / Monday APAC, there won't officially be a sheriff on duty, but whoever is rostered on Friday in APAC should be sheriffing (see "APAC and time zones" below).
If you need to swap shifts with someone, add them to the rotation so that the buildbot and other tools display the proper people as sheriffs. To do this:
This guideline is under discussion. See this thread on chromium-dev. The following is the practice that has been adopted by Sydney office.
If you are sheriffing from an APAC time zone (e.g., Tokyo, Sydney, etc), please note that the calendar shows times in the California / PST time zone which is one day behind. You should sheriff when it is that day in California (i.e., one day later in your local time zone).
Infra is aware that this is awkward, and has plans to fix it eventually.
For Developers >