Anonymized complaints dataset

The full complaints dataset is available via Google BigQuery, which support SQL queries and data exports. You can access it here; you need a Google account.

For now, please contact to get your Google account added to the access list.

Data fields

The complaints data is basically a single table, [serfr0-1000:public.comp]. Each row is a single, anonymized complaint. (serfr0-1000 is the behind-the-scenes name of in Google Cloud.)

The columns in that table contain data as follows:

  • Timestamp: a UTC timestamp, accurate to the second, of when the user complained.
  • Speedbrakes: whether the user ticked the 'heard speedbrakes' tickbox
  • Loudness: 0=undefined, 1=loud, 2=LOUD, 3=INSANE
  • Activity: whichever activity the user picked from the dropdown (if any)
  • User: a SHA512_256 fingerprint of the user's email address, as a hex string
  • City: the name of the city in the user's profile
  • Zip: the 5-digit ZIP code in the user's profile
  • DatePST: the timestamp's date, in Pacific time, as a %Y-%m-%d formatted string
  • HourPST: the timestamp's hour of day, in Pacific time, in the range 00 to 23.

The following fields relate to the flight identified for the complaint; if no flight was identified, then they will be empty or zero. (If the user manually entered a flightnumber, then some but not all of them will be empty or zero.)

  • FlightKey: a "unique" ID for this particular flight, based on the flightnumber and the date (Pacific time) when we noticed the flight. (See notes below for more on this field.)
  • FlightNumber: the IATA schedule flight number for this flight (e.g. AA123, with a two-letter airline code)
  • AirlineCode: the two-letter IATA airline code (e.g. AA), as extracted from the FlightNumber. It gets its own field to help with aggregate queries.
  • Origin,Destination: the three-letter airport codes (e.g. SFO) for the flight.
  • EquipType: the type of aircraft (e.g. B744, for a (4th generation ?) Boeing 747).
  • Lat,Long: the position of the flight at the time the complaint was filed, as best as we could figure out. (See anonymization notes below for more on this data.)
  • PressureAltitude: the altitude of the aircraft, in feet. There are many ways to describe the altitude or height of an aircraft; this is the pressure altitude. Depending on the atmospherics, it can be as much as 400 feet different from the physical height above sea level.
  • Groundspeed: the speed of the aircraft, in knots. Note that aircraft regulations are almost always expressed in airspeed, not groundspeed, and there is no sensible way to convert from one to the other without very detailed knowledge of the winds around the aircraft.

A few notes on the data

    • Users who have unticked the 'share my data' box on their profile are omitted from this dataset; as of June 2016, ~1% of all profiles had unticked this box.
    • Totals derived from anonymized data are thus a tiny bit smaller than the totals displayed on stop.jetnoise itself.
  • The dataset is updated once per day, just after midnight, with the previous day's complaints.
    • Realtime updates would be interesting, but would be complex, as users often revise or delete complaints, or add them in later on.
  • The date in the FlightKey field is very approximate.
    • It's basically "the date, in Pacific Time, when we first saw this flight show up via ADSB/MLAT".
    • A flight that normally shows up at 23:30, but is delayed by an hour into the next day, makes this pretty screwy; that next day will then see both the 00:30 and 23:30 versions of the flight, grouped under the same key.
    • This is a surprisingly hard problem to fix robustly, without access to industry data sources :/
    • For now, take any analysis based on this key as, at best, 80% accurate.
    • Caveats aside: you can use this key to JOIN against the [] table, which has waypoints and arrival procedure (as well as a bunch of identifiers) for mostly all flights we've been able to track down
  • The flight's lat/long values are approximate
    • They can be wrong by up to 4KM . That's because of timing issues (when the complaint button gets pressed; when the data streams report new positions), and also inaccuracies in the data itself (MLAT is always fuzzy, and ADSB can be misconfigured by a few hundred metres)
  • Flights can be misidentified
    • I'd guess that 5% of the complaints with flightnumbers pick out the wrong flight.
  • BigQuery does everything in UTC, all the time.
    • If you use the builtin SQL operator DATE(<timestamp>), you'll get the UTC date, which isn't that helpful. Same goes for the HOUR(<timestamp>) operator.
    • Instead, you should use the DatePST and HourPST columns, which are in America/Los_Angeles time (and in fact respect the shifts in/out of daylight savings, despite being called PST)

How the data is anonymized

You can review the anonymizing code on github. It works like this:

  • Names (e.g. Jane Doe) are removed
  • Email addresses are removed
    • They are combined with a secret global salt, and SHA512_256 hashed into a string, which is used as a unique userID. This allows counts of unique users, or number of complaints per user, to be generated, without exposing the actual email address
  • The postal address is mostly cleared
    • The street address is removed
    • The City (e.g. "Scotts Valley") and the 5-digit ZIP (e.g. "95066") are kept, so that some basic per-city / per-ZIP reporting can be done.
  • The timestamp is kept unchanged (it is accurate to the second)
  • The details of the flight (flightnumber, position etc.) are kept unchanged.
    • Note: this includes the lat/long of the flight, as best we could tell, at the time the complaint was made. With enough instances, this data builds a rough approximation to the physical location of the user.

Example queries

Count complaints and unique users, for each day:

  COUNT(datepst) AS NumComplaints,
  COUNT(DISTINCT(user)) AS UniqueUsers,
FROM [serfr0-1000:public.comp]
GROUP BY datepst

Count complaints and unique users by city, on a particular date:

  COUNT(city) AS NumComplaints,
  COUNT(DISTINCT(user)) AS NumUsers
  datepst = '2016-06-20'
ORDER BY NumComplaints DESC

There are a bunch of views already defined; each of them has a backing query that you can look at, too.