Data Sets

This website is built around a collection of data sets which are used throughout. The data sets can be downloaded here, and are also linked directly in a data folder called 'sneddon' on NZGrapher.

You can go straight to them with the links below.

COOKIES.CSV

COOKIES data set in NZ Grapher

The data set contains information about the chocolate chip cookies being produced at two factories, an old factory and a new factory. The new factory was designed to improve consistency of the product produced.

The two factories use the same recipe for the cookies.

The company expect cookies to weigh 12 grams, on average (and put 17 cookies in a 200 gram packet).

Data is available for 2500 cookies produced on March 23 2018, and the company assures us that this day was no different to all the other days in March for the factories' production.

Variables:

  • FACTORY: categorical variable:

    • "OLD" - cookie produced in old factory

    • "NEW" - cookie produced in new factory

  • WEIGHT: numerical variable:

    • the weight in grams of the sampled cookie

  • CHIP_COUNT: numerical variable:

    • the number of chocolate chips in the cookie

This data set doesn't have enough variables to use in an assessment - there isn't enough variety in the potential questions that could be asked.

GULLS.CSV

GULLS data set in NZ Grapher

Ecologists undertake a population survey of seagulls in the Auckland region. They catch, measure and release red-billed gulls at four sites, two west coast locations and two east coast locations. This was conducted in summer (February 2017) and in winter (August 2017) at all four sites. None of the locations is a major breeding colony.

Background information about red-billed gulls is available at NZ Birds Online Red-Billed Gull. You can also find information about other New Zealand birds at this site.

Only red-billed gulls were measured (there are several other species of gulls in New Zealand).

Variables:

  • WEIGHT: numerical variable:

    • the weight in grams of the seagull

  • LENGTH: numerical variable:

    • the length in centimetres of the seagull

  • LOCATION: categorical variable:

    • Piha, Muriwai, Mareatai or Waitawa

  • COAST: categorical variable:

    • "EAST" (either Piha or Muriwai)

    • "WEST" (either Maraetai or Waitawa)

  • SEASON: categorical variable

    • "SUMMER" (data collected in 2017 Jan/Feb)

    • "WINTER" (data collected in 2017 Jul/Aug)

  • SEX: categorical variable

    • "MALE"

    • "FEMALE"

NEWCARS.CSV

NEWCARS data set in NZ Grapher

Many details are kept on the characteristics of new cars sold in New Zealand. The data contains a simple random sample of 2000 new cars sold in 2018 that meet these criteria:

  • petrol engine

  • hatchback, saloon or station wagon

The variable for car colour has been simplified to list red, black, white, silver and grey cars, with all other colours (blue, green, etc) called "OTHER".

The full data set is available here:

https://nzta.govt.nz/resources/new-zealand-motor-vehicle-register-statistics/new-zealand-vehicle-fleet-open-data-sets/

(This full data set has many other vehicle types, such as sports cars, motorcycles, trucks, buses, and vehicles with hybrid, diesel and electric engines).

Variables:

  • BASIC_COLOUR: categorical variable:

    • one of RED, BLACK, WHITE, SILVER, GREY, OTHER

  • BODY_TYPE: categorical variable:

    • the data set has been restricted to three types of car:

    • HATCHBACK

    • SALOON

    • STATION WAGON

  • CC_RATING: numerical variable:

    • the volume of the engine, in cubic centimetres

  • GROSS_VEHICLE_MASS: numerical variable:

    • the weight of the car, in kilograms

  • MAKE: categorical variable:

    • the make of the vehicle, for example "FORD" or "MASERATI"

    • take care to choose groups with enough data if using this variable

  • MODEL: categorical variable:

    • the particular model of that make of car - 150 different models, with only nine occurring 50 times or more

    • COROLLA, SWIFT and RAV4 are the most common values

SCRABBLE.CSV

The full data set can be downloaded here, and will load in Excel. NZ Grapher is unable to cope with the full data set.

Five different samples from the full data set, each of 2000 words, are available.

SCRABBLE-SET1 data set in NZ Grapher

SCRABBLE-SET2 data set in NZ Grapher

SCRABBLE-SET3 data set in NZ Grapher

SCRABBLE-SET4 data set in NZ Grapher

SCRABBLE-SET5 data set in NZ Grapher

The full data set contains all 267,751 allowable words in the board game Scrabble.

Different letters score different points when used. In particular, four rare letters (J, X, Q and Z) score 8 or 10 points, and other letters score between 1 and 5 points.

This data set treats only the letters A, E, I, O and U as vowels.

This data set also checks to see whether a word has a double letter (like "COOPER" but not "BOB").

Variables:

  • WORD: the word allowed to be played in the board game

    • words longer than 15 letters cannot fit on the board

    • 1-letter words are not allowed

  • VOWELS: numerical variable:

    • the number of vowels (A, E, I, O, U) in the word.

    • Y and other very rare 'vowel' sound letters are not counted

  • LENGTH: numerical variable:

    • the number of letters in the word

  • VOWELRATIO: numerical variable:

    • the ratio of vowels to letters (between 0 and 1), rounded to 2 decimal places

  • HALF: categorical variable:

    • whether the word is in FIRST half of the alphabet (starts A-M) or SECOND half of the alphabet (starts N-Z)

  • DOUBLES: categorical variable:

    • DOUBLE if the word contains one or more double letter (like the O's in BOOKLET)

    • NONE otherwise

  • RARE: categorical variable:

    • RARES: if the word contains one or more of J, Q, X and Z

    • NONE: otherwise

  • SCORE: numerical variable:

    • the total number of points the tiles in the word

    • the highest scoring word, RAZZAMATAZZES, scores 51 points; 99.9% of words score under 35

Please note that the COOKIES.CSV and GULLS.CSV data sets have been creatively fabricated.

NEWCARS.CSV and SCRABBLE.CSV (and its subsets) are real data as described.