In this sub-module you will learn about:
Using Google Sheets
The differences between Google Sheets and Python
Writing code in Python
Analyzing a dataset from the Federal Aviation Administration (FAA) Wildlife Strike Database containing reported wildlife strikes with aircraft in North Carolina.
Data analysis helps us answer questions like: How many different species are in the data set? Are there times of the year where most strikes occur? How frequently do land-based animals get impacted by this issue?
To complete this section...
Download a dataset and a data dictionary from the FAA Wildlife Strike Database using the instructions provided in the video or the written instructions below.
The dataset for this section is from the FAA Wildlife Strike Database and contains records of reported wildlife strikes with aircraft in North Carolina.
Click on the arrow to the right to see the written instructions.
A common access point for large datasets is an online database. Often, an online database will provide a search interface that allows users to filter a dataset and access or download only the data they need for their research.
The dataset you will download from the FAA Wildlife Strike Database is contained in a common tabular data file type known as a csv file, a type of file consisting of rows and columns of data separated by commas (comma separated values = csv).
Navigate to the FAA Wildlife Strike Database.
Click the “Search the Database” button to access the database search form.
Select “NC” from the “State” dropdown menu in the “Search Criteria” section.
Make sure the “Allow compression to .zip file” button is checked.
Click the “Submit” button, you will see a table containing the first 10 records of the dataset when the search is complete.
Click the “Export to Excel” button.
Navigate to the folder where this file was downloaded, the file should have a name like “Wildlife_Export_1272020.zip” where the numbers indicate the date on which the data was downloaded.
Extract the file from the downloaded zipped folder (see zip files for more information on this type of file).
On a Mac, double click on the zipped folder. On Windows, right-click the zipped folder and select "Extract All"
The extracted file is a csv file (what is a csv file?) where the numbers indicate the date on which the data was downloaded.
A data dictionary explains the contents of a dataset. For the dataset you just downloaded, the data dictionary will provide full definitions of abbreviations used in field names (column headers) and coded values used in field observations (the value in a specific cell).
In the “Search Criteria” section on the same page as the database search form find the “Note” text providing information about the dataset you just generated.
In the last sentence of that text click the “here” link to download the data dictionary for this dataset as a pdf file (Alternatively, you can download the data dictionary using this direct link to the data dictionary pdf file).
Learn more about data dictionaries and the importance of describing data on the Smithsonian Libraries’ Describing Your Data: Data Dictionaries page.
Now we will use the data we just downloaded. To complete this section....
Use the instructions provided in the video or the written instructions below to:
Import data into Google Sheets
Rename columns headers
Delete columns
Combine categories
New to Google Sheets and Microsoft Excel? Start with these resources:
Google Sheets Quick Reference Guide (2 pages)
Google Sheets Essential Training (a series of short videos)
Note: How to access LinkedIn Learning as an NC State Community Member
Click on the arrow to the right to see the written instructions.
You can also use Excel if you prefer.
Unzip the download data (on a Mac, double click on the ZIP folder. On a Windows, right-click the ZIP folder and select "Extract All")
Create a new Google Sheet
Click on File > Import and select the data you just downloaded.
If you want to create a new spreadsheet for the data, don't change any settings. If you would like to add the data to the current spreadsheet, select "Append to current sheet"
Click on the green "Import data" button
Often, the dataset you download in its original form doesn't fit your data analysis needs exactly. You may want to make surface-level changes to this "raw" dataset so that it's easier to work with. This process is called "cleaning" your data, or preparing it for data analysis.
It is important to remember that as much as possible, cleaning data should not change the values in any way.
1. Renaming column headers
In this data, the columns are named using a code. We can use the data dictionary to match up the column header names with what they stand for to make it easier for us to read.
Find the column labeled REG (use ctrl+F in Windows or command+F on a Mac to search in the data)
Search in the data dictionary to see that "REG" is the column for the aircraft registration number.
Double-click on the cell to change the name from "REG" to "AIRCRAFT_REGISTRATION"
Try this again by using the data dictionary to rename "FLT".
Remember, when naming column headers, we want to avoid using spaces or special characters (! , * are examples of special characters) so that it is easy for the computer to read.
2. Removing unnecessary columns
There may be columns in the dataset that you do not plan to use. You can delete those so that the size of the file is smaller and easier to work with.
Find the column labeled "COMMENT" in column CP
Click on the column label (CP, above the column's data) to highlight the entire column.
Right-click on the column, or click on the down arrow in the column label, then select "Delete column."
3. Combining categories
Sometimes the dataset is more specific than we need. In this case, unknown birds are listed several different ways: "Unknown bird" or "Unknown bird - small" or "Unknown bird - medium" or "Unknown bird - large" or "Unknown bird or bat." This may be helpful, but for our dataset, we want them all labeled "Unknown Species" for a more general category.
Find the column labeled "SPECIES"
Click on the column label (CE, above the column's data) to highlight the entire column.
Use ctrl+F in Windows or command+F on a Mac to search only in the highlighted data. We do not want to change data in any other column.
Click on the three dots to the right of the search box to use Find and Replace.
In the Find box, type "Unknown bird - small" and in the Replace with box, type "Unknown species"
Click Find at the bottom. This will highlight all of the matches for what you entered in the Find box.
Next, click Replace all.
Repeat this process on "Unknown bird - medium," "Unknown bird - large," "Unknown bird or bat", and "Unknown bird" so that they all read "Unknown species."
Google Sheets and Microsoft Excel have a tool called PivotTables that help you easily calculate, summarize, and analyze data. You see comparisons, patterns, and trends in your data and then use those visualizations in reports or papers.
To complete this section...
Use PivotTables to summarize your data in three separate sheets using the instructions provided in the video or the written instructions below.
Click on the arrow to the right to see the written instructions.
Summarize your data using pivot tables:
Select the entire “SPECIES” column (Column CE)
Navigate to the menu bar > select “Data” > select “Pivot table.”
Select “Create” in the “Create pivot table” pop-up. (“Data range” refers to the Column you’ve selected)
Rename the new worksheet by double-clicking in the tab labelled “Pivot Table 1” and replacing it with “unique_species.”
Open the Pivot Table editor on the right by clicking on any of the shaded cells.
In the Pivot Table editor, select “Add” next to “Rows” and select the header that you are summarizing (in this example, select “SPECIES”).
In the Pivot Table editor, select “Add” next to “Values” and select “SPECIES.” Make sure “Summarize by” has “COUNTA” selected. COUNTA returns the number of values (including letters, characters, and numbers) in a dataset, whereas COUNT returns only the number of NUMERIC values.
Navigate back to Rows, and under “Order,” select “Descending”, and under “Sort by,” select COUNTA of SPECIES. This will sort the table by the number of unique species, from greatest to least, instead of alphabetically.
Calculate the number of unique species:
Select a cell a few columns after your pivot table. Insert the following calculation:
=COUNTUNIQUE(Sheet1!CE:CE)
This calculation will return the number “204.” It is referencing column CE in the worksheet “Sheet1” and calculating the unique entries. Please note that it is counting the header “SPECIES” as part of this, so the accurate number of unique species is “203.”
Summarize the number of incidents per year by creating a pivot table. You will be repeating the steps above, but using the “INCIDENT_YEAR” column instead. As a reminder, the steps are:
Select the entire “INCIDENT_YEAR” column by selecting column D.
Navigate to the menu bar > select “Data” > select “Pivot table.”
Select “Create” in the “Create pivot table” pop-up.
Rename the new worksheet by double-clicking in the tab labelled “Pivot Table 1” and replacing it with “incident_year.”
Open the Pivot Table editor on the right by clicking on any of the shaded cells.
In the Pivot Table editor, select “Add” next to “Rows” and select “INCIDENT_YEAR,” then select “Add” next to “Values” and select “INCIDENT_YEAR.” Make sure “Summarize by” has “COUNTA” selected.
Confirm that below “Rows”, the settings for “Order,” are “Descending”, and “Sort by” are “INCIDENT_YEAR.”
Create a horizontal bar chart to visualize incidents per year:
Select the entire pivot table, except for the last two rows. Notice that the last row has a grand total and the row before that is empty.
Navigate to the menu bar > select “Insert” > select “Chart.”
Use the ‘Chart Editor’ on the right to customize your chart:
Under the “Setup” tab, navigate to “Chart Type” and scroll down to “Bar” and select the first horizontal bar chart.
Under the “Customize” tab, you can select “Chart & axis titles” to change the labeling. You can also double-click into each label on the chart to edit it there.
This analysis will use a regular expression to find species IDs that match land animals. Regular expressions ("regex" for short) are used in many data analysis applications, including Google Sheets and Python. They use specialized syntax to help you match patterns, or find other strings or sets of strings. Regex can be tricky, but a useful tool to learn about when you need to do more advanced data manipulation. To learn more about regex: https://regexone.com/
In this example, we will look at a column of species IDs. International Civil Aviation Organization (ICAO) codes tell us that land animals will have an ID that begins with 1 or 2, so we will write a regular expression that finds any string of characters that begins with 1 or 2:
Create a new column “IS_LAND” to identify whether the animal is a land or flying animal. To create a new column, navigate to the “SPECIES_ID” column, right-click in the cell, and select “Insert 1 right.” This will create a blank column on the right-side of “SPECIES_ID.”
Rename the column “IS_LAND.”
Use a regular expression to evaluate whether the species is a land or flying animal. According to the International Civil Aviation Organization (ICAO) codes, animals have a SPECIES_ID that begins with the number 1 or the number 2. To evaluate whether the SPECIES_ID begins with 1 or 2 enter the following regular expression in the first cell under the new column “IS_LAND”:
=regexmatch(CD2, "^[1-2]")
"regexmatch" is the formula that indicates we will be using a regular expression. In Regex, ^ indicates the beginning of a string and [1-2] indicates a range of numbers.
To perform this calculation on the rest of the column, double click on the right-bottom corner of the cell. It will populate the rest of the column with the calculation.
Use a pivot table to compare the number of land animals versus flying animals. If the cell says True, it is a land animal. If it says False, it is a flying animal. You will be repeating the steps above, but using the “IS_LAND” column.
So far, you have been using Google Sheets. Many people use spreadsheet applications like Google Sheets or Microsoft Excel to store, analyze, and visualize data. They are relatively easy to learn and use and have built-in advanced data analysis capabilities. Spreadsheet applications use a graphic user interface with drop-down menus and toolbars, so they can be more familiar to many of us.
Python is an open-source programming language that can also be used for data analysis purposes. There can be a steep learning curve at first, but it is a powerful tool. Python and other programming languages work well for:
Doing research with large datasets
Automating or repeating the same processes many times
Showing your work so others can reproduce your research
To complete this section...
Watch the video below about how to open and use a Google Colaboratory (Colab) notebook.
Review the Google Colab Python notebook: Exploring FAA Wildlife Strikes... to try the same data analysis in Python.
We won't be teaching you all about how to use Python in this module, instead we created a Google Colab notebook with the data analysis already prepared. You can read through the text blocks and see the outcomes of code blocks by pressing the "play" icon to the left of the box. Then you can write your own lines of code in the "Try it yourself" blocks.
Python is one many programming languages that you could use to do data analysis. Programming languages give you more flexibility and control over the analysis that you're doing. Find more in this section's Google Colab Python notebook: Exploring FAA Wildlife Strikes....
FAA Wildlife Hazard Mitigation page to learn more about wildlife strikes and hazards.
Smithsonian Libraries’ Describing Your Data: Data Dictionaries page to learn more about data dictionaries and the importance of describing data.
LinkedIn Learning training video Google Sheets: PivotTables video to learn more about Pivot tables in Google Sheets.
Note: How to access LinkedIn Learning as an NC State Community Member.
Learn more about regular expressions with RegexOne.