Big Data Sheets
05
Big Data
We live in the information age with an exponential growth of data. In 2010 Eric Schmidt, the CEO of Google, said, "There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." In 2019, the World Economic Forum estimated that "the entire digital universe is expected to reach 44 zettabytes by 2020."
How much is an Exabyte or Zettabyte? Here is a visualization and a table from the same article at the World Economic Forum.
Everything is Quantifiable
We live in the era of Big Data which refers to data sets that are too large to fit on a normal computer or be processed by a standard spreadsheet or database program. Large data sets are difficult to process using a single computer and may require parallel systems (multiple computers working together to run an algorithm). Scalability of systems is an important consideration when working with large data sets, as the computational capacity of a system affects how data sets can be processed and stored.
We will explore Big Data through a number of videos from the PBS documentary, The Human Face of Big Data. Start by watching this short video (2:31): Everything Is Quantifiable.
Data Science
The field of Data Science deals with extracting information from and visualizing the results of manipulating large data sets. The size of a data set affects the amount and quality of information that can be extracted from it. From this information, further analysis may yield knowledge or even wisdom. Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data. We often think of data, information, knowledge and wisdom forming a pyramid.
Data provide opportunities for identifying trends, making connections, and addressing problems. Computing enables new methods of deriving information from data, driving monumental change across many disciplines — from art to business to science. Keep the DIKW pyramid in mind as you watch the short 3 minute video, Learning Revealed: Acquiring Language.
Impacts of Big Data
Careful analysis of data can help us solve many problems. Watch the following 4-minute video to see how tracking data on The Smallest Heartbeat can help save a child's life.
Bias in Data
The path from data to information to knowledge is not always straightforward. Bias can be introduced into the collection and analysis of data with dangerous results. Care must be taken when collecting and analyzing data. Problems of bias are often caused by the type or source of data that is being collected. Bias is not eliminated by simply collecting more data.
Joy Buolamwini from the MIT Media labs studies the impact of bias in face recognition systems. Watch the following video about her research.
Gender Shades - MIT Media Lab
Analyzing Global Health Data
Large data sets can be overwhelming to analyze, but software tools can help people extract information, identify trends, make connections, and solve problems with data. Software programs, such as the graph below from Google can allow you to process data interactively to gain insight and knowledge.
Go to this interactive data set and answer the questions on your AnswerDoc.
In ordinary speech, the words "data" and "information" are used interchangeably. But in computing, these words have specific technical meanings.
Data are the values that computers receive from various sources, including human activity, sensors, etc.
Information is the humanly-useful patterns extracted from data.
Data provide opportunities for identifying trends, making connections, and addressing problems. Information is the result of analyzing that data.
The data given in the graph above let us answer some questions but not others. We can, for example, answer questions about how patterns of fertility and life expectancy differ from one continent to another, but not questions about how life expectancy is affected by the jobs people do, because the data displayed doesn't show jobs.
When looking at visualizations, consider:
What does the data show? - fact
Why might that be the case? - opinion
Be careful when making assumptions about data:
Correlation does not equal Causation.
Correlation is a particular kind of information, namely a dependence between two variables. For example in the first picture here, as one variable goes up the other goes down. It's also a correlation when as one variable goes up or down the other changes in the same manner.
Insight is a meaningful conclusion drawn from analyzing information.
negative correlation
positive correlation
no correlation
Often, a single source does not contain the data needed to draw a conclusion; it may be necessary to combine data from a variety of sources. As you found using visualization software with the fertility and life expectancy data, sometimes a pattern you discover in one data set can just raise another question for research such as, "Are either of these things correlated with median income in the country?" To answer this question, you could find an economic database, download some data, and use look for additional correlations. There can be several cycles of seeing something in the data and collecting more data to examine before you have what seems like a reliable insight about causation.
Working with Spreadsheets
Make a copy of this spreadsheet. This data set shows butterfly specimens captured and tagged in the Guanacaste National Park in Costa Rica. Look through the data and notice that the first column (herbivore species) is the species of each butterfly that was tagged. The last columns show the latitude and longitude where each butterfly was tagged. The first row is metadata that describes the data in each column.
Metadata is data about data. It can be associated with the primary data, and changes and deletions made to metadata do not change the primary data. Metadata allows data to be structured and organized and is used for finding, organizing and managing information.
Metadata can increase the effective use of data or data sets by providing additional information about various aspects of that data.
It can be changed without impacting the primary data
Used for finding, organizing, and managing information
Increases effective use of data by providing extra information
Allows data to be structured and organized
2. Set up first rows. First select the first row by clicking the number 1 on the left, then select Bold
Insert a row for future operations by right-clicking the 1, and selecting:
+ Insert 1 row below
Then freeze the column headers by clicking View → Freeze → 2 rows
3. Formulas and Functions. Each box in the spreadsheet is called a cell. Every cell in the spreadsheet is identifiable by its column letter and row number. For example, cell A3 refers to the box at column A and row 2 below and contains the data "Astraptes SENNOV" which is a butterfly species.
We can manipulate numeric data in a spreadsheet by using formulas and functions built into the spreadsheet software. Typing in a = in a cell signals the start of a formula like =K3+K4 or a function like =SUM(K3,K4). These functions can take a list of cells or a range of cells such as K3:K5 which is equivalent to the list K3, K4, K5. There are many built-in functions in standard spreadsheet software, but the most commonly used ones are SUM, AVERAGE, COUNT, MAX, and MIN.
Here is a tutorial that reviews how to use functions in Google Sheets.
Let's use a formula to calculate the average wingspan of the butterflies in our spreadsheet. Column K contains the wingspan measurement of each butterfly.
In cell K2, the empty row you inserted, type the formula: =AVERAGE(K3:K89)
This will average the data in column K rows 3-89. You could select the data that you want instead of typing in the cell numbers. When you hit enter, it will compute the average 54.63
TClark hint, if you do not know the last column for your data, you can always leave the last number blank, so this formula would give you the same result: =AVERAGE(K3:K)
You can control the decimal precision with the precision buttons in the toolbar at the top. Either decrease the viewable decimals, or increase.
The buttons near it will auto-convert the numbers into currency or percentage.
4. Sort and Filter: You can sort and filter columns to find information and extract patterns from the data. To sort by species, click on the A at the top of column A to select the column, and then from the Data menu, choose Sort sheet→ Sort sheet by column A (A to Z). You may also right-click on the A column header and select from the drop down menu on column A.
Note: do not sort range, as that will only sort that selected column.
You can also filter data to show only the data you need. Click on column E or any column that you want to filter, and then click on Data/Create a Filter or the filter funnel icon to turn on filtering. Click on the filter icon created in E1 and uncheck Blanks and male, to leave just the female values. Click on OK to see the filtered data.
Turn off filtering by clicking on the filter funnel icon or from the Data menu to go back to seeing all the data.
To help, here's a sorting and filtering tutorial.
5. Charts: Let's make a chart to visualize some of the data in this spreadsheet.
Click on the A heading in the first column (herbivore species).
From the Insert menu at the top, select Chart. You will see a bar chart of the different species found in column A.
Once you are finished designing your chart, you can click on the dots in the top right corner of the chart to copy the image or move it to its own sheet.
Charts can help us answer questions such as which species is the most common?
Investigate the many chart options available. Try a pie chart like below. Here's more information about different charts in Google Sheets and a tutorial on comparing charts.
TClark note: you might need to make sure the Aggregate option in the Chart editor side-panel is checked.
6. Select one data set from this folder (not the US Presidents). Then perform the following actions:
Insert a row and then freeze the header
Calculate 2 numbers based on numerical columns
(SUM, SUMIF, AVERAGE, COUNT, COUNTA, COUNTIF, ADD, MINUS, MULTIPLE, MAX, MIN, full list)Sort data based on one column (different than what the original data has sorted)
Filter based on one column
Create a chart that allows you to answer an insightful question. Note: you can select 2 columns and graph them together.
Still Curious?
Here's a nice visualization of student debt that was put together by the New York Times.
Reddit maintains a Data is Beautiful site that has lots of visualizations of interesting data sets. Browse through that collection.