Big Data Sheets

05

Big Data

We live in the information age with an exponential growth of data. In 2010 Eric Schmidt, the CEO of Google, said, "There were five exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days." In 2019, the World Economic Forum estimated that "the entire digital universe is expected to reach 44 zettabytes by 2020."



How much is an Exabyte or Zettabyte? Here is a visualization and a table from the same article at the World Economic Forum.

Big Data - Video

Big Data - Slides

D05-Big Data

Everything is Quantifiable

We live in the era of Big Data which refers to data sets that are too large to fit on a normal computer or be processed by a standard spreadsheet or database program. Large data sets are difficult to process using a single computer and may require parallel systems (multiple computers working together to run an algorithm). Scalability of systems is an important consideration when working with large data sets, as the computational capacity of a system affects how data sets can be processed and stored.

We will explore Big Data through a number of videos from the PBS documentary, The Human Face of Big Data. Start by watching this short video (2:31): Everything Is Quantifiable.

Data Science

The field of Data Science deals with extracting information from and visualizing the results of manipulating large data sets. The size of a data set affects the amount and quality of information that can be extracted from it. From this information, further analysis may yield knowledge or even wisdom. Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data. We often think of data, information, knowledge and wisdom forming a pyramid.

Data provide opportunities for identifying trends, making connections, and addressing problems. Computing enables new methods of deriving information from data, driving monumental change across many disciplines — from art to business to science. Keep the DIKW pyramid in mind as you watch the short 3 minute video, Learning Revealed: Acquiring Language.

Impacts of Big Data

Careful analysis of data can help us solve many problems. Watch the following 4-minute video to see how tracking data on The Smallest Heartbeat can help save a child's life.

Bias in Data

The path from data to information to knowledge is not always straightforward. Bias can be introduced into the collection and analysis of data with dangerous results. Care must be taken when collecting and analyzing data. Problems of bias are often caused by the type or source of data that is being collected. Bias is not eliminated by simply collecting more data.

Joy Buolamwini from the MIT Media labs studies the impact of bias in face recognition systems. Watch the following video about her research.

Gender Shades - MIT Media Lab

Analyzing Global Health Data

Large data sets can be overwhelming to analyze, but software tools can help people extract information, identify trends, make connections, and solve problems with data. Software programs, such as the graph below from Google can allow you to process data interactively to gain insight and knowledge.

Go to this interactive data set and answer the questions on your AnswerDoc.

In ordinary speech, the words "data" and "information" are used interchangeably. But in computing, these words have specific technical meanings.

Data provide opportunities for identifying trends, making connections, and addressing problems. Information is the result of analyzing that data.

The data given in the graph above let us answer some questions but not others. We can, for example, answer questions about how patterns of fertility and life expectancy differ from one continent to another, but not questions about how life expectancy is affected by the jobs people do, because the data displayed doesn't show jobs.

When looking at visualizations, consider:

Be careful when making assumptions about data:

negative correlation

positive correlation

no correlation

Often, a single source does not contain the data needed to draw a conclusion; it may be necessary to combine data from a variety of sources. As you found using visualization software with the fertility and life expectancy data, sometimes a pattern you discover in one data set can just raise another question for research such as, "Are either of these things correlated with median income in the country?" To answer this question, you could find an economic database, download some data, and use look for additional correlations. There can be several cycles of seeing something in the data and collecting more data to examine before you have what seems like a reliable insight about causation.

Working with Spreadsheets



2. Set up first rows. First select the first row by clicking the number 1 on the left, then select Bold

Insert a row for future operations by right-clicking the 1, and selecting:
        + Insert 1 row below

Then freeze the column headers by clicking View → Freeze → 2 rows

3. Formulas and Functions. Each box in the spreadsheet is called a cell. Every cell in the spreadsheet is identifiable by its column letter and row number. For example, cell A3 refers to the box at column A and row 2 below and contains the data "Astraptes SENNOV" which is a butterfly species.

We can manipulate numeric data in a spreadsheet by using formulas and functions built into the spreadsheet software. Typing in a = in a cell signals the start of a formula like =K3+K4 or a function like =SUM(K3,K4). These functions can take a list of cells or a range of cells such as K3:K5 which is equivalent to the list K3, K4, K5. There are many built-in functions in standard spreadsheet software, but the most commonly used ones are SUM, AVERAGE, COUNT, MAX, and MIN.

Here is a tutorial that reviews how to use functions in Google Sheets.

Let's use a formula to calculate the average wingspan of the butterflies in our spreadsheet. Column K contains the wingspan measurement of each butterfly.

In cell K2, the empty row you inserted, type the formula: =AVERAGE(K3:K89)

This will average the data in column K rows 3-89. You could select the data that you want instead of typing in the cell numbers. When you hit enter, it will compute the average 54.63

TClark hint, if you do not know the last column for your data, you can always leave the last number blank, so this formula would give you the same result: =AVERAGE(K3:K)

You can control the decimal precision with the precision buttons in the toolbar at the top. Either decrease the viewable decimals, or increase. 

The buttons near it will auto-convert the numbers into currency or percentage.

4. Sort and Filter: You can sort and filter columns to find information and extract patterns from the data. To sort by species, click on the A at the top of column A to select the column, and then from the Data menu, choose Sort sheet→ Sort sheet by column A (A to Z). You may also right-click on the A column header and select from the drop down menu on column A

Note: do not sort range, as that will only sort that selected column.

You can also filter data to show only the data you need. Click on column E or any column that you want to filter, and then click on Data/Create a Filter or the filter funnel icon to turn on filtering. Click on the filter icon created in E1 and uncheck Blanks and male, to leave just the female values. Click on OK to see the filtered data. 

Turn off filtering by clicking on the filter funnel icon or from the Data menu to go back to seeing all the data.

To help, here's a sorting and filtering tutorial.

5. Charts: Let's make a chart to visualize some of the data in this spreadsheet.

6. Select one data set from this folder (not the US Presidents). Then perform the following actions:

Data05 Sheets Overview.webm

Still Curious?