Section 7.12
Introduction to ARC Challenge #4 & Sprint #1
Learning Goals
DAT-2.A: Students will practice identifying trends and patterns in data sets.
DAT-2.A: Students will describe what information can be extracted from data.
DAT-2.B: Students will describe what information can be extracted from metadata.
DAT-2.C: Students will identify and solve challenges associated with processing data.
DAT-2.D: Students will extract information from data using a program.
DAT-2.E: Students will explain how programs can be used to gain insight and knowledge from data.
DAT-2.A.1: Information is the collection of facts and patterns extracted from data.
DAT-2.A.3: Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a causal relationship exists. Additional research is needed to understand the exact nature of the relationship.
DAT-2.A.4: Often, a single source does not contain the data needed to draw a conclusion. It may be necessary to combine data from a variety of sources to formulate a conclusion.
DAT-2.B.1: Metadata are data about data. For example, the piece of data may be an image, while the metadata may include the date of creation or the file size of the image.
DAT-2.B.2: Changes and deletions made to metadata do not change the primary data.
DAT-2.B.3: Metadata are used for finding, organizing, and managing information.
DAT-2.B.4: Metadata can increase the effective use of data or data sets by providing additional information.
DAT-2.B.5: Metadata allows data to be structured and organized.
DAT-2.C.1: The ability to process data depends on the capabilities of the users and their tools.
DAT-2.C.2: Data sets pose challenges regardless of size, such as:
the need to clean data
incomplete data
invalid data
the need to combine data sources
DAT-2.D.2: Tables, diagrams, text, and other visual tools can be used to communicate insight and knowledge gained from data.
DAT-2.D.3: Search tools are useful for efficiently finding information.
DAT-2.D.4: Data filtering systems are important tools for finding information and recognizing patterns in data.
DAT-2.D.5: Programs such as spreadsheets help efficiently organize and find trends in information.
DAT-2.D.6: Some processes that can be used to extract or modify information from data include the following:
transforming every element of a data set, such as doubling every element in a list, or adding a parent’s email to every student record
combining or comparing data in some way, such as adding up a list of numbers, or finding the student who has the highest GPA
visualizing a data set through a chart, graph, or other visual representation
DAT-2.E.1: Programs are used in an iterative and interactive way when processing information to allow users to gain insight and knowledge about data.
DAT-2.E.2: Programmers can use programs to filter and clean digital data, thereby gaining insight and knowledge.
DAT-2.E.3: Combining data sources, clustering data, and classifying data are parts of the process of using programs to gain insight and knowledge from data.
DAT-2.E.4: Insight and knowledge can be obtained from translating and transforming digitally represented information.
DAT-2.E.5: Patterns can emerge when data are transformed using programs.
Objectives and General Description
ARC challenge #4 is a Data Analysis project that spans both Unit 4 (Data Representation) and Unit 5 (Big Data). A Big Data Analytics firms has hired the students to investigate & analyze an industry concern. Students will determine the concern, write questions that need to be answered, identify needed data points, design & collect real life data and then analyze the results. Students will visually present findings and interpretations of the data as well as address concerns that arise with the collection of their data ( privacy, storage, security). The project will combine designing, collecting, filtering, cleaning and analyzing the collected data. It should be started early in unit 4 and will be completed at the end of the semester for the semester showcase.
ARC challenge #4 is a Data Analysis project that spans both Unit 4 (Data Representation) and Unit 5 (Big Data). This challenge has four parts:
Part A: Class Data Analysis: This will use the data your class has supplied over the past week. You will be using this data to practice making predictions, identify patterns and summarize results in a visualization.
Part B: Design a citizen science project. Your team will use a design thinking approach to create a crowdsourced research project relating to your topic/innovation.
Part C: Collect Data. The team will implement the citizen science project to collect data from a wide variety of stakeholders.
Part D: Analyze the data and make recommendations. The team will clean, filter, sort and analyze the data to make research based recommendations.
Sprint 1 will be devoted to Part A.
Sprints 2 & 3 will focus on Parts B, C & D. Sprints 2 and 3 will be completed in Unit 5.
Activities
Activity 7.12.1 (40 minutes)
Introduction of Google Trends
Data analysis requires multiple steps. The raw data must be collected from a large, representative group. This data must be cleaned, filtered and organized. Then analysts look for patterns and trends. Finally, a decision must be made regarding how to communicate the findings. Most people don't enjoy reading spreadsheets of numbers. Analysts decide what visual format will best convey their findings.
Let’s do some background research first. Go to Google Trends. Play around with it for about five minutes. Encourage students to think about a question that they have. For example, is there a correlation between being a dog owner and eating pizza? Pop in those search terms on Google Trends. Do you think there is a correlation?
Google Trends is a search trends feature that shows how frequently a given search term is entered into Google's search engine relative to the site's total search volume over a given period of time.
For the teacher - How Trends works.
Google Trends allows you to see the topics people are—or aren’t—following, practically in real time. Journalists can use this information to explore potential story ideas, and can also feature Trends data within news stories to illustrate a general level of interest in, say, a political candidate, social issue or event.
The Google Trends homepage (google.com/trends) features clustered topics that Google detects are related and trending together on either Search, Google News, or YouTube. Trending Stories are collected based on Google’s Knowledge Graph technology, which gathers search information from those three Google platforms to detect when stories are trending based on the relative spike in volume and the absolute volume of searches.
The Featured insights at the top of the new Google Trends homepage are curated by News Lab to highlight additional data patterns or interesting trends.
Class Discussion: What do you think this could be used for? How does it work? How many search terms can you put in? What type of information do you get? Is all the data exactly from the search terms? Or, is there other data associated with it? What types of tables, diagrams, and other visual tools does Google Trends use to communicate data? Have students explain how programs can be used to gain insight and knowledge from data.
Important concepts to emphasize:
The processed data may show a correlation between variables. However, it does not necessarily show a causal relationship. Additional research would be required to understand the exact relationship.
Google Trends is one source of information.
Introduce the term, metadata. Metadata is data that defines or describes other data. For example, the piece of data may be an image, while the metadata may include the date of creation or the file size of the image. Other examples include time stamps, author-created, etc.
Use this Metadata Worksheet to show an example of finding metadata in an image. If you want students to complete the worksheet, you will need to assign 2.3: Daily Video 2 “Extracting Information from Data” on AP Classroom. to finish answering the last few questions. Below are four key concepts that you will want students to learn:
Changes and deletions of metadata doesn't change the collected responses associated with it. Example includes: adjusting the date a picture was taken does not change the picture itself.
It can help organize, find, and manage information. Examples include: figuring out the day of the week and / or the time the question was answered, searching for responses within a certain time frame, etc.
It provides additional information to the collected responses that can help find or disprove trends of the collected data set. Example includes: noticing people who respond in the morning were more positive about certain questions than those who respond in the evening.
It allows information to be structured and organized based on other factors than the collected responses. Examples include: organizing data by age, organizing data by time of day, organizing data by earliest to latest respondents.
Google trends uses metadata in addition to the search terms. What pieces of information are from metadata? Is this helpful?
Google trends is taking all the search terms that are stored in it’s data sets and analyzing them along with the metadata to see if there is any correlation. They are taking the hard part of studying huge quantities of data and have made it available online. It is an abstraction of the data analysis process.
Activity 7.12.2 (2 - 3 hours)
Introduction of ARC Challenge #4 and Sprint 1
Let students know that ARC Challenge #4 is focused on data analysis. It has four parts as described in the general description above. Sprint 1 will focus only on Part A.
Part A is designed to allow students to practice cleaning, filtering and analyzing raw (or dirty) data to answer questions. They will be using the data that you have collected with the forms from Section 4.6. You will need to have the spreadsheet posted so that each team has access to it. You should delete the names column so that the students don't see who the responses belong to.
For Sprint #1, the students only need to use files 1 - 4. However, if you want them to see the whole project, you can give them access to the teacher folder.
Part A detailed description with instructions for the students.
Summary of project: The teams are going to develop two questions that they think the data will answer. Example: Does a lower amount of sleep correlate to a bad mood the next day? Are more students happier on Friday? Then they will analyze the data to see if the data supports their hypothesis and they will will create a visualization of their results.
Students complete the KWL Chart and Sprint 1 Assignments. Once Sprint 1 has been approved, the teams can begin working.
Things to watch out for...
Students don't often realize that they will need to manually change some of the data. They are getting a "dirty" spreadsheet. The data is raw and untouched. Some students may have entered nonsense answers. Some students may have typed in "4" and others may have typed in "four". The intention was the same but the analysis will need to clean that up so that the data can be filtered. Let's students hit this roadblock themselves instead of forewarning them. It will be more meaningful when they realize that there are problems with the data.
Similarly, students have to adjust data sets and analysis when there are blank or incomplete responses. This becomes a big problem when there are a noticeable amount of those types of responses, because trends cannot be made from mostly missing information. Metaphorically, that's like trying to guess what a whole picture is when you can only see 1/8 of the image.
Invalid data, data that does not correspond to the asked question, needs to be removed before the data set can be analyzed. Since it does not relate, it should not be in consideration of the results as well. A simple example is someone providing an address when the form asked for their email address. It's not the information asked for and cannot be used when emailing someone.
Students can group responses together. For example, the options of a "meh" mood and a bad mood may be interpreted as the same thing. Or, students may want to group responses together into categories (ie. tv & video games in one category, outdoor sports in another). They have the flexibility to interpret the data.
Students may not have Excel or Sheets experience. They may need to watch some tutorials to learn how to filter, put in equations, run a chart, etc. This is an excellent opportunity to learn those skills.
A key takeaway is that the ability to process data depends on the capabilities of the users and their tools.
Discussions to incorporate
How much power do data analysts have to control the data? We often think of data as being very objective. However, when we clean the data, the opportunity for bias and subjectivity enters the data set.
What types of questions require less cleaning? What makes a "good" survey question?
How much data do you need?
What were the demographics of the responders? Was this a representative study? Do you think the results are accurate?
Did any teams ask the same questions but interpret the results differently?
What implications does this activity have for how we interpret the news?