Students will practice identifying trends and patterns in data sets. (DAT-2.A)
Students will describe what information can be extracted from data. (DAT-2.A)
Students will describe what information can be extracted from metadata. .(DAT-2.B)
Students will identify and solve challenges associated with processing data . (DAT-2.C)
ARC challenge #4 is a Data Analysis project that spans both Unit 4 (Data Representation) and Unit 5 (Big Data). A Big Data Analytics firms has hired the students to investigate & analyze an industry concern. Students will determine the concern, write questions that need to be answered, identify needed data points, design & collect real life data and then analyze the results. Students will visually present findings and interpretations of the data as well as address concerns that arise with the collection of their data ( privacy, storage, security). The project will combine designing, collecting, filtering, cleaning and analyzing the collected data. It should be started early in unit 4 and will be completed at the end of the semester for the semester showcase.
ARC challenge #4 is a Data Analysis project that spans both Unit 4 (Data Representation) and Unit 5 (Big Data). This challenge has four parts:
Part A: Class Data Analysis: This will use the data your class has supplied over the past week. You will be using this data to practice making predictions, identify patterns and summarize results in a visualization.
Part B: Design a citizen science project. Your team will use a design thinking approach to create a crowdsourced research project relating to your topic/innovation.
Part C: Collect Data. The team will implement the citizen science project to collect data from a wide variety of stakeholders.
Part D: Analyze the data and make recommendations. The team will clean, filter, sort and analyze the data to make research based recommendations.
Sprint 1 will be devoted to Part A.
Sprints 2 & 3 will focus on Parts B, C & D. Sprints 2 and 3 will be completed in Unit 5.
Activity 4.12.1 (20 minutes)
Introduction of Google Trends
Data analysis requires multiple steps. The raw data must be collected from a large, representative group. This data must be cleaned, filtered and organized. Then analysts look for patterns and trends. Finally, a decision must be made regarding how to communicate the findings. Most people don't enjoy reading spreadsheets of numbers. Analysts decide what visual format will best convey their findings.
Let’s do some background research first. Go to Google Trends. Play around with it for about five minutes. Encourage students to think about a question that they have. For example, is there a correlation between being a dog owner and eating pizza? Pop in those search terms on Google Trends. Do you think there is a correlation?
Google Trends is a search trends feature that shows how frequently a given search term is entered into Google's search engine relative to the site's total search volume over a given period of time.
What do you think this could be used for? How does it work? How many search terms can you put in? What type of information do you get? Is all the data exactly from the search terms? Or, is there other data associated with it?
Important concepts to emphasize:
The processed data may show a correlation between variables. However, it does not necessarily show a causal relationship. Additional research would be required to understand the exact relationship.
Google Trends is one source of information.
Google trends uses metadata in addition to the search terms. What pieces of information are from metadata? Is this helpful?
Google trends is taking all the search terms that are stored in it’s data sets and analyzing them along with the metadata to see if there is any correlation. They are taking the hard part of studying huge quantities of data and have made it available online. It is an abstraction of the data analysis process.
Activity 4.12.2 (2 - 3 hours)
Introduction of ARC Challenge #4 and Sprint 1
Let students know that ARC Challenge #4 is focused on data analysis. It has four parts as described in the general description above. Sprint 1 will focus only on Part A.
Part A is designed to allow students to practice cleaning, filtering and analyzing raw (or dirty) data to answer questions. They will be using the data that you have collected with the forms from Section 4.6. You will need to have the spreadsheet posted so that each team has access to it. You should delete the names column so that the students don't see who the responses belong to.
For Sprint #1, the students only need to use files 1 - 4. However, if you want them to see the whole project, you can give them access to the teacher folder.
Part A detailed description with instructions for the students.
Summary of project: The teams are going to develop two questions that they think the data will answer. Example: Does a lower amount of sleep correlate to a bad mood the next day? Are more students happier on Friday? Then they will analyze the data to see if the data supports their hypothesis and they will will create a visualization of their results.
Students complete the KWL chart and Sprint 1 assignments. Once Sprint 1 has been approved, the teams can begin working.
Things to watch out for...
There is metadata (timestamp) on the spreadsheets. They can use that to figure out the day of the week, the time the question was answered, etc.
Students don't often realize that they will need to manually change some of the data. They are getting a "dirty" spreadsheet. The data is raw and untouched. Some students may have entered nonsense answers. Some students may have typed in "4" and others may have typed in "four". The intention was the same but the analysis will need to clean that up so that the data can be filtered. Let's students hit this roadblock themselves instead of forewarning them. It will be more meaningful when they realize that there are problems with the data.
Students can group responses together. For example, the options of a "meh" mood and a bad mood may be interpreted as the same thing. Or, students may want to group responses together into categories (ie. tv & video games in one category, outdoor sports in another). They have the flexibility to interpret the data.
Students may not have Excel or Sheets experience. They may need to watch some tutorials to learn how to filter, put in equations, run a chart, etc. This is an excellent opportunity to learn those skills.
Discussions to incorporate
How much power do data analysts have to control the data? We often think of data as being very objective. However, when we clean the data, the opportunity for bias and subjectivity enters the data set.
What types of questions require less cleaning? What makes a "good" survey question?
How much data do you need?
What were the demographics of the responders? Was this a representative study? Do you think the results are accurate?
Did any teams ask the same questions but interpret the results differently?
What implications does this activity have for how we interpret the news?