Lectures will take place in NSH 1305
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.
Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.
[1 Introduction][Slides describing Byte 1]
Homework: Byte 1 Assigned
Description: How do we decide what questions to ask of the data
Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc
[3 Data Storage][Slides describing Byte 2]
Readings: Optional: Stonebraker & Hellerstein: What Goes Around Comes Around
(a historical view of different classes of data modeling)
Homework: Byte 1 Due; Byte 2 Assigned
Description: Sampling issues; Pros and cons of different sources of data; Practical overview of APIs and OAuth
[4 Acquiring Data][4 Practical Access to Data]
Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Case studies: Mouse data; Location Data
[5 Data Quality][ZipFound.csv][5 Case Study Loscation]
Readings:
Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
[6 Exploration][Description of Byte 3][Data for StemLeaf]
Readings:
Description: Overview of key concepts in information visualization and a brief discussion of D3 Th
Readings:
Description: Discussion of 4 studies that influenced the design of the Stepgreen.org website.
Readings:
Slides: [8 Information Visualization Case Study]
Description: Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute. His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. Slides available on Blackboard.
Readings: New Directions in Artificial Intelligence for Public Health Surveillance. Neill, D.
Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. No Slides
Sites visited:
Explorables site (includes links to some of the below)
AirNow air quality exploration
Oil and gas drilling in selected states of the U.S. (limited to states we've scraped data from. If you want to join the scraping effort let me know!)
Whole-Earth Time-lapse (be sure to zoom out and into other spots)
EVA 3-d high-dimensional data exploration
Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present
Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.
Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on).
[10 Introductory Statistics][Project 1]
Homework: Byte 4 Due; First Project Assigned
Readings:
Optional Readings:
Description: Discussion of the t-test, the math and assumptions underlying it, and the process for using it. Discussion of correlation and regression, the math and assumptions underlying them, and how to use them. Also discussed limitations and paradoxes (such as Simpson's paradox).
[Slides]
[Project 1 Meeting Signup (9-12:45)]
Readings:
Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.
Readings:
Slides: [Slides]
Description: Meetings with Project I Groups to discuss project progress and goals.
Description: Discussion of Decision Trees, Naïve Bayes, and Regression
http://pdf.aminer.org/001/202/088/evaluating_learning_algorithms_composed_by_a_constructive_meta_learning_scheme.pdf
Homework: Byte 5 Assigned
Description: We talked about infrastructure issues for big data
[Slides]
Readings:
Description: Discussion about Visualization of Big Data & Byte 6 option 1.
[Slides][Description of Byte 6 -- big data]
Readings:
Description: Discussion of social network analytics
[Slides][Description of Byte 6 -- social networking]
Readings:
HW: Byte 5 Due; Byte 6 Assigned
Description:
[Slides]
Readings: None
HW: Byte 6 Due
Description: The focus of the lecture was on the issues faced when crowdsourcing. Slides available on Blackboard.
Readings: Tomasic et al: Motivating Contribution in a Participatory Sensing System via Quid-Pro-Quo. To Appear in CSCW 2014. [Blackboard]
Bio: Aaron is an associate research professor in the Robotics Institute at Carnegie Mellon and the co-director of the Rehabilitation Engineering Research Center on Accessible Public Transportation. He earned his Ph.D., M.S. and B.S. in industrial & operations engineering from the University of Michigan (1999, 1994 and 1993, respectively) and completed a postdoctoral position at the University of California, Berkeley (2000). Steinfeld’s interest is focused around constrained user interfaces and operator assistance, predominantly in the realms of human-robot interaction, rehabilitation, transportation and intelligent systems. He is interested in how to enable timely and appropriate interaction when interfaces are restricted through design, tasks, the environment, time pressures, and/or user abilities. He works on the Tiramisu project.
Tiramisu Transit is a crowd-powered transit information system developed by researchers to improve users' transit experiences and transit accessibility. With Tiramisu - literally Italian for "pick me up" - anyone waiting at a bus stop with a smartphone can see which buses or light rail vehicles are due to arrive next and, thanks to the signals from riders already aboard, get an idea of how long they have to wait. When a rider first activates the app, Tiramisu displays the nearest stops and a list of buses or light rail vehicles that are scheduled to arrive. The list includes arrival times, based either on historical data for that route or on real-time reports from riders. When the desired vehicle arrives, the user indicates the level of "fullness" and then presses a button, allowing their phone to share an ongoing GPS trace with the Tiramisu server. Once aboard, the rider can use Tiramisu to find out which stop is next and to report problems, positive experiences and suggestions.
Description: Discussion of Intelligible machine learning
[Lecture Slides: TBD]
Readings [tentative]: