Collecting, Analyzing And Interacting With Data

Class Calendar 2014

Lectures (Tentative Schedule, lectures in Blue are final)

Lectures will take place in NSH 1305

Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).

Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.

Tues 1/13 Introduction & Overview of Data Science Pipeline

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.

[1 Introduction][Slides describing Byte 1]

Homework: Byte 1 Assigned

Thurs 1/15 Scoping Projects; Asking good Questions

Description: How do we decide what questions to ask of the data

[2 Asking Questions]

Tues 1/20 Structured vs Unstructured Data

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc

[3 Data Storage][Slides describing Byte 2]

Readings: Optional: Stonebraker & Hellerstein: What Goes Around Comes Around

(a historical view of different classes of data modeling)

Homework: Byte 1 Due; Byte 2 Assigned

Thurs 1/22 Acquiring Data

Description: Sampling issues; Pros and cons of different sources of data; Practical overview of APIs and OAuth

[4 Acquiring Data][4 Practical Access to Data]

Tues 1/27 Understanding and Cleaning your Data

Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Case studies: Mouse data; Location Data

[5 Data Quality][ZipFound.csv][5 Case Study Loscation]

Readings:

McCallum: Bad Data: Chapter 7 [on blackboard]
Dasu: Data Glitches: Monsters in your Data
Optional: Dasu & Loh: Statistical Distortion: Consequences of Data Cleaning
Optional: Raman, V., & Hellerstein, J. M. (2001, September). Potter's wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381-390). Chicago

Thurs 1/29 Visualizing and Exploring your Data

Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications

[6 Exploration][Description of Byte 3][Data for StemLeaf]

Readings:

Pearson: Mining Imperfect Data: Chapter 1 [on blackboard]
Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, not just the blog post)

Tues 2/3: Information Visualization Overview & Introduction to D3

Description: Overview of key concepts in information visualization and a brief discussion of D3 Th

[7 Visualization]

Readings:

Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning)
Hullman, J., & Diakopoulos, N. (2011). Visualization rhetoric: Framing effects in narrative visualization. Visualization and Computer Graphics, IEEE Transactions on, 17(12), 2231-2240.
Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)
Optional: Tufte: Beautiful Evidence: The Fundamental Principles of Analytical Design
Optional: D3: Data-Driven Documents: Michael Bostock, Vadim Ogievetsky and Jeffrey Heer

Homework: Byte 2 Due; Byte 3 Assigned

Thursday 2/5 Information Visualization Case Study: StepGreen

Description: Discussion of 4 studies that influenced the design of the Stepgreen.org website.

Readings:

Optional: Mankoff, J., Fussell, S. R., Dillahunt, T., Glaves, R., Grevet, C., Johnson, M., ... & Setlock, L. D. (2010, May). StepGreen. org: Increasing Energy Saving Behaviors via Social Networks. In ICWSM.

Slides: [8 Information Visualization Case Study]

Tues 2/10 Guest Lecture by Daniel Neill

Description: Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute. His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. Slides available on Blackboard.

Readings: New Directions in Artificial Intelligence for Public Health Surveillance. Neill, D.

Thurs 2/12 Mobile Data & Map Data

Description: Discussion of what can be accomplished with mobile data collection. Description of Byte 4 (Mobile Byte or Map Byte)

Slides:[9 Mobile][Slides for Byte 4]

Reading: Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., ... & Campbell, A. T. (2014, September). Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3-14). ACM. [Video]

Homework: Byte 3 Due Friday 2/14; Byte 4 Assigned

Tues 2/17 Large-volume Geographic Data: [Guest lecture by Randy Sargent]

Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. No Slides

Sites visited:

Explorables site (includes links to some of the below)

Gigapan Obama Inauguration

Racial dot map of the U.S.

AirNow air quality exploration

Lights at Night

Oil and gas drilling in selected states of the U.S. (limited to states we've scraped data from. If you want to join the scraping effort let me know!)

A year of fires

Wind map of Earth

Time Machine

Whole-Earth Time-lapse (be sure to zoom out and into other spots)

EVA 3-d high-dimensional data exploration

Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present

Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.

Thurs 2/19 Overview of Statistical Hypothesis Testing

Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on).

[10 Introductory Statistics][Project 1]

Homework: Byte 4 Due; First Project Assigned

Readings:

Nuzzo, R. (2014). Statistical errors. Nature, 506(13), 150-152.
Hart, A. (2000). Towards better research: a discussion of some common mistakes in statistical analyses. Complementary therapies in medicine, 8(1), 37-42. [ON BLACKBOARD]
Not your median patient: How a climate scientist faced cancer (John Ungar Zussman, August 11, 2010)

Optional Readings:

(Optional) Dunlop, M. D., & Baillie, M. (2009). Paper Rejected (p> 0.05): An Introduction to the Debate on Appropriateness of Null-Hypothesis Testing. International Journal of Mobile Human Computer Interaction (IJMHCI), 1(3), 86-93.
(Optional) Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216.
(Optional) Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.

Tues 2/24 T-Tests, Correlation and Regression

Description: Discussion of the t-test, the math and assumptions underlying it, and the process for using it. Discussion of correlation and regression, the math and assumptions underlying them, and how to use them. Also discussed limitations and paradoxes (such as Simpson's paradox).

[Slides]

[Project 1 Meeting Signup (9-12:45)]

Readings:

(Optional) Carte, T. A., & Russell, C. J. (2003). In pursuit of moderation: Nine common errors and their solutions. Mis Quarterly, 479-501.

Thurs 2/26 Finish Stats & Start Classification Basic & Metrics

Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.

Readings:

Slides: [Slides]

Tues 3/3 Project I Meetings: 9-12:45

Description: Meetings with Project I Groups to discuss project progress and goals.

Link to sign up

Thurs 3/5 Classification Algorithms

[Slides] [Slides for Byte 5]

Description: Discussion of Decision Trees, Naïve Bayes, and Regression

http://pdf.aminer.org/001/202/088/evaluating_learning_algorithms_composed_by_a_constructive_meta_learning_scheme.pdf

Homework: Byte 5 Assigned

3/9-3/13: Spring Break

Tues 3/17 Project 1 Poster Session

Thursday 3/19 Infrastructure for Big Data

Description: We talked about infrastructure issues for big data

[Slides]

Readings:

Critical Questions for Big Data, danah boyd & Kate Crawford, Information, Communication & Society, 15:5, 662-679, 2012. [on blackboard];
Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with big data analytics. interactions, 19(3), 50-59.
(optional) Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204.

Tues 3/24: Visualizing Big Data

Description: Discussion about Visualization of Big Data & Byte 6 option 1.

[Slides][Description of Byte 6 -- big data]

Readings:

(optional) A Review of Overview+Detail, Zooming, and Focus+Context Interfaces, ANDY COCKBURN, AMY KARLSON, BENJAMIN B. BEDERSON
(optional) Liu, Z., Jiang, B., & Heer, J. (2013, June). imMens: Real‐time Visual Querying of Big Data. In Computer Graphics Forum (Vol. 32, No. 3pt4, pp. 421-430). Blackwell Publishing Ltd.
(optional) Getting Started with Google Big Query

Thurs 3/26: Social Network Analytics

Description: Discussion of social network analytics

[Slides][Description of Byte 6 -- social networking]

Readings:

HW: Byte 5 Due; Byte 6 Assigned

Tues 3/31: Guest Lecture: Afsanah Doryab (Tracking Individual Behavior)

Description:

[Slides]

Readings: None

Thurs 4/2: Final Project Planning (Individual Meetings 9-12)

Link to sign up

Tues 4/7: Guest Lecture: Aaron Steinfeld (Tiramisu)

HW: Byte 6 Due

Description: The focus of the lecture was on the issues faced when crowdsourcing. Slides available on Blackboard.

Readings: Tomasic et al: Motivating Contribution in a Participatory Sensing System via Quid-Pro-Quo. To Appear in CSCW 2014. [Blackboard]

Bio: Aaron is an associate research professor in the Robotics Institute at Carnegie Mellon and the co-director of the Rehabilitation Engineering Research Center on Accessible Public Transportation. He earned his Ph.D., M.S. and B.S. in industrial & operations engineering from the University of Michigan (1999, 1994 and 1993, respectively) and completed a postdoctoral position at the University of California, Berkeley (2000). Steinfeld’s interest is focused around constrained user interfaces and operator assistance, predominantly in the realms of human-robot interaction, rehabilitation, transportation and intelligent systems. He is interested in how to enable timely and appropriate interaction when interfaces are restricted through design, tasks, the environment, time pressures, and/or user abilities. He works on the Tiramisu project.

Tiramisu Transit is a crowd-powered transit information system developed by researchers to improve users' transit experiences and transit accessibility. With Tiramisu - literally Italian for "pick me up" - anyone waiting at a bus stop with a smartphone can see which buses or light rail vehicles are due to arrive next and, thanks to the signals from riders already aboard, get an idea of how long they have to wait. When a rider first activates the app, Tiramisu displays the nearest stops and a list of buses or light rail vehicles that are scheduled to arrive. The list includes arrival times, based either on historical data for that route or on real-time reports from riders. When the desired vehicle arrives, the user indicates the level of "fullness" and then presses a button, allowing their phone to share an ongoing GPS trace with the Tiramisu server. Once aboard, the rider can use Tiramisu to find out which stop is next and to report problems, positive experiences and suggestions.

Thurs 4/9: Guest Lecture by Anind Dey

Description: Discussion of Intelligible machine learning

[Lecture Slides: TBD]

Readings [tentative]:

Brian Y. Lim, Anind K. Dey: Evaluating Intelligibility Usage and Usefulness in a Context-Aware Application. HCI (5) 2013: 92-101

Tues 4/14 9-12: Final Project Planning [Sign up links]

Thurs 4/16: No Class (Carnival)

Tues 4/21: Guest Lecture by Nikola Banovic on Extracting Routines

Thurs 4/23: Final Exam Review Session

Tues 4/28: Final Project Presentations [select timeslots]

Thurs 4/30: Final Project Presentations [select timeslots]

Final Exam: Take Home (to be discussed further in class)

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM,55(4), 77-84.
Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014)
Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/

Page updated

Google Sites

Report abuse