Collecting, Analyzing And Interacting With Data

Class Calendar 2013

Lectures (Tentative Schedule)

Lectures will take place in NSH 3002

Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).

Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.

Tues 1/14 Introduction & Overview of Data Science Pipeline

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.

[Slides][Slides describing Byte 1]

Homework: Byte 1 Assigned

Thurs 1/16 Scoping Projects; Asking good Questions

Description: How do we decide what questions to ask of the data

[Slides]

Tues 1/21 Structured vs Unstructured Data

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc

[Slides][Slides describing Byte 2]

Readings: Optional: Stonebraker & Hellerstein: What Goes Around Comes Around

(a historical view of different classes of data modeling)

Homework: Byte 1 Due; Byte 2 Assigned

Thurs 1/23 Acquiring Data

Description: Sampling issues; Pros and cons of different sources of data; Practical overview of APIs and OAuth

[Slides]

Tues 1/28 Understanding and Cleaning your Data

Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Case studies: Mouse data; Location Data

[Sildes][ZipFound.csv][Case Study]

Readings:

McCallum: Bad Data: Chapter 7 [on blackboard]
Dasu: Data Glitches: Monsters in your Data
Optional: Dasu & Loh: Statistical Distortion: Consequences of Data Cleaning

Thurs 1/30 Visualizing and Exploring your Data

Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications

[Slides][boxplot_demo.py][Description of Byte 3]

Readings:

Pearson: Mining Imperfect Data: Chapter 1 [on blackboard]
Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, not just the blog post)

Tues 2/4 Guest Lecture: Golan Levin NOTE: CFA 111

Golan Levin is a professor of Art Practice at CMU. He teaches the class "Interactive Art and Computational Design". He says on his website: I am an artist and educator living in Pittsburgh. I teach at Carnegie Mellon University, where I also direct the STUDIO for Creative Inquiry, an interdisciplinary arts-research center.

I create interactive artifacts and experiences with a variety of collaborators. I also blog, tweet, and publish writings.

Feel free to contact me, or join my low-traffic mailing list.

If you missed this wonderful lecture, you can find more on the visualization page of his website for his parallelism class.

HW: Byte 2 Due date is 2/4 (was incorrect on this page)

HW: Byte 3 Assigned

Thursday 2/6 Information Visualization & Introduction to D3

Description: Overview of key concepts in information visualization and a brief discussion of D3 Th

[Slides]

Readings:

Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning)
Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)
Optional: Tufte: Beautiful Evidence: The Fundamental Principles of Analytical Design
Optional: D3: Data-Driven Documents: Michael Bostock, Vadim Ogievetsky and Jeffrey Heer

Tues 2/11 Quantified Self: Guest lecture by Anne Wright

Description: Ann Wright is a member of the CREATE lab and the brains behind fluxstream.org. She was kind enough to share her slides with us, and they are posted on Blackboard.

~~HW: Byte 3 Due; Byte 4 Assigned~~

Readings:

Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014)
Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/

Thurs 2/13 Mobile Data & Map Data

Description: Description of Byte 4 (Mobile Byte or Map Byte)

Slides: continuation of 2/6 I think; [Slides for Byte 4]

Homework: Byte 3 Due Friday 2/14; Byte 4 Assigned

Tues 2/18 Large-volume Geographic Data: [Guest lecture by Randy Sargent]

Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. No Slides

Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present

Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.

Thurs 2/20 Testing for Differences

Description: Discussion of the t-test, the math and assumptions underlying it, and the process for using it. Also discussed limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on).

[Slides]

Homework: Byte 4 Due; First Project Assigned

Tues 2/25 Correlation and Regression

Description: Discussion of correlation and regression, the math and assumptions underlying them, and how to use them. Also discussed limitations and paradoxes (such as Simpson's paradox).

[Slides]

Thurs 2/27 Getting Quality Data: [Guest Lecture by Bill Thies]

Description: The class will focus on “getting quality data”? Bill will to touch on response bias, UIs for data collection, some experiences gained from partners (e.g., in http://datadev.acmdev.org/) and also some related work (e.g., DEV 2012 paper on detecting interviewer fabrication of data). Slides are on Blackboard.

Readings: Critical Questions for Big Data, danah boyd & Kate Crawford, Information, Communication & Society, 15:5, 662-679, 2012. [on blackboard

Bio: I am a researcher in the Technologies for Emerging Markets Group at Microsoft Research India. My research focuses on building appropriate information and communication technologies that contribute to the socio-economic development of low-income communities (ICT4D). This work often encompasses human-computer interaction (HCI), online education, mHealth, crowdsourcing, and other areas. Previously I worked on programming languages and compilers, for multicore processors as well as microfluidic chips. I received all of my degrees from the Massachusetts Institute of Technology, where I completed a Ph.D. in computer science in 2009.

Slides: Posted on Blackboard

Tues 3/4 Classification Basics & Metrics

Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.

[Slides]

Thurs 3/6 Project 1 Presentations

3/7-3/14: Spring Break

Tues 3/18 Classification Algorithms

Description: Discussion of Decision Trees, Naïve Bayes, and Regression

[Slides][Slides for Byte 5]

HW: Byte 5 Assigned

Thurs 3/20 Guest Lecture by: Daniel Neill

Description: Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute. His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. Slides available on Blackboard.

Readings: Please look over http://www.bizjournals.com/chicago/news/2013/12/26/carnegie-mellon-smells-a-rat-and-chicago-is.html and read New Directions in Artificial Intelligence for Public Health Surveillance. Instead of posting to the discussion board, please prepare for class by working together to edit and fill in this shared document.

Tues 3/25: Infrastructure Issues for Big Data

Description: We talked about infrastructure issues for big data and regression.

[Slides]

Reading: (optional) Getting Started with Google Big Query

Thurs 3/27: Guest Lecture: Aaron Steinfeld (Tiramisu)

Description: The focus of the lecture was on the issues faced when crowdsourcing. Slides available on Blackboard.

Readings: Tomasic et al: Motivating Contribution in a Participatory Sensing System via Quid-Pro-Quo. To Appear in CSCW 2014. [Blackboard] [Discussion of readings]

Bio: Aaron is an associate research professor in the Robotics Institute at Carnegie Mellon and the co-director of the Rehabilitation Engineering Research Center on Accessible Public Transportation. He earned his Ph.D., M.S. and B.S. in industrial & operations engineering from the University of Michigan (1999, 1994 and 1993, respectively) and completed a postdoctoral position at the University of California, Berkeley (2000). Steinfeld’s interest is focused around constrained user interfaces and operator assistance, predominantly in the realms of human-robot interaction, rehabilitation, transportation and intelligent systems. He is interested in how to enable timely and appropriate interaction when interfaces are restricted through design, tasks, the environment, time pressures, and/or user abilities. He works on the Tiramisu project.

Tiramisu Transit is a crowd-powered transit information system developed by researchers to improve users' transit experiences and transit accessibility. With Tiramisu - literally Italian for "pick me up" - anyone waiting at a bus stop with a smartphone can see which buses or light rail vehicles are due to arrive next and, thanks to the signals from riders already aboard, get an idea of how long they have to wait. When a rider first activates the app, Tiramisu displays the nearest stops and a list of buses or light rail vehicles that are scheduled to arrive. The list includes arrival times, based either on historical data for that route or on real-time reports from riders. When the desired vehicle arrives, the user indicates the level of "fullness" and then presses a button, allowing their phone to share an ongoing GPS trace with the Tiramisu server. Once aboard, the rider can use Tiramisu to find out which stop is next and to report problems, positive experiences and suggestions.

HW: Byte 5 Due; Byte 6 Assigned

Tues 4/1: Review; Additional CAP and Regression Discussion

Description: Revisiting CAP in a little more depth and a clearer presentation of regression.

[Slides][Description of Byte 6]

Readings: None

Thurs 4/3: Guest Lecture: Harry Hochheiser

Description: Harry introduced the problem of bad metadata and the role of ontologies in supporting data analysis.

[Slides]

Readings:None

Speaker Bio: My research has covered a range of topics, including human-computer interaction, information visualization, bioinformatics, universal usability, security, privacy, and public policy implications of computing systems. I have published more than 40 peer-reviewed journal and conference papers and two book chapters. At Towson University, I was an investigator on NSF-funded projects in computer security in introductory computer science classes and computational thinking.I am currently working on the development of highly-interactive, user-centered systems for finding and exploring biomedical datasets, with specific applications ranging from basic research data to electronic health records.. I have been a member of the Executive Committee of the Association of Computing Machinery's US Public Policy Committee (USACM) since 2004, and I am co-author of Research Methods in Human-Computer Interaction (Wiley, 2010).

HW: Byte 6 Due; Final Project Assigned

Tues 4/8 Final Project Planning (Individual Meetings 9-12)

Thurs 4/10: No Class (Carnival)

Tues 4/15: Natural Language Processing

Description: Discussion of basic principles of Natural Language Processing

[Lecture Slides: TBD]

Thurs 4/17: Dealing with Uncertainty in Data or Social Network Analysis

Description:

[Lecture Slides: TBD]

Readings:

HW:

Tues 4/22: Final Project Discussions (Individual Meetings 9-12)

Thurs 4/24: Review Session (for Final Exam)

Take Home Final Handed Out

Tues 4/29: No Class (Final)

Thurs 5/1: No Class (Final)

Finals Slot (5/12): 5:30-8:30 Final Project Presentations Location: DH 1112

Page updated

Google Sites

Report abuse