Class Calendar 2013
Lectures (Tentative Schedule)
Lectures will take place in NSH 3002
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.
Tues 1/14 Introduction & Overview of Data Science Pipeline
Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.
[Slides][Slides describing Byte 1]
Homework: Byte 1 Assigned
Thurs 1/16 Scoping Projects; Asking good Questions
Description: How do we decide what questions to ask of the data
Tues 1/21 Structured vs Unstructured Data
Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc
[Slides][Slides describing Byte 2]
Readings: Optional: Stonebraker & Hellerstein: What Goes Around Comes Around
(a historical view of different classes of data modeling)
Homework: Byte 1 Due; Byte 2 Assigned
Thurs 1/23 Acquiring Data
Description: Sampling issues; Pros and cons of different sources of data; Practical overview of APIs and OAuth
Tues 1/28 Understanding and Cleaning your Data
Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Case studies: Mouse data; Location Data
[Sildes][ZipFound.csv][Case Study]
Readings:
- McCallum: Bad Data: Chapter 7 [on blackboard]
- Dasu: Data Glitches: Monsters in your Data
- Optional: Dasu & Loh: Statistical Distortion: Consequences of Data Cleaning
Thurs 1/30 Visualizing and Exploring your Data
Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
[Slides][boxplot_demo.py][Description of Byte 3]
Readings:
- Pearson: Mining Imperfect Data: Chapter 1 [on blackboard]
- Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, not just the blog post)
Tues 2/4 Guest Lecture: Golan Levin NOTE: CFA 111
Golan Levin is a professor of Art Practice at CMU. He teaches the class "Interactive Art and Computational Design". He says on his website: I am an artist and educator living in Pittsburgh. I teach at Carnegie Mellon University, where I also direct the STUDIO for Creative Inquiry, an interdisciplinary arts-research center.
I create interactive artifacts and experiences with a variety of collaborators. I also blog, tweet, and publish writings.
Feel free to contact me, or join my low-traffic mailing list.
If you missed this wonderful lecture, you can find more on the visualization page of his website for his parallelism class.
HW: Byte 2 Due date is 2/4 (was incorrect on this page)
HW: Byte 3 Assigned
Thursday 2/6 Information Visualization & Introduction to D3
Description: Overview of key concepts in information visualization and a brief discussion of D3 Th
Readings:
- Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning)
- Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)
- Optional: Tufte: Beautiful Evidence: The Fundamental Principles of Analytical Design
- Optional: D3: Data-Driven Documents: Michael Bostock, Vadim Ogievetsky and Jeffrey Heer
Tues 2/11 Quantified Self: Guest lecture by Anne Wright
Description: Ann Wright is a member of the CREATE lab and the brains behind fluxstream.org. She was kind enough to share her slides with us, and they are posted on Blackboard.
HW: Byte 3 Due; Byte 4 Assigned
Readings:
- Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014)
- Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/
Thurs 2/13 Mobile Data & Map Data
Description: Description of Byte 4 (Mobile Byte or Map Byte)
Slides: continuation of 2/6 I think; [Slides for Byte 4]
Homework: Byte 3 Due Friday 2/14; Byte 4 Assigned
Tues 2/18 Large-volume Geographic Data: [Guest lecture by Randy Sargent]
Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery. No Slides
Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present
Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.
Thurs 2/20 Testing for Differences
Description: Discussion of the t-test, the math and assumptions underlying it, and the process for using it. Also discussed limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on).
Homework: Byte 4 Due; First Project Assigned
Tues 2/25 Correlation and Regression
Description: Discussion of correlation and regression, the math and assumptions underlying them, and how to use them. Also discussed limitations and paradoxes (such as Simpson's paradox).
Thurs 2/27 Getting Quality Data: [Guest Lecture by Bill Thies]
Description: The class will focus on “getting quality data”? Bill will to touch on response bias, UIs for data collection, some experiences gained from partners (e.g., in http://datadev.acmdev.org/) and also some related work (e.g., DEV 2012 paper on detecting interviewer fabrication of data). Slides are on Blackboard.
Readings: Critical Questions for Big Data, danah boyd & Kate Crawford, Information, Communication & Society, 15:5, 662-679, 2012. [on blackboard
Bio: I am a researcher in the Technologies for Emerging Markets Group at Microsoft Research India. My research focuses on building appropriate information and communication technologies that contribute to the socio-economic development of low-income communities (ICT4D). This work often encompasses human-computer interaction (HCI), online education, mHealth, crowdsourcing, and other areas. Previously I worked on programming languages and compilers, for multicore processors as well as microfluidic chips. I received all of my degrees from the Massachusetts Institute of Technology, where I completed a Ph.D. in computer science in 2009.
Slides: Posted on Blackboard
Tues 3/4 Classification Basics & Metrics
Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.
Thurs 3/6 Project 1 Presentations
3/7-3/14: Spring Break
Tues 3/18 Classification Algorithms
Description: Discussion of Decision Trees, Naïve Bayes, and Regression
HW: Byte 5 Assigned
Thurs 3/20 Guest Lecture by: Daniel Neill
Description: Daniel Neill is a faculty member at CMU's Heinz College, where he directs the Event and Pattern Detection Laboratory. He is also associated with the Machine Learning Department and Robotics Institute. His research is focused on novel statistical and computational methods for discovery of emerging events and other relevant patterns in complex and massive datasets, applied to real-world policy problems ranging from medicine and public health to law enforcement and security. Slides available on Blackboard.
Readings: Please look over http://www.bizjournals.com/chicago/news/2013/12/26/carnegie-mellon-smells-a-rat-and-chicago-is.html and read New Directions in Artificial Intelligence for Public Health Surveillance. Instead of posting to the discussion board, please prepare for class by working together to edit and fill in this shared document.
Tues 3/25: Infrastructure Issues for Big Data
Description: We talked about infrastructure issues for big data and regression.
Reading: (optional) Getting Started with Google Big Query
Thurs 3/27: Guest Lecture: Aaron Steinfeld (Tiramisu)
Description: The focus of the lecture was on the issues faced when crowdsourcing. Slides available on Blackboard.
Readings: Tomasic et al: Motivating Contribution in a Participatory Sensing System via Quid-Pro-Quo. To Appear in CSCW 2014. [Blackboard] [Discussion of readings]
Bio: Aaron is an associate research professor in the Robotics Institute at Carnegie Mellon and the co-director of the Rehabilitation Engineering Research Center on Accessible Public Transportation. He earned his Ph.D., M.S. and B.S. in industrial & operations engineering from the University of Michigan (1999, 1994 and 1993, respectively) and completed a postdoctoral position at the University of California, Berkeley (2000). Steinfeld’s interest is focused around constrained user interfaces and operator assistance, predominantly in the realms of human-robot interaction, rehabilitation, transportation and intelligent systems. He is interested in how to enable timely and appropriate interaction when interfaces are restricted through design, tasks, the environment, time pressures, and/or user abilities. He works on the Tiramisu project.
Tiramisu Transit is a crowd-powered transit information system developed by researchers to improve users' transit experiences and transit accessibility. With Tiramisu - literally Italian for "pick me up" - anyone waiting at a bus stop with a smartphone can see which buses or light rail vehicles are due to arrive next and, thanks to the signals from riders already aboard, get an idea of how long they have to wait. When a rider first activates the app, Tiramisu displays the nearest stops and a list of buses or light rail vehicles that are scheduled to arrive. The list includes arrival times, based either on historical data for that route or on real-time reports from riders. When the desired vehicle arrives, the user indicates the level of "fullness" and then presses a button, allowing their phone to share an ongoing GPS trace with the Tiramisu server. Once aboard, the rider can use Tiramisu to find out which stop is next and to report problems, positive experiences and suggestions.
HW: Byte 5 Due; Byte 6 Assigned
Tues 4/1: Review; Additional CAP and Regression Discussion
Description: Revisiting CAP in a little more depth and a clearer presentation of regression.
[Slides][Description of Byte 6]
Readings: None
Thurs 4/3: Guest Lecture: Harry Hochheiser
Description: Harry introduced the problem of bad metadata and the role of ontologies in supporting data analysis.
Readings:None
Speaker Bio: My research has covered a range of topics, including human-computer interaction, information visualization, bioinformatics, universal usability, security, privacy, and public policy implications of computing systems. I have published more than 40 peer-reviewed journal and conference papers and two book chapters. At Towson University, I was an investigator on NSF-funded projects in computer security in introductory computer science classes and computational thinking.I am currently working on the development of highly-interactive, user-centered systems for finding and exploring biomedical datasets, with specific applications ranging from basic research data to electronic health records.. I have been a member of the Executive Committee of the Association of Computing Machinery's US Public Policy Committee (USACM) since 2004, and I am co-author of Research Methods in Human-Computer Interaction (Wiley, 2010).
HW: Byte 6 Due; Final Project Assigned
Tues 4/8 Final Project Planning (Individual Meetings 9-12)
Thurs 4/10: No Class (Carnival)
Tues 4/15: Natural Language Processing
Description: Discussion of basic principles of Natural Language Processing
[Lecture Slides: TBD]
Thurs 4/17: Dealing with Uncertainty in Data or Social Network Analysis
Description:
[Lecture Slides: TBD]
Readings:
HW: