Lectures (Tentative Schedule, lectures in Blue are tentative)
Lectures will take place in NSH 3002
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.
Tues 1/17 Introduction & Overview of Data Science Pipeline [Jen]
Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.
Learning Goals:
Homework: Byte 1 Assigned; Byte 3 setup Assignment
Slides: [Introduction]
Reading: the hardest parts of data science.
Thurs 1/19 Scoping Projects; Asking good Questions & Selecting Data Sources [Nikola]
Description: How do we decide what questions to ask of the data; Pros and cons of different sources of data
Case study based, includes mobile data.
Learning Goals: Learn how to ask a question that can be answered with data and explain how the question being answered affects the rest of the pipeline.
Reading:
- ProactiveTasks: the short of mobile device use sessions. Nikola Banovic, Christina Brant, Jennifer Mankoff, and Anind K. Dey. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14). ACM, New York, NY, USA, 243-252. PDF
- Understanding the Challenges of Mobile Phone Usage Data. Karen Church, Denzil Ferreira, Nikola Banovic, and Kent Lyons. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '15). ACM, New York, NY, USA, 504-514. PDF
Slides: [Asking Questions]
Tues 1/24 Structured vs Unstructured Data [Jen]
Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc; Practical overview of APIs and OAuth;
Learning Goals:
Readings: Required: Google's Introduction to (semi)-structured data.
Required: [on Canvas] Chapter 1 of Data Modeling Essentials (read sections: 1.3, 1.4, 1,6 & 1.11)
Optional: Stonebraker & Hellerstein: What Goes Around Comes Around Pages 1-2 (sections I and II); Section V (The Entity-Relationship Era); IX (The Object-Relational Era); X (Semi-Structured Data) (a historical view of different classes of data modeling)
Slides: [Data Storage & Data Structures]
Homework: Byte 1 Due; Byte 3 install due; Byte 2 Assigned
where to introduce? before viz
Thurs 1/26 Theory and Practice of Data Cleaning [Nikola]
Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Practical overview of survey question design issues
Learning Goals: Understand and describe the four Cs of data quality, and explain causes of and fixes for quality issues for each of them.
Readings: Reading Question: List and briefly discuss one example of how bad data can affect data pipeline.
- McCallum: Bad Data: Chapter 7 [on Canvas]
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '11). ACM, New York, NY, USA, 3363-3372. PDF
- Optional:Dasu: Data Glitches: Monsters in your Data
- Optional: Dasu & Loh: Statistical Distortion: Consequences of Data Cleaning
- Optional: Raman, V., & Hellerstein, J. M. (2001, September). Potter's wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381-390). Chicago
Homework: Byte 1 Peer Grading Due
Slides: [Data Quality]
Tues 1/31 Data Sampling: Acquiring the right data
Description: Discussion of data sampling issues
Slides: [github]
Thurs 2/2 Exploring Imperfect Data: Plots and Distributions [Jen]
Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
Readings: Reading question: When is it valuable to read raw data without plots and how can plotting your data help you to identify data to read
Pearson: Mining Imperfect Data: Chapter 1
Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, discussions of article optional)
Slides: [Exploratory Visualization]
Tuesday 2/7: Big Data of One [Nikola]
Description: Big Data and how it relates to Big Data of One--people's personal data collected using mobile devices and other forms of sensed data.
Learning Goals: Be able to define Big Data and list major challenges in collecting and consuming Big Data.
Reading: Reading Question: Pick and briefly discuss one challenge that people in the quantify-self movement face when trying to understand their data
Slides: [Big Data & Human Issues]
Homework: Byte 2 Due;
Thursday 2/9: Information Visualization Overview [Nikola]
Description: Overview of key concepts in information visualization; Testing visualizations; StepGreen Case Study; Description of Byte 3 (Mobile/Visualization Byte)
Learning Goals:
Readings: Reading Question: Post a link or screenshot of a data visualization, and analyze how it addresses Tufte's six principles.
- Tufte: Beautiful Evidence: The Fundamental Principles of Analytical Design [Canvas]
- S. Carpendale, "Evaluating Information Visualizations", in Information Visualization: Human-Centered Issues and Perspectives, (Editors: A. Kerren, J. Stasko, J.-D. Fekete, C. North), Springer, 2008, pp. 19-45. [Canvas]
- Optional: Mankoff, J., Fussell, S. R., Dillahunt, T., Glaves, R., Grevet, C., Johnson, M., ... & Setlock, L. D. (2010, May). StepGreen. org: Increasing Energy Saving Behaviors via Social Networks. In ICWSM.
Slides: [Overview of Info Viz Case Study on Location]
Homework: Byte 2 Peer Grading Due, Byte 3 Visualizing Mobile Visualization Assigned
Tuesday 2/14: Perception and Information Visualization [Jen]; Guide to Byte 3 [Nikola]
Description: Overview of human perceptual factors affecting information visualization and a brief discussion of D3
Readings: Reading Question: Although pattern detection is typically simpler with a graphical interface, are we missing out on interesting numerical relationships by allowing both the machine and the human analyst to focus only on what they do "best"?
- Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning) [on Canvas]
- Satyanarayan, Arvind, et al. "Vega-lite: A grammar of interactive graphics." IEEE Transactions on Visualization and Computer Graphics 23.1 (2017): 341-350.
- Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)
Slides: [Perception & Info Viz][Byte 3 Overview]
Thursday 2/16: The Role of Narrative in Visualization [Jen]
Readings:
- Hullman, J., & Diakopoulos, N. (2011). Visualization rhetoric: Framing effects in narrative visualization. Visualization and Computer Graphics, IEEE Transactions on, 17(12), 2231-2240.
- E. Segel and J. Heer, "Narrative Visualization: Telling Stories with Data", IEEE Trans. on Visualization and Computer Graphics, Vol. 16, No. 6, Nov.-Dec. 2010, pp. 1139-1148
- Optional: J.S. Yi, Y.A. Kang, J.T. Stasko and J.A. Jacko, "Toward a Deeper Understanding of the Role of Interaction in Information Visualization",IEEE Transactions on Visualization and Computer Graphics, Vol. 13, No. 6, Nov/Dec 2007, pp. 1224-1231.
Potential Reading Questions:
- In what ways can factors external to the visualization itself, such as internalized knowledge and conventions at the individual and community level, interact with the rhetorical strategies used in a narrative visualization to influence interpretation?
- How do communicative and explorative rhetorical strategies effectively work together in a narrative visualization?
- Section 2.3 in the Hullman paper mentions how subtle changes in framing my influence, or otherwise solicit a particular opinion from the user. Can you find any examples of Visualizations that do this?
Slides: [Narrative Visualization Design]
Tuesday 2/21: Byte 3 Help Day [Nikola]
Description: In class help with Byte 3.
Readings (in lieu of Big Data Quality and Sampling):
- Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with big data analytics. interactions, 19(3), 50-59.
- Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204.
- (optional) Getting Started with Google Big Query
Slides: [Byte 3 part 1 Byte 3 Part 2]
Thursday 2/23: Visualizing Big Data [Jen]
Description: Discussion about Visualization of Big Data
Readings:
- (optional) Liu, Z., Jiang, B., & Heer, J. (2013, June). imMens: Real‐time Visual Querying of Big Data. In Computer Graphics Forum (Vol. 32, No. 3pt4, pp. 421-430). Blackwell Publishing Ltd.
- (optional) A Review of Overview+Detail, Zooming, and Focus+Context Interfaces, ANDY COCKBURN, AMY KARLSON, BENJAMIN B. BEDERSON
Slides: [Visualizing Big Data]
Homework: Byte 3 due; midterm project assigned here [requires iterative design]
Homework: Byte 3 Peer Grading Due
Thursday 3/2: Iterative Design [Nikola]
Description: Discussion of HCI principles, rapid prototyping, and getting feedback from end users.
Readings:
- (Required) Buxton, Bill. Sketching user experiences: getting the design right and the right design. Morgan Kaufmann, 2010. [Canvas]
- (Required) Nielsen, Jakob. "Iterative user-interface design." Computer 26, no. 11 (1993): 32-41. https://www.nngroup.com/articles/iterative-design/
- (Optional) Nielsen, Jakob. "Discount usability: 20 years." Jakob Nielsen's Alertbox. https://www.nngroup.com/articles/discount-usability-20-years/
Tuesday 3/7: Guest Lecture (Medical Informatics, Adam Perer)
Friday 3/10-Sunday 3/19 Spring Break (No Classes)
Tuesday 3/21: Classification Basics & Algorithms [Jen]
Description: Discussed the basic process by which classifiers are trained and used. Talked about the importance of having a train/test set that is separate from the data you experiment on. Mention accuracy. Introduce some algorithms they will use in. Byte 4 (introduce algorithms ultimately useful with larger data sets)
Readings:
Homework: Byte 4 Assigned; Discuss Byte 4 (Interactive Machine Learning)
Slides: [github]
Thursday 3/23: Midterm Project Presentations
Tuesday 3/28: Classification Metrics & Practical Null Hypothesis Testing [Nikola]
Description: Discussion about how to compare algorithms and what metrics to use (accuracy, precision and recall, kappa, f-score). Introduce practical null hypothesis testing (e.g., t-tests) as a rough check on whether differences are real.
Learning goals: Be able to choose the best algorithm that will generalize to unseen data.
Readings: No readings for this lecture.
Thursday 3/30: Usable ML [Jen]
Slides: [Usable ML Slides: Blackboard]
Readings: All optional (since I'm so late :):
Slides [GitHub]
Tuesday 4/4: Classification and Regression Algorithms and Classification of Big Data [Nikola]
Description: Overview of different algorithms and their applications. Considerations for classification of Big Data.
Thursday 4/6: Guest Lecture: Mayank Goel
Homework: Byte 4 Due; Final Projects Assigned.
Reading: de Greef, Lilian, et al. "Bilicam: using mobile phones to monitor newborn jaundice." Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2014.
Tuesday 4/11: Integrating Classification into Interactive Systems [Nikola]
Description: Getting labels. Making predictions. Assessing accuracy over time. Real world prediction problems.
[Slides]
Homework: Final Project Proposals Due on paper in class
Thursday 4/13: Final Project Meetings [
Sign Up]
Tuesday 4/18: Causality, Bayesian Inference & Statistical Hypothesis Testing [Nikola]
Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).
Readings:
Optional Readings:
- Nuzzo, R. (2014). Statistical errors. Nature, 506(13), 150-152.
- Hart, A. (2000). Towards better research: a discussion of some common mistakes in statistical analyses. Complementary therapies in medicine, 8(1), 37-42. [ON BLACKBOARD]
- Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216.
- Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.
- http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412?WT.ec_id=NATURE-20150430
- Not your median patient: How a climate scientist faced cancer (John Ungar Zussman, August 11, 2010)
HW: Project Part I assigned
Slides: [github]
Thursday 4/20: No Class (Carnival)
Tuesday 4/25: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression [Nikola]
Description: Discussion of causality and regression, the math and assumptions underlying regression, and how to use it.
Readings:
Slides: [github]
Homework: Byte 4 (Machine Learning with Big Data) due. Discussion of Final Project.
Tuesday 5/2: Final Exam Review
Finals Period: Final Project Presentations
Things cut from the class:
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM,55(4), 77-84.
- Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., ... & Campbell, A. T. (2014, September). Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3-14). ACM. [Video]
- B. Lee, C. Plaisant, C. Sims Parr, J.-D. Fekete, N. Henry, "Task Taxonomy for Graph Visualization", Proc. of BELIV '06, April '06, pp. 1-5.
- A. Perer, B. Shneiderman, "Balancing Systematic and Flexible Exploration of Social Networks," IEEE Trans. on Visualization and Computer Graphics, Vol. 12, No. 5, Sep.-Oct. 2006, pp. 693-700
- F. Viegas, S. Golder, and J. Donath, "Visualizing Email Content: Portraying Relationships from Conversational Histories", Proceedings of CHI 2006, Montreal, Canada, April 2006, pp. 979-988.
- M. Wattenberg and J. Kriss, "Designing for Social Data Analysis," IEEE Transactions on Visualization and Computer Graphics Vol. 12, No. 4, Jul.-Aug. 2006, pp. 549-557.
- http://v.isits.in/