Collecting, Analyzing And Interacting With Data

Class Calendar 2016

Lectures (Tentative Schedule, lectures in Blue are final)

Lectures will take place in NSH 3002

Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).

Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.

Tues 1/12 Introduction & Overview of Data Science Pipeline

Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.

Homework: Byte 1 Assigned; Byte 3 setup Assignment

Slides: [github]

Reading: the hardest parts of data science.

fine

Thurs 1/14 Scoping Projects; Asking good Questions & Selecting Data Sources

Description: How do we decide what questions to ask of the data; Pros and cons of different sources of data

Slides: [github]

fine

Tues 1/19 Structured vs Unstructured Data

Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc; Practical overview of APIs and OAuth;

Readings:

Required: Stonebraker & Hellerstein: What Goes Around Comes Around Pages 1-2 (sections I and II); Section V (The Entity-Relationship Era); IX (The Object-Relational Era); X (Semi-Structured Data) (a historical view of different classes of data modeling)

Slides: [github]

Wednesday 1/20: Byte 1 Due; Byte 3 install due;

where to introduce? before viz

Thurs 1/21 Exploring Your Data

Description:Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications

Readings:

Pearson: Mining Imperfect Data: Chapter 1 [on blackboard]
Optional: Gelman: Exploratory Data Analysis for Complex Models (read the article, not just the blog post)

Homework: Byte 2 Assigned

Slides: [github]

Friday 1/22: Byte 1 Peer Grading Due

xx where to introduce?

Tues 1/26 Understanding and Cleaning Your Data

Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Practical overview of survey question design issues

Readings:

McCallum: Bad Data: Chapter 7 [on blackboard]
Dasu: Data Glitches: Monsters in your Data
Optional: Dasu & Loh: Statistical Distortion: Consequences of Data Cleaning
Optional: Raman, V., & Hellerstein, J. M. (2001, September). Potter's wheel: An interactive data cleaning system. In VLDB (Vol. 1, pp. 381-390). Chicago

Slides: [github]

Thurs 1/28 Guest Lecture: Polo Chau: Data Cleaning and Integration

For suggested "reading":

* I will mention (and play demo video) a tool called "open refine", a free open source tool for both cleaning and integration.

http://openrefine.org

* if have time, I hope to also show a video of "Wrangler" from Jeff here's group

http://vis.stanford.edu/papers/wrangler

That paper is good to read regardless.

http://dl.acm.org/citation.cfm?id=1979444

http://vis.stanford.edu/papers/wrangler

Slides: See blackboard

Thursday 2/11: Information Visualization Overview

Description: Overview of key concepts in information visualization; Testing visualizations; StepGreen Case Study

Readings:

Tufte: Beautiful Evidence: The Fundamental Principles of Analytical Design [blackboard]
S. Carpendale, "Evaluating Information Visualizations", in Information Visualization: Human-Centered Issues and Perspectives, (Editors: A. Kerren, J. Stasko, J.-D. Fekete, C. North), Springer, 2008, pp. 19-45. [blackboard]
Optional: Mankoff, J., Fussell, S. R., Dillahunt, T., Glaves, R., Grevet, C., Johnson, M., ... & Setlock, L. D. (2010, May). StepGreen. org: Increasing Energy Saving Behaviors via Social Networks. In ICWSM.

Reading question: Post a link or screenshot of a data visualization, and analyze how it addresses Tufte's six principles.

Slides: [github]

Tuesday 2/16: Perception and Information Visualization; Practical Introduction to D3

Description: Overview of human perceptual factors affecting information visualization and a brief discussion of D3

Readings:

Ware: Visual Thinking for Design: Chapter 9 (The Dance of Meaning)
Optional: Ware: Visual Thinking for Design: Chapter 1 (Visual Queries)
Optional: D3: Data-Driven Documents: Michael Bostock, Vadim Ogievetsky and Jeffrey Heer

Possible reading question: Although pattern detection is typically simpler with a graphical interface, are we missing out on interesting numerical relationships by allowing both the machine and the human analyst to focus only on what they do "best"?

Slides: [github][github]

Tuesday 2/23 The Role of Narrative in Visualization

Readings:

Hullman, J., & Diakopoulos, N. (2011). Visualization rhetoric: Framing effects in narrative visualization. Visualization and Computer Graphics, IEEE Transactions on, 17(12), 2231-2240.
E. Segel and J. Heer, "Narrative Visualization: Telling Stories with Data", IEEE Trans. on Visualization and Computer Graphics, Vol. 16, No. 6, Nov.-Dec. 2010, pp. 1139-1148.

Optional:

J.S. Yi, Y.A. Kang, J.T. Stasko and J.A. Jacko, "Toward a Deeper Understanding of the Role of Interaction in Information Visualization",IEEE Transactions on Visualization and Computer Graphics, Vol. 13, No. 6, Nov/Dec 2007, pp. 1224-1231.

Potential Reading Questions:

In what ways can factors external to the visualization itself, such as internalized knowledge and conventions at the individual and community level, interact with the rhetorical strategies used in a narrative visualization to influence interpretation?
How do communicative and explorative rhetorical strategies effectively work together in a narrative visualization?
Section 2.3 in the Hullman paper mentions how subtle changes in framing my influence, or otherwise solicit a particular opinion from the user. Can you find any examples of Visualizations that do this?

Homework: Byte 3 due; Byte 4 (visualizing your data) assigned

Slides: [github]

xx maybe do a midterm project here?

Tues 2/2 Acquiring Data From People

Description: Case study of data cleaning; Discussion of data sampling issues

Slides: [github]

Thursday 2/4: Acquiring Data From Mobile Devices & The Web

Description: Discussion of what can be accomplished with mobile data collection and other forms of sensed data. Description of Byte 3 (Mobile Byte) Reading: ProactiveTasks: the short of mobile device use sessions. Nikola Banovic, Christina Brant, Jennifer Mankoff, and Anind K. Dey. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14). ACM, New York, NY, USA, 243-252. PDF

Slides: [github][github]

maybe add data streams

Tuesday 2/9: Issues With Big Data Quality and Sampling

Description: Infrastructure issues for big data; Sampling and Quality

Readings:

Critical Questions for Big Data, danah boyd & Kate Crawford, Information, Communication & Society, 15:5, 662-679, 2012. [on blackboard];
Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with big data analytics. interactions, 19(3), 50-59.
(optional) Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204.

Homework: Byte 2 due; Byte 3 Mobile Assigned

Slides: [github]

Thursday 2/18: Guest Lecture: Chinmay Kulkarni

Readings: None

Title: Advanced common-sense: Making sense of data that is invisible, ugly, or incomplete

This talk will be informal, and based on my own experience making sense of data from large-scale education applications. Through this talk, I want to remind you of data issues that are not emphasized in traditional data-processing pipelines; e.g. how do you estimate data that is hard to get? How do you sanity check data or run simple experiments that validate your hypotheses? Much of this is "common sense," and should be used combined with other techniques learned through class.

Thursday 2/25: Visualizing Big Data

Description: Discussion about Visualization of Big Data

Readings:

(optional) A Review of Overview+Detail, Zooming, and Focus+Context Interfaces, ANDY COCKBURN, AMY KARLSON, BENJAMIN B. BEDERSON
(optional) Liu, Z., Jiang, B., & Heer, J. (2013, June). imMens: Real‐time Visual Querying of Big Data. In Computer Graphics Forum (Vol. 32, No. 3pt4, pp. 421-430). Blackwell Publishing Ltd.
(optional) Getting Started with Google Big Query

Slides: [github]

Tuesday 3/1 Byte 4 In Class Work Day

Description: In class work day/office hours [outcome: Probably don't repeat :]

Readings:

Thursday 3/3 Guest Lecture: Randy Sargent: Large Volume Geographic Data

Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery.

Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present

Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.

xx maybe touch on some of this in week one case study to prep for byte 1

3/7-3/11: Spring Break: Homework: Byte 4 due on Tuesday 3/8

Tuesday 3/15: Causality, Bayesian Inference & Statistical Hypothesis Testing

Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).

Readings:

Dunlop, M. D., & Baillie, M. (2009). Paper Rejected (p> 0.05): An Introduction to the Debate on Appropriateness of Null-Hypothesis Testing. International Journal of Mobile Human Computer Interaction (IJMHCI), 1(3), 86-93.
Not your median patient: How a climate scientist faced cancer (John Ungar Zussman, August 11, 2010)

Reading Questions:

1) Why is it important to estimate the likelihood of an outcome in the population and how might you do that?

2) What are some examples of things that you might have data about for which process knowledge is your best option, rather than the frequency analysis typical of most statistics (e.g. randomized clinical trials, t-tests, etc).

Optional Readings:

Nuzzo, R. (2014). Statistical errors. Nature, 506(13), 150-152.
Hart, A. (2000). Towards better research: a discussion of some common mistakes in statistical analyses. Complementary therapies in medicine, 8(1), 37-42. [ON BLACKBOARD]
Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.

HW: Project Part I assigned

Slides: [github]

Thursday 3/17: Guest Lecture

Mike Blackhurst

University of Pittsburgh

University Center for Social and Urban Research

Tuesday 3/22: Finish Causality, further discussion of Bayesian Inference; Introduction of Regression

Description: Discussion of causality and regression, the math and assumptions underlying regression, and how to use it.

Readings:

(Optional) Carte, T. A., & Russell, C. J. (2003). In pursuit of moderation: Nine common errors and their solutions. Mis Quarterly, 479-501.

Slides: [github]

Thursday 3/24: Classification Basic & Metrics

Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.

Readings:

Slides: [ ]

xx can we add a machine learning assignemnt...

Tuesday 3/29: More Classification Algorithms & Metrics

[Slides]

Description: Discussion of Decision Trees, Naïve Bayes, and Regression

Thursday 3/31: Project Part I Presentations

Tuesday 4/5: More Classification Algorithms

Readings: Introduction to the algorithms

Thursday 4/7: Classification of Big Data / Revisiting Big Query / Usable ML

[Big Data Slides][Usable ML Slides: Blackboard]

Tues 4/12 9-12: Project Checkin [Sign up links]

ATTENDANCE REQUIRED

Thurs 4/14: No Class (Carnival)

Tues 4/19: Final Exam Review Session

Thurs 4/21: Project Checkin [Sign up links]

ATTENDANCE REQUIRED

Tues 4/26: Final Project Presentations [select timeslots]

ATTENDANCE REQUIRED

Thurs 4/28: Final Project Presentations [select timeslots]

ATTENDANCE REQUIRED

Final Exam: Take Home (to be discussed further in class)

Things cut from the class:

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM,55(4), 77-84.
Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014)
Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/
Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., ... & Campbell, A. T. (2014, September). Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. 3-14). ACM. [Video]
B. Lee, C. Plaisant, C. Sims Parr, J.-D. Fekete, N. Henry, "Task Taxonomy for Graph Visualization", Proc. of BELIV '06, April '06, pp. 1-5.
A. Perer, B. Shneiderman, "Balancing Systematic and Flexible Exploration of Social Networks," IEEE Trans. on Visualization and Computer Graphics, Vol. 12, No. 5, Sep.-Oct. 2006, pp. 693-700
F. Viegas, S. Golder, and J. Donath, "Visualizing Email Content: Portraying Relationships from Conversational Histories", Proceedings of CHI 2006, Montreal, Canada, April 2006, pp. 979-988.
M. Wattenberg and J. Kriss, "Designing for Social Data Analysis," IEEE Transactions on Visualization and Computer Graphics Vol. 12, No. 4, Jul.-Aug. 2006, pp. 549-557.

http://v.isits.in/
Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, Julie A. Kientz, Understanding Quantified-Selfers’ Practices in Collecting and Exploring Personal Data. To Appear (CHI 2014)
Mining the Quantified Self: Personal Knowledge Discovery as a Challenge for Data Science. Fawcett Tom. Big Data. January 2016, 3(4): 249-266. doi:10.1089/big.2015.0049. http://online.liebertpub.com/doi/full/10.1089/big.2015.0049
Optional: Take a look at the following 'show and tell' talks: http://quantifiedself.com/2013/07/mark-wilson-on-synthesizing-data/ andhttp://quantifiedself.com/2013/12/chris-bartley-understanding-chronic-fatigue/

Page updated

Google Sites

Report abuse