Lectures will take place in NSH 3002
Lecture Slides are available at https://github.com/jmankoff/data (to easily keep an up to date copy of it you will need to get yourself a git hub account (free) and then open the URL above. Click 'fork' and from then on it will show up in your repository. Github has a GUI that you can download which will update this when you request it. However you should create a separate place for your work and only use this to see my version of each project).
Readings are linked off this page or available on Blackboard. Discussion posts are not required for optional readings.
Description: An overview exploring the hype around Data Science, the different perspectives that are needed on a team that works with data, and the pipeline involved in working with data.
Homework: Byte 1 Assigned; Byte 3 setup Assignment
Slides: [github]
Reading: the hardest parts of data science.
fine
Slides: [github]
fine
Description: Discussions of properties of data; and practical overview of XML/Json/SQL/etc; Practical overview of APIs and OAuth;
Readings:
Required: Stonebraker & Hellerstein: What Goes Around Comes Around Pages 1-2 (sections I and II); Section V (The Entity-Relationship Era); IX (The Object-Relational Era); X (Semi-Structured Data) (a historical view of different classes of data modeling)
Slides: [github]
Wednesday 1/20: Byte 1 Due; Byte 3 install due;
where to introduce? before viz
Description:Description: Transforming data; Stem & Leaf plots; Boxplots; Histograms and distributions and their implications
Readings:
Homework: Byte 2 Assigned
Slides: [github]
Friday 1/22: Byte 1 Peer Grading Due
xx where to introduce?
Description: The four Cs (Correctness, Coherence, Completeness, and AcCountability); Practical overview of survey question design issues
Readings:
Slides: [github]
For suggested "reading":
* I will mention (and play demo video) a tool called "open refine", a free open source tool for both cleaning and integration.
* if have time, I hope to also show a video of "Wrangler" from Jeff here's group
http://vis.stanford.edu/papers/wrangler
That paper is good to read regardless.
http://dl.acm.org/citation.cfm?id=1979444
http://vis.stanford.edu/papers/wrangler
Slides: See blackboard
Description: Overview of key concepts in information visualization; Testing visualizations; StepGreen Case Study
Readings:
Reading question: Post a link or screenshot of a data visualization, and analyze how it addresses Tufte's six principles.
Slides: [github]
Description: Overview of human perceptual factors affecting information visualization and a brief discussion of D3
Readings:
Possible reading question: Although pattern detection is typically simpler with a graphical interface, are we missing out on interesting numerical relationships by allowing both the machine and the human analyst to focus only on what they do "best"?
Readings:
Optional:
Potential Reading Questions:
Homework: Byte 3 due; Byte 4 (visualizing your data) assigned
Slides: [github]
xx maybe do a midterm project here?
Description: Case study of data cleaning; Discussion of data sampling issues
Slides: [github]
Description: Discussion of what can be accomplished with mobile data collection and other forms of sensed data. Description of Byte 3 (Mobile Byte) Reading: ProactiveTasks: the short of mobile device use sessions. Nikola Banovic, Christina Brant, Jennifer Mankoff, and Anind K. Dey. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services (MobileHCI '14). ACM, New York, NY, USA, 243-252. PDF
maybe add data streams
Description: Infrastructure issues for big data; Sampling and Quality
Readings:
Homework: Byte 2 due; Byte 3 Mobile Assigned
Slides: [github]
Readings: None
Title: Advanced common-sense: Making sense of data that is invisible, ugly, or incomplete
This talk will be informal, and based on my own experience making sense of data from large-scale education applications. Through this talk, I want to remind you of data issues that are not emphasized in traditional data-processing pipelines; e.g. how do you estimate data that is hard to get? How do you sanity check data or run simple experiments that validate your hypotheses? Much of this is "common sense," and should be used combined with other techniques learned through class.
Description: Discussion about Visualization of Big Data
Readings:
Slides: [github]
Description: In class work day/office hours [outcome: Probably don't repeat :]
Readings:
Description: I'll discuss some of the approaches to and challenges with large-volume geographic data. I'll be joined by some of our team members to show some interesting examples drawn from census data and satellite imagery.
Speaker Bio: Randy Sargent holds dual appointments at Carnegie Mellon University and Google. As Visiting Scientist in Google’s Earth Engine team, Randy helps research and develop time-lapse explorable maps, including a recently-released global Landsat timelapse mosaicked from 29 years of Landsats 4, 5, and 7. As Senior Systems Scientist in Carnegie Mellon University's CREATE Lab, Randy works with a team to develop ways to explore and visualize big data – massive time-series data from the BodyTrack self-tracking project, and terapixel-scale zoomable and explorable videos of diverse subjects such as plants growing, or a simulation of the universe from big bang to present
Prior to CMU and Google, Randy helped develop planetary rover software in NASA Ames’s Intelligent Robotics Group, and founded/co-founded three successful technology companies. Randy received his BS in Computer Science from MIT, and his MS from the MIT Media Lab, where he developed the Programmable Brick, a research prototype for LEGO Mindstorms.
xx maybe touch on some of this in week one case study to prep for byte 1
Description: Discussed the overall difference between frequency based hypothesis-testing and process-based (bayesian) hypothesis testing, limitations and tests for special situations (multiple comparisons, when assumptions are violated, and so on). Also discussed limitations and paradoxes (such as Simpson's paradox).
Readings:
Reading Questions:
1) Why is it important to estimate the likelihood of an outcome in the population and how might you do that?
2) What are some examples of things that you might have data about for which process knowledge is your best option, rather than the frequency analysis typical of most statistics (e.g. randomized clinical trials, t-tests, etc).
Optional Readings:
HW: Project Part I assigned
Slides: [github]
Mike Blackhurst
University of Pittsburgh
University Center for Social and Urban Research
Description: Discussion of causality and regression, the math and assumptions underlying regression, and how to use it.
Readings:
Slides: [github]
Description: Discussed the basic process by which classifiers are trained and used, and some of the metrics used to evaluate their success. Talked about the importance of having a train/test set that is separate from the data you experiment on.
Readings:
Slides: [ ]
xx can we add a machine learning assignemnt...
[Slides]
Description: Discussion of Decision Trees, Naïve Bayes, and Regression
[Big Data Slides][Usable ML Slides: Blackboard]
Things cut from the class: