Topics
Students in this class will develop a toolkit for working with real datasets, applying machine learning techniques, performing exploratory data analysis, and generating effective visualizations.
The following topics will be explored:
Data management: extract, transform and load (ETL), data cleaning
Exploration of a single variable: visualizing distributions, summary statistics, outliers and errors, robust statistics.
Machine learning: feature extraction and selection, survey of methods, use of scikit-learn.
Exploration of variable relationships: scatterplots, correlation, linear regression, non-linear relationships.
Statistical analysis: hypothesis testing and estimation using a simulation-based approach.
Collaborative exploration: formulating questions, performing analysis, communicating results, visualization.
Most student teams will use Python and related technologies, but some projects might use other languages.