Preparing for the Workshop
Pedagogical materials addressing the technical background necessary for the Datathon challenge will be delivered through pre-recorded videos and structured notebooks. We’ve prepared instructional materials with a range of different depths. We ask that participants review the materials that are complementary to their backgrounds before coming to the workshop -- depending on your background, you may find that the materials cover concepts with which you are already familiar or you may find that only some of the materials cover concepts that are new to you. The best way to see if you need to brush up on a concept is to take the concept quiz (which follows every topic) or run through the exploratory DeepNote notebooks (which follows every 2-3 topics).
About the WiDS 2025 Datathon
The WiDS Datathon is hosted on Kaggle, an online platform for data science competitions and community activities. In this year’s datathon challenge participants will look at how imaging based measurements of the female brain track with diagnosis of neurological or neuropsychiatric diseases.
Resources for Exploring and Understanding the Data
The following are resources for getting started with Kaggle competitions and with working with this year’s data:
Getting Started with Kaggle (Video), Getting Started with Kaggle Tutorial Blog Post
WiDS Data Science Tutorials
WiDS Datathon Skill Building Tutorial Kit by Keiko Kamei
Data Science Modules by UC Berkeley Professor
Exploratory Data Analysis (Video)
Join Slack Workspace here: WiDS Cambridge 2025 Datathon
Technical Background for the Datathon Challenge
It would be helpful for participants to have some familiarity in python programming. You can familiarize yourself with python by completing online tutorials, for example: https://www.learnpython.org. In particular, it’s helpful for participants to be able to manipulate data using pandas DataFrames, perform basic operations with numpy Arrays and make use of basic plotting functions (e.g. line chart, histogram, scatter plot, bar chart) from matplotlib:
pandas Basics
numpy Basics
matplotlib Basics
For this workshop, we will be using DeepNote - a free cloud computing service that comes with pre-installed machine learning tools. Deepnote allows you to easily work on your data science projects, together in real-time and in one place with your friends and colleagues. It allows you to create and share documents that contain live code, equations, visualizations and narrative text. You can familiarize yourself with the interface of DeepNote notebooks by reading the following tutorials (remember you don’t need to install anything!):
Deepnote: the modern way to teach Data Science
Deepnote: Documentation
Finally, we will be working with random variables in this workshop and will be reasoning about them through their distributions. But we’ll only need a bit of familiarity with these concepts:
Probability distributions in python
Skills Check for Participants
You might find the following Deepnote notebook useful to get a sense of the types of computational tasks that we will be using during the workshop
Introduction to Deepnote Notebook and python Libraries for Data Science
Instructions:
Open the notebook, under the drop-down menu by the name of the project, WiDS Datathon 2024, choose to Duplicate the project.
If you want to share your work with others: under ‘Share’ (upper right-hand side of your screen), change the permissions to “Public access: On” and share the link, or invite specific collaborators.
The materials will be broken down into a sequence of bite-sized concepts. Each concept will be introduced in a short 10-20 minute video; following each video, there will be a short concept-check quiz for the viewer to test their understanding. For each topic, we have selected some supplementary readings that may be helpful. After two or three topics there will be an DeepNote notebook with starter code and experiments to help you further explore these topics.
These materials are taken from DSC6232 Machine Learning and Computational Statistics, an intensive summer data science course run by IACS and the University of Rwanda. You can find the complete set of lecture materials on the course website.
Topics on Regression
Topics on Uncertainty, Variance and Bias
Topics on Neural Networks for Regression
Topics on Transforming and Manipulating Data