REU in Data Science at Harvey Mudd College


Data science is one of the fastest growing professional fields of work and research. The REU in Data Science at Harvey Mudd College is a 10-week summer research program for undergraduates interested in data science methods and data-heavy STEM careers. Student participants will be part of research projects and receive additional training in data science methods and professional skills. Part of a larger group of summer research students, REU participants will have opportunities to get to know the Claremont Colleges, the larger Los Angeles metropolitan region, and enjoy social events on and off campus.

What we offer:

  • A fully funded 10-week summer research program (limited cost of travel support available)
  • Mentoring and research guidance by experienced and passionate faculty in the life and environmental sciences, mathematics, computer science, and engineering
  • Supplementary training workshops in data science and professional skills such as time and project management, public speaking, and more
  • Advising on graduate schools and STEM careers
  • A rich social program with peers in and around the Claremont Colleges

Important Dates:

  • Call for applications: January 2019
  • Last day to apply: February 28, 2019
  • Decisions are made on a rolling basis
  • Start date of the REU: May 28, 2019 (with arrival on May 27, 2019)
  • End date of the REU: August 2, 2019 (with departure on August 2 or 3, 2019)

To Apply:

  • You must be a U.S. citizen or permanent resident to be eligible to participate in the REU.
  • Please complete the application form and email 2 letters of recommendation and your (un)official academic transcript to Tanja Srebotnjak at subject line: REU in Data Science. You can also ask your recommenders to email their letters directly to Tanja Srebotnjak (


This year's projects and advisors:

Attention: new projects by Prof. de Pillis and Prof. Adolph have been added (02-04-2019, 01-28-2019)

Prevalence and Propagation of “Fake News” (Prof. Susan Martonosi, Mathematics Dept.)

The prevalence and propagation of “fake news” has garnered international attention following the 2016 U.S. presidential election. The mechanisms by which fake and/or biased news articles are propagated are an active area of research, particularly as social media outlets such as Facebook are increasingly being asked to play an active role in fake news detection and deterrence. This proposed research project will build on last year's work to further develop data and a probability model for the likelihood that a given user, whose beliefs lie on a continuum, will share a news article characterized by an observable bias and level of truthfulness. Using that probability model, we will develop a framework that determines the optimal distribution of bias and truthfulness of articles produced by a malicious agent to maximize propagation within a population having known belief distribution. This work will provide insights into the optimal characteristics of biased and/or “fake” news, which can then be used within a game theoretic framework to develop defensive strategies. The data science student researchers will assist in validating our models against publicly available social media data.

Invisible Cyclists and Road Network Analysis (Prof. Paul Steinberg, HSA Dept. and Prof. Srebotnjak, Hixon Center and Engineering Dept.)

Active forms of transportation, such as walking and bicycling, have many documented benefits including improved public health outcomes, reduced pollution and traffic, and enhanced revenues for local businesses. Social justice has emerged as a major theme within active transportation research. Cycling is not merely a recreational activity, but a vital transportation option for those who lack access to automobiles – particularly the poor, but also undocumented workers, children, and the elderly. Latinos and African Americans report the strongest interest in cycling, yet often lack access to bike lanes and suffer higher accident rates and are rarely represented in policymaking, giving rise to the term "invisible cyclists." This project entails a transportation needs assessment of underrepresented populations in Claremont to help city officials adopt an equitable and inclusive approach to sustainable transportation planning, including for the extension of the Gold Line commuter rail project. The student hired through the REU will engage in data collection through in-person surveys as well as statistical and spatial analysis of the results. Prerequisites include fluent (ideally native) Spanish as well as training in statistical analysis and ideally experience with Geographic Information Systems (GIS).

Analysis of RNA-seq Data (Prof. Daniel Stoebel, Biology Dept., Prof. Danae Schulz, Biology Dept., and Prof. Jo Hardin, Math Dept. Pomona College)

Biologists can measure the transcription of all genes in the genome using a technique called RNA-seq. This technique uses modern high-throughput sequencing techniques to sequence all of the RNA isolated from a group of cells. This sequence data is then analyzed to measure levels of expression of each gene. Further analysis is then used to, for example, cluster genes and/or growth conditions, or to determine what genes differ in their levels of expression across conditions. This project focuses on the analysis of time course RNA-seq data. The project is motivated by two experimental RNA-seq data sets. The first is for the parasite Trypanosoma brucei, which causes sleeping sickness in humans and is transmitted from person to person by a tsetse fly. As the parasite cycles between the human bloodstream and insect environments, it changes the expression levels of around 1/3 of the genes in its genome. The second data set is for E. coli, which changes the expression of its genes in response to the transition from exponential growth to starvation. Students will be required to identify appropriate methods for normalizing the data, an essential and experiment-specific first step for all RNA-seq analysis. They will then use current data science approaches to analyze time course data, including clustering methods to identify groups of genes with similar expression patterns and identifying genes and gene networks for problems of both unknown and pre-specified biological structure. It will be important for the student to assess the appropriateness of each bioinformatic tool to the problem at hand. After gaining a thorough understanding of the process and methods, the student(s) will be able to address the biological research questions. Desired skills/background: Course work in statistics, familiarity with R, and an interest in biological problems.

Spatial modeling of the climatic ecology and geographic range of a desert lizard (Steve Adolph, Biology Dept.)

This project will combine mathematical models of population dynamics of the desert lizard Xantusia vigilis with spatial and temporal datasets on precipitation and temperature. We will use GIS and other spatial statistical methods to define the climatic niche and geographical range of this lizard species. We will then couple this spatial model with local models of population dynamics to predict spatial variation in population dynamics. Ultimately, we would like to develop a predictive model for how this species will respond in space and time to predicted climate change in California.

Dimension Reduction and Pseudo Time Analysis of Large Scale Biological Cell Differentiation Data (Lisette de Pillis, Mathematics Dept.)

Understanding the processes of cell differentiation and proliferation is crucial to developing a deeper knowledge of ailments like cancer. Techniques in genomic analysis yield very high-dimensional data sets, but in order to gain insight into the developmental behaviors hidden inside these large data sets, specialized mathematical and computational techniques must be created and applied.

The first REU project team investigated a set of dimension reduction techniques thought to be particularly relevant for this project. These techniques included Principal Component Analysis (PCA), Diffusion Mapping, t-distributed Stochastic Neighborhood Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). The mathematics behind each of these approaches was explored, and the resulting algorithms were applied to biological data sets to help determine which combinations of approaches could be most effective at providing the critical information needed for characterizing cell differentiation. This year's project team will be able to take the next step in the process, which will involve investigating approaches intended for analyzing and interpreting large scale time series data. Preliminary work has also begun on solving partial differential equations on the graphs derived from these dimension reduction methods. The long-term goal of this project is to develop new mathematical descriptions of a continuous phenotype space.

Past Projects:

  • Invisible cyclists and road network analysis
  • Computing for active transportation
  • Sports analytics
  • Brain tumor detection
  • Predicting human behavior from smartphone data
  • Data dimensionality reduction methods