Research Projects
(these are subject to change - and will be updated as more come in)
(these are subject to change - and will be updated as more come in)
PI: Prof. Leon Bergen
Neural language models such as BERT (Devlin et al., 2019), Transformer-XL (Dai et al., 2019), and GPT-3 (Brown et al., 2020) have achieved success in both text prediction and downstream tasks such as question-answering, text classification, and summarization. The strong performance of these models raises scientific questions about the knowledge they have acquired, in particular, whether these models have acquired linguistic knowledge which is as abstract and general as that of humans. However, previous work has shown that different methods of probing these models’ knowledge (probing methods) produce conflicting results. This has limited our ability to draw strong conclusions about what these models know. Our objective is to map out the space of probing methods for neural language models, and understand why these methods produce varying results. We will determine which methods have greatest internal and external validity.
The student will learn how to set up computational experiments on neural language models, and how to use these experiments to gain scientific understanding of models.
PI: Prof. Kristen Vaccaro
In this project we are designing around microaggressions. This year students have worked on this problem from a number of directions. Some have worked on machine learning: finding data sets of microaggressions, identifying data quality issues, and training an NLP model. Building on prior work, this effort is particularly interested in modeling intersectionally-targeted microaggressions. Others have done design work, exploring a variety of different tools or systems we could build. In summer 2023, we plan to use the models to characterize use of microaggressions on different online communities: which online spaces are more or less toxic? We also plan to conduct human-subjects experiments to determine whether the tools & interventions we have designed actually reduce microaggression use.
Depending on student interest, you might choose to be more involved in modeling or human-subjects experiments. Both will involve extensive data analysis.
PI: Prof. Gary Cottrell
In collaboration with William Gerwick at Scripps Institute of Oceanography, we have been developing systems to speed up structure determination from NMR spectra of small molecules extracted from Natural Products (NPs) (Zhang et al., 2017; Li et al., 2020; Reher et al., 2020). Approximately 70% of all approved drugs are NPs, their analogues, or a chemical modification of an existing NP (Newman & Cragg, 2016). In addition to these academic and societal benefits, NPR provides a powerful incentive for the conservation and sustainable use of biodiversity and biodiverse habitats (Kursar, T. A. et al., 2006). A bottleneck in this research is determining the structure of a new molecule. Molecules are analyzed by extracting the NMR spectrum of a molecule. However, it takes a skilled researcher approximately two weeks to then infer the structure from the spectrum. Our goal is to learn a mapping from the NMR spectra of natural products (sometimes called the “fingerprint” of a molecule) to their structure. We have been developing advanced techniques using deep learning to do this.
We are developing improvements over our previous methods to more specifically produce the structure of a molecule in terms of SMILES strings. The student would learn how to train deep networks for this task.
PI: Prof. Taylor Berg-Kirkpatrick
Digital Humanities is an interdisciplinary field at the intersection of computational disciplines (e.g. computer science, statistics) and humanistic ones (e.g. literature, history, bibliography). ‘Print and Probability’ is a digital humanities project about uncovering the history of printed books from the Early Modern period (roughly 1500-1800 AD). The invention of the printing press allowed for circulation of ideas, but authoritarian rule and the menace of persecution meant that potentially controversial books and pamphlets were often printed clandestinely. As a result, the history and origin of important artifacts has been lost. ‘Print and Probability’ will use AI to identify and track unique stamps and marks in secretly printed early modern books in order to uncover which printing houses and compositors were responsible for their production. For example, John Milton’s Aeropagitica: A Speech for the Liberty of Unlicensed Printing to the Parliament of England famously argued for freedom of press and is considered a progenitor of the 1st Amendment. Yet, ironically, it was printed secretly in 1644 because no printing house was willing to publicly claim ownership for fear of persecution. As a result, important historical relationships are shrouded in mystery. Who printed Aeropagitica? Our early work on this project has used AI to provide new evidence for attributing Aeropagitica to two known printing houses in 17th century England. We used computer vision techniques to identify unique imprints (due to bent or damaged character stamps) across a large collection of books with known printers. Then, by using probabilistic inference to align these ‘fingerprints’ with those found in Aeropagitica, we are able to automatically predict likely attributions.
This project involves multiple core AI disciplines and connects with exciting historical questions that serve to illustrate the broader applications of AI outside traditional computing disciplines. First, as a point of entry to the project, teachers will have the opportunity to learn about and develop components that have a low barrier-to-entry (e.g. manually inspecting Early Modern documents, developing page-segmentation algorithms). As the project continues into second and third weeks, and once teachers have a grounding in the core research questions, the project can expand into more advanced AI techniques (e.g. automatic image classification, language modeling for text). Teachers will be involved in using optical character recognition and computer vision pipelines to help extract data from images of Early Modern books, and helping to sort through images of historical artifacts in order to find tell-tale signatures and evaluate our attribution system. Teachers will develop code in PyTorch, aid in designing experiments and collecting and managing datasets. Background in History and Early Modern Studies is not required. Teachers will be introduced to relevant historical knowledge during the core project period.
PI: Prof. Nuno Vasconcelos
While a wide spectrum of automated tools have been created for building deep learning models in the past decade, dataset collection has remained a largely manual process with little systematic effort to account for bias in raw data or human annotations. The goal of this project is to build an iterative framework for dataset collection, annotator teaching and model training. Under this unified framework, new examples are automatically selected for human annotation, cleaned for label bias, and added to the dataset progressively. Neural network models are trained on each iteration of data, and model explanation techniques are used to create teaching examples that reduce the bias of crowd-source annotators. The whole framework aims to produce datasets that are optimal for machine learning, under multiple objectives, including classification accuracy and fairness. The research is connected to topics like active learning and human-in-the-loop AI systems. The project aims for top-tier conference publication.
Student Responsibilities: Software development in Python, Linux and at least one popular deep learning framework such as PyTorch. Students will also learn basics in computer vision and natural language processing.
Analyzing and extending the impact of a first-year mentoring and academic program in computing
PIs: Niha Bhaskar, Amari Lewis, Kristen Vaccaro, Joe Politz, and Mia Minnes.
The CSE- Peer-led Academic Cohort Experience (PACE, https://pace.ucsd.edu/) program provides weekly mentoring and research exploration sessions for first-year CSE students at UCSD. The 2022-2023 academic year was its first implementation. This year we explored topics including: how choice of training data can introduce bias in machine learning, Bluetooth signal tracking and impacts on privacy, and block-based programming applications in CS education and industry robotics. In summer 2023, we plan to analyze quantitative and qualitative data collected from students’ experience in the program over this year, and design new research-based modules for cohorts of students in the upcoming school year to explore.
Research-focused opportunity: Analyze interviews, survey responses, and other data from students’ experience in the first year of the program to inform improvements and understand what was effective (and not) about the program.
Curriculum development opportunity: Create research-based activity modules on cutting-edge computing topics and the intersection of computing and society for the next group of students.
PI: Nuno Vasconcelos
Brachytherapy is a treatment in which a radioactive source is used to deliver radiation internally to treat cancers such as cervical cancer. Currently, clinicians manually tune treatment parameters to customize the radiation to individual patients’ anatomy. This process can take over an hour, which is problematic because patients are waiting in discomfort and often
under sedation for this to occur. Deep learning can identify anatomical features that relate to ideal, customized radiation treatments by learning from past patient imaging and treatment data. In this project, we will generate new networks and inputs and/or modify existing networks to accurately predict radiation treatment parameters. The end goal is to automate the treatment
customization process to ensure high quality radiation treatments can be produced in a matter of minutes with a single button-click. This project will involve working with a team of medical physicists (including Dr. Sandra Meyers), radiation oncologists and electrical engineers, and is a collaboration between the Vasconcelos and Meyers labs. The project aims for a top-tier conference or journal publication.
Student Responsibilities: Software development in Python, Linux and at least one popular deep learning framework such as PyTorch.