One of the most important techniques in data analysis is called nearest neighbor search. The idea is simple enough. I have a set of data points with some property that is known. When a new data point comes along, I guess that its property will match that of the closest data point in my input set. Thus, the computational problem is to find the closest data point to a given point. This is a classic and very well-studied problem where many data structures are known. One fascinating observation from modern AI systems is that there are many different ways to represent data and that these lead to different notions of distance. With several different ways of measuring distance, we can thus have several data structures. There are several ways of combining multiple distances into a single distance. Thus, the goal will be to efficiently combine the data structures for the individual distances into a larger structure that represents the combined distances. The naive approach would be to reconstruct everything from scratch, but it should be possible to do something better.
Students will learn about metric tree data structures that extend binary search trees into higher dimensions. They will learn how these are implemented and analyzed.
Then, the students will work with faculty and graduate students to design and analyze algorithms that combine these structures.
We will also look into implementations in Python and experimental evaluation.
Web: https://www.ise.ncsu.edu/people/arescobe/
The objective of this research is to make optimization software more reliable. Optimization helps solve real-world problems in fields like science, engineering, and business, but the software commonly used in practice often gives inconsistent results due to numerical issues. To address these challenges, specialized solvers equipped with mixed-precision algorithms have been developed. This project will explore such open-source software in the context of mixed integer linear programming. Students will work on modifying their underlying algorithms to balance efficiency and accuracy and then test the improvements by benchmarking the updated software on a variety of real-world optimization problems across different domains. Tasks will include learning about the underlying algorithms and their implementation within the software, identifying computational bottlenecks, developing and coding improved algorithmic subroutines, and analyzing the implications of the executed modifications.
Machine learning methods, particularly neural networks, are increasingly applied to scientific and engineering problems governed by physical laws, yet standard training procedures often rely on loss functions and numerical operations that are poorly conditioned and weakly correlated with the true solution error. These issues can lead to unstable training and inefficient or unreliable results. This project will build on recent advances in robust variational physics-informed neural networks (RVPINNs), which reformulate neural network training using variational principles and residual-based error estimators. A key computational component of these methods is the repeated solution of structured linear systems—such as Gram matrix problems arising from Riesz representations of residuals—that strongly influence both stability and computational cost. The project will explore open-source implementations of RVPINNs and related physics-informed learning frameworks, focusing on improving the underlying numerical linear algebra subroutines. Prior coursework or experience in numerical methods, particularly finite element methods, is a plus.
Website: https://zguo32.wordpress.ncsu.edu/
Plant biosynthetic gene clusters (BGCs) encode pathways for specialized metabolites that are important for plant defense, signaling, and potential pharmaceutical applications. Unlike microbial BGCs, plant BGCs are more fragmented, sparsely annotated, and embedded in large, complex genomes. In this project, we will build on an existing pipeline that represents plant genomes as sequences of Pfam protein domains and scores candidate BGC loci. The main research question is how to use modern Transformer-based sequence models to better capture long-range context and to classify plant BGCs by functional type (e.g., specialized vs primary metabolism, or putative pathway classes), under weak and noisy supervision from GO/KEGG and microbial reference data.
Depending on background and interest, the REU student may: (i) help prepare and organize plant genomic data into token sequences (Pfam domains or gene-level units) suitable for Transformer models; (ii) implement and experiment with Transformer architectures for tasks such as BGC vs non-BGC detection and BGC type classification; and (iii) design evaluation metrics and visualization tools (e.g., attention maps, motif analyses) to interpret what the model learns about plant BGC organization. The project sits at the intersection of algorithms, machine learning, and computational biology. Prior coursework in data structures/algorithms and programming experience in Python are expected; familiarity with basic machine learning is helpful but can be developed during the project.
Verifiable & Real Time-Safe Control on F1Tenth
This project involves autonomous driving and racing using F1Tenth: a tenth-scale autonomous racing platform. We use racing as an avenue to explore the intersection of real-time systems, control, and verifiable safety. To address the speed/safety tradeoff, we're combining runtime assurance with time-aware solvers to ensure that the vehicle is always in a safe state while preventing failures due to schedule overruns. To support real-world use, we are developing robust switching logic that preserves formal safety guarantees under dynamic environments and uncertainty. We intend for this framework to be used with both classic control methods as well as black-box reinforcement learning models. Relevant tasks and areas include formal modeling, online and offline verification, scheduling theory, parallel programming, and reinforcement learning. We develop and test on both physical cars and in simulation, and we leverage the Linux scheduler (SCHED_DEADLINE) and ROS (Robot Operating System).
The student should have experience with C/C++, Python, and Linux.
Web: https://www.csc.ncsu.edu/people/rychirko
Chirkova's research group has been doing foundational research in enabling the use of data and knowledge to accelerate data-driven decisions. The research outcomes have broad applicability to systems that focus on data processing and/or on enhancing the quality of data and knowledge. Currently, the team is getting ready to start working on an NSF project dedicated to building the prototype open knowledge network, with the kick-off data of the project scheduled for October 2023. The role of Chirkova's team in the project is development of prototype algorithms for creating the interconnecting technical “fabric” needed to link the knowledge graphs to be connected within the Proto-OKN project. The rest of the multi university Proto-OKN team will be testing, revising, and refining these prototype algorithms, and deploying the outcomes as project deliverables. Within this project, REU students working with Chirkova's team can focus on (i) studying the state of the art for connecting knowledge-graph data, (ii) implementing the most promising algorithms from the literature, and (iii) suggesting and discussing with the rest of the team changes to the implementations that would allow the team to use the algorithms not only as baseline approaches but also as a source of ideas for advancing the state of the art. These directions of work will give the REU students the experience of working on a multi university research team focused on connecting knowledge graphs for the envisioned Proto-OKN prototype open knowledge network.
Website: https://chinholee.github.io/
Pseudorandomness
Randomness plays a central role in algorithms, cryptography, and complexity theory, yet truly random bits are often expensive or unavailable in practice. Pseudorandomness studies how to efficiently construct deterministic objects that behave like random ones for a wide range of computational tasks. This REU project will introduce students to the theory of pseudorandomness and its deep connections to computational complexity, coding theory, and combinatorics.
Depending on students’ backgrounds and interests, projects may focus on developing randomness-efficient algorithms, constructing pseudorandom generators for specific computational models (such as small-space or low-depth circuits), or exploring applications to derandomization and hardness-versus-randomness tradeoffs.
The project emphasizes the development of core proof techniques in theoretical computer science, including basic Fourier analysis on Boolean functions and combinatorial arguments. Students will gain experience reading research papers, formulating precise definitions, and working toward open-ended research questions. The goal of the project is to prepare participants for graduate study in theoretical computer science while contributing to ongoing research in pseudorandomness.