One of the most important techniques in data analysis is called nearest neighbor search. The idea is simple enough. I have a set of data points with some property that is known. When a new data point comes along, I guess that its property will match that of the closest data point in my input set. Thus, the computational problem is to find the closest data point to a given point. This is a classic and very well-studied problem where many data structures are known. One fascinating observation from modern AI systems is that there are many different ways to represent data and that these lead to different notions of distance. With several different ways of measuring distance, we can thus have several data structures. There are several ways of combining multiple distances into a single distance. Thus, the goal will be to efficiently combine the data structures for the individual distances into a larger structure that represents the combined distances. The naive approach would be to reconstruct everything from scratch, but it should be possible to do something better.
Students will learn about metric tree data structures that extend binary search trees into higher dimensions. They will learn how these are implemented and analyzed.
Then, the students will work with faculty and graduate students to design and analyze algorithms that combine these structures.
We will also look into implementations in Python and experimental evaluation.
Web: https://www.ise.ncsu.edu/people/arescobe/
The objective of this research is to make optimization software more reliable. Optimization helps solve real-world problems in fields like science, engineering, and business, but the software commonly used in practice often gives inconsistent results due to numerical issues. To address these challenges, specialized solvers equipped with mixed-precision algorithms have been developed. This project will explore such open-source software in the context of mixed integer linear programming. Students will work on modifying their underlying algorithms to balance efficiency and accuracy and then test the improvements by benchmarking the updated software on a variety of real-world optimization problems across different domains. Tasks will include learning about the underlying algorithms and their implementation within the software, identifying computational bottlenecks, developing and coding improved algorithmic subroutines, and analyzing the implications of the executed modifications.
Automated algorithms are increasingly utilized to make important decisions across various domains, but they face growing distrust due to issues like bias and lack of interpretability. To address these challenges, there is a rising interest in integrating collective intelligence of diverse groups (i.e., “the wisdom of the crowd) into decision-making processes. Hybrid systems that combine crowd wisdom with machine learning (ML) could leverage complementary strengths of human judgment and computational power to create more trustworthy and effective solutions. This project focuses on developing algorithms for creating and deploying these systems. Tasks will include deploying ML algorithms, developing and coding new algorithms that aggregate human inputs and integrate crowd wisdom into the decision-making pipeline, and designing intuitive online activity interfaces.
Website: https://zguo32.wordpress.ncsu.edu/
Cancer remains one of the leading causes of mortality worldwide, highlighting the urgent need for innovative tools to accelerate drug development and enable personalized treatment strategies. This project designs deep learning algorithms to integrate genomic profiles, drug molecular features, and clinical data to develop a robust multimodal predictive framework. The methodology incorporates a Transformer-based algorithm framework, and a CNN enhanced with attention layer calculations to effectively capture cancer cell line features and drug molecular characteristics. This AI-driven algorithm framework not only enhances predictive accuracy for anti-cancer drug response but also establishes a foundation for broader applications in precision oncology and drug discovery. The tasks include: (1) Algorithm Development: Collaborate in designing and optimizing deep learning architectures, including Transformers and CNNs, with a focus on efficiently encoding genomic and chemical information. Experiment with model configurations to improve performance on multimodal datasets. (2) Data Preprocessing and Feature design: Develop pipelines for data filtering, preprocessing genomic data, and selecting biologically relevant features to enhance model training.
A deep learning genome-mining strategy for PlantBGC Prediction.
This project aims to develop novel algorithmic approaches utilizing deep learning for the detection and annotation of biosynthetic gene clusters (BGCs) in plants. BGCs play a critical role in the production of bioactive compounds. Due to the complexity of plant genomes and the diversity of BGC structures, traditional computational biology methods face challenges in scalability and accuracy. Our approach focuses on designing efficient language-model-based algorithms and machine learning models. These algorithms leverage advanced pattern matching, sequence alignment optimization, and graph traversal techniques to identify BGCs from genomic sequences accurately. The tasks include: (1) Algorithm Development: Assist in designing and implementing BGC detection algorithms based on plant genomic sequences, focusing on improving accuracy and efficiency. (2) Machine Learning Integration: Develop and fine-tune models to predict and classify BGCs using genomic and proteomic data, applying SOTA machine learning techniques. (3) Data Analysis and Performance Evaluation: Preprocess large genomic datasets and benchmark the developed algorithms against existing tools, emphasizing runtime efficiency and prediction accuracy.
Web: https://www.csc.ncsu.edu/people/rychirko
Chirkova's research group has been doing foundational research in enabling the use of data and knowledge to accelerate data-driven decisions. The research outcomes have broad applicability to systems that focus on data processing and/or on enhancing the quality of data and knowledge. Currently, the team is getting ready to start working on an NSF project dedicated to building the prototype open knowledge network, with the kick-off data of the project scheduled for October 2023. The role of Chirkova's team in the project is development of prototype algorithms for creating the interconnecting technical “fabric” needed to link the knowledge graphs to be connected within the Proto-OKN project. The rest of the multi university Proto-OKN team will be testing, revising, and refining these prototype algorithms, and deploying the outcomes as project deliverables. Within this project, REU students working with Chirkova's team can focus on (i) studying the state of the art for connecting knowledge-graph data, (ii) implementing the most promising algorithms from the literature, and (iii) suggesting and discussing with the rest of the team changes to the implementations that would allow the team to use the algorithms not only as baseline approaches but also as a source of ideas for advancing the state of the art. These directions of work will give the REU students the experience of working on a multi university research team focused on connecting knowledge graphs for the envisioned Proto-OKN prototype open knowledge network.
Website: http://www.pengg-robotics.com/
In this project, students will work individually to implement a complete software system for an artificial intelligence (AI)-powered robot. This project provides an opportunity to deepen the understanding of robotic algorithms while also exploring new robotic capabilities available in the Robot Operating System (ROS).
The student is required to implement an autonomous robot control algorithm on a real physical robot using ROS Noetic. The project should focus on enabling the robot to operate autonomously based on sensor inputs, AI-based decision-making, or other advanced robotic techniques.
An omnidirectional Triton robot will be provided for implementation. The robot follows a classic two-level computing architecture:
Nvidia Jetson Nano: Serves as the main computer to execute ROS and AI algorithms, including deep learning-based methods.
Arduino: Handles real-time control tasks and will be treated as a black box in this course.
The robot is equipped with:
RPLIDAR (2D LiDAR)
Intel Realsense D435 (color-depth camera with IMU)
Wheel encoders
LED ring for visual communication
Website: https://www.csc.ncsu.edu/people/jajenni3
Regular expressions have been the "go to" technique for software developers and sophisticated computer users who need to do pattern-based text search. Like the "goto" statement, regex should be considered harmful. They are subject to myriad well-documented semantic and performance errors, and Unicode support is generally quite poor. In this project, building on prior work using a PEG-like grammar to define search patterns, we will design optimization algorithms for a pattern compiler. Our compiler transforms a PEG-like grammar into byte-code for a high-performance text-matching virtual machine. As for any compiler optimization, we seek ones that (1) preserve correctness of the input grammar, (2) reduce the CPU time needed in typical use cases, and (3) do not significantly increase atypical CPU time or space requirements generally.
The REU students working on this project will:
Help design text search workloads that use Unicode text data (in several languages) as their input. (Note that reading in languages other than English is not required.)
Benchmark our existing codebase on the designed workloads, and extend the benchmarking to our earlier work and to other prior work as well.
Develop an understanding of CPU caching, branch prediction, and SIMD instructions sufficient to design compiler optimizations meant to reduce the CPU time needed for the chosen workloads.
Implement promising optimizations in C99, and benchmark the results to understand the effect of the new algorithm(s).
Time permitting, find the Pareto frontier in the 2-d space of code size and CPU time needed for each workload relative to the aggressiveness of the optimizations employed. This step would include prior work developing parameterized optimizations as well as new work done in the project.