Mission Statement

Our mission is to empower genomics researchers with an intelligent LLM model designed to streamline and simplify workflows

Problem

In genomics research today, there are hundreds of specialized tools designed for specific steps in the research process. While these tools are powerful, they are often hard to use and have steep learning curves. This makes the research process slower and more cumbersome, as researchers need to spend a lot of time just learning how to use these tools.

"I had to fly to Arizona for three days to learn how to use another tool”

-Microbiologist

"The tool isn’t user-friendly; it takes me a couple of iterations to get it right"

-Researcher

Solution

As a proof of concept of the feasibility of this vision, we've engineered a model capable of performing select genomics research workflows power through natural language processing

Dataset

The GUE dataset was utilized throughout the project as the sole training and evaluation source. This dataset had came directly from the DNA-BERT2 paper, and includes sequence information from humans, mice, yeast, and fungus. The GUE is largely balanced across all tasks, and includes fixed sequence lengths for each task ranging from 70 to 1000 base pairs. We primarily utilized four binary classification tasks (transcription factor prediction in mice, transcription factor prediction in humans, promoter site prediction, and epigenetic mark prediction), as well as two multi-class classification tasks (splice site prediction and COVID variant prediction.

How it works

The model combines a state of the art DNA encoder, DNABERT2, and a LLM decoder to intake sequence data and questions and output sensible, English-language responses. To connect the to models, it leverages a Querying Transformer, a lightweight transformer model that is trained to attend to the outputs of the encoder to find the most salient information in the encoder outputs and project them into a vector usable by the LLM.

Results

We evaluated various models to achieve performance levels close to the DNA-BERT2 encoder across genomics tasks, focusing on the GUE dataset. This included binary classification tasks like transcription factor prediction in mice and humans, promoter site prediction, and epigenetic mark prediction, as well as multi-class classification tasks like splice site prediction and COVID variant prediction. Using top-1 perplexity for prediction and metrics such as Matthew’s correlation coefficient (mcc) for binary tasks and F1 for multi-class tasks, we compared these values against standalone DNA-BERT2 model results.

We first measured performance within the standalone straight-out-the-box LLM. We additionally trained the LLM with DNA-BERT2 with only a linear projection between the encoder and decoder being trained. Both the standalone LLM and a linear projection model between the encoder and decoder failed to infer any genomic information, with F1 and mcc values near 0 or at 0 for all tasks. This highlighted the need for advanced techniques to transfer genomic syntax knowledge. We then tested a Q-Former model and a parameter-efficient fine-tuning (LORA) model, both trained and evaluated on single genomic classification tasks. We found that LORA showed generally better performance at the cost of longer training time.

We then looked at training the Q-Former model and the LORA model with multiple binary tasks all at once, and evaluating all the tasks the model was trained on. Training the Q-Former in such manner removed any syntactic genomic language knowledge the model had. Training the LORA model on multiple binary tasks simultaneously significantly enhanced performance, nearing encoder-level metrics for epigenetic mark prediction, promoter prediction, and transcription factor prediction. Overall, our study demonstrates the feasibility of transferring DNA encoder knowledge to LLMs using multi-modal vision architectures, allowing a single model to handle multiple genomics tasks effectively.