Learning and Predicting Transcriptome Architecture
Abstract:
Linking DNA sequence to genomic function remains one of the grand challenges in genetics and genomics. Here, we combine large-scale single-molecule transcriptome sequencing of diverse cancer cell lines with cutting-edge machine learning to build LoRNA-SH, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture—the relative abundances and molecular structures of mRNA isoforms. Owing to its use of the StripedHyena architecture, LoRNA-SH handles extremely long sequence inputs at base-pair resolution (~65 kilobase pairs), allowing for quantitative, zero-shot prediction of all aspects of transcriptome architecture, including isoform abundance, isoform structure, and the impact of DNA sequence variants on transcript structure and abundance. We anticipate that our public data release and the accompanying frontier model will accelerate many aspects of RNA biotechnology. More broadly, we envision the use of LoRNA-SH as a foundation for fine-tuning of any transcriptome-related downstream prediction task, including cell-type specific gene expression, splicing, and general RNA processing.
Bio:
Dr. Hani Goodarzi is an Arc Core Investigator and Associate Professor at UCSF. His research combines novel discovery platforms and frontier AI models to reveal molecular mechanisms of cancer progression. Dr. Goodarzi has made key scientific contributions at the intersection of AI, RNA biology, and cancer research, and in both diagnostics and therapeutics. He has also co-founded several companies in this field, including Exai Bio, a next-generation liquid biopsy company, and Vevo Tx, a platform drug discovery leveraging AI to design better drugs. His work has been recognized with prestigious awards, including the Vilcek Prize for Creative Promise and the AACR-MPM Transformative Cancer Research Award. Dr. Goodarzi was previously honored with the Martin and Rose Wachtel Award in Cancer Research and named an American Cancer Society scholar.
Summary:
Genome and its transcription into proteins
Game gene can be transcribed into different protein isoforms (sometimes hundreds per gene)
Different abundances of various isoforms
Gene sequence exon-intron-exon-intron-exon-....
Transcription machinery remove
Intron segments,
Splices together the exons into a single sequence and
Then generates the encoded protein
Different proteins:
Subsequences of exons
Some exons may be skipped
Complex dependencies between which exons are included (e.g. is one is included, another is always excluded)
Lots of work deciphering the behavior of this mechanism
Sequence to function mapping
RNA foundation models
Borzoi/Enformer, BigRNA, NucT, HyenaDNA
Need a model for joint representation for
Gene sequence
Associated RNA isoform architecture
At Nucleotide resolution
Challenge: data availability
Many more DNA foundation models but there is a lot less RNA data
mRNA sequencing is done at the granularity of short segments (mRNA is chopped up and then scanned)
Hard to infer full mRNA sequence of segments
Many mRNA sequences due to transcription/splicing variants / skips, etc.
Now its possible to sequence full mRNA sequences (e.g. via PacBio Sequel2)
Used new instruments to do long-read scans and short-read scans
Short-read: much higher throughput (400 base-pair)
Long-read: better ability to connect segments, helps to connect short-reads into long sequences (3.5k base-pairs)
Different human cell lines
2 biological replicates
26 cell lines
+ 149 human, 46 mouse long-read samples from various studies
Total ~95M full-length reads
520k unique inferred transcripts
LoRNASH model
RNA foundation model based on long-sequence reads
Align each mRNA with its DNA
Encode both, marking the
basepairs from the mRNA as one set of tokens
the DNA that is not transcribed into mRNA using different tokens
Different tokens for DNA before/after transcription and
Untranscribed DNA in the middle of the mRNA
Enables model to see where transcription starts/ends and where introns are dropped in the middle of transcription
Model
Layers: Embedding, Hyena (like Attention but sub-quadratic computational cost), Attention
256 latent dimensions
7m parameters
Training:
Next-token prediction, like a language model
Cross-entropy loss
Perplexity
Model accuracy increases with model size/compute time but there’s a point of diminishing returns where the model overfits
Applications
Can predict the likelihood of a given possible isoform
Zero-shot
Tests on new tasks
Compare the model’s predicted likelihood of an isoform to its abundance in a cell
These are correlated: .15 Pearson Correlation
Other models’ (HyenaDNA, NucT) probabilities have no correlation
Indicates that the model can track the isoform abundances that occurred in the cells when they were harvested for analysis
Correlations should improve if cell’s state is included in the model’s context
Prediction of trapped exons
Identification of mRNA sequences that have never been seen but are plausible
Model performance: Pearson Correlation=.2
Prediction of Pathogenic Non-coding Variants
Embeddings:
Given isoform: get its embedding
Cluster them to map their global structure
Embeddings of PAS tokens (regulate transcription) correlated with isoform abundance: Pearson Correlation=.5
Deep concept learning
Sequence likelihoods encode semantic information about the transcriptome
Model assigns a lot of likelihood to
Gene sites that encode start/end of transcription
Start/end of sites of common exons
Captures splice site donor/acceptor sequences
Learns the cis-regulatory code underlying splicing
Identifies the sites that bind to common transcription control factors
Found some new sites (some factors only sometimes control transcription)
Synthetic RNAs from LoRNA’s sequence generation
Exon and intron lengths are similar to what is found in real mRNA
Distribution of codon usage also similar