Learning and Predicting Transcriptome Architecture

Abstract:
Linking DNA sequence to genomic function remains one of the grand challenges in genetics and genomics. Here, we combine large-scale single-molecule transcriptome sequencing of diverse cancer cell lines with cutting-edge machine learning to build LoRNA-SH, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture—the relative abundances and molecular structures of mRNA isoforms. Owing to its use of the StripedHyena architecture, LoRNA-SH handles extremely long sequence inputs at base-pair resolution (~65 kilobase pairs), allowing for quantitative, zero-shot prediction of all aspects of transcriptome architecture, including isoform abundance, isoform structure, and the impact of DNA sequence variants on transcript structure and abundance. We anticipate that our public data release and the accompanying frontier model will accelerate many aspects of RNA biotechnology. More broadly, we envision the use of LoRNA-SH as a foundation for fine-tuning of any transcriptome-related downstream prediction task, including cell-type specific gene expression, splicing, and general RNA processing.

Bio:
Dr. Hani Goodarzi is an Arc Core Investigator and Associate Professor at UCSF. His research combines novel discovery platforms and frontier AI models to reveal molecular mechanisms of cancer progression. Dr. Goodarzi has made key scientific contributions at the intersection of AI, RNA biology, and cancer research, and in both diagnostics and therapeutics. He has also co-founded several companies in this field, including Exai Bio, a next-generation liquid biopsy company, and Vevo Tx, a platform drug discovery leveraging AI to design better drugs. His work has been recognized with prestigious awards, including the Vilcek Prize for Creative Promise and the AACR-MPM Transformative Cancer Research Award. Dr. Goodarzi was previously honored with the Martin and Rose Wachtel Award in Cancer Research and named an American Cancer Society scholar.

Summary:

Genome and its transcription into proteins
- Game gene can be transcribed into different protein isoforms (sometimes hundreds per gene)
- Different abundances of various isoforms
- Gene sequence exon-intron-exon-intron-exon-....
  - Transcription machinery remove
    - Intron segments,
    - Splices together the exons into a single sequence and
    - Then generates the encoded protein
    - Different proteins:
      - Subsequences of exons
      - Some exons may be skipped
      - Complex dependencies between which exons are included (e.g. is one is included, another is always excluded)
  - Lots of work deciphering the behavior of this mechanism
    - Sequence to function mapping
    - RNA foundation models
      - Borzoi/Enformer, BigRNA, NucT, HyenaDNA
    - Need a model for joint representation for
      - Gene sequence
      - Associated RNA isoform architecture
      - At Nucleotide resolution
Challenge: data availability
- Many more DNA foundation models but there is a lot less RNA data
- mRNA sequencing is done at the granularity of short segments (mRNA is chopped up and then scanned)
  - Hard to infer full mRNA sequence of segments
  - Many mRNA sequences due to transcription/splicing variants / skips, etc.
- Now its possible to sequence full mRNA sequences (e.g. via PacBio Sequel2)
- Used new instruments to do long-read scans and short-read scans
  - Short-read: much higher throughput (400 base-pair)
  - Long-read: better ability to connect segments, helps to connect short-reads into long sequences (3.5k base-pairs)
- Different human cell lines
  - 2 biological replicates
  - 26 cell lines
  - + 149 human, 46 mouse long-read samples from various studies
  - Total ~95M full-length reads
  - 520k unique inferred transcripts
LoRNASH model
- RNA foundation model based on long-sequence reads
- Align each mRNA with its DNA
- Encode both, marking the
  - basepairs from the mRNA as one set of tokens
  - the DNA that is not transcribed into mRNA using different tokens
    - Different tokens for DNA before/after transcription and
    - Untranscribed DNA in the middle of the mRNA
  - Enables model to see where transcription starts/ends and where introns are dropped in the middle of transcription
- Model
  - Layers: Embedding, Hyena (like Attention but sub-quadratic computational cost), Attention
  - 256 latent dimensions
  - 7m parameters
- Training:
  - Next-token prediction, like a language model
  - Cross-entropy loss
  - Perplexity
- Model accuracy increases with model size/compute time but there’s a point of diminishing returns where the model overfits
Applications
- Can predict the likelihood of a given possible isoform
- Zero-shot
  - Tests on new tasks
    - Compare the model’s predicted likelihood of an isoform to its abundance in a cell
      - These are correlated: .15 Pearson Correlation
      - Other models’ (HyenaDNA, NucT) probabilities have no correlation
      - Indicates that the model can track the isoform abundances that occurred in the cells when they were harvested for analysis
      - Correlations should improve if cell’s state is included in the model’s context
    - Prediction of trapped exons
      - Identification of mRNA sequences that have never been seen but are plausible
      - Model performance: Pearson Correlation=.2
    - Prediction of Pathogenic Non-coding Variants
  - Embeddings:
    - Given isoform: get its embedding
    - Cluster them to map their global structure
    - Embeddings of PAS tokens (regulate transcription) correlated with isoform abundance: Pearson Correlation=.5
- Deep concept learning
  - Sequence likelihoods encode semantic information about the transcriptome
  - Model assigns a lot of likelihood to
    - Gene sites that encode start/end of transcription
    - Start/end of sites of common exons
  - Captures splice site donor/acceptor sequences
  - Learns the cis-regulatory code underlying splicing
    - Identifies the sites that bind to common transcription control factors
    - Found some new sites (some factors only sometimes control transcription)
- Synthetic RNAs from LoRNA’s sequence generation
  - Exon and intron lengths are similar to what is found in real mRNA
  - Distribution of codon usage also similar