Transcriptome Annotation and Reference Preparation
I began by assembling a comprehensive reference annotation that integrates long non‑coding RNAs (lncRNAs) with previously annotated small open reading frames (smORFs). The lncRNA data were obtained from the GENCODE repository (release 47, GRCh37 mapping), while smORF annotations were sourced from Martinez et al. (2020). After downloading, the lncRNA GTF was processed to remove “chr” prefixes for compatibility with the reference genome. The lncRNA and smORF files were then concatenated to produce a unified GTF. I fed this merged GTF and lncRNA GTF (with the Homo sapiens GRCh37 primary‐assembly FASTA from Ensembl) directly into STAR’s genomeGenerate function to build custom transcriptome indices.
Ribosome Profiling Data Pre-processing and Alignment
Three ribosome profiling (ribo-seq) raw datasets from three different studies were downloaded from the NCBI Gene Expression Omnibus (GEO) and represent viral infection experiments involving influenza and SARS-CoV-2, totaling 24 ribo-seq fastq files (Machkovech et al., 2019; Razooky et al., 2017; Finkel et al., 2021). Raw sequencing reads underwent quality control and adapter trimming using standard preprocessing tools (e.g., FastQC for quality assessment and Trim Galore! for adapter removal). For alignment, I utilized the STAR aligner (version 2.7.3a) with a custom-built index derived from the aforementioned transcriptome reference. STAR was configured with stringent parameters to permit minimal mismatches (maximum of two mismatches per read) and limit multimapping (allowing a single alignment per read) while enforcing an exact match length threshold. Aligned reads were output as coordinate-sorted BAM files, and BAM integrity was verified using Samtools (version 1.10).
RiboCode-based ORF Detection
To delineate translation events, the aligned ribo-seq datasets were analyzed using RiboCode, which is designed specifically for identifying ribosome occupancy over predicted open reading frames. For each sample, a configuration file was generated that detailed sample-specific parameters such as the range of ribosome-protected fragment (RPF) lengths (26–34 nucleotides) and the corresponding P-site offsets. These parameters were optimized based on the known properties of the ribo-seq libraries. RiboCode subsequently assessed the genome-wide occupancy patterns, annotating each predicted ORF with detailed information including transcript and gene identifiers, ORF coordinates, statistical significance (combined p-value), and the predicted amino acid sequence.
Candidate Micropeptide Filtering and ORF Counting
From the complete set of ORF predictions, candidate micropeptides were identified by applying two key filtering criteria: a maximum ORF length of 150 amino acids and a statistical significance threshold (combined p-value < 0.05). This filtering process yielded a list of high-confidence candidates indicative of potential micropeptide translation. To further quantify ribosome occupancy across these candidate ORFs, we employed the ORFcount tool. The ORFcount analysis was designed to mitigate biases at transcript boundaries by excluding a defined number of codons at the 5′ (first five) and 3′ (last three) ends of each ORF, provided that the ORF exceeded 30 codons in length. Additionally, only reads falling within an RPF length window of 25 to 35 nucleotides were considered.
Differential Expression Analysis
To assess changes in translation under viral infection, I integrated ORF quantification results across samples for 2 datasets. For each experimental condition (Finkel SARS‑CoV‑2, Razooki Influenza) a full outer join was performed on the ORFcount output files using the common ORF_ID field. This produced dataset-specific count matrices that were then filtered to remove non-standard rows (e.g., those beginning with “__”). The resulting count matrices were exported as CSV files.
For each dataset, the DESeq2 (v1.40.2) R package was employed for differential expression analysis. In the Finkel SARS‑CoV‑2 dataset, sample labels were derived from file naming conventions where “uninf” indicated uninfected controls, while all remaining “hpi” (hours post-infection) samples were grouped as “infected.” Similarly, for the Razooki Influenza dataset, samples were curated by selecting only the ribosome-protected fragment (RPF) samples corresponding to either “uninfected” conditions or those "infected" with the PR8 strain. For each analysis, metadata tables were constructed to assign a categorical condition (e.g., “uninfected” vs. “infected”), and DESeq2 was used to create a DESeqDataSet object with the design formula set to the condition factor.
The DESeq2 workflow involved estimation of size factors and dispersion parameters, followed by fitting a generalized linear model for each ORF. The contrasts were specified to compare infected against uninfected conditions. Differential expression was then assessed by extracting results that satisfied a false discovery rate-adjusted p-value (padj) threshold of <0.05 and an absolute log₂ fold change greater than 1. These filtered ORFs represent the high-confidence candidates for differential translation under viral infection.