Methodology

Project Workflow

Our final design consisted of 4 parts:

generation of the operon table
a process for processing and cleaning up translation efficiency datasets for usage as training datasets
feature generation
the actual model using ElasticNet

The development of our machine learning model followed a structured approach to predict bacterial translation efficiency for optimizing protein production, with particular focus on balancing predictive performance with model efficiency and interpretability.

Data Collection and Sources

Our training data is processed from 5 RNA sequencing and Ribosome Profiling datasets from MG1655 in normal condition. The source datasets we selected to process are GSE90056, GSE56372, and GSE103421. To reduce complexity, we utilized pre-processed datasets providing reads per gene directly provided by the research associated with the dataset, or processed by Translatomedb.

Translation efficiency datasets were collected from multiple publicly available, open-source datasets that contained ribosome profiling and RNA sequencing data from E. coli MG1655 under normal growth conditions. These datasets provided the foundation for calculating translation efficiency as the ratio of ribosome profiling reads to RNA-seq reads per gene, following established methodologies in the field.

TranslatomeDB

Data Processing Pipeline

The dataset processing code is consolidated in a single file, reprocess_data_batch.ipynb. This file automatically search for pre-processed RNA-seq and Ribosome Profiling read per gene samples for each dataset following the same file format as .quant files produced by Translatomedb, filter genes by pre-defined filter parameters, re-normalize, and calculate the final translation efficiency per gene.

Our data processing pipeline contained the following key steps:

Prepare each dataset into four easy to process dataframes
Filter raw counts by three filters - The primary goal of our filters is to remove low read count genes and genes with highly variable read counts, which indicate biases and variations from sources such as technical bias or variation across different incubation conditions
Calculate RPKM after filtering
Use new RPKM values to calculate TE
Normalize TE to remove sample bias - Cross dataset normalization method is selected using scatterplots of all samples compared with one of the samples. The primary goal of this normalization is to reduce the noise from sample to sample variation.

Filter values are picked based on the change in the violin plot of mean and max value per gene and the final correlation coefficient heat map results. By utilizing quantile normalization, we were able to reduce technical bias and sample-to-sample variation across datasets.

Distribution before filter

Distribution after filter

Feature Engineering

Feature extraction was performed using one Jupyter notebook file called extracting_features.ipynb. Our feature engineering process incorporated multiple biological descriptors of translation efficiency that have been validated through meticulous research and experimentation. The features implemented span several categories of translational regulation:

Sequence-based features: Gene length was found by subtracting the start position of a gene from its end position. GC content was found by calculating the number of guanines and cytosines in a gene and dividing that number by the total number of nucleotides in that gene.

Codon usage features: CAI was calculated by defining a dictionary of codon weights based on their usage frequency in E. coli, processing a given DNA sequence by splitting it into codons, retrieving their corresponding weights, and computing the geometric mean of these weights. The same process for CAI was also used for TAI, except with a different set of codon weights, since CAI and TAI are both geometric means.

Structural features: Gene sequences were input into the MFE calculator function provided by the package ViennaRNA to calculate mean free energy of mRNA secondary structures.

Regulatory elements: RBS sequences are extracted from genomic data and compared vs. a generated idealized Shine-Dalgarno PPM to evaluate how closely they match an optimal RBS. A Position Probability Matrix and Position-Specific Scoring Matrix is generated to analyze DNA sequences.

Positional features: The first twenty amino acids were found by converting the first 60 nucleotides into their corresponding amino acids. Features specific to translational ramps included polarity scores, CAI, and TAI calculated specifically for the first 20 codons.

In addition to feature extraction, feature correlation and statistical analysis was carried out. Pearson, Spearman, and Kendall's Tau scores were calculated using pandas's .corr() method to assess the relationships between features and target translation efficiency values.

Relationship between translation efficiency and features

Machine Learning Model Architecture

After evaluating alternatives using a decision matrix that considered interpretability, predictive accuracy, and overfitting mitigation, we selected Elastic Net Cross-Validation as our final model. Elastic Net combines LASSO (L1) and Ridge (L2) regularization which allows it to perform feature selection while also retaining the useful correlated features we need.

Our machine learning model development included several key components:

Model Selection: Elastic Net CV was chosen as our design because it effectively balances feature selection and regularization. Unlike LASSO, which aggressively removes a lot of features, Elastic Net retains useful correlated features, improving model stability and performance.

Hyperparameter Optimization: We created a list of various values for alpha and l1 ratio. This list was iterated through by our model, and whichever produced the best results was used to display the final results. In this way, we were able to optimize the alpha and l1 values our model used.

Data Preprocessing: By scaling the data, we ensured that all our datasets had proportionate values, so it wasn't a particular set of genes with extremely high values that was influencing our model to work in certain ways. By scaling, we ensured all genes and their translation efficiency values were equally contributing to how our model trained.

Model Evaluation: Our design was evaluated using structured testing methodology focused on accuracy, interpretability, and robustness. Performance metrics included Mean Squared Error (MSE), L1 ratio, and R-squared (R²) as the main metric of interest in analyzing model performance.

(Justin Lagala)

Page updated

Report abuse