Our best Elastic Net Cross-Validation model achieved an R² of 0.41, indicating that the model can explain approximately 41% of the variance in genes' translation efficiencies. The model demonstrated a mean squared error (MSE) of 0.58, representing a significant improvement from earlier iterations. These results indicate that there are many opportunities for improvements while demonstrating meaningful predictive capability for translation efficiency prediction.
The model underwent iterative refinement, with performance improvements achieved through feature selection, hyperparameter tuning, and data preprocessing. Dataset GSE90056 performed the best, where most of the datasets tested showed quality issues for our application, highlighting the importance of high-quality input data for model performance.
Performance with different feature removed
The final model utilized Elastic Net Cross-Validation, which effectively balances feature selection and regularization by combining LASSO (L1) and Ridge (L2) regularization methods. This approach allows the model to handle correlated features while performing feature selection, making it suitable for our datasets containing multiple interdependent biological variables.
The model incorporated multiple diverse features including mRNA secondary structure free energy, codon bias indices (CAI, TAI), gene length, GC content, intergenic distance, ribosome binding site scores, and gene expression rates. The selection of a very small alpha value suggests that the model retained most features and applied minimal shrinkage, highlighting the biological relevance of the included variables.
TAI exhibited the highest correlation with translation efficiency, showing the strongest linear relationship among all features with a Pearson correlation of 0.32. This finding aligns with previous research demonstrating that codon bias closely correlates with tRNA availability and translation efficiency.
Mean Free Energy (MFE) was ranked as the most important feature in the Shapley analysis, despite having lower correlation scores, suggesting that MFE captures nonlinear relationships with translation efficiency that are not fully reflected in correlation metrics. Previous studies indicate that MFE is a strong determinant of translation efficiency, potentially even stronger than CAI.
Multiple factors contribute to translation efficiency regulation rather than a single mechanism. The AU content, CAI, and TAI for the first 51 bases all demonstrated relatively high correlation scores, indicating that the structure and sequence of the beginning of genes play an important role in determining translation efficiency. These findings support the concept of translational ramps at the beginning of genes.
GC content and gene length emerged as important determinants, ranking as 2nd and 4th most important features in the Shapley analysis. GC content stabilizes mRNA due to stronger base-pairing, while shorter genes promote more efficient ribosome recycling.
Graph of the Shapley values for each feature, calculated after training model using these features and the Bush
While our model shows predictive capability, the overall predictive power remains limited, with significant variance in translation efficiency left unexplained. No single feature was determined to be the most important regulatory mechanism, suggesting that translation efficiency is influenced by complex feature-feature interactions that may not have been fully captured.
Dataset quality emerged as a critical factor, with significant technical biases and noise observed across datasets. Our current datasets contain lots of data points off the line of Y=X, indicating noise and technical bias that remain in our translation efficiency datasets, suggesting further data processing may be needed.
The model's training and testing were conducted exclusively with E. coli data, limiting its applicability to other species without additional training on species-specific datasets. Model runtime remains significant, with processing times in the magnitude of minutes to an hour, indicating opportunities for computational optimization.
Cross-dataset normalization challenges were observed, with substantial sample-to-sample variation requiring quantile normalization to reduce noise from technical bias and variation across different experimental conditions.