Figure 1. General Schema of the methodology. Created with BioRender.com
We used the COSMIC database to obtain wild-type and cancer-associated SPHK1 sequences. A total of 69 sequences were collected and aligned using Python (Code can be found here). A multiple sequence alignment (MSA) was then performed with Clustal Omega to identify conserved regions and quantify mutation frequencies across all samples. This allowed us to detect variant hotspots, which were selected for further structural and functional analysis.
Protein sequences were retrieved in FASTA format from UniProt for human, mouse, rat, and chicken SPHK1 orthologs (Supplementary Figures 1, 4-6). These sequences were aligned using Clustal Omega to perform a multiple sequence alignment (MSA) and assess evolutionary conservation across species. This analysis helped evaluate the importance of each residue and contextualize the mutations observed in prostate cancer. To complement the conservation analysis, we used the Variant Effect Predictor (VEP) tool through Ensembl to assess the functional relevance of the identified mutations. This included two predictive algorithms:
SIFT, which predicts whether amino acid substitutions affect protein function.
PolyPhen, which predicts the possible impact of substitutions on protein structure and function.
Wild-type and mutated SPHK1 protein structures were predicted using AlphaFold through the ColabFold platform. The models were visualized in PyMOL, where structures are colored by pLDDT confidence score and N→C residue gradient. Mutated residues are labeled, and local structural features such as conformation, packing, and flexibility are examined. Sequence coverage plots are used to verify the reliability of each modeled region (Supplementary Figure 3). This structural approach helps generate mechanistic hypotheses about how mutations could influence SPHK1 behavior.