We tested multiple classifiers such as the Random Tree, Decision Stump, Random Forest, and the J48. Results are similar across classifiers, but J48 is slightly superior to the other algorithms. For example, as shown in Table 1, J48 led to a model accuracy of 93.10%, versus Random Tree's accuracy of 91.03%. Similarly, the root mean squared error (RMSE) is 0.18 in the J48 algorithm, which is slightly lower than the RMSE of 0.19 in the Random Tree algorithm. Additionally, the execution time of the J48 model is smaller than that of the Random Tree.
The machine learning pipeline that we setup and trained is successful at predicting the variant final determination, or the cystic fibrosis disease state, from data that quantifies the number and frequency of alleles in the CFTR2 gene. Our final model predicts disease state at an accuracy of 93%. Thus, our results demonstrate promising predictive accuracy, suggesting that the genetic variations in the CFTR2 gene could serve as valuable biomarkers for identifying individuals most likely to have cystic fibrosis.
The results of our model shows that it distinguishes between CF-causing variants, non CF-causing variants, variants of unknown significance, and variants of varying clinical consequences. Because cystic fibrosis diagnosis is often complicated due to the high genetic complexity of the disease, these results demonstrate that allele data can be leveraged to capture this underlying genetic complexity. In fact, these findings really highlight the potential of machine learning approaches in genetic risk assessment and personalized medicine, paving the way for more targeted and effective interventions for individuals with genetic disorders such as cystic fibrosis. Further validation and refinement of the model with larger datasets and additional genetic markers could enhance its clinical utility in the future.
We calculated the average number of alleles in the CFTR2 gene across all variants within each of the four disease states. Our findings indicate that CF-causing variants are associated with a substantially higher average number of alleles in the CFTR2 gene within the population, averaging 187 alleles per variant. This observation suggests that the accumulation or presence of multiple combinations of two alleles in the CFTR2 gene within a population may be indicative of more severe or clinically relevant mutations that contribute to CF development.
In contrast, non CF-causing variants have a much lower average number of alleles, falling at 47 alleles per variant. This disparity in allele number between CF-causing and non CF-causing variants underscores the importance of the CFTR2 gene's genetic variability within a population in determining disease susceptibility and severity. Interestingly, variants of varying clinical significance average in between these two extremes, with an average of 81 alleles per variant. This intermediate allele frequency suggests a continuum of genetic variability and clinical impact, where variants with higher allele counts may have a greater potential to influence disease manifestation. Furthermore, variants of unknown significance display the lowest number of alleles, averaging at 14 alleles per variant. While the clinical implications of these variants remains uncertain, their lower allele frequency may indicate a lesser degree of genetic variability or impact on CF-related phenotypes.
Overall, this data highlights the complex relationship between allele frequency in the CFTR2 gene and disease severity, suggesting that the number of alleles could serve as a potential biomarker for predicting CF risk and prognosis. However, further research is necessary to validate these findings, as our study is subject to limitations. Importantly, despite the high allele average, some CF-causing variants only have one allele. Thus, further studies with larger sample sizes and a more comprehensive analysis are needed to provide a more nuanced understanding of the genetic landscape of CF.
We determined the top ten most common cystic fibrosis mutations using allele frequency values out of the 800 different mutations included in the CFTR2 dataset. Each patient can inherit one or two alleles of the CFTR gene from their mother and/or father. The alleles within the population were grouped based on their mutation variant. Among the 88,000 patients, approximately 69.7%, 2.5%, 2.1%, 1.6%, 1.3%, 1.2%, 0.9%, 0.9%, 0.8%, and 0.8% contain an allele with the F508del, G542X, G551D, N1303K, R117H, W1282X, R553X, 621+1G->T, 1717-1G->A, and 3829+10kbC->T mutation, respectively.
The mutation variant name describes the location of mutation on the 6128 nucleotide CFTR sequence, as well as the type of mutation (deletion, insertion, single nucleotide polymorphism, etc.). The F508del mutation is characterized by the loss of the phenylalanine residue at the 508 position on the CFTR sequence through the deletion of three base pairs, resulting in misfolding [1]. The G542X, W1282X, and R553X mutations result in the premature insertion of a stop codon, which truncates the CFTR protein [2-4]. Two of these mutations cause ATP-gating defects, the G551D mutation consists of a substitution at the 551 position that abolishes ATP-dependent gating, while R117H causes the loss of a hydrogen bond at the 117 position that causes a permanently open gate [5,6]. N1303K contains of a substitution at 1303 that disrupts folding of the CFTR protein [7]. The 621+1G->T, 1717-1G->A, and 3829+10kbC->T mutations cause a single nucleotide polymorphism that disrupts CFTR protein function. These mutations are mainly designated as CF-causing in the CFTR2 dataset, with the exception of R117H, which is designated as "varying clinical consequences".