Data Acquisition:
We obtained the most up-to-date data on cystic fibrosis transmembrane conductance regulator (CFTR) variants from the Clinical and Functional Translation of CFTR website database [1]. The database is continually updated as more clinical and functional information on CFTR2 variants is analyzed, so we used the most recent dataset that was released in April of 2023. It is important to note that the databases uses information from 88,000 patients who lived in the United States, Canada, and Europe. Thus, the data may be subject to regional and ethnic variability of variant distribution.
The dataset includes a total of 804 variants (including their cDNA name, protein name, and legacy name) accompanied by their corresponding number of alleles in CFTR2, the allele frequency in CFTR2, and their variant final determination. Variant final determination includes four categories: CF-causing, varying clinical consequence, non CF-causing, and unknown significance. Across the entire dataset, 719 variants are CF-causing, 49 variants are of varying clinical consequence, 25 variants are non CF-causing, and 11 variants have unknown significance.
Additionally, the number of alleles in the CFTR2 gene ranges from to 1 to 199061 with a mean of 173.7. The allele frequency is calculated across the entire CFTR2 database, and it ranges from 0.000007 to 0.697 with a mean of 0.001.
Data Pre-Processing:
We randomly sorted our data before splitting it into a training dataset and a testing dataset. The training dataset encompassed about 75% of our data, while the testing dataset contained about 25% of our data. We excluded the variant cDNA, protein, and legacy names from our data to focus on the number of alleles, allele frequency, and the variant final determination.
Training a Machine Learning Framework:
We used Waikato Environment for Knowledge Analysis (Weka) software for all machine learning [2]. We designed a simple classification framework to predict the nominal variable (variant final determination) based on two numerical variables (number of alleles in CFTR2 and allele frequency in CFTR2).
Specifically, after loading our training data in Weka, we tested multiple classifiers and ended up employing the J48 algorithm, which we chose for its effectiveness in handling predictive tasks, especially for relatively small datasets. J48 is a decision tree classifier that recursively splits data based on its most significant attributes, and it has reduced susceptibility to overfitting compared to more complex algorithms.
Next, we chose a 10-fold cross-validation as our test option, meaning we included nested cross-validation into our model. Then, we began training. Our workflow is shown below.
Testing the Machine Learning Framework:
We tested our trained machine learning model on the test data that we previously removed from the initial data set. The input was the number of alleles in CFTR2 and the allele frequency in CFTR2. We then compared the predicted variant final determination to the ground truth.