Cystic fibrosis (CF) is a genetic disorder that alters chloride transport on cell membranes in mucus secreting organs such as the lungs, pancreas, and digestive system [1]. CF is characterized by genetic mutations in the cystic fibrosis transmembrane conductance regulator (CFTR protein) located on chromosome 7 [1]. There are six classes of CFTR mutations that create varying severity of symptoms (Figure 1).
CF is inherited through an autosomal recessive pattern, where two mutated copies (alleles) of chromosome 7 combine to cause a negative impact on CFTR function [1]. The alleles are inherited from each parent, and may encode for the same mutation or two different mutations. There are over 2,000 different mutations in patients diagnosed with CF, each with varying symptoms. Some CFTR mutations produce no CFTR at all (Class I), some code for CFTR proteins that never make it to the cell membrane (Class II), and others encode for mutated CFTRs that are expressed on the cell membrane (Class III-IV) [1]. Patients with no CFTR expressed, such as in Class I and II, tend to have more severe symptoms. Patients with CFTR expression but have limited chloride transport can have moderate to mild symptoms.
There are data repositories that provide extensive genomic information on cystic fibrosis mutations such as allele frequency, allele number, and corresponding disease condition. The dataset classifies CFTR mutations as CF-causing, Non CF-causing, varying clinical consequences, or unknown significance. Our work will predict which classifier the CF mutation falls under based off the allele number and allele frequency. We hope to accomplish this by training, then verifying, multiple machine learning frameworks.
Train a machine learning algorithm with 75% of data to predict disease variant based on allele frequency.
Evaluate efficacy of our machine learning algorithm using the remaining 25% of data.