The protein structure of these two genes were done with HHPRED and MODELLER. The software was obtained in browser using the MPI Bioinformatics Toolkit. [6] HHPRED was used to build sequence profiles by using multiple sequence alignments. The multiple sequence alignment obtained by HHPRED was then used to create a PIR file which was sent to MODELLER to predict the 3D protein structure.
Data from both genes was taken from ExAC, an exome database, which contains gene data from 60,706 unrelated individuals. [7] ExAC contains a plethora of data on gene variants, ranging from SNPs to CNVs. Each variant for the gene was analyzed and determined if related to cleft lip/palate. ExAC calculates a filtering allele frequency for each variant, which tells you how plausibly the variant can affect a disease. In order for a variant to be identified as a possible disease propagator, the filtering allele frequency has to be less than the maximum credible allele frequency for the disease of interest. The maximum credible allele frequency is calculated by the equation:
Disease prevalence : 1/1000 for cleft lip/palate
Genetic heterogeneity : Most pathogenic allele variant for CL/P = HYAL2 chr3:g.50320047T>C ) = 3% (0.03) [8]
Penetrance: How often do mutations in the gene lead to observed phenotype. Difficult to assess, therefore, is set to 0.50.
Max credible population allele frequency =
6e-5
Data from the ExAC browser was parsed using the Harvard ExAC browser API. [9]
The gene variants were then analyzed for their affect on cancer. Data from GDC cancer portal was taken and examined. [10] The GDC cancer data portal has information on mutations in each gene for cancer patients. In the portal, mutations in a gene are characterized by their effect on the gene's function. The effect is determined by the amino acid change from the mutation. The amino acid mutations were taken for each patient and I examined whether or not the mutation was in a conserved region. The principle behind this is that if a mutation is in a conserved part on the gene, it may cause the gene to have a loss of function. Conserved regions were identified by using the software BioEdit. First, CLPTM1 and CLPTM1L were BLASTed and a multiple sequence alignment was done. This alignment was then entered into BioEdit to determine the conserved regions.
Protein-Protein interactions of CLPTM1 and CLPTM1L were analyzed using STRING and GENEMANIA. The gene network obtained was analyzed to see how and what other genes they interact with. STRING and GENEMANIA gave a rough estimate on coexpression, neighborhood, and homology