Data Preparation & Pre-processing


1.  Handling Missing Values : 

   Simple Imputation

  -We are identifying missing values by examining each column of the dataset for null or NaN values  using python’s pandas library . In our dataset the attributes like patient Id , Genes in mother’s side , Paternal gene , Blood cell count (mcl), Patient First Name, Father’s name , Location of institute , status are free from  missing or null values . Rest attributes contain missing values & the description of  each attribute with missing values 


2.  Dealing with Duplicates : 

   Duplicate Detection and Removal

   -  This is a crucial step in analysis. Here the important features

(attributes) are selected and unimportant and irrelevant features are dropped. In our dataset there are several features

(attributes) that do not contribute to gene disorder prediction.

The features (Attributes) like ‘patient Id’, ’patient first name’, ‘family name’, ‘father’s name’, ‘institute name’ , ‘location of institute’ , ‘place of birth’, ‘parental consent’ can be dropped due to their low or no contribution in predicting the genetic disorders . The Data features (attributes) ‘test 1’,’test 2’,’test 3’,’test 4’, ‘test 5’ and ‘autopsy shows birth defect (if applicable)’ are dropped due to lower feature importance values (they have same values overall the column).


3.  Encoding Categorical Variables :

    Label Encoding

   - We have done encoding for disorder subclass. The Disorder subclass contain the values ‘leber’s hereditary optic neuropathy’, ‘Diabetes’, ‘Leigh syndrome’, ‘cancer’, ‘cystic fibrosis’, ‘Tay-sachs’, ‘hemochromatosis’, ‘Mitochondrial’,’ Alzheimer’s’ and are mapped by the values 0,1,2,3,4,5,6,7,8.



4.  Feature Selection :

    Correlation Analysis

The correlation between the features(attributes) can be 

Positive, negative or zero. The value of one feature tends to increases as the value of one feature increases in positive correlation coefficient. And the value of one feature tends to decrease as the value of one feature increases in negative correlation coefficient. 

For example there is a negative correlation between ‘Patient age’ , and  ‘Mother’s age ‘

 (-0.0074) . It means that there is a weak tendency for mothers to be younger than their patients.