Often in real-world data set, there are missing values for some features. This may be because of the values of those features unavailable or not recorded this is often represented by NA or a blank.
Because the proportion of missing values (blank) is relatively low (12%), I have removed the null values from the data set.
In the data set, many values were unavailable (NA). When the number of NA values for the data was observed, the following were the findings:
Out of 3390 subjects, 11% had at least one NA value for a feature. There were two options now, to choose to either (a) remove all patients (~404 of them) or (b) impute the NA data. Since, ~404 patients (11% of total patients) had NA values, removing these patients could lead to significant information degeneration. Thus, I instead imputed the data using Imputation transformer from Scikit and replacing the unavailable values with the mean.
The data set has been used in different ways on a single algorithm to study it's sensitivity to data quality. It is split into a ratio of 20:80 for the prediction set and the training set respectively using component splitting.
The prediction model is trained by analyzing existing data because the data set contains information weather each subject has heart disease or not. This process is known as supervised learning. The trained model is then used to predict CHD within the first 10 years of examination of the subject.
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. It predicts a binary dependent variable from a set of predictor or independent variables. Logistic regression is mainly used for prediction calculating the probability of success ie. binary (Pass/Fail).
On running logistic regression it is noticed that the model predicts 0 more often; the absence of CHD in 10 years because the training data contains a significant amount of subjects with the negative CHD. This algorithm can work, but there are two flaws. First, because there is less variance in the training data set the results to incline towards a particular outcome. Second, binary classification is not suitable for all cases as it is difficult to categorize the risk of developing CHD for all subjects under 0 and 1.
To overcome the variance problem the data was manipulated. It was observed that on changing the number of cigarettes and diabetes distribution the outcome was affected. On increasing these features the probability of CHD was increased.
For classification problems, the accuracy is a vital performance measurement of the classifier.
Accuracy of the model is 85%.
The study uses Google Colab and Kaggle.