Exploring Applied Machine Learning Techniques for Challenging Data Problems: An In-Depth Analysis and Implementation of Models

With the advancement in the technology each of the department is rapidly growing these days. The advancement in the technology has given rise to the process called machine learning which is used in different departments like healthcare department. This study is focused on development of the program with which the identification and prediction of the hepatitis can be made. The study uses 3 models including linear regression, logistic regression and Naïve Bayes Classification which are used to predict the disease. This will be used to identify the disease stage as well as the treatment can be made much easier.

Introduction

The advancement in the technology has taken many social department out of the park. The technology is improving day by day which is helping all kinds of issues like healthcare, education and many others (Ahad, Paiva, Tripathi, & Feroz, 2020). Technology has made life easier for all especially the help it gives to the professionals in their regular duties is way more impressive and helpful. It is stated that even the illiterate’s use technology very often in their daily life like the use of the mobile phone and all other electronic devices are the advancements in the technology.

One of the advancement in the technology is the development of the useful tool called machine learning. This is a method which is used to utilize many complex models as well as algorithms for the analysis of the data or the situation as well as making predictions regarding the field. In the healthcare department, machine learning is used for the identification of the disease as well as the prediction of the progression of the disease (Ghazal, et al., 2021). For this study we analyses a data set of Hepatitis C prediction (source from Kaggle.com). Machine learning provide the best way to make the predictions with the help of data or the information based on the incorporation different variables which can predict the level or the progression of the disease without compromising on any risk precision prediction. The disease is of non-linear nature therefore, it is difficult to develop a risk prediction model for the Hepatitis C (Saad, Gómez-Aguilar, & Almadiy, 2020).

Hepatitis is considered as one the major disease which is spread because of the virus. This virus is easily transmittable from one infected or affected person to other. Number of patients are taken into the consideration fighting this disease. The number of affected people has reached up to 17 million all over the world and also this number is increasing day by day. The number of deaths all over the world because of the Hepatitis C is about 4 hundred thousand annually. The hepatitis C virus attack the liver of the person causing the inflammation. There are both acute as well as chronic stages of this virus. Acute hepatitis, also called as brief episode of hepatitis, occurs during the initial six months of the identification of HCV infection and progresses to severe or chronic hepatitis following 6 months, leading to a big sickness (Khan, Soh, Maenner, Thompson, & Nelson, 2019). The liver is attacked by HVC. As a consequence, the innate immune system releases inflammatory chemicals. To heal the injury, these inflammatory molecules stimulate the liver to manufacture protein which is fibrous in nature. The virus can continue to affect for about few weeks if diagnosed and cured early but it can also last for the life time of the person. Therefore the proper and accurate identification of the disease and its stage would be very important in the treatment. The machine learning can help a great deal in this treatment (Khan, et al., 2018).

Machine learning is the great procedure of learning the progression of the hepatitis C disease. It automatically learns itself from the data incorporated into the program which is obtained from the past observations. So the basic purpose of machine learning procedure is to build a program in order to identify and predict the progression of the disease. There are number of benefits using the machine learning procedure. First of all, it is more accurate than any of the human based or human experience based predictions. It even does not require any human help to draw conclusions as the results are based on the previous data and information. The procedure of machine learning is also very cheap and also it can be incorporated into any learning procedure. But also this program require the lots of labelled data without which the prediction is not easy (Butt, et al., 2021).

Machine learning is a critical component of health-care revolution. For prevention strategies, different machine learning approaches such as Naive Bayes, decision tree, logistical regression, linear regression, and others can be used to predict one's own HCV risk. This will allow the victim to receive therapy early in the disease's life cycle, preventing it from being magnified. This report presents a comparison of the performance metrics of several categorization approaches in machine learning applied to Hepatitis C datasets. Hepatitis C Stage Definitive diagnosis were investigated in this research utilizing an Artificial Neural Network, which has several benefits such as increased identification detection rate, simple design, small-sample concern power, and good generalization (Konerman, et al., 2019).

In this research, as we know from the previous researches different models can be used as per the Artificial Neural network but we use three main models which include linear regression, logistic regression and Naive Bayes classification. These three models are very helpful in the identification of the stage of the disease as well as the prediction of the progression of the disease especially the linear regression as well as logistic regression can be used for the prediction of the disease so that the proper steps could be followed and prevention measures can be taken as soon as possible. The disease does not have a proper cure or the vaccine till now so the prevention is the best solution for this problem.

The recent research is based on following research questions

1. The use of machine learning in the healthcare department.

2. Identification of disease stage could be made possible with accuracy.

3. The prediction of the progression of the disease could be made with precision.

Exploratory Data Analysis

Missing Plot

The CHOL column contain missing value that was 1.635 and ALp contains 2.93% missing values in the data set. While all columns did not contain any missing value in the data set.

Data visualization

Bar Plot: the frequency of two categorical variables, in category blood donor frequency was high followed by Cirrhosis, Hepatitis, Fibrosis. In Sex Variable Male patient frequency was high as compared to females.

Histogram Plot:The Age, ALB, CHE, CHOL, PROT variable was normally distributed while other variables are skewed toward left or right.

Density Plot: There is some bi modal, skewed, and normal density flow in the density plot for the numeric and integer variables.

Box Plot: In this plot, Category variable was compared all other variables as shown below

Correlation Plot: In this graph all variables were compared with other variables to understand that is there any relationship between them. If the value was found positive that means with the increase of variables it will leads to improve the other variables.

Model implementation & Validation

The model for predicting the hepatitis C cases will be implemented in this chapter. The original data was separated into Training and Test Set. The training data will be used for prediction and study, and the prediction will then be applied to test data to assess the accuracy and outcome of the prediction. Linear Regression, Logistic regression, and Navie classification are the three models that will be used to make this prediction. After the experiment, we will assess the accuracy and validity of the results.

Linear Regression

In the model hepatitis c was consider as a dependent variable and all other variables as a independent varietals. The person who has hepatitis C, its age was decreased significantly. The ALP, ALT, have also negative impact on the dependent varietals while values was significant. The CHE, GGT have positive impact on the dependent variables. The model fitness was 0.22.

Logistic Regression

From confusion matrix from logistic regression showed that 2 was true positive and 3 was true negative in our model accuracy was too low that was 2.109

Naïve Bayes Classification

For Navie Bayes Classification model klar library was used in this analysis. First remove the missing value in the data set after that data was splitted into train and train test data set. After this process, full model was run and after that confusion matrix was made from the model. The results was given below. The model fitness was 97%.

Page updated

Google Sites

Report abuse