SYSTEM DESIGN
In this project, we have used random forest and XGBoost machine learning algorithms for PD detection. Random forest is applied on spiral drawing to measure the deflection in handwriting. XGBoost is applied to the voice dataset to measure voice deflections of Parkinson disease affected people.
Data set is collected from Kaggle repository. Voice data contains total 195 records. 147 parkinson's disease effected people. 48 healthy people. The spiral dataset also collected from Kaggle repository. In training dataset contains 72 persons data(36 Parkinson, 36 healthy) .In training dataset contains 72 persons data (36 parkinson,36 healthy). In testing dataset contains 30 persons data (15 parkinson,15 healthy).
The main aim of this step is to study and understand the nature of data that was acquired in the previous step and also to know the quality of data. A real-world data generally contains noises, missing values, and maybe in an unusable format that cannot be directly used for machine learning models. Data pre-processing is a required task for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model. Identifying duplicates in the dataset and removing them is also done in this step. In this project, Data processing means apply min max scalers which reduces all values of features to a value between -1 and 1. All spiral images are reshaped to desired size.
Features are selected based on ranking method. Voice features dataset usually suffers from curse of dimensionality problem. Increases the time constrains and degrades classification accuracy. Hence, it is important for reducing dimensionality by selecting the relevant features. Ranking based feature selection technique has been implemented using XGBoost.
XGBoost classification algorithm is the classifier used to classify disease based on the speech and the spiral designs as the input [19]. The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups.
4.2.1 Detecting PD using Voice Deflections
As Voice deflection has been used as a measure to detect PD because it is the most ubiquitous symptom that has been observed in more people. Different algorithms are used to measure the voice's deflection in order to identify the sickness based on the formulas. Dataset for voices was collected from the Kaggle website. A separate speech measurement that aids in the diagnosis of PD is mentioned in each column. Status indicates 0/1. 1 indicates disease and 0 indicate normal conditions.
Ø Loading dataset: The is first step in voice deflection analysis. First, we have load the data set downloaded from Kaggle repository to our working directory.
Ø Applying min max Scheduler: This is immediate step have to be after loading the dataset. Min max scheduler reduces each value of dataset between -1 to 1. This is pre-pocessing technique.
Ø Splitting: Entire dataset was splitted into training and testing dataset. 60% of dataset is used to train the model. 40% of dataset is used test the model.
Ø Applying machine learning algorithms: This is the last step. Multiple algorithms were applied on the training and test dataset and accuracy was measured for best algorithm.
The simplest machine learning method that belongs to the realm of supervised learning is K Nearest Neighbor. It is quite simple to use and comprehend this method. The typical names for KNN are lazy learner algorithm and non-parametric algorithm. The main objective of this approach is classification. A complicated procedure cannot be handled by this algorithm.
The most used supervised learning method is logistic regression. The categorical data are subjected to this. On the basis of a collection of independent variables, it is used to forecast the results of categorical data. YES or 1 indicates the presence of disease, while NO or 0 indicates its absence. It is mostly used for output of the YES/NO variety based on a single attribute.
Problems involving classification and regression may both be solved using the support vector machine approach. It is a machine-learning technique. It gives the results accurately when compared with other techniques. In realtime, there is a chance of increasing data. So, using this model is not a great decision.
The term "extreme gradient boosting" is "XGBoost." XGBoost is now the most popular machine-learning algorithm. When XGBoost is put up against other algorithms from the existing model, it consistently produces excellent accuracy. This results in an accuracy of 93.4%, which is very impressive. Based on the results of the decision tree, the algorithm operates. A boosting method called XGBoost builds decision trees based on the weights as shown in fig.7 describes the methodology of the XGBoost classifier.
In this, the person is asked to draw spiral drawings. These spiral models are tested with a previously trained model to detect PD. Sprial drawing data set has taken from Kaggle repository. The input image should be in the form of the fig.8. shown below. The images, collected from several public data sets A single data set was created by combining the Hand PD data set, Parkinson's Drawings data set, and the improved spiral test employing a digitized graphics tablet for monitoring PD.
For feature extraction from spiral drawings, we have used HOG technique.
HOG is a structural descriptor that will capture and quantify changes in local gradient in the input image. HOG will naturally be able to quantify how the directions of a both spirals and waves change. And furthermore, HOG will be able to capture if these drawings have more of a “shake” to them, as we might expect from a Parkinson’s patient. The resultant feature vector will then be used to train the classifier.
To quantify an image using a Histogram of Oriented Gradients (HOG) descriptor, we need to follow these steps.
· Pre-Processing : First, we need to preprocess the image by converting it to grayscale and applying any necessary normalization or filtering.
· Cell size and block size : Next, we need to decide on the cell size and block size to be used in the HOG computation. The cell size defines the spatial resolution of the HOG feature, while the block size defines the region over which normalization is performed.
· Gradient computation : We compute the gradients of the image in the x and y directions using a derivative filter (such as Sobel). This gives us the gradient magnitudes and orientations at each pixel.
· Orientation binning : We then quantize the gradient orientations into a set of predefined bins (usually 9) and accumulate the gradient magnitudes into each bin.
· Block normalization : Finally, we normalize the HOG descriptors in each block by dividing the block into overlapping cells and applying a normalization function (such as L2-norm). This helps to make the descriptor more robust to lighting and contrast changes.
· Descriptor Calculation : The final HOG descriptor is calculated by concatenating the normalized histograms from all of the cells within a block, and then moving the block across the entire image to compute the descriptor for the entire image.
The output of this process is a feature vector that represents the HOG descriptor of the input image. This feature vector can be used as input to a machine learning algorithm for tasks such as object detection and recognition. The resulting HOG descriptor can be used for various tasks such as object detection, pedestrian detection, or facial recognition. The descriptor captures the local gradient information in the image and is invariant to illumination and contrast changes, making it a useful feature for computer vision applications. The fig.9 describes the steps involved in spiral dataset analysis.
Loading Dataset : The is first step in voice deflection analysis. First, we have load the data set downloaded from Kaggle repository to our working directory.
Resizing the images : As all the spiral drawings cannot be in same size, in order to obtain accurate results we have to resize the spiral images to a certain size. Following are the steps have to followed to resize the images.
1) Load the image into memory using an image processing library, such as Pillow or OpenCV.
2) Determine the desired size of the resized image. This can be done either by specifying the width and height in pixels or by specifying a scaling factor that indicates how much the image should be scaled up or down.
3) Use an image resizing function to resize the image. The exact function will depend on the library you are using, but typical options include resize() in Pillow or resize() in OpenCV. These functions will resample the image pixels to the new size while preserving the aspect ratio.
4) Save the resized image to disk in a suitable file format, such as JPEG or PNG.
Labelling : Spiral images have to be labelled as healthy or Parkinson. This is used to understand the output.
Generating NumPy array : Features and label are appended to the data and labels lists respectively. Finally, data and labels are converted to NumPy arrays and returned conveniently in a tuple.