Final Year Project

Development of an Intelligent Screening System for Chronic Leukemia

Artificial Intelligent

ABSTRACT

The accuracy of conventional diagnosis procedure for Leukemia may reduce by a certain factor such as tiredness and emotion of the expert as the diagnostic procedure is done manually by the hematologist or pathologist. Due to growing statistics and the important role of early diagnosis for chronic Leukemia, an automated intelligent screening system for chronic Leukemia is needed. The development procedure consists of five main stages, namely image segmentation, feature extraction, feature selection, classification and a Graphical User Interface (GUI). A total of 548 nuclei were extracted from the 100 images (50 samples for CML and 50 samples for CLL) and used for the analysis. The development procedure begins with the stage of Image Segmentation that involve the Colour Thresholding, Gradient Edge Detection and Convex Area Filtering to remove the artefacts and prepared the image for the feature extraction stage. There was a total of 28 features extracted from the geometrical, colour and textural features. In order to improve the performance of classifier, feature selection was applied to select some dominant features. Genetic Algorithm (GA), ReliefF (RfF) algorithm and Neighbourhood Components Analysis (NCA) algorithms were used for the feature selection. Based on the result, the selected features from RfF were able to provide the highest overall accuracy (99.4%). In order to ensure the reliability and accuracy of the selected classifier for this system, three types of classifiers were applied in the study. Optimization of each classifier was done before the selection of classifier to ensure each classifiers was fit with the data of chronic Leukemia. Weightage scoring method was implemented to select the best classifier amongst k-nearest neighbour (kNN), Support Vector Machine (SVM) and Multilayer Perceptron (MLP) network with Levenberg Marquardt (LM) algorithm to overcome the uneven contribution problem of the parameter. The classifier which gave the highest score will be selected as the classifier to be implemented in this screening system. Based on the result, MLP was selected in this study. The last stage in this study was creating the GUI which was done by the GUIDE in MATLAB R2017b.

Problem Statements

1. The problem of Leukemia in Global and Malaysia is growing.

Leukemia & Lymphoma Society (LLS) showed that approximately every 3 minutes, 1 person in the United States (US) is diagnosed with a blood cancer. An estimated combined total of 174,250 people in the US are expected to be diagnosed with Leukemia, lymphoma or myeloma in 2018. Besides, statistics from MIMS Malaysia claims that the blood cancer is the fourth cancer in Malaysia in 2016. In addition, according to the World Life Expectancy, the death rate (per 100,000) of Leukemia in Malaysia is 4.18, it is corresponding to 75 ranks over 183 countries in the world.

2. The accuracy of conventional diagnosis procedure for Leukemia is uncertain.

The conventional early stage diagnosis procedure is performed manually by the human eyes of experts that the human expert needs to keep screening a thousand samples of blood slide under the microscope for a long period, hence the accuracy will be reduced by certain factors such as tiredness or emotion of human expert.

Objectives

To explore and evaluate some suitable features extraction algorithm
To identify and evaluate a suitable intelligent classifier to perform the chronic cells identification.
To develop an intelligent screening system based on the selected procedure.

Methodology

The whole study consists of 5 main steps, that are Image Acquisition, Image Segmentation. Feature Extraction. Feature Selection and Classification.

Image Acquisition

In this study, the slide image of chronic leukemia was provided by Hospital Universiti Sains Malaysia (HUSM). The samples of slide images were analyzed by Leica microscope at 40× magniﬁcations that captured using Inﬁnity 2 camera and saved into (.*bitmap) format at 800×600 resolution. A total of 100 slide images will be interpreted and analyzed in this study which were 50 sample slide images for CML while 50 sample slide images for CLL. Besides, a total of 548 cell nuclei were segmented from these slide images which were 291 cell nuclei for CML while 257 cell nuclei for CLL.

Image Segmentation

Image segmentation consists of two steps which are colour segmentation and filtering where colour segmentation consists of Colour Thresholding and Gradient Edge Detection while filtering consisted of Convex Area Filtering.

Colour Thresholding

threshold value have to be identified to separate the relevant and irrelevant information
the pixel value for each and total of RGB components was identified
this only focused on the regions of background, RBCs and nucleus of WBCs of the image for both CML and CLL
the threshold values were determined by using range and mean of pixel value

Gradient Edge Detection

Sobel, Perwitt and Roberts’s operators were applied to separate the cell in the image as individual object
the operator that able to produce the clearest boundary between each other was selected

Convex Area Filtering

it still contained some artefact that may affected the performance of the next stage
to complete this filtering, the values of convex area for each cell in the 100 samples were collected and tabulated
the filtering range was determined by the minimum and maximum value of convex area

Feature Extraction

Feature extraction is a stage that transform the large input data into a reduced representation set of features. Different type of input will result in different type of feature. In this study, geometrical, colour and textural features were used to distinguish the CML and CLL. A total of 28 features were extracted from these categories.

Feature Selection

Feature selection algorithm is used to reduce the dimension of the feature space and improve the performance of classification. The feature selection algorithms that have been applied in this study included Genetic Algorithm (GA), ReliefF algorithm (RfF) and Neighbourhood Component Analysis (NCA) algorithm.

Optimization of GA

Selection Function = Tournament Selection (tournament size 2)
Population Size = 50
Generations = 100
Mutation Rate = 0.1
Crossover Rate = 0.5
Fitness Function = kNN (k-value = 3, distance function = Euclidean)

Optimization of RfF

Categorical predictors flag = Off
Distance scaling factor = Infinity
Prior probabilities for each class = Empirical
Number of observations for computing weights = All

Optimization of NCA

Method for fitting the model = Exact
Solver type = Limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm
Maximum number of iteration = 1000
Convergence tolerance on the step size = 0
Width of the kernel = 1
Prior probabilities for each class = Uniform

Selection of Feature Selection algorithm

After adjusting the optimal arguments and parameters, a total of 28 features were fed into the related algorithm and recorded the results accordingly.

Step 1: Select a classifier and fed the selected features into the classifier

Step 2: Record the evaluation parameter accordingly (accuracy for training, testing and overall with one decimal place) and tabulated in Appendix B

Step 3: Repeat step 1 – step 2 for another 9 times

Step 4: Calculate the average accuracies for each set of selected feature

Step 5: Compare the accuracies and the marks is given based on ranking concept (for example, 1 is refer to the lowest accuracy while 5 is refer to the highest accuracy)

Step 6: Calculate the total score of the each feature selection and select the feature selection algorithm with the highest scores.

Classification

In order to ensure the performance of the classifier in this system, three types of classifiers were chosen to be evaluated. This three types of classifier included k-Nearest Neighbour (kNN), Support Vector Support (SVM) and Multilayer Perceptron (MLP) network.

Division of Data

division of data was done to ensure that the same data would be used to train and test for the three classifiers
the total 548 cell nuclei (291 for CML and 257 for CLL) will be divided into two sets, training data set and testing set with the percentage of 70% : 30% which corresponding to 466 data (246 for CML and 220 for CLL) and 82 data (45 for CML and 37 for CLL), respectively

Optimization of kNN

Step 1: Run the kNN algorithm with k-value of 1

Step 2: Record the evaluation parameter (testing accuracy with one decimal place)

Step 3: Repeat the step 1 – step 2 with other k-value = 3, 5, 7, 9, 11, 13, 15

Step 4: Compare the testing accuracy for each k-value

Step 5: Select the k-value with highest testing accuracy

Optimization of SVM

Step 1: Start the optimization with the first parameter (Box Constraint) with the minimum value in the range

Step 2: Record the evaluation parameter (testing accuracy with one decimal place)

Step 3: Repeat step 1 – step 2 for another value in the range (for example, 0.01 for Box Constraint)

Step 4: Compare the testing accuracy for each Box Constraint

Step 5: Select the value with the highest accuracy

Step 6: Repeat the step 1 – step 5 for the next parameter (i.e. polynomial order if the kernel function is Polynomial)

Optimization of MLP

Step 1: Run the algorithm with the number of hidden nodes of 1

Step 2: Record the evaluation parameters (testing accuracy with one decimal place and Mean Square Error, MSE expressed in exponential form with six decimal place)

Step 3: Repeat step 1 – step 2 for another number of hidden nodes (5, 10, 15, 20, 25, 30)

Step 4: Compare and select the number of hidden nodes with the high testing accuracy and low MSE

*Same optimization steps for Learning Rate (0.1, 0.2, 0.3, 0.4, 0.5) and Number of Epochs (1, 5, 10, 15, 20)

Evaluation of Classifier

The evaluation parameters: testing accuracy for both CML and CLL, the difference of testing accuracy between CML and CLL (the accuracy for CML and CLL was calculated by confusion matrix) and the performance error
Each classifier would run the relevant algorithm with the optimal setting and the evaluation parameters were recorded for analysis
a weightage scoring method was used in which the mark was given by the ranking method based on each of the evaluation parameters (3 mark represents the best performance while 1 mark represents the worst performance)
The score for the evaluation parameter of a classifier were obtained by multiplied the mark with the weightage
The final score of a classifier was determined by adding each of the score
the highest final score classifier will be selected and used in this screening system

Result

Image Segmentation

determination of threshold value was only focused on the region of nucleus of WBCs for G component.
threshold value of Colour Thresholding for CML and CLL was threshold value = 117.
threshold value for total value of RGB component was set between 300 and 450.
the Roberts’s operator will be selected as it produced low noise density image.
range of convex area between 2500 and 9000 was applied in Convex Area Filtering.

Feature Selection

RfF Group 1 score 1
RfF Group 2 score 7
RfF Group 3 score 12
GA score 11
NCA score 9

Therefore, feature set of RfF Group 3 was selected and applied.

Classification

weightage ranking scoring is applied for evaluation of classifier.
weightage for Testing Accuracy, Difference Testing Accuracy (between CML and CLL) and Performance Error is 0.5, 0.30 and 0.2 respectively.
The final score for each classifier was computed.
kNN score 1 mark only, SVM score 2 marks only and MLP score 3 marks.

Then MLP is chosen.

Graphical User Interface (GUI) - VISIO

Page updated

Google Sites

Report abuse