According to the American Cancer Society, melanoma is the least common but most aggressive form of skin cancer. The onset of melanoma can be determined by analyzing skin lesions that have recently appeared. Important variables for melanoma detection generally include asymmetry, borders, color, diameter, and evolution over time. At the moment, diagnosis is performed through physical tests conducted by medical professionals. Predictive models based off of solely images of skin lesions have usually been found to be less accurate. My goal in this project is to develop a computerized model that can produce a likelihood of malignant melanoma occurrence given an image of a patient's skin lesion and also to determine which created variables correlated the strongest with disease presence. Furthermore, since I am calculating the variables through my own analysis and then inputting them into an algorithm, my model should have a faster run time and require less computing power than existing models for various types of skin cancer.
In the beginning, I acquired images from datasets available on the website of the International Skin Imaging Collaboration Archive. I then performed image segmentation in order to prepare for pixel extraction. I used a type of segmentation algorithm known as the watershed algorithm in order to acquire a spectral image that indicates points of high concentration or contrast. Afterwards, I extracted numerical coefficients for multiple variables. Out of the important variables that were previously indicated, I focused more on the effects of color and diameter. Since the software I used does not have much capability with spatial recognition of asymmetrical objects such as skin lesions, I opted to use a different novel method to capture the effect of diameter. Instead, I calculated a value for the total area or size of the lesions. For color, I tested two novel methods, a summation of RGB values and an average of all the RGB values. I inputted the variables by themselves and in tandem with each other into a Gaussian Naïve Bayes classification algorithm in order to test the effectiveness of these indicators after using them for training.
My statistical tests led to the conclusion that the average RGB value coefficient was the better color indicator and the highest performing indicator overall. By itself, it was found to have the highest area under the curve value, 0.737, and when tested together with the size coefficient, it was found to have highest accuracy, 0.833. The summation coefficient had lower accuracies in all tests and had area under the curve values closer to 0.50, indicating struggles with distinguishing between true positives and false positives. The size coefficient struggled in scenarios outside of the aforementioned test with the average RGB value. The high performance of the RGB value can be attributed to its success at negating the impact of outlier pixels, which have high RGB values when they realistically should not. Future studies could include testing with more datasets, experimenting with new methods for calculating color and size, and devising new methods for finding novel coefficients for other variables such as asymmetry and the thickness of borders.