Methods
For this project, Google Colabs was used to code with the python data analysis library, Pandas. Breast cancer classification data was obtained from the Breast Cancer Wisconsin (diagnostic) Data Set [3] from 2016. This dataset contains 569 total observations with 357 from benign tumors and 212 from malignant. Each observation has 13 attributes including radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.
Classifiers that were evaluated include: Decision Tree with Gini Index, Decision Tree with Entropy, K-Nearest Neighbors with Weighted Voting, K-Nearest Neighbors with Majority Voting, Naive Bayes, and Support Vector Machines.
Before using each classifier, the dataset was broken into 5 parts for N-fold cross-validation. In each iteration, one part was held as the test data while the other 4 parts were used for training. The errors of each fold were then averaged to determine the overall error in algorithm performance.
Further methods for each algorithm type are listed below:
Decision Tree with Gini Index [4]
sklearn library class tree was utilized to generate a model with criterion = 'Gini'
max_depth = 5
Decision Tree with Entropy [4]
sklearn library class tree was utilized to generate a model with criterion = 'Entropy'
max_depth = 5
K-Nearest Neighbors with Weighted Voting
Multiple K values were used to determine the best hyperparameter (1, 3, 13, 25, 50, 100)
Weighted voting was used in which neighbors that are closer have a greater impact on the classification
Data was scaled from 0 - 1 to avoid one attribute dominating
linalg.norm was used as distance measure
K-Nearest Neighbors with Majority Voting
Multiple K values were used to determine the best hyperparameter (1, 3, 13, 25, 50, 100)
Majority voting was used in which all neighbors have the same impact on classifications despite the distance
Data was scaled from 0 - 1 to avoid one attribute from dominating
linalg.norm was used as distance measure
Naive Bayes
sklearn class GaussianNB was used to make model
Support Vector Machines
sklearn class SVM used
Evaluation:
The error rate for class classifiers were calculated
ROC curves were plotted for the decision trees, Naive Bayes, and Support Vector Machines
For the K-nearest neighbors, plots displaying the error rate as a function of K value were displayed