Methods

For this project, Google Colabs was used to code with the python data analysis library, Pandas. Breast cancer classification data was obtained from the Breast Cancer Wisconsin (diagnostic) Data Set [3] from 2016. This dataset contains 569 total observations with 357 from benign tumors and 212 from malignant. Each observation has 13 attributes including radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

Classifiers that were evaluated include: Decision Tree with Gini Index, Decision Tree with Entropy, K-Nearest Neighbors with Weighted Voting, K-Nearest Neighbors with Majority Voting, Naive Bayes, and Support Vector Machines.

Before using each classifier, the dataset was broken into 5 parts for N-fold cross-validation. In each iteration, one part was held as the test data while the other 4 parts were used for training. The errors of each fold were then averaged to determine the overall error in algorithm performance.

Further methods for each algorithm type are listed below:

Decision Tree with Gini Index [4]

  • sklearn library class tree was utilized to generate a model with criterion = 'Gini'

  • max_depth = 5

Decision Tree with Entropy [4]

  • sklearn library class tree was utilized to generate a model with criterion = 'Entropy'

  • max_depth = 5

K-Nearest Neighbors with Weighted Voting

  • Multiple K values were used to determine the best hyperparameter (1, 3, 13, 25, 50, 100)

  • Weighted voting was used in which neighbors that are closer have a greater impact on the classification

  • Data was scaled from 0 - 1 to avoid one attribute dominating

  • linalg.norm was used as distance measure

K-Nearest Neighbors with Majority Voting

  • Multiple K values were used to determine the best hyperparameter (1, 3, 13, 25, 50, 100)

  • Majority voting was used in which all neighbors have the same impact on classifications despite the distance

  • Data was scaled from 0 - 1 to avoid one attribute from dominating

  • linalg.norm was used as distance measure

Naive Bayes

  • sklearn class GaussianNB was used to make model

Support Vector Machines

  • sklearn class SVM used

Evaluation:

  • The error rate for class classifiers were calculated

  • ROC curves were plotted for the decision trees, Naive Bayes, and Support Vector Machines

  • For the K-nearest neighbors, plots displaying the error rate as a function of K value were displayed