Accuracy Results

Accuracy Results of Stacking ensemble - Training Data set

The table below summarizes the training data set performance of all the base learners that was evaluated to be used as part of our stacking meta-learner setup.

All models evaluated had an prediction accuracy around the 80% mark. The Neural Net classifier has a slightly lower accuracy to 78% but has shown signs of distinctness from other classifiers in the ratio of False Positives (FP) to False Negatives (FN) reported by this model. The FP (classifying humans as bots) ratio is considerably low in case of the Neural net classifier but at the same time the FN (classifying bots as humans) ratio is also higher when compared to the other models. This behavior of the Neural net is consistent across data sets. The best performing base classifier on the training data set was Ada Boost Classifier

Prediction Accuracy score of the Stacking Meta-Learner on TRAIN data set - 83%

Prediction accuracy comparison of various classification models - TRAIN data set

Bar plot shows comparison based on MEAN CROSS VALIDATION SCORES

Accuracy Results of Topic Modeling based classification - Training Data set

The Topic Modeling based classification approach performed extremely well on the training data set and was probably be over fit. A prediction accuracy score of 99% lead us to suspect that and was confirmed by a 70% mean accuracy from a 10 fold cross-validation test. We feel that the reason for this is the size of the training data set and an increased volume used which contained diversified topics will help us better train the model and avoid over fitting.

Accuracy Results of Stacking ensemble - Validation Data set

The following below summarizes the Validation data set performance of all the base learners that was evaluated to be used as part of our stacking meta-learner setup.

The performance of all the models is comparable to the performance in the training data set. This gives us the confidence that these models will generalize well with the Test data set as well. Also, note that the the optimal performing model on the validation data set based on CV accuracy is the Logistic CV and this was not the case with the training data set. So, this also ensures that adopting a stacking based technique will allow us to compensate classifier performance across varying data sets.

Prediction Accuracy score of the Stacking Meta-Learner on VALIDATION data set - 82%

Prediction accuracy comparison of various classification models - VALIDATION data set

Bar plot shows comparison based on MEAN CROSS VALIDATION SCORES

Accuracy Results of Topic Modeling based classification - Validation Data set

A prediction accuracy score of 70% was achieved with the validation data set. This was expected from the observations during the 10-fold cross validation testing on the training data set.

Accuracy Results of Stacking ensemble - TEST Data set

The table below summarizes the test data set performance of all the base learners that was evaluated to be used as part of our stacking meta-learner setup. The bar plots for test data set uses prediction accuracy scores instead of the mean CV scores as in the case of the training and validation data sets.

Although, QDA has not been included in the final set of base learners, we have added it to the summary list just for indicating its performance on the test data set. The Random Forest classifier has performed the best among all the base learners evaluated. The test prediction accuracy score of the meta-learner is 82% which is comparable with the training data set and this indicates that the models has generalized well for varying data sets.

Prediction Accuracy score of the Stacking Meta-Learner on TEST data set - 81%

Prediction accuracy comparison of various classification models - TEST data set

Bar plot shows comparison based on PREDICTION ACCURACY SCORES

Accuracy Results of Topic Modeling based classification - TEST Data set

A prediction accuracy score of 85% was achieved with the independent TEST data set. This is comparable with the performance achieved by the stacking learner classification approach