Approach and Models

Approach

The primary principle behind our modeling approach was to avoid relying on a specific type of classification model and instead use a stacking based ensembling technique to utilize predictions from multiple learners and combine information gathered from multiple category of models through the use of a meta learner. This would help us reduce the variance and improve the prediction capability by over reliance on a specific model. The decision to include varying types of models to make sure that the end predictions generalize well and is not over fit to the training data (since the volume of the training data was comparatively small)

The meta learner used in our approach is a custom learner that acts based on the majority prediction from the base learners. This approach would allow for a greater generalization of the base predictors and would compensate for the weak learners. The meta learner can be updated with a different classification model by training the model with the predictions from the base learners.

All the classification models that we had used as part of the stacking ensemble approach performed Sentiment Analysis on the tweet messages and computed polarity and subjectivity scores to be used as part of the classification process. But, we also wanted to explore the approach of Topic based Modeling and use that classify between the user classes. So, we trained a separate classification model outside of the base learners used in the stacking ensemble to do classification based on the Tweet topics discussed by the user. The hashtags used with in the individual messages to arrive at the topics to use.

Stacking based Ensemble Technique

Models

Following are the models that were trained, evaluated and tuned for consideration as part of the final Stacked model.

  • Trivial classifier with manual prediction (Baseline Model)
  • Logistic Regression CV Classifier
  • Principal Component Analysis (PCA) through Logistic regression to evaluate predictor significance
  • Linear Discriminant Analysis (LDA) Classifier
  • Quadratic Discriminant Analysis (QDA) Classifier
  • K Nearest Neighbor (kNN) Classifier
  • Gradient Boosting Classifier
  • Decision Tree Classifier
  • Random Forest Ensemble Classifier
  • Adaptive Boosting Ensemble Classifier
  • Support Vector Machine using SVC
  • Sequential Neural Network
  • Topic Based Modeling using LinearSVC classifier

Each model was fit using the scaled training data set and then Accuracy scores and 10 fold cross validation scores were generated to assess the model's performance on the Training data set.

Trivial Classifier with manual prediction (Baseline Model)

A trivial classifier based on a manual prediction method that predicts Human (0) class as the output for any input was implemented as the baseline model to see the effectiveness of a model that classifies any user as human. This resulted in an prediction accuracy score of ~67% with the training and validation data set. This was taken as the baseline accuracy against which all the other models evaluated were compared.

Logistic Regression CV Classifier

The first classification model chosen after the baseline model was a Logistic regression classifier with Cross Validation. The logistic regression model was instantiated with L2 (Ridge regression) regularization to avoid over-fitting

Training prediction accuracy was ~81% and Validation prediction accuracy was ~83%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Principal Component Analysis (PCA) Dimensionality Reduction

Although a PCA based classification model with reduced dimension was not going to be used as part of our stacking group., we wanted to evaluate how reduction of the feature dimensions explained the variance in the data and at what dimension the model starts explaining at least 90% variance. Based on the analysis at each dimension, the plot below shows that at degree 4 the model starts explaining close to 100% variance.

Plot showing the dimension and the variance ratio achieved with each dimension

A Logistic Regression model was used with the PCA data set generated with 4 primary components. Although the PCA analysis showed that more than 95% variance was explained at level 4 , the model performance in terms of prediction accuracy was lower than the logistic regression model.

Training prediction accuracy was ~80% and Validation prediction accuracy was ~82%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Linear Discriminant Analysis (LDA) Classifier

A LDA classifier with a 'svd' solver was used and no dimensionality reduction was done. Since LDA implementation is a closed form implementation, we were not able to do lot of tuning by changing the hyper parameters. We had a similar case with the QDA classifier as well.

Training prediction accuracy was ~81% and Validation prediction accuracy was ~82%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Quadratic Discriminant Analysis (QDA) Classifier

Training prediction accuracy was ~77% and Validation prediction accuracy was ~80%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

K Nearest Neighbor (kNN) Classifier

In order to determine the optimal k to use, multiple kNN classifier models with increasing k value was fit using the training data and the prediction accuracy scores and mean of 5 fold CV scores for each k value plotted. The k value (13) at which the classifier had the optimal performance was chosen for further evaluation with the validation data set.

Training prediction accuracy was ~82% and Validation prediction accuracy was ~82%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Gradient Boosting Classifier

Ensemble technique based on Gradient Boosting for classification was evaluated. An early stopping approach was adopted when determining the optimal number of estimators for the classifier. The maximum number of iterations to proceed without any accuracy improvement (tolerance of 0.001) was set to 5. This would enable early stopping if there is no significant improvement in classification accuracy with each iteration. A learning rate of 0.01 was set and a validation fraction of 0.2 was set to perform in sample validation.

In order to determine, the optimal number of estimators a cross-validation based approach was taken and the hyper parameters other than estimator count was set to the above mentioned values. The estimator count was increased using increments of 32 (2^5) which would provide an optimal validation of the performance with increasing estimator count. Based on the analysis an optimal estimator count of 192 was used to further evaluation of the classifier with validation data set.

Training prediction accuracy was ~84% and Validation prediction accuracy was ~80%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Decision Tree Classifier

We understand that the prediction accuracy of a decision tree classifier is going to increase on the training set with higher depth but would lead to tremendous over-fitting, the optimal depth for the decision tree classifier was selected based on the mean cross validation scores at each increasing depth. Based on that analysis a max tree depth of 3 was chosen as the optimal depth. A decision tree with the optimal depth chosen was then further evaluated with the validation data set.

Training prediction accuracy was ~82% and Validation prediction accuracy was ~79%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Random Forest Classifier

Random Forest was the second ensemble technique that we had evaluated for our data set. In order to determine the optimal number of estimators, we followed an approach similar to the one followed to determine the optimal tree depth of decision tree classifier. But instead of increment the number of estimators by 1, we increased the estimators using increments of 32 (2^5) to do an exhaustive analysis of the classifier performance with increasing estimators. Based on the analysis, an estimator count of 416 was chosen for the Random Forest classifier to be used for further evaluation.

Training prediction accuracy was ~83% and Validation prediction accuracy was ~81%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Adaptive Boosting Classifier

Adaptive Boost based boosting was the third ensemble technique that we had evaluated for our data set. In order to determine the optimal number of estimators to use, we followed an approach similar to the one followed in Random Forest. The no of estimators were gradually increased with increments of 32 (2^5) to assess the prediction accuracy and the cross validation scores as the estimators increase. Based on the analysis, it was observed that the prediction accuracy increased as the no of estimators increased but it came up with a cost of complexity and over-fitting. The cross-validation mean scores also increased with the number of estimators but it was gradual after a certain stage. So, keeping in mind the complexity of the model, not to over-fit on the train data set and also achieve an optimal classification, an estimator count of 736 was chosen for the AdaBoost classifier. A learning rate of 0.001 was chosen based on trial runs with varying rates.

Training prediction accuracy was ~83% and Validation prediction accuracy was ~80%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Support Vector Machine through SVC

We decided to evaluate SVM models for our classification and went ahead with evaluating SVC (Support Vector Classification) based approach. We were not sure on the kernel type to use from the choices like 'rbf', 'linear', etc. and was also not sure on what penalty factor C was optimal or what gamma scale to use. So, we chose to do a Grid Search using RandomizedSearchCV in Sklearn to determine the optimal hyper parameters like the kernel to use, gamma scale, penalty factor C, etc. The best estimator from the Randomized search CV was used for evaluation with our training and validation data sets.

The best estimator obtained from the grid search exercise achieved a Training prediction accuracy of ~83% and Validation prediction accuracy was ~82%.Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set.

Neural Network

A sequential Neural network implemented using Keras was used to perform 2 class classification. The Neural net architecture had an input layer with 19 nodes for input of all the individual features, three hidden layers with 50 nodes each and an output layer with just 1 node for 2 class classification. We had used dropout regularization for the input layer and L2 regularization for all the hidden layers. 'ReLu' activation function was used in the input and hidden layers and 'Sigmoid' activation function was used in the output layer as it was a 2 class problem.

We used Stochastic Gradient Descent (SGD) optimizer with 'Binary Cross Entropy' loss function. The model was trained with the training data set for 2000 epochs using a batch size of 128. A validation split of 0.5 was used. The plot below shows the performance of the model and the reduction of the model's loss through the training epochs.

Training prediction accuracy was ~78% and Validation prediction accuracy was ~81%.

Model Accuracy and Loss statistics during the training process

Ensemble using a Stacking meta-learner

Based on all the different category of classifiers evaluated until now, there is no specific classifier type whose performance stands out and all the classifiers have an prediction accuracy range around 80%. Also, all the classifiers performed better than the trivial baseline model which was set as the bottom line performance measure.

Since there has not been a stand out classifier and the performances are comparable in many cases, we decided to go forward with a stacking approach and used a custom meta-learner that will act on the predictions from the classifier models from the previous stage and generalize the prediction based on the majority achieved by all the classification models. This gives us a chance to ensure higher prediction accuracy and generalize well for varying test data sets.

We decided not to include the trivial model used for base-lining, the PCA classifier using logistic classifier since it was evaluated primarily to explain feature variance and the QDA classifier as it had a similar approach to the LDA model when compared to the other category of classifiers. The final set of base learners used as part of the stacking process was Logistic Regression CV, LDA, kNN with K=13, Decision Tree with Max depth=3, Gradient Boost Classifier with 192 estimators, Random Forest classifier with 416 estimators, AdaBoost classifier with 736 estimators, Support Vector machine using SVC and Neural network classifier. The ensemble illustration used to explain stacking technique at the modeling approach section captures the classifiers used.

The final stacking prediction accuracy on the training data set was 83% and it had even better validation accuracy of 82%. This gives us confidence that this model will generalize well with the test data set.

Topic Modeling based Classification

For topic modeling we extracted the topics discussed in user tweet messages based on the hashtags used in the tweet messages and that was further processed using natural language processing libraries like spacy and nltk to cleanse, process and vectorize the tokenized topics before feeding in to the classification model. The classification model used was Linear SVC. The predictors comprised of the modeled topics processed and extracted from the user tweet messages and the response variable was the pre-classified bot or human classes.

Training prediction accuracy was ~99% and Validation prediction accuracy was ~71%. Following is a plot that shows the model's performance on training data set and their 10 fold cross validation performance on both the training data set