Conclusion

We started with the objective of detecting twitter bot users based on their profile and tweeting patterns. In order to achieve the objective, we took a supervised classification approach that would act based on the information gathered from the twitter user profile and the user's historical tweets.

Tweets generated by an user or tweets based on topics can be extracted from twitter through the twitter developer API but a key challenge that we faced at the start was identifying the users based on whom the training and test data sets were going to be formed and how to classify them as human versus bots. The literature and reference from Botometer helped us kick start the data collection process. The bot repository hosted as part of the Botometer project had multiple set of pre-classified data sets which had considerably recent data and fairly sufficient in volume to train the models. Once we got past this step, the next challenge was to determine how many tweets from history would we collect per user for analysis. This was majorly influenced by one the rate limits imposed by the twitter developer API subscription accounts.

As soon as we decided on the training data set to work, the subsequent phases of analyzing, understanding and exploring the data to determine characteristic differences between the user classes of humans vs bots was an interesting journey. We were able to uncover lot of valuable attributes and metadata from the twitter API feeds that helped us with effective feature extraction and engineering. As part of feature engineering, we spent efforts to perform sentiment analysis on the tweet messages that can differentiate between the patterns used by human and bot users. We also spent efforts in determining how to do classification based on a topic modeling approach and for that purpose we used the hashtags used in the tweet messages to build our topic modeling classifier.

The next step in the process was to decide on the classification models and any natural language processing based models that we can use to achieve our objective. The problem that we had was a 2 class classification problem and there were many options that we could evaluate. The decision was taken to evaluate all categories of classification models ranging from logistic regression, nearest neighbor approach, decision trees, Neural nets to ensemble methods. As we went through the training process of each of the models, we observed similar performance accuracy scored achieved. Proper measures and steps including cross-validation and grid search based hyper parameter tuning and model selection was adopted. But, since we were not sure which of these models would generalize well with external test data sets, we decided to adopt an Stacking based ensemble approach that would act based on the predictions from all the base learners that we have tuned and evaluated. This would give our model the best chance to generalize well with external test data sets.

Along with the stacking based classification approach which already had sentiment analysis incorporated as features engineered we also wanted to evaluate a topic modeling based approach to classify the users. In order to achieve this, we used the hash tags from the tweet messages to build our topic modeling and perform classification based on that. The topic modeling based classification approach performed comparatively well but we feel that with more training data and tuning, we could further improve the performance of this model.

So, based on all the analysis, evaluation and classification models trained, we are confident that our models would generalize well to external data sets and would perform with higher accuracy on classifying human vs bot users.

Future Work

Explore further into the possibilities of using Natural Language processing to perform time series analysis of the tweet messages to form patterns and based on that detect the user types.
Explore and evaluate Recurrent Neural Networks (RNNs) to do Natural Language processing of the tweet messages to do pattern and topic analysis.
Our current implementation of stacking is just a single layer with a set of base learners which feeds into the meta learner. We would like to further evaluate implementing a multi layer stacking setup where we can also incorporate our topic modeling based classifiers.
The current scope of our work was to just classify the human vs bot users but we did not venture into classifying different types of bots users like spam bots, social bots, etc. We can extend our current work to incorporate identification of specific classes of bots along with human vs bot classification.

Google Sites

Report abuse