Classifying Cross-lingual Abusive Contents in Social Media: A Comparative Study of MultiClass Category Using Various Machine Learning Models
Classifying Cross-lingual Abusive Contents in Social Media: A Comparative Study of MultiClass Category Using Various Machine Learning Models
Authors: M. A. I. Rafi, T. Islam, O. F. Shikdar, T. R. Sakib, M. Sakib, G. Hossain and M. M. Hossain
Abstract— The proliferation of social media platforms has highlighted the necessity for effective text classification techniques, particularly for detecting offensive and non-offensive content. This study focuses on classifying abusive and non-abusive social media texts in Bangla, English, and Banglish (Bangla words written in English format). We aimed to address the challenges of offensive text detection and multi-class categorization, which include classifications for Religion, Sports, Crime, Entertainment, and Politics, providing deeper insights of the content. The dataset used for this study was balanced and contained 15,000 data points for three languages combined. To do the classification task, we employed various models, including Traditional Machine Learning (Decision Tree (DT) and Support Vector Machine (SVM)), Bayesian methods (Bernoulli Naive Bayes (NBN) and Multinomial Naive Bayes (MNB)), Ensemble techniques (Random Forest (RF) and eXtreme Gradient Boosting (XGBoost)), Deep Learning approaches (Convolutional Neural Networks (CNN) and Long Short-Term Memory networks (LSTM)), and Transformers Models (Generative Pre-trained Transformer (GPT-2) and Bidirectional Encoder Representations from Transformers (BERT)) for labeling and categorization. Additionally, the study utilized vectorization and tokenization to improve feature engineering for handling its critical and diverse linguistic elements found in the dataset. Among all these tested models, LR model shows the highest accuracy across both cases, achieving 88.60% accuracy in determining between abusive and non-abusive labels and 81.30% accuracy in multi-class categorization. A comparative evaluation of Bayesian, ensemble, deep learning, and transformer models highlights the efficiency and simplicity of Logistic Regression in handling diverse linguistic structures.
Keywords— Social Media, Abusive Language, Machine Learning, Deep Learning, Convolutional Neural Networks, Transformer Models.