Leveraging Machine Learning Models in Developing a Web-Based Multilingual Hate Speech Detection System for Cebuano, Tagalog, and English on Social Media
Keywords:
Hate Speech Detection, Multilingual NLP, Machine Learning, Hyperparameter TuningAbstract
In this study, we explore the development of a web-based, multilingual hate speech detection system that supports Cebuano, Tagalog, and English languages. We integrated both traditional machine learning models and transformer-based deep learning approaches to assess their effectiveness in identifying hate speech from social media comments across various contexts. Specifically, we evaluated Naïve Bayes, Decision Tree, Support Vector Machine (SVM), Random Forest, MBERT, and XLM-Roberta. To prepare the data, we applied a series of preprocessing steps including tokenization, stemming, stopword removal, and TF-IDF vectorization. Feature relevance was enhanced through Chi-Square filtering, and we addressed class imbalance using the Synthetic Minority Over-sampling Technique (SMOTE), which improved recall rates for underrepresented classes. Among the traditional models, the fine-tuned SVM achieved 92.1% accuracy, while Random Forest reached 93.3%, showing strong recall performance particularly for Cebuano and English texts. Meanwhile, transformer-based models yielded superior performance following hyperparameter tuning: MBERT achieved 96.1% accuracy with an F1-score of 0.97, and XLM-Roberta obtained 95.4% accuracy with an F1-score of 0.96. These results highlight the value of combining Chi-Square feature selection, SMOTE balancing, and fine-tuning strategies to optimize multilingual hate speech detection. Despite the advancements, our findings also reveal ongoing challenges related to class imbalance, as reflected in the macro F1-scores—even in transformer-based models. Overall, we demonstrate that a well-tuned hybrid approach can provide an efficient and scalable solution for multilingual hate speech detection in diverse digital environments.