Ensemble-Based Text Classification for Spam Detection

Meng Zhang

Abstract


This research proposes an ensemble-based approach for spam detection in digital communication, addressing the escalating challenge posed by unsolicited messages, commonly known as spam. The exponential growth of online platforms has necessitated the development of effective information filtering systems to maintain security and efficiency. The proposed approach involves three main components: feature extraction, classifier selection, and decision fusion. The feature extraction techniques is word embedding, are explored to represent text messages effectively. Multiple classifiers, including RNN including LSTM and GRU are evaluated to identify the best performers for spam detection. By employing the ensemble model combines the strengths of individual classifiers to achieve higher accuracy, precision, and recall. The evaluation of the proposed approach utilizes widely accepted metrics on benchmark datasets, ensuring its generalizability and robustness. The experimental results demonstrate that the ensemble-based approach outperforms individual classifiers, offering an efficient solution for combatting spam messages. Integration of this approach into existing spam filtering systems can contribute to improved online communication, user experience, and enhanced cybersecurity, effectively mitigating the impact of spam in the digital landscape.


Full Text:

PDF

References


Yadav, B. P., Ghate, S., Harshavardhan, A., Jhansi, G., Kumar, K. S., & Sudarshan, E. (2020, December). Text categorization Performance examination Using Machine Learning Algorithms. In IOP Conference Series: Materials Science and Engineering (Vol. 981, No. 2, p. 022044). IOP Publishing.

Wang, S., Cai, J., Lin, Q., & Guo, W. (2019). An overview of unsupervised deep feature representation for text categorization. IEEE Transactions on Computational Social Systems, 6(3), 504-517.

Belazzoug, M., Touahria, M., Nouioua, F., & Brahimi, M. (2020). An improved sine cosine algorithm to select features for text categorization. Journal of King Saud University-Computer and Information Sciences, 32(4), 454-464.

Almuzaini, H. A., & Azmi, A. M. (2020). Impact of stemming and word embedding on deep learning-based Arabic text categorization. IEEE Access, 8, 127913-127928.

Lee, J., Yu, I., Park, J., & Kim, D. W. (2019). Memetic feature selection for multilabel text categorization using label frequency difference. Information Sciences, 485, 263-280.

Chen, S. W., Chen, Y. W., & Wei, C. P. (2020). Deep learning-based text classification: A comprehensive review. Journal of Computer Science and Technology, 35(1), 143-165.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT.

Gupta, B. B., & Soni, D. (2020). Detecting malicious URLs using machine learning algorithms: A comparative study. International Journal of Advanced Computer Science and Applications, 11(9), 185-191.

Maatuk, M. J. A., & Abbass, H. A. (2020). Spam detection in online social networks: A survey. IEEE Access, 8, 189095-189105.

Singh, A. K., & Singh, S. K. (2018). Text classification using ensemble methods: A survey. Procedia Computer Science, 132, 1095-1102.

Zhou, Z., & Wu, H. (2020). Ensemble methods in machine learning: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 50(5), 1774-1792.

Al-Salemi, B., Ayob, M., Kendall, G., & Noah, S. A. M. (2019). Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Information Processing & Management, 56(1), 212-227.

Berge, G. T., Granmo, O. C., Tveit, T. O., Goodwin, M., Jiao, L., & Matheussen, B. V. (2019). Using the Tsetlin machine to learn human-interpretable rules for high-accuracy text categorization with medical applications. IEEE Access, 7, 115134-115146.

Berge, G. T., Granmo, O. C., Tveit, T. O., Goodwin, M., Jiao, L., & Matheussen, B. V. (2019). Using the Tsetlin machine to learn human-interpretable rules for high-accuracy text categorization with medical applications. IEEE Access, 7, 115134-115146.

Kilimci, Z. H., & Akyokuş, S. (2019, July). The analysis of text categorization represented with word embeddings using homogeneous classifiers. In 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1-6). IEEE.

Cherif, W., Madani, A., & Kissi, M. (2021). Text categorization based on a new classification by thresholds. Progress in Artificial Intelligence, 10(4), 433-447.

Cherif, W., Madani, A., & Kissi, M. (2021). Text categorization based on a new classification by thresholds. Progress in Artificial Intelligence, 10(4), 433-447.

H. Ahmed, I. Traore, and S. Saad, “Detecting opinion spams and fake news using text classification,” Security and Privacy, vol. 1, no. 4, p. e9, 2018.

D. Martens and W. Maalej, “Towards understanding and detecting fake reviews in app stores,” Empirical Software Engineering, vol. 24, no. 6, pp. 3316–3355, 2019.

N. Jindal and B. Liu, “Analyzing and detecting review spam,” in Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 547–552, Omaha, NE, USA, October 2007.

J. Li, M. Ott, C. Cardie, and E. Hovy, “Towards a general rule for identifying deceptive opinion spam,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1566–1576, Baltimore, MD, USA, June 2014.

Y. Lin, T. Zhu, H. Wu, J. Zhang, X. Wang, and A. Zhou, “Towards online anti-opinion spam: spotting fake reviews from the review sequence,” in Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 261–264, Beijing, China, August 2014.

Y. Ren and D. Ji, “Neural networks for deceptive opinion spam detection: an empirical study,” Information Sciences, vol. 385-386, pp. 213–224, 2017.

Y. Ren, D. Ji, and H. Zhang, “Positive unlabeled learning for deceptive reviews detection,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 488–498, Doha, Qatar, October 2014.

A. Sharaff and A. Soni, “Analyzing sentiments of product reviews based on features,” in Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 710–713, Tirunelveli, India, May 2018.




DOI: https://doi.org/10.31449/inf.v48i6.5246

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.