Semantic Feature Engineering with LSA-SVM for Cyberbullying Comment Classification on Instagram
Abstract
Social media is now an essential part of everyday life, with Instagram being one of the most popular platforms and often utilized for various purposes, one of which is to increase popularity. However, the platform also often becomes a place where acts of violence and impoliteness in commenting increase, known as cyberbullying. To address the problem, detecting and classifying cyberbullying comments on Instagram is an important step in cyberbullying prevention. However, in text classification, several challenges need to be overcome to ensure the success of the model, such as polysemy, curse of dimensionality, and selection of text representation for feature extraction. Therefore, this study aims to implement a feature engineering technique using a hybrid approach that combines word weighting with TF-IDF and LSA method to reduce feature dimensionality and capture the semantic meaning of the data, with SVM used as a classifier to classify bullying and non-bullying comments. The results showed that the proposed method using feature engineering of the LSA matrix formed from the dataset of one of the classes, yielded a significant accuracy of 98%. In comparison, the conventional method with feature engineering using TF-IDF and the use of LSA matrix formed from the dataset of both classes only achieved an accuracy of 84%. This shows that the proposed method is more effective than the baseline approach.
Full Text:
PDFReferences
Datareportal, ‘Digital 2024: Global Overview Report’, DataReportal – Global Digital Insights. Accessed: Mar. 27, 2024. [Online]. Available: https://datareportal.com/reports/digital-2024-global-overview-report
Datareportal, ‘Digital 2024: Indonesia’, DataReportal – Global Digital Insights. Accessed: Mar. 27, 2024. [Online]. Available: https://datareportal.com/reports/digital-2024-indonesia
Datareportal, ‘Instagram Users, Stats, Data, Trends, and More’, DataReportal – Global Digital Insights. Accessed: Mar. 27, 2024. [Online]. Available: https://datareportal.com/essential-instagram-stats
H. W. Aripradono, ‘Implementation of Digital Storytelling Communication on Instagram Social Media’, Teknika, vol. 9, no. 2, pp. 121–128, Nov. 2020, doi: 10.34148/teknika.v9i2.298.
R. Rubiyanto and M. Fildyanti, ‘Personal Branding Barbie Kumalasari Untuk Meraih Popularitas Melalui Instagram’, WACANA J. Ilm. Ilmu Komun., vol. 20, no. 1, Jun. 2021, doi: 10.32509/wacana.v20i1.1253.
M. A. Caesaryo, M. Giswandhani, and A. Z. Hilmi, ‘Cyberbullying Selebriti Instagram’, J. Syntax Admiration, vol. 3, no. 5, pp. 671–679, May 2022, doi: 10.46799/jsa.v3i5.423.
C. Juditha, ‘Analysis of Content the Case of Cyberbullying Against Celebrities on Instagram’, J. Penelit. Komun. Dan Opini Publik, vol. 25, no. 2, 2021, doi: 10.33299/jpkop.25.2.4300.
J. Wakefield, ‘Instagram tops cyber-bullying study’, Jul. 18, 2017. Accessed: Mar. 27, 2024. [Online]. Available: https://www.bbc.com/news/technology-40643904
M. S. Z. Al-Sulami, ‘The Role of Social Work in Facing the Negative Effects of Cyberbullying on Adolescents in Saudi Arabia’, Arab J. Sci. Res. Publ., vol. 7, no. 11, pp. 109–124, Nov. 2023, doi: 10.26389/AJSRP.N130723.
Kus Hanna Rahmi, Rijal Abdillah, and Andreas Corsini Widya Nugraha, ‘Understanding The Danger of Bullying: A Phenomenological Study on Female College Students As Victims of Cyberbullying’, Krtha Bhayangkara, vol. 18, no. 1, pp. 61–84, Apr. 2024, doi: 10.31599/krtha.v18i1.1612.
K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, ‘Text Classification Algorithms: A Survey’, Information, vol. 10, no. 4, p. 150, Apr. 2019, doi: 10.3390/info10040150.
T. Gupta and E. Kumar, ‘Learning Improved Class Vector for Multi-Class Question Type Classification’, presented at the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021), Bangalore, India, 2021. doi: 10.2991/ahis.k.210913.015.
F. Di Martino and S. Senatore, ‘Semi-supervised Feature Selection Method for Fuzzy Clustering of Emotional States from Social Streams Messages’, in Advances in Machine Learning/Deep Learning-based Technologies, vol. 23, G. A. Tsihrintzis, M. Virvou, and L. C. Jain, Eds., in Learning and Analytics in Intelligent Systems, vol. 23. , Cham: Springer International Publishing, 2022, pp. 9–25. doi: 10.1007/978-3-030-76794-5_2.
A. Adeleke, N. A. Samsudin, Z. A. Othman, and S. K. Ahmad Khalid, ‘A two-step feature selection method for quranic text classification’, Indones. J. Electr. Eng. Comput. Sci., vol. 16, no. 2, p. 730, Nov. 2019, doi: 10.11591/ijeecs.v16.i2.pp730-736.
D. Kim, ‘Research On Text Classification Based On Deep Neural Network’, Int. J. Commun. Netw. Inf. Secur. IJCNIS, vol. 14, no. 1s, pp. 100–113, Dec. 2022, doi: 10.17762/ijcnis.v14i1s.5618.
V. Dogra et al., ‘A Complete Process of Text Classification System Using State-of-the-Art NLP Models’, Comput. Intell. Neurosci., vol. 2022, pp. 1–26, Jun. 2022, doi: 10.1155/2022/1883698.
S. Suswadi and Moh. Erkamim, ‘Sentiment Analysis of Shopee App Reviews Using Random Forest and Support Vector Machine’, Ilk. J. Ilm., vol. 15, no. 3, pp. 427–435, Dec. 2023, doi: 10.33096/ilkom.v15i3.1610.427-435.
R. Kosasih and A. Alberto, ‘Sentiment analysis of game product on shopee using the TF-IDF method and naive bayes classifier’, Ilk. J. Ilm., vol. 13, no. 2, pp. 101–109, Aug. 2021, doi: 10.33096/ilkom.v13i2.721.101-109.
A. A. Nafea, N. Omar, and M. M. AL-Ani, ‘Adverse Drug Reaction Detection Using Latent Semantic Analysis’, J. Comput. Sci., vol. 17, no. 10, pp. 960–970, Oct. 2021, doi: 10.3844/jcssp.2021.960.970.
A. H. Abed, S. A. Jabber, and A. A.-J. Altameemi, ‘Extracting Adverse Drug Reaction Using Latent Semantic Analysis from Medical Social Media Reviews’, 2021, ICIC International 学会: 08. doi: 10.24507/icicel.15.08.907.
A. A. Nafea, N. Omar, and Z. M. Al-qfail, ‘Artificial Neural Network and Latent Semantic Analysis for Adverse Drug Reaction Detection’, Baghdad Sci. J., May 2023, doi: 10.21123/bsj.2023.7988.
M. A. Gumilang, T. D. Puspitasari, H. A. Putranto, A. Kholiq, and A. Samsudin, ‘Sentiment Analysis Based on Tweet Reply at Public Figure Account using Machine Learning and Latent Semantic Analysis’, in 2022 8th International Conference on Science and Technology (ICST), Yogyakarta, Indonesia: IEEE, Sep. 2022, pp. 1–6. doi: 10.1109/ICST56971.2022.10136288.
Md. T. Ahmed, M. Rahman, S. Nur, A. Z. M. T. Islam, and D. Das, ‘Natural language processing and machine learning based cyberbullying detection for Bangla and Romanized Bangla texts’, TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 20, no. 1, p. 89, Feb. 2021, doi: 10.12928/telkomnika.v20i1.18630.
N. M. G. Dwi Purnamasari, M. A. Fauzi, I. Indriati, and L. S. Dewi, ‘Cyberbullying identification in twitter using support vector machine and information gain based feature selection’, Indones. J. Electr. Eng. Comput. Sci., vol. 18, no. 3, p. 1494, Jun. 2020, doi: 10.11591/ijeecs.v18.i3.pp1494-1500.
A. Ali and A. M. Syed, ‘Cyberbullying Detection using Machine Learning’, Pak. J. Eng. Technol., vol. 3, no. 2, pp. 45–50, Apr. 2022, doi: 10.51846/vol3iss2pp45-50.
A. Dewani et al., ‘Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques’, Appl. Sci., vol. 13, no. 4, p. 2062, Feb. 2023, doi: 10.3390/app13042062.
S. Paul and S. Saha, ‘CyberBERT: BERT for cyberbullying identification: BERT for cyberbullying identification’, Multimed. Syst., vol. 28, no. 6, pp. 1897–1904, Dec. 2022, doi: 10.1007/s00530-020-00710-4.
N. Yuvaraj et al., ‘Nature-Inspired-Based Approach for Automated Cyberbullying Classification on Multimedia Social Networking’, Math. Probl. Eng., vol. 2021, pp. 1–12, Feb. 2021, doi: 10.1155/2021/6644652.
R. R. Dalvi, S. Baliram Chavan, and A. Halbe, ‘Detecting A Twitter Cyberbullying Using Machine Learning’, in 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India: IEEE, May 2020, pp. 297–301. doi: 10.1109/ICICCS48265.2020.9120893.
A. R. Lahitani, A. N. Zhafarina, N. S. Windi Oktavia, and N. Jariyah, ‘Pemetaan Topik Pembicaraan Pada Komentar Live Youtube Menggunakan K-Means Clustering sebagai Identifikasi awal Kejahatan Verbal Cyberbullying’, J. Tek. Elektro Uniba JTE UNIBA, vol. 8, no. 2, pp. 399–403, Apr. 2024, doi: 10.36277/jteuniba.v8i2.253.
C. T. Hanni, ‘Cyberbullying Bahasa Indonesia’. Accessed: Jul. 05, 2024. [Online]. Available: https://www.kaggle.com/datasets/cttrhnn/cyberbullying-bahasa-indonesia
R. S. Perdana, ‘Dataset Sentimen Analisis Bahasa Indonesia’, GitHub. Accessed: Jul. 05, 2024. [Online]. Available: https://github.com/rizalespe/Dataset-Sentimen-Analisis-Bahasa-Indonesia/blob/master/dataset_komentar_instagram_cyberbullying.csv
M. Jubaidi and N. Fadilla, ‘Pengaruh Fenomena Cyberbullying Sebagai Cyber-Crime di Instagram dan Dampak Negatifnya’, Shaut Al-Maktabah J. Perpust. Arsip Dan Dok., vol. 12, no. 2, pp. 117–134, Dec. 2020, doi: 10.37108/shaut.v12i2.327.
S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera, ‘Big data preprocessing: methods and prospects’, Big Data Anal., vol. 1, no. 1, p. 9, Dec. 2016, doi: 10.1186/s41044-016-0014-0.
S. Khairunnisa, A. Adiwijaya, and S. A. Faraby, ‘Pengaruh Text Preprocessing terhadap Analisis Sentimen Komentar Masyarakat pada Media Sosial Twitter (Studi Kasus Pandemi COVID-19)’, J. MEDIA Inform. BUDIDARMA, vol. 5, no. 2, p. 406, Apr. 2021, doi: 10.30865/mib.v5i2.2835.
A. N. Sutranggono, Riyanarto Sarno, and Imam Ghozali, ‘Multi-Class Multi-Level Classification of Mental Health Disorders Based on Textual Data from Social Media’, J. Inf. Commun. Technol., vol. 23, no. 1, pp. 77–104, Jan. 2024, doi: 10.32890/jict2024.23.1.4.
R. Dzisevic and D. Sesok, ‘Text Classification using Different Feature Extraction Approaches’, in 2019 Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania: IEEE, Apr. 2019, pp. 1–4. doi: 10.1109/eStream.2019.8732167.
D. Kalman, ‘A Singularly Valuable Decomposition: The SVD of a Matrix’, Coll. Math. J., vol. 27, no. 1, pp. 2–23, Jan. 1996, doi: 10.1080/07468342.1996.11973744.
R. Jeevitha, K. Chaitanya, N. Mathesh, B. Nithyanarayanan, and P. Darshan, ‘Using Machine Learning to Identify Instances of Cyberbullying on Social Media’, in 2023 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India: IEEE, Mar. 2023, pp. 207–212. doi: 10.1109/ICSCDS56580.2023.10104720.
M. Grandini, E. Bagli, and G. Visani, ‘Metrics for Multi-Class Classification: an Overview’, ArXiv, vol. abs/2008.05756, Aug. 2020, doi: https://doi.org/10.48550/arXiv.2008.05756.
S. Qaiser and R. Ali, ‘Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents’, Int. J. Comput. Appl., vol. 181, no. 1, pp. 25–29, Jul. 2018, doi: 10.5120/ijca2018917395.
F. Zait and N. Zarour, ‘Addressing Lexical and Semantic Ambiguity in Natural Language Requirements’, in 2018 Fifth International Symposium on Innovation in Information and Communication Technology (ISIICT), Amman: IEEE, Oct. 2018, pp. 1–7. doi: 10.1109/ISIICT.2018.8613726.
S.-A. Rueschemeyer and M. G. Gaskell, Eds., The Oxford Handbook of Psycholinguistics, 2nd ed. Oxford University Press, 2018. doi: 10.1093/oxfordhb/9780198786825.001.0001.
DOI: https://doi.org/10.31449/inf.v49i15.6992

This work is licensed under a Creative Commons Attribution 3.0 License.