Gender Classification on Twitter Based on Feeds and User Descriptions Using Xlnet-Fasttext

Vito Rozaan Alandeta, Derwin Suhartono

Abstract


Gender falsification in social media content is an increasingly troubling challenge, with users often choosing to hide their true gender identity or pretend to be members of a different gender. This can lead to negative consequences, including the spread of disinformation, discrimination and online security risks. To overcome this problem, this research proposes a text classification-based solution to identify gender fakes in social media texts. This method involves extracting linguistic features from texts, such as word usage, sentence structure, and language patterns that can provide clues to the author's gender. Therefore, this research aims to introduce a new transformers-based approach that uses XLNet and is also modified with additional Fasttext embedding. Modifications were made to the embedding section which can increase XLNet's understanding of text context in carrying out text classification. The results of this research are that baseline XLNet gets a fairly good performance score in gender classification based on Twitter feeds, namely with accuracy, precision, recall and f1-score of 0.704, 0.770, 0.598, 0.674 respectively, while XLNet-FastText gets the respective scores. -respectively 0.714, 0.770, 0.609, 0.680. And for gender classification based on user account descriptions, baseline XLNet gets scores of accuracy, precision, recall, f1-score of 0.705, 0.771, 0.598, 0.674 respectively while XLNet-FastText gets scores of 0.724, 0.751, 0.6324, 0.686 respectively.

Full Text:

PDF

References


Delić, D. (2022). Are women at more risk of online scams, the latest 2024 statistics. Retrieved from https://proprivacy.com/blog/women-and-online-scams-latest-statistics-2022

Susandra, A. (2022). Erayani Pelaku Penipuan nikah Sesama Jenis dilaporkan ke Polresta Jambi : Okezone Video. Retrieved from https://video.okezone.com/play/2022/06/30/1/149948/erayani-pelaku-penipuan-nikah-sesama-jenis-dilaporkan-ke-polresta-jambi

Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment analysis for e-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access, 8, 23522–23530. https://doi.org/10.1109/access.2020.2969854

Bazzaz Abkenar, S., Haghi Kashani, M., Akbari, M., & Mahdipour, E. (2023). Learning textual features for Twitter Spam Detection: A systematic literature review. Expert Systems with Applications, 228, 120366. https://doi.org/10.1016/j.eswa.2023.120366

Adhikari, A., Ram, A., Tang, R., Hamilton, W. L., & Lin, J. (2020). Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT. Proceedings of the 5th Workshop on Representation Learning for NLP. https://doi.org/10.18653/v1/2020.repl4nlp-1.10

Joshi, S., & Abdelfattah, E. (2021). Multi-class text classification using machine learning models for online drug reviews. 2021 IEEE World AI IoT Congress (AIIoT). https://doi.org/10.1109/aiiot52608.2021.9454250

Suleymanov, U., Kiani Kalejahi, B., Amrahov, E., & Badirkhanli, R. (2020). Text classification for azerbaijani language using machine learning. Computer Systems Science and Engineering, 35(6), 467–475. https://doi.org/10.32604/csse.2020.35.467

Garcia-Mendez, S., Fernandez-Gavilanes, M., Juncal-Martinez, J., Gonzalez-Castano, F. J., & Seara, O. B. (2020). Identifying banking transaction descriptions via support vector machine short-text classification based on a specialized labelled corpus. IEEE Access, 8, 61642–61655. https://doi.org/10.1109/access.2020.2983584

Zhong, B., Xing, X., Love, P., Wang, X., & Luo, H. (2019). Convolutional Neural Network: Deep learning-based classification of building quality problems. Advanced Engineering Informatics, 40, 46–57. https://doi.org/10.1016/j.aei.2019.02.009

Wani, A., Joshi, I., Khandve, S., Wagh, V., & Joshi, R. (2021). Evaluating deep learning approaches for covid19 fake news detection. Combating Online Hostile Posts in Regional Languages during Emergency Situation, 153–163. https://doi.org/10.1007/978-3-030-73696-5_15

Gupta, A., Chugh, D., Anjum, & Katarya, R. (2022). Automated News summarization using Transformers. Retrieved from https://link.springer.com/chapter/10.1007/978-981-16-9012-9_21

Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110

Anwar, M. T., Permana, A. K., Ambarwati, L., & Agustin, D. (2021). Analyzing public opinion based on emotion labeling using Transformers. 2021 2nd International Conference on Innovative and Creative Information Technology (ICITech). https://doi.org/10.1109/icitech50181.2021.9590110

Kumar, D., Kumar, N., & Mishra, S. (2021). NLP@NISER: Classification of covid19 tweets containing symptoms. Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task. https://doi.org/10.18653/v1/2021.smm4h-1.19

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., &; Soricut, R. (2020). Albert: A lite bert for self-supervised learning of language representations. arXiv.org. https://doi.org/10.48550/arXiv.1909.11942

Yao, T., Zhai, Z., & Gao, B. (2020). Text classification model based on fasttext: IEEE Conference Publication: IEEE Xplore. Retrieved from https://doi.org/10.1109/ICAIIS49377.2020.9194939

Nia, Z. M., Ahmadi, A., Mellado, B., Wu, J., Orbinski, J., Agary, A., & Kong, J. D. (2022). Twitter-based gender recognition using Transformers. Retrieved from https://arxiv.org/abs/2205.06801

Vashisth, P., &; Meehan, K. (2020). Gender classification using Twitter Text Data. 2020 31st Irish Signals and Systems Conference (ISSC). https://doi.org/10.1109/issc49989.2020.9180161

Puertas, E., Ureña-López, L. A., Pomares-Quimbaya, A., Alvarado-Valencia, J. A., Plaza-del-Arco, F. M., & Moreno-Sandoval, L. G. (2019). Bots and gender profiling on Twitter using sociolinguistic features ... Bots and gender profiling on Twitter using sociolinguistic features. https://www.researchgate.net/publication/335611800_Bots_and_Gender_Profiling_on_Twitter_using_Sociolinguistic_Features_Notebook_for_PAN_at_CLEF_2019

Staykovski, T. (2019). Stacked bots and gender prediction from Twitter feeds - CEUR-WS.org. Stacked Bots and Gender Prediction from Twitter Feeds. https://ceur-ws.org/Vol-2380/paper_197.pdf

Alroobaea, R., Aldahass, A., Alhomidi, S., Alafif, S., Hamed, R., Mulla, R., &; Alotaibi, B. (2020). A decision support system for detecting age and gender from Twitter feeds based on a comparative experiments. International Journal of Advanced Computer Science and Applications, 11(12). https://doi.org/10.14569/ijacsa.2020.0111245

Saeed, U., &; Shirazi, F. (2019). Bots and gender classification on Twitter - Webis. Notebook for PAN at CLEF 2019. https://pan.webis.de/downloads/publications/papers/saeed_2019.pdf

Ouni, S., Fkih, F., &; Omri, M. N. (2022). Bots and gender detection on twitter using stylistic features. Advances in Computational Collective Intelligence, 650–660. https://doi.org/10.1007/978-3-031-16210-7_53

Soldevilla, I., &; Flores, N. (2021). Natural language processing through Bert for identifying gender-based violence messages on social media. 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE). https://doi.org/10.1109/icicse52190.2021.9404127

Hashempour, R., Amorim, R., Villavicencio, A., & Plank, B. (2019). A deep learning approach to language-independent gender prediction on Twitter. ACL Anthology. https://aclanthology.org/W19-3630/

Eight, F. (2016). Twitter User Gender Classification. Retrieved from https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification




DOI: https://doi.org/10.31449/inf.v48i20.5761

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.