Performance of Malware Detection Classifier Using Genetic Programming in Feature Selection

Heba Harahsheh, Mohammad Shraideh, Saleh Sharaeh


The term "malicious software," which is commonly referred to as malware, describes malicious software that affects or harms computers, servers, or networks. While the numbers and complexity of malware have rapidly increased, developing a malware detection system is required to detect malware in the world of cybersecurity and test the behavior of its new features. While traditional techniques provide less efficiency in detecting new malware, machine learning techniques are used to achieve rapid malware detection in an intelligent way to improve detection performance, as malware and its application in the industry are constantly increasing. In this study, we developed a malware detection model by detecting malware using machine learning classifiers, after passing a new feature selection technique using genetic programming. We also compared the performance of all classifiers using the most recent feature selection techniques. Results show that Random Forest, Random Forest (4), and Random Tree give the best value in all experiments, while Hoeffding Tree and Decision Stump give lower values for F1-score and accuracy in all experiments. The feature selection method that proposed GPMP gives a better value than Filter-based with little differences. The accuracy and F1-score have the values of 0.881066 and 0.867546 for GPMP, and the values of 0.877624 and 0.862894 for Filter-based, respectively. The experimental results reveal that GPMP used fewer features than Filter-based, and this affected the computation and complexity of the model.

Full Text:



Alotaibi, S. S. (2021) ‘Regression coefficients as triad scale for malware detection’, Computers and Electrical Engineering. Elsevier Ltd, 90(December 2019), p. 106886. doi: 10.1016/j.compeleceng.2020.106886.

Alsaif, S. A. and Hidri, A. (2021) ‘Impact of data balancing during training for best predictions’, Informatica (Slovenia), 45(2), pp. 223–230. doi: 10.31449/inf.v45i2.3479.

Amer, E. and Zelinka, I. (2020) ‘A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence’, Computers and Security. Elsevier Ltd, 92. doi: 10.1016/j.cose.2020.101760.

Angelo Oliveira (2019a) API Call Sequences.

Angelo Oliveira (2019b) PE Section Headers.

Angelo Oliveira (2019c) Top-1000 PE Imports.

BIG Malware Dataset from Microsoft | Kaggle (no date).

Cheng, B. et al. (2019) ‘MoG: Behavior-Obfuscation Resistance Malware Detection’, Computer Journal, 62(12), pp. 1734–1747. doi: 10.1093/comjnl/bxz033.

Elyasaf, A. and Sipper, M. (2014) ‘Software review: The HeuristicLab framework’, Genetic Programming and Evolvable Machines, 15(2), pp. 215–218. doi: 10.1007/s10710-014-9214-4.

Euh, S. et al. (2020) ‘Comparative analysis of low-dimensional features and tree-based ensembles for malware detection systems’, IEEE Access. IEEE, 8, pp. 76796–76808. doi: 10.1109/ACCESS.2020.2986014.

GitHub - motakbiri/malware-detection: Machine Learning-Based Malicious Application Detecting using Low-level Architectural Features (no date).

Iqbal, N. and Islam, M. (2019) ‘Machine learning for dengue outbreak prediction: A performance evaluation of different prominent classifiers’, Informatica, 43(3). doi: 10.31449/inf.v43i3.1548.

Kumar, A. (2020) ‘ClaMP (Classification of Malware with PE headers)’. Mendeley Data, 1. doi: 10.17632/XVYV59VWVZ.1.

Kumar, A., Kuppusamy, K. S. and Aghila, G. (2019) ‘A learning model to detect maliciousness of portable executable using integrated feature set’, Journal of King Saud University - Computer and Information Sciences, 31(2), pp. 252–265. doi: 10.1016/j.jksuci.2017.01.003.

Lima, J. L. P., MacEdo, D. and Zanchettin, C. (2019) ‘Heartbeat Anomaly Detection using Adversarial Oversampling’, Proceedings of the International Joint Conference on Neural Networks, 2019-July(July), pp. 1–7. doi: 10.1109/IJCNN.2019.8852242.

de Lima, S. M. L. et al. (2021) ‘Artificial intelligence-based antivirus in order to detect malware preventively’, Progress in Artificial Intelligence. Springer Berlin Heidelberg, 10(1). doi: 10.1007/s13748-020-00220-4.

Malware Classification | Kaggle (no date).

Malware Executable Detection | Kaggle (no date).

Malware Goodware Dataset | Kaggle (no date).

Najadat, H. (2018) ‘Data Mining Classification Approaches for Malicious Executable File Detection’, International Journal of Cyber-Security and Digital Forensics, 7(3), pp. 238–242. doi: 10.17781/p002422.

Sakib, M. N., Huang, C. T. and Lin, Y. D. (2020) ‘Maximizing accuracy in multi-scanner malware detection systems’, Computer Networks. Elsevier B.V., 169, p. 107027. doi: 10.1016/j.comnet.2019.107027.

Savenko, O. et al. (2019) ‘Dynamic signature-based malware detection technique based on API call tracing’, CEUR Workshop Proceedings, 2393, pp. 633–643.

Shaukat, K. et al. (2020) ‘A Survey on Machine Learning Techniques for Cyber Security in the Last Decade’, IEEE Access, 8(01), pp. 222310–222354. doi: 10.1109/ACCESS.2020.3041951.

The number of new malicious files | Kaspersky (no date).

Windows Malware Detection | Kaggle (no date).


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.