A Comparative Analysis of Machine Learning Algorithms to Build a Predictive Model for Detecting Diabetes Complications

Ali A. Abaker, Fakhreldeen A. Saeed


Diabetes complications have a significant impact on patients’ quality of life. The objective of this study was to predict which patients were more likely to be in a complicated health condition at the time of admission to allow for the early introduction of medical interventions. The data were 644 electronic health records from Alsukari Hospital collected from January 2018 to April 2019. We used the following machine learning methods: logistic regression, random forest, and k-nearest neighbor (KNN). The logistic regression algorithm performed better than the other algorithms achieving an accuracy of 81%, recall of 81%, and F1 score of 75%. Also, attributes such as infection years, swelling, diabetic ketoacidosis, and diabetic septic foot were significant in predicting diabetes complications. This model can be useful for the identification of patients requiring additional care to limit the complications and help practitioners in making decisions on whether the patient should be hospitalized or sent home. Furthermore, we used the sequential feature selection(SFS) algorithm which reduced the features to six, which is fewer than any model built before to predict diabetes complications. The primary goal of this study was achieved. The model had fewer attributes which means we have a simple and understandable model in addition to, it has a better performance.

Full Text:



J. P. Kandhasamy and S. Balamurali, “Performance Analysis of Classifier Models to Predict Diabetes Mellitus,” Procedia - Procedia Comput. Sci., vol. 47, pp. 45–51, 2015, doi: 10.1016/j.procs.2015.03.182.

L. Liu, “Forecasting Potential Diabetes Complications,” 2014.

S. Malik, R. Khadgawat, S. Anand, and S. Gupta, “Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva,” SpringerPlus, vol. 5, no. 1. 2016, doi: 10.1186/s40064-016-2339-6.

D. Sisodia and D. S. Sisodia, “ScienceDirect Prediction of Diabetes using Classification Algorithms,” Procedia Comput. Sci., vol. 132, no. Iccids, pp. 1578–1585, 2018, doi: 10.1016/j.procs.2018.05.122.

A. E. Anderson, W. T. Kerr, A. Thames, T. Li, J. Xiao, and M. S. Cohen, “Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study,” J. Biomed. Inform., vol. 60, no. December, pp. 162–168, 2016, doi: 10.1016/j.jbi.2015.12.006.

“Machine learning Model for Predicting Diabetes Complications Using Electronic Health Records.”

A. Anand and D. Shakti, “Prediction of diabetes based on personal lifestyle indicators,” Proc. 2015 1st Int. Conf. Next Gener. Comput. Technol. NGCT 2015, no. September, pp. 673–676, 2016, doi: 10.1109/NGCT.2015.7375206.

G. Peddinti et al., “Early metabolic markers identify potential targets for the prevention of type 2 diabetes,” Diabetologia, vol. 60, no. 9, pp. 1740–1750, 2017, doi: 10.1007/s00125-017-4325-0.

T. P. A. Debray, Y. Vergouwe, H. Koffijberg, D. Nieboer, E. W. Steyerberg, and K. G. M. Moons, “ORIGINAL ARTICLES A new framework to enhance the interpretation of external validation studies of clinical prediction models,” J. Clin. Epidemiol., vol. 68, no. 3, pp. 279–289, 2015, doi: 10.1016/j.jclinepi.2014.06.018.

M. Komi, J. Li, Y. Zhai, and Z. Xianguo, “Application of data mining methods in diabetes prediction,” 2017 2nd Int. Conf. Image, Vis. Comput. ICIVC 2017, no. S Ix, pp. 1006–1010, 2017, doi: 10.1109/ICIVC.2017.7984706.

A. Dagliati et al., “Machine Learning Methods to Predict Diabetes Complications,” 2017, doi: 10.1177/1932296817706375.

Purushottam, K. Saxena, and R. Sharma, “Diabetes mellitus prediction system evaluation using C4.5 rules and partial tree,” 2015 4th Int. Conf. Reliab. Infocom Technol. Optim. Trends Futur. Dir. ICRITO 2015, pp. 1–6, 2015, doi: 10.1109/ICRITO.2015.7359272.

J. S. Kim et al., “Examining the Ability of Artificial Neural Networks Machine Learning Models to Accurately Predict Complications Following Posterior Lumbar Spine Fusion,” Spine (Phila. Pa. 1976)., vol. 43, no. 12, pp. 853–860, 2018, doi: 10.1097/BRS.0000000000002442.

M. Kumar, N. K. Rath, A. Swain, and S. K. Rath, “Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor,” Procedia Comput. Sci., vol. 54, pp. 301–310, 2015, doi: 10.1016/j.procs.2015.06.035.

B. Liu, Y. Li, Z. Sun, S. Ghosh, and K. Ng, “Early Prediction of Diabetes Complications from Electronic Health Records : A Multi-Task Survival Analysis Approach,” pp. 101–108.

N. Razavian, S. Blecker, A. M. Schmidt, A. Smith-mclallen, S. Nigam, and D. Sontag, “Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors,” vol. 3, no. 4, 2015, doi: 10.1089/big.2015.0020.

V. R. Balpande and R. D. Wajgi, “Prediction and severity estimation of diabetes using data mining technique,” IEEE Int. Conf. Innov. Mech. Ind. Appl. ICIMIA 2017 - Proc., no. Icimia, pp. 576–580, 2017, doi: 10.1109/ICIMIA.2017.7975526.

C. Zhao and C. Yu, “Rapid model identification for online subcutaneous glucose concentration prediction for new subjects with type i diabetes,” IEEE Trans. Biomed. Eng., vol. 62, no. 5, pp. 1333–1344, 2015, doi: 10.1109/TBME.2014.2387293.

O. Geman, I. Chiuchisan, and R. Toderean, “Application of Adaptive Neuro-Fuzzy Inference System for diabetes classification and prediction,” 2017 E-Health Bioeng. Conf. EHB 2017, no. Dm, pp. 639–642, 2017, doi: 10.1109/EHB.2017.7995505.

N. Sneha and T. Gangil, “Analysis of diabetes mellitus for early prediction using optimal features selection,” J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0175-6.

S. Joshi and M. Borse, “Detection and prediction of diabetes mellitus using back-propagation neural network,” Proc. - 2016 Int. Conf. Micro-Electronics Telecommun. Eng. ICMETE 2016, pp. 110–113, 2016, doi: 10.1109/ICMETE.2016.11.

H. Y. Tsao, P. Y. Chan, and E. C. Y. Su, “Predicting diabetic retinopathy and identifying interpretable biomedical features using machine learning algorithms,” BMC Bioinformatics, vol. 19, no. Suppl 9, 2018, doi: 10.1186/s12859-018-2277-0.

H. Kaur and V. Kumari, “Predictive modelling and analytics for diabetes using a machine learning approach,” Appl. Comput. Informatics, no. December, 2019, doi: 10.1016/j.aci.2018.12.004.

H. Wu, S. Yang, Z. Huang, J. He, and X. Wang, “Type 2 diabetes mellitus prediction model based on data mining,” Informatics Med. Unlocked, vol. 10, pp. 100–107, 2018, doi: 10.1016/j.imu.2017.12.006.

B. J. Lee and J. Y. Kim, “Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on Machine Learning,” IEEE J. Biomed. Heal. Informatics, vol. 20, no. 1, pp. 39–46, 2016, doi: 10.1109/JBHI.2015.2396520.

T. Zheng et al., “A Machine Learning-based Framework to Identify Type 2 Diabetes through Electronic Health Records,” Int. J. Med. Inform., 2016, doi: 10.1016/j.ijmedinf.2016.09.014.

P. Songthung and K. Sripanidkulchai, “Improving type 2 diabetes mellitus risk prediction using classification,” 2016 13th Int. Jt. Conf. Comput. Sci. Softw. Eng. JCSSE 2016, 2016, doi: 10.1109/JCSSE.2016.7748866.

S. Perveen, M. Shahbaz, A. Guergachi, and K. Keshavjee, “Performance Analysis of Data Mining Classification Techniques to Predict Diabetes,” Procedia Comput. Sci., vol. 82, no. March, pp. 115–121, 2016, doi: 10.1016/j.procs.2016.04.016.

V. A. Kumari, “Classification Of Diabetes Disease Using Support Vector Machine,” vol. 3, no. 2, pp. 1797–1801, 2013.

J. Li et al., “Feature selection: A data perspective,” ACM Comput. Surv., vol. 50, no. 6, 2017, doi: 10.1145/3136625.

K. Zarkogianni, M. Athanasiou, and A. C. Thanopoulou, “Comparison of Machine Learning Approaches Toward Assessing the Risk of Developing Cardiovascular Disease as a Long-Term Diabetes Complication,” IEEE J. Biomed. Heal. Informatics, vol. 22, no. 5, pp. 1637–1647, 2018, doi: 10.1109/JBHI.2017.2765639.

G. Huzooree, “Glucose Prediction Data Analytics for Diabetic Patients Monitoring,” no. i, 2017.

M. Almetwazi et al., “Factors associated with glycemic control in type 2 diabetic patients in Saudi Arabia,” Saudi Pharm. J., vol. 27, no. 3, pp. 384–388, 2019, doi: 10.1016/j.jsps.2018.12.007.

M. YimamAhmed, S. H. Ejigu, A. Z. Zeleke, and M. Y. Hassen, “Glycemic control, diabetes complications and their determinants among ambulatory diabetes mellitus patients in southwest ethiopia: A prospective cross-sectional study,” Diabetes, Metab. Syndr. Obes. Targets Ther., vol. 13, pp. 1089–1095, 2020, doi: 10.2147/DMSO.S227664.

B. N. Armstrong, A. Renson, L. C. Zhao, and M. A. Bjurlin, “Development of novel prognostic models for predicting complications of urethroplasty,” World J. Urol., vol. 37, no. 3, pp. 553–559, 2019, doi: 10.1007/s00345-018-2413-5.

V. Rodriguez-Romero, R. F. Bergstrom, B. S. Decker, G. Lahu, M. Vakilynejad, and R. R. Bies, “Prediction of Nephropathy in Type 2 Diabetes: An Analysis of the ACCORD Trial Applying Machine Learning Techniques,” Clin. Transl. Sci., vol. 12, no. 5, pp. 519–528, 2019, doi: 10.1111/cts.12647.

S. Ding, Z. Li, X. Liu, H. Huang, and S. Yang, “Diabetic complication prediction using a similarity-enhanced latent Dirichlet allocation model,” Inf. Sci. (Ny)., vol. 499, pp. 12–24, 2019, doi: 10.1016/j.ins.2019.05.037.

K. Alexiadou and J. Doupis, “Management of diabetic foot ulcers,” Diabetes Ther., vol. 3, no. 1, pp. 1–15, 2012, doi: 10.1007/s13300-012-0004-9.

Y. J. van de Vegte, B. S. Tegegne, N. Verweij, H. Snieder, and P. van der Harst, “Genetics and the heart rate response to exercise,” Cell. Mol. Life Sci., no. 123456789, 2019, doi: 10.1007/s00018-019-03079-4.

B. Xue, M. Zhang, S. Member, and W. N. Browne, “Particle Swarm Optimization for Feature Selection in Classification : A Multi-Objective Approach,” Ieee Trans. Cybern., pp. 1–16, 2012.

M. A. Sulaiman and J. Labadin, “Feature selection based on mutual information for machine learning prediction of petroleum reservoir properties,” 2015 9th Int. Conf. IT Asia Transform. Big Data into Knowledge, CITA 2015 - Proc., pp. 2–7, 2015, doi: 10.1109/CITA.2015.7349827.

V. Bolón-Canedo and A. Alonso-Betanzos, “Ensembles for feature selection: A review and future trends,” Inf. Fusion, vol. 52, pp. 1–12, 2019, doi: 10.1016/j.inffus.2018.11.008.

R. Cekik and A. K. Uysal, “A novel filter feature selection method using rough set for short text data,” Expert Syst. Appl., vol. 160, p. 113691, 2020, doi: 10.1016/j.eswa.2020.113691.

E. Hancer, B. Xue, and M. Zhang, “Differential evolution for filter feature selection based on information theory and feature ranking,” Knowledge-Based Syst., vol. 140, pp. 103–119, 2018, doi: 10.1016/j.knosys.2017.10.028.

M. Monirul Kabir, M. Monirul Islam, and K. Murase, “A new wrapper feature selection approach using neural network,” Neurocomputing, vol. 73, no. 16–18, pp. 3273–3283, 2010, doi: 10.1016/j.neucom.2010.04.003.

V. F. Rodriguez-Galiano, J. A. Luque-Espinar, M. Chica-Olmo, and M. P. Mendes, “Feature selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters, embedded and wrapper methods,” Sci. Total Environ., vol. 624, pp. 661–672, 2018, doi: 10.1016/j.scitotenv.2017.12.152.

J. González, J. Ortega, M. Damas, P. Martín-Smith, and J. Q. Gan, “A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI,” Neurocomputing, vol. 333, pp. 407–418, 2019, doi: 10.1016/j.neucom.2019.01.017.

D. Jain and V. Singh, “Feature selection and classification systems for chronic disease prediction: A review,” Egypt. Informatics J., vol. 19, no. 3, pp. 179–189, 2018, doi: 10.1016/j.eij.2018.03.002.

J. Pirgazi, M. Alimoradi, T. Esmaeili Abharian, and M. H. Olyaee, “An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets,” Sci. Rep., vol. 9, no. 1, pp. 1–15, 2019, doi: 10.1038/s41598-019-54987-1.

H. Liu, S. Member, M. Zhou, I. Qing, and G. Liu, “An Embedded Feature Selection Method for Imbalanced Data Classification,” IEEE/CAA J. Autom. Sin., vol. PP, pp. 1–13, doi: 10.1109/JAS.2019.1911447.

M. Lu, “Embedded feature selection accounting for unknown data heterogeneity,” Expert Syst. Appl., vol. 119, pp. 350–361, 2019, doi: 10.1016/j.eswa.2018.11.006.

J. Wang, J. Xu, C. Zhao, Y. Peng, and H. Wang, “An ensemble feature selection method for high-dimensional data based on sort aggregation,” Syst. Sci. Control Eng., vol. 7, no. 2, pp. 32–39, 2019, doi: 10.1080/21642583.2019.1620658.

G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Comput. Electr. Eng., vol. 40, no. 1, pp. 16–28, 2014, doi: 10.1016/j.compeleceng.2013.11.024.

J. Lee, D. Park, and C. Lee, “Feature selection algorithm for intrusions detection system using sequential forward search and random forest classifier,” KSII Trans. Internet Inf. Syst., vol. 11, no. 10, pp. 5132–5148, 2017, doi: 10.3837/tiis.2017.10.024.

O. F.Y, A. J.E.T, A. O, H. J. O, O. O, and A. J, “Supervised Machine Learning Algorithms: Classification and Comparison,” Int. J. Comput. Trends Technol., vol. 48, no. 3, pp. 128–138, 2017, doi: 10.14445/22312803/ijctt-v48p126.

B. J. Frey, S. Member, and N. Jojic, “freyJojicTutorial_pami_sep05.pdf,” vol. 27, no. 9, pp. 1392–1416, 2005.

C. M. Bishop, “Model-based machine learning Author for correspondence :,” Philos. Trans. R. Soc. A Math. Phys. Eng. Sci., vol. 371, no. 1984, p. 20120222, 2013.

M. M. Churpek, T. C. Yuen, C. Winslow, D. O. Meltzer, M. W. Kattan, and D. P. Edelson, “Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards,” Crit. Care Med., vol. 44, no. 2, pp. 368–374, 2016, doi: 10.1097/CCM.0000000000001571.

B. Heung, H. C. Ho, J. Zhang, A. Knudby, C. E. Bulmer, and M. G. Schmidt, “An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping,” Geoderma, vol. 265, pp. 62–77, 2016, doi: 10.1016/j.geoderma.2015.11.014.

M. Maniruzzaman et al., “Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers,” J. Med. Syst., vol. 42, no. 5, pp. 1–17, 2018, doi: 10.1007/s10916-018-0940-7.

C. Zhu, C. U. Idemudia, and W. Feng, “Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques,” Informatics Med. Unlocked, vol. 17, no. January, p. 100179, 2019, doi: 10.1016/j.imu.2019.100179.

W. Xu, J. Zhang, Q. Zhang, and X. Wei, “Risk prediction of type II diabetes based on random forest model,” 2017.

B. Baba, “Borsa _ Istanbul Review Predicting IPO initial returns using random forest,” 2020, doi: 10.1016/j.bir.2019.08.001.

K. Saxena, Z. Khan, and S. Singh, “Diagnosis of Diabetes Mellitus using K Nearest Neighbor Algorithm,” vol. 2, no. 4, pp. 36–43, 2014.

S. K. Nayak, M. Panda, and G. Palai, “Realization of optical ADDER circuit using photonic structure and KNN algorithm,” Optik (Stuttg)., vol. 212, no. March, p. 164675, 2020, doi: 10.1016/j.ijleo.2020.164675.

N. Nai-Arun and R. Moungmai, “Comparison of Classifiers for the Risk of Diabetes Prediction,” Procedia Comput. Sci., vol. 69, pp. 132–142, 2015, doi: 10.1016/j.procs.2015.10.014.

DOI: https://doi.org/10.31449/inf.v45i1.3111

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.