An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique

Matthew Abiola Oladipupo, Princewill Chima Obuzor, Babatunde Joseph Bamgbade, Abidemi Emmanuel Adeniyi, Kazeem M. Olagunju, Sunday Adeola Ajagbe

Abstract


Every employee in the company that deals with data needs to have clean, noise-free data. Since data warehouses store and update enormous amounts of data from several sources, there is a potential that some of those references may contain inaccurate data. Due to the noise, inefficacy, and poor characterization of the vast amount of accessible data, as well as the ensuing insensitivity and inefficiencies of human data cleaning and labeling, the presentation of the data has become ambiguous, and the assessment of the information has become difficult. A hole in the creation of a better data analysis method was identified. This helped to guide the creation of a Python script for automatically cleaning and labeling data. The first step in the strategy used in this study to accomplish its goals and objectives was to obtain a financial dataset from the top database, "Kaggle". Create a machine learning (ML) approach in Python that intends to automate the financial dataset cleaning. This covers ingesting data, addressing incomplete data, addressing anomalies, one-hot wrapping and label encoding, extracting date and time values, and data normalization. Implementing an unsupervised machine learning method that attempts to automate financial dataset labeling (k-means). Using the method includes the elbow principle, k-means clustering, data modeling of "age" versus "arrival," dimensionality reductions, computer vision, and dataset categorizing using the groupings. Empirical assessment of the cleaned and labeled automated trading dataset utilizing a comparison of the cleaned dataset before and after PCA adoption. The results show that the developed ML technique not only improved the performance of the audit data used in this study, but it also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCA

Full Text:

PDF

References


Ajagbe, S. A., Oladipupo, M. A. & Balogun, E. O., 2020. Crime Belt Monitoring Via Data Visualization: A Case Study of Folium. International Journal of Information Security, Privacy and Digital Forensic, 4(2), pp. 35-44.

Alkatheeri, Y. et al., 2020. The effect of big data on the quality of decision-making in Abu Dhabi Government organisations. In: Data management, analytics and innovation . s.l.:Springer, Singapore.

Alwert, K., Bornemann, M. & Will, M., 2009. Does intellectual capital reporting matter to financial analysts?. Journal of intellectual capital., Volume 10, pp. 354-368.

Bansal, S. K., 2014. Towards a semantic extract-transform-load (ETL) framework for big data integration. s.l., IEEE, pp. 522-529.

Bansal, S. K. & Kagemann, S., 2015. Integrating big data: A semantic extract-transform-load framework. Computer, 48(3), pp. 42-50.

Benenson, Z., Gassmann, F. & Landwirth, R., 2017. Unpacking spear phishing susceptibility. s.l., Cham: Springer, p. 610–627.

Bergholz, A. et al., 2010. New filtering approaches for phishing email. Journal of Computer Security, 18(1), pp. 7-35.

Bergholz, A. et al., 2008. Improved Phishing Detection using Model-Based Features. Mountain View, California, USA, s.n., pp. 1-10.

Beskales, G., Ilyas, I. F. & L., G., 2010. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1-2), pp. 197-207.

Chang, J. C., Amershi, S. & Kamar, E., 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. s.l., s.n., pp. 2334-2346.

Chen, Z. & Cafarella, M., 2014. Integrating spreadsheet data via accurate and low-effort extraction. s.l., ACM, p. 1126–1135.

Chicco, D., 2017. Ten quick tips for machine learning in computational biology. Bio Data mining, 10(1), pp. 1-17.

Dallachiesa, M. et al., 2013. Nadeef: a commodity data cleaning system. SIGMOD, pp. 541-552.

Fang, Y. et al., 2019. Phishing Email Detection Using Improved RCNN Model With Multilevel Vectors and Attention Mechanism. IEEE Access, Volume 7, pp. 56329-56340.

Fan, W. et al., 2010. Towards certain fixes with editing rules and master data. PVLDB, 3(1-2), pp. 173-184.

Halgaš, L., Agrafiotis, I. & Nurse, J. R. C., 2020. Catching the Phish: detecting Phishing Attacks Using Recurrent Neural Networks RNNs. s.l., Springer, pp. 219-233.

Hellerstein, J. M., 2008. Quantitative data cleaning for large databases, s.l.: United Nations Economic Commission for Europe (UNECE).

Johnson, G. M., 2021. Algorithmic bias: on the implicit biases of social technology. Synthese, 198(10), pp. 9941-9961.

Kairam, S. & Heer, J., 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. s.l., ACM, pp. 1637-1648.

Khayyat, Z. et al., 2015. Bigdansing: A system for big data cleansing. s.l., ACM, pp. 1215-1230.

Kostopoulos, G., Kotsiantis, S. & Pintelas, P., 2015. Estimating student dropout in distance higher education using semi-supervised techniques. s.l., s.n., pp. 38-43.

Krishnan, S. et al., 2016. ActiveClean: interactive data cleaning for statistical modeling. s.l., ACM, p. 948.

Kubat, M., 2017. An introduction to machine learning (2nd Ed.). s.l.:Springer Publishing Company, Incorporated.

Kulesza, T. et al., 2014. Structured labeling for facilitating concept evolution in machine learning. s.l., ACM, p. 3075–3084.

Lai, S., Xu, L., Liu, K. & Zhau, J., 2015. Recurrent convolutional neural networks for text classification. s.l., ACM, p. 2267–2273.

Liebchen, G. A. & Shepper, M., 2005. Gernot Armin Liebchen, Martin Shepper, “Software Productivity Analysis of a Large Data Set and Issues of Confidentiality and Data Quality” 11th IEEE International Software Metrics Symposium (METRICS 2005).. s.l., ACM.

Madanagopal, K., Ragan, E. D. & Benjamin, P., 2019. Analytic provenance in practice: The role of provenance in real-world visualization and data analysis environments. IEEE Computer Graphics and Applications, 39(6), pp. 30-45.

Myklebust, T. et al., 2021. Data safety, sources, and data flow in the offshore industry. ESREL, Angers.

Phene, S. et al., 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology, 126(12), pp. 1627-1639.

Pisani, M., 2020. CHAPTER 1 – Introduction. In: MACHINE LEARNING . s.l.:Rootstrap, pp. 1-10.

Rajasekar, S. P., Philominathan, P. & Chinnathambi, V., 2019. Research Methodology. Knowledge Management Techniques for Risk Management in IT Projects.. Knowledge Management Techniques for Risk Management in IT Projects, pp. 1-53.

Reddy, U. S., Thota, A. V. & Dharun, A., 2018. Machine learning techniques for stress prediction in working employees. s.l., IEEE, pp. 1-4.

Roh, Y., Heo, G. & Whang, S. E., 2019. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, pp. 1-1.

Sadique, F., Kaul, R. & Badsha, S. S. S., 2020. An Automated Framework for Real-time Phishing URL Detection. s.l., IEEE, pp. 335-341.

Sidi, F. et al., 2012. Data Quality: A Survey of Data Quality Dimensions. s.l., IEEE, pp. 300-304.

Taleb, I., Dssouli, R. & Serhani, M. A., 2015. Big data pre-processing: A quality framework. s.l., IEEE, pp. 191-198.

Tang, N., 2014. Big Data Cleaning. International Journal of Database Theory and Application, pp. 13-24.

Thadson, K., Visitsattapongse, S. & Pechprasarn, S., 2021. Deep learning-based single-shot phase retrieval algorithm for surface plasmon resonance microscope based refractive index sensing application. Scientific Reports, 11(1), pp. 1-14.

Tomar, D. & Agarwal, S., 2014. A Survey on Pre-processing and Post-processing Techniques in Data Mining. International Journal of Database Theory and Application , 7(4), pp. 99-128.

Toolan, F. & Carthy, J., 2010. Feature selection for Spam and Phishing detection. s.l., IEEE, pp. 1-12.

Yao, L., Mao, C. & Luo, Y., 2019. Graph convolutional networks for text classification. s.l., ACM, p. 7370–7377.




DOI: https://doi.org/10.31449/inf.v47i6.4474

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.