An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique
Abstract
Full Text:
PDFReferences
Ajagbe, S. A., Oladipupo, M. A. & Balogun, E. O., 2020. Crime Belt Monitoring Via Data Visualization: A Case Study of Folium. International Journal of Information Security, Privacy and Digital Forensic, 4(2), pp. 35-44.
Alkatheeri, Y. et al., 2020. The effect of big data on the quality of decision-making in Abu Dhabi Government organisations. In: Data management, analytics and innovation . s.l.:Springer, Singapore.
Alwert, K., Bornemann, M. & Will, M., 2009. Does intellectual capital reporting matter to financial analysts?. Journal of intellectual capital., Volume 10, pp. 354-368.
Bansal, S. K., 2014. Towards a semantic extract-transform-load (ETL) framework for big data integration. s.l., IEEE, pp. 522-529.
Bansal, S. K. & Kagemann, S., 2015. Integrating big data: A semantic extract-transform-load framework. Computer, 48(3), pp. 42-50.
Benenson, Z., Gassmann, F. & Landwirth, R., 2017. Unpacking spear phishing susceptibility. s.l., Cham: Springer, p. 610–627.
Bergholz, A. et al., 2010. New filtering approaches for phishing email. Journal of Computer Security, 18(1), pp. 7-35.
Bergholz, A. et al., 2008. Improved Phishing Detection using Model-Based Features. Mountain View, California, USA, s.n., pp. 1-10.
Beskales, G., Ilyas, I. F. & L., G., 2010. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1-2), pp. 197-207.
Chang, J. C., Amershi, S. & Kamar, E., 2017. Revolt: Collaborative crowdsourcing for labeling machine learning datasets. s.l., s.n., pp. 2334-2346.
Chen, Z. & Cafarella, M., 2014. Integrating spreadsheet data via accurate and low-effort extraction. s.l., ACM, p. 1126–1135.
Chicco, D., 2017. Ten quick tips for machine learning in computational biology. Bio Data mining, 10(1), pp. 1-17.
Dallachiesa, M. et al., 2013. Nadeef: a commodity data cleaning system. SIGMOD, pp. 541-552.
Fang, Y. et al., 2019. Phishing Email Detection Using Improved RCNN Model With Multilevel Vectors and Attention Mechanism. IEEE Access, Volume 7, pp. 56329-56340.
Fan, W. et al., 2010. Towards certain fixes with editing rules and master data. PVLDB, 3(1-2), pp. 173-184.
Halgaš, L., Agrafiotis, I. & Nurse, J. R. C., 2020. Catching the Phish: detecting Phishing Attacks Using Recurrent Neural Networks RNNs. s.l., Springer, pp. 219-233.
Hellerstein, J. M., 2008. Quantitative data cleaning for large databases, s.l.: United Nations Economic Commission for Europe (UNECE).
Johnson, G. M., 2021. Algorithmic bias: on the implicit biases of social technology. Synthese, 198(10), pp. 9941-9961.
Kairam, S. & Heer, J., 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. s.l., ACM, pp. 1637-1648.
Khayyat, Z. et al., 2015. Bigdansing: A system for big data cleansing. s.l., ACM, pp. 1215-1230.
Kostopoulos, G., Kotsiantis, S. & Pintelas, P., 2015. Estimating student dropout in distance higher education using semi-supervised techniques. s.l., s.n., pp. 38-43.
Krishnan, S. et al., 2016. ActiveClean: interactive data cleaning for statistical modeling. s.l., ACM, p. 948.
Kubat, M., 2017. An introduction to machine learning (2nd Ed.). s.l.:Springer Publishing Company, Incorporated.
Kulesza, T. et al., 2014. Structured labeling for facilitating concept evolution in machine learning. s.l., ACM, p. 3075–3084.
Lai, S., Xu, L., Liu, K. & Zhau, J., 2015. Recurrent convolutional neural networks for text classification. s.l., ACM, p. 2267–2273.
Liebchen, G. A. & Shepper, M., 2005. Gernot Armin Liebchen, Martin Shepper, “Software Productivity Analysis of a Large Data Set and Issues of Confidentiality and Data Quality” 11th IEEE International Software Metrics Symposium (METRICS 2005).. s.l., ACM.
Madanagopal, K., Ragan, E. D. & Benjamin, P., 2019. Analytic provenance in practice: The role of provenance in real-world visualization and data analysis environments. IEEE Computer Graphics and Applications, 39(6), pp. 30-45.
Myklebust, T. et al., 2021. Data safety, sources, and data flow in the offshore industry. ESREL, Angers.
Phene, S. et al., 2019. Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs. Ophthalmology, 126(12), pp. 1627-1639.
Pisani, M., 2020. CHAPTER 1 – Introduction. In: MACHINE LEARNING . s.l.:Rootstrap, pp. 1-10.
Rajasekar, S. P., Philominathan, P. & Chinnathambi, V., 2019. Research Methodology. Knowledge Management Techniques for Risk Management in IT Projects.. Knowledge Management Techniques for Risk Management in IT Projects, pp. 1-53.
Reddy, U. S., Thota, A. V. & Dharun, A., 2018. Machine learning techniques for stress prediction in working employees. s.l., IEEE, pp. 1-4.
Roh, Y., Heo, G. & Whang, S. E., 2019. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering, pp. 1-1.
Sadique, F., Kaul, R. & Badsha, S. S. S., 2020. An Automated Framework for Real-time Phishing URL Detection. s.l., IEEE, pp. 335-341.
Sidi, F. et al., 2012. Data Quality: A Survey of Data Quality Dimensions. s.l., IEEE, pp. 300-304.
Taleb, I., Dssouli, R. & Serhani, M. A., 2015. Big data pre-processing: A quality framework. s.l., IEEE, pp. 191-198.
Tang, N., 2014. Big Data Cleaning. International Journal of Database Theory and Application, pp. 13-24.
Thadson, K., Visitsattapongse, S. & Pechprasarn, S., 2021. Deep learning-based single-shot phase retrieval algorithm for surface plasmon resonance microscope based refractive index sensing application. Scientific Reports, 11(1), pp. 1-14.
Tomar, D. & Agarwal, S., 2014. A Survey on Pre-processing and Post-processing Techniques in Data Mining. International Journal of Database Theory and Application , 7(4), pp. 99-128.
Toolan, F. & Carthy, J., 2010. Feature selection for Spam and Phishing detection. s.l., IEEE, pp. 1-12.
Yao, L., Mao, C. & Luo, Y., 2019. Graph convolutional networks for text classification. s.l., ACM, p. 7370–7377.
DOI: https://doi.org/10.31449/inf.v47i6.4474
This work is licensed under a Creative Commons Attribution 3.0 License.