Big Data Clustering Techniques Challenged and Perspectives: Review

Fouad Hammadi Awad, Murtadha M. Hamad

Abstract


Clustering in big data considers a critical data mining and analysis technique. There are problems with adapting clustering algorithms to large amounts of data, along with new challenges brought by big data. As the size of big data is up to petabytes of data, and clustering methods have high processing costs, the challenge is how to overcome this issue and utilize clustering techniques for big data promptly. The purpose of this work is to investigate the history and advancement of clustering platforms and techniques to handle big data issues, from the basic suggested techniques to today's novel solutions. The methodology and specific issues for building an effective clustering mechanism are presented and evaluated, followed by a discussion of the choices for enhancing clustering algorithms. A brief literature review of the recent advancement in clustering techniques has been presented to address the main characteristics and drawbacks of each solution. In addition, an example of big data set clustering has been presented for a further overview of the clustering techniques.


Full Text:

PDF

References


Timothy C Havens, James C Bezdek, and

Marimuthu Palaniswami. Scalable single

linkage hierarchical clustering for big data.

In 2013 IEEE eighth international conference

on intelligent sensors, sensor networks

and information processing, pages 396–401.

IEEE, 2013.

Philicity K Williams, Caio V Soares, and

Juan E Gilbert. A clustering rule based

approach for classification problems. International

Journal of Data Warehousing and

Mining (IJDWM), 8(1):1–23, 2012.

Vishnu Priya and A Vadivel. User behaviour

pattern mining from weblog. International

Journal of Data Warehousing and Mining

(IJDWM), 8(2):1–22, 2012.

Omran Alshamma, Fouad H Awad, Laith

Alzubaidi, Mohammed A Fadhel, Zinah

Mohsin Arkah, and Laith Farhan.

Employment of multi-classifier and multidomain

features for pcg recognition. In 2019

th International Conference on Developments

in eSystems Engineering (DeSE),

pages 321–325. IEEE, 2019.

Laith Alzubaidi, Reem Ibrahim Hasan,

Fouad H Awad, Mohammed A Fadhel, Omran

Alshamma, and Jinglan Zhang. Multiclass

breast cancer classification by a novel

two-branch deep convolutional neural network

architecture. In 2019 12th International

Conference on Developments in eSystems

Engineering (DeSE), pages 268–273.

IEEE, 2019.

Harihar Kalia, Satchidananda Dehuri, and

Ashish Ghosh. A survey on fuzzy association

rule mining. International Journal of Data

Warehousing and Mining (IJDWM), 9(1):1–

, 2013.

Francois G Meyer and Jatuporn Chinrungrueng.

Spatiotemporal clustering of fmri time

series in the spectral domain. Medical Image

Analysis, 9(1):51–68, 2005.

Jason Ernst, Gerard J Nau, and Ziv

Bar-Joseph. Clustering short time series

gene expression data. Bioinformatics,

(suppl 1):i159–i168, 2005.

F´elix Iglesias and Wolfgang Kastner. Analysis

of similarity measures in times series clustering

for the discovery of building energy

patterns. Energies, 6(2):579–597, 2013.

Ying Zhao and George Karypis. Empirical

and theoretical comparisons of selected criterion

functions for document clustering. Machine

learning, 55(3):311–331, 2004.

Richard J Hathaway and James C Bezdek.

Extending fuzzy and probabilistic clustering

to very large data sets. Computational Statistics

and Data Analysis, 51(1):215–234, 2006.

Hiba Asri, Hajar Mousannif, Hassan

Al Moatassime, and Thomas Noel. Big data

in healthcare: challenges and opportunities.

pages 1–7, 2015.

In Lee. Big data: Dimensions, evolution,

impacts, and challenges. Business horizons,

(3):293–303, 2017.

Gartner hype cycle for emerging technologies:

Ai, ar/vr, digital platforms - what-s the

big data? Accessed 10 June 2017.

Btissam Zerhari, Ayoub Ait Lahcen, and

Salma Mouline. Big data clustering: Algorithms

and challenges. In Proc. of Int.

Conf. on Big Data, Cloud and Applications

(BDCA’15), 2015.

Xinwang Liu, Xinzhong Zhu, Miaomiao Li,

Lei Wang, En Zhu, Tongliang Liu, Marius

Kloft, Dinggang Shen, Jianping Yin, and

Wen Gao. Multiple kernel k k-means with

incomplete kernels. IEEE transactions on

pattern analysis and machine intelligence,

(5):1191–1204, 2019.

Erich Schubert and Peter J Rousseeuw.

Faster k-medoids clustering: improving the

pam, clara, and clarans algorithms. In International

conference on similarity search and

applications, pages 171–187. Springer, 2019.

Fanyi Xie. Semiconductor scheduling problem

based on k-mode clustering algorithm. In

International Conference on Frontier Computing,

pages 867–873, 2020.

Erich Schubert and Peter J Rousseeuw.

Faster k-medoids clustering: improving the

pam, clara, and clarans algorithms. In International

conference on similarity search and

applications, pages 171–187. Springer, 2019.

T. Gupta and S.P. Panda. A comparison of

k-means clustering algorithm and clara clustering

algorithm on iris dataset. International

Journal of Engineering & Technology,

(4):4766–4768,.

L. Matioli, S. Santos, M. Kleina, and

E. Leite. A new algorithm for clustering

based on kernel density estimation. Journal

of Applied Statistics, 45(2):347–366,.

X. Cao, T. Su, P. Wang, G. Wang, Z. Lv,

and X. Li. An optimized chameleon algorithm

based on local features. In Proceedings

of the 2018 10th International Conference

on Machine Learning and Computing,

pages 184–192. ACM.

T. Xiong, S. Wang, A. Mayers, and

E. Monga. Dhcc: Divisive hierarchical clustering

of categorical data. Data Mining and

Knowledge Discovery, 24(1):103–135,.

Maurice Roux. A comparative study of divisive

and agglomerative hierarchical clustering

algorithms. Journal of Classification,

(2):345–366, 2018.

K.M. Kumar and A.R.M. Reddy. A fast

dbscan clustering algorithm by accelerating

neighbor searching using groups method.

Pattern Recognition, 58:39–48,.

A. Idrissi, H. Rehioui, A. Laghrissi, and

S. Retal. An improvement of denclue algorithm

for the data clustering. In 2015 5th

International Conference on Information &

Communication Technology and Accessibility

(ICTA, pages 1–6. IEEE.

Z. Deng, Y. Hu, M. Zhu, X. Huang, and

B. Du. A scalable and fast optics for clustering

trajectory big data. Cluster Computing,

(2):549–562,.

Matthias Carnein and Heike Trautmann.

Optimizing data stream representation: An

extensive survey on stream clustering algorithms.

Business & Information Systems Engineering,

(3):277–297, 2019.

Nguyen Duy Dat, Vo Ngoc Phu, Vo Thi Ngoc

Tran, Vo Thi Ngoc Chau, and Tuan A

Nguyen. Sting algorithm used english sentiment

classification in a parallel environment.

International Journal of Pattern Recognition

and Artificial Intelligence, 31(07):1750021,

Yan Jin, Bowen Xiong, Kun He, Yangming

Zhou, and Yi Zhou. On fast enumeration

of maximal cliques in large graphs. Expert

Systems with Applications, 187:115915, 2022.

Attri Ghosal, Arunima Nandy, Amit Kumar

Das, Saptarsi Goswami, and Mrityunjoy

Panday. A short review on different

clustering techniques and their applications.

Emerging technology in modelling and graphics,

pages 69–83, 2020.

S. Balakrishnan, M.J. Wainwright, and

B. Yu. Statistical guarantees for the em algorithm:

From population to sample-based

analysis. The Annals of Statistics, 45(1):77–

,.

N. Mulani, A. Pawar, P. Mulay, and A. Dani.

Variant of cobweb clustering for privacy

preservation in cloud db querying. Procedia

Computer Science, 50:363–368,.

Wanli Zhang and Yanming Di. Model-based

clustering with measurement or estimation

errors. Genes, 11(2):185, 2020.

A. Jovic, K. Brkic, and N. Bogunovic. A

review of feature selection methods with applications.

In 2015 38th International Convention

on Information and Communication

Technology, Electronics and Microelectronics

(MIPRO, pages 1200– 1205. IEEE.

Q. Zhang, C. Zhu, L.T. Yang, Z. Chen,

L. Zhao, and P. Li. An incremental cfs algorithm

for clustering large data in industrial

internet of things. IEEE Transactions on Industrial

Informatics, 13(3):1193–1201,.

Y.Wang, J.Wang, H. Liao, and H. Chen. An

efficient semisupervised representatives feature

selection algorithm based on information

theory. Pattern Recognition, 61:511–

,.

X. Kong, C. Hu, and Z. Duan. Generalized

principal component analysis. In Principal

Component Analysis Networks and Algorithms,

pages 185–233. Springer.

D. Chu, L.-Z. Liao, M.K.-P. Ng, and

X. Wang. Incremental linear discriminant

analysis: a fast algorithm and comparisons.

IEEE transactions on neural networks and

learning systems, 26(11):2716–2735,.

T. Wu, S.A.N. Sarmadi, V. Venkatasubramanian,

A. Pothen, and A. Kalyanaraman.

Fast svd computations for synchrophasor algorithms.

IEEE Transactions on Power Systems,

(2):1651–1652,.

T Ragunthar, P Ashok, N Gopinath, and

M Subashini. A strong reinforcement parallel

implementation of k-means algorithm using

message passing interface. Materials Today:

Proceedings, 46:3799–3802, 2021.

Tanvir Habib Sardar and Zahid Ansari. An

analysis of mapreduce efficiency in document

clustering using parallel k-means algorithm.

Future Computing and Informatics Journal,

(2):200–209, 2018.

K Indira, S Karthiga, CV Nisha Angeline,

and C Santhiya. Parallel clarans algorithm

for recommendation system in multi-cloud

environment. In Computer Networks and Inventive

Communication Technologies, pages

–472. Springer, 2021.

Kheyreddine Djouzi and Kadda Beghdad-

Bey. A review of clustering algorithms for

big data. In 2019 International Conference

on Networking and Advanced Systems (ICNAS),

pages 1–6. IEEE, 2019.

Xiaoxiao Cao, Tianyun Su, Pengyu Wang,

GuoyuWang, Zhihan Lv, and Xinfang Li. An

optimized chameleon algorithm based on local

features. In Proceedings of the 2018 10th

International Conference on Machine Learning

and Computing, pages 184–192, 2018.

Panthadeep Bhattacharjee and Pinaki Mitra.

A survey of density based clustering

algorithms. Frontiers of Computer Science,

(1):1–27, 2021.

Madhav Poudel and Michael Gowanlock.

Cuda-dclust+: Revisiting early gpuaccelerated

dbscan clustering designs. In

IEEE 28th International Conference

on High Performance Computing, Data, and

Analytics (HiPC), pages 354–363. IEEE,

Murtadha M Hamad et al. Big data

management using hadoop. In Journal of

Physics: Conference Series, volume 1804,

page 012109. IOP Publishing, 2021.

Murtadha M Hamad. A comparative study

of indexing techniques effect in big data

system storage optimization. In 2020 2nd

Al-Noor International Conference for Science

and Technology (NICST), pages 18–21.

IEEE, 2020.

Laith Alzubaidi, Mohammed A Fadhel,

Omran Al-Shamma, Jinglan Zhang, and

Ye Duan. Deep learning models for classification

of red blood cells in microscopy images

to aid in sickle cell anemia diagnosis. Electronics,

(3):427, 2020.

Sreekanth Rallapalli, RRb Gondkar, and

Uma Pavan Kumar Ketavarapu. Impact of

processing and analyzing healthcare big data

on cloud computing environment by implementing

hadoop cluster. Procedia Computer

Science, 85:16–22, 2016.

Taiwo Kolajo, Olawande Daramola, and Ayodele

Adebiyi. Big data stream analysis: a

systematic literature review. Journal of Big

Data, 6(1):1–30, 2019.

Marcos Dias de Assuncao, Alexandre

da Silva Veith, and Rajkumar Buyya. Distributed

data stream processing and edge

computing: A survey on resource elasticity

and future directions. Journal of Network

and Computer Applications, 103:1–17, 2018.

Eduardo PS Castro, Thiago D Maia, Marluce

R Pereira, Ahmed AA Esmin, and Denilson

A Pereira. Review and comparison

of apriori algorithm implementations on

hadoop-mapreduce and spark. The Knowledge

Engineering Review, 33, 2018.

Yassine Benlachmi and Moulay Lahcen Hasnaoui.

Big data and spark: Comparison

with hadoop. In 2020 Fourth World Conference

on Smart Trends in Systems, Security

and Sustainability (WorldS4), pages

–817. IEEE, 2020.

Yassir Samadi, Mostapha Zbakh, and Claude

Tadonki. Performance comparison between

hadoop and spark frameworks using hibench

benchmarks. Concurrency and Computation:

Practice and Experience, 30(12):e4367, 2018.

Mithu Mary George and PS Rasmi. Performance

comparison of apache hadoop and

apache spark for covid-19 data sets. In

4th International Conference on Smart

Systems and Inventive Technology (ICSSIT),

pages 1659–1665. IEEE, 2022.

Laith Alzubaidi, Muthana Al-Amidie,

Ahmed Al-Asadi, Amjad J Humaidi, Omran

Al-Shamma, Mohammed A Fadhel, Jinglan

Zhang, J Santamara, and Ye Duan. Novel

transfer learning approach for medical

imaging with limited labeled data. Cancers,

(7).

K Rajendra Prasad, Moulana Mohammed,

LV Narasimha Prasad, and Dinesh Kumar

Anguraj. An efficient sampling-based visualization

technique for big data clustering

with crisp partitions. Distributed and Parallel

Databases, 39(3):813–832, 2021.

Mehdi Assefi, Ehsun Behravesh, Guangchi

Liu, and Ahmad P Tafti. Big data machine

learning using apache spark mllib. In 2017

ieee international conference on big data (big

data), pages 3492–3498. IEEE, 2017.

Laith Alzubaidi, Mohammed A Fadhel, Omran

Al-Shamma, Jinglan Zhang, J Santamar

´ıa, and Ye Duan. Robust application

of new deep learning tools: an experimental

study in medical imaging. Multimedia Tools

and Applications, pages 1–29, 2021.

Gunasekaran Manogaran, V Vijayakumar,

R Varatharajan, Priyan Malarvizhi Kumar,

Revathi Sundarasekar, and Ching-Hsien Hsu.

Machine learning based big data processing

framework for cancer diagnosis using hidden

markov model and gm clustering. Wireless

personal communications, 102(3):2099–2116,

Gourav Bathla, Himanshu Aggarwal, and

Rinkle Rani. A novel approach for clustering

big data based on mapreduce. International

Journal of Electrical & Computer Engineering

(2088-8708), 8(3), 2018.

Ahmed Z. Skaik. Clustering big data based

on iwc-pso and mapreduce. In Thesis Submitted

in Partial Fulfillment of the Requirements

For the Degree of Master in Computer

Engineering.

Behrooz Hosseini and Kourosh Kiani. A robust

distributed big data clustering-based on

adaptive density partitioning using apache

spark, 2018.

Mo Hai, Yuejing Zhang, and Haifeng Li.

A performance comparison of big data processing

platform based on parallel clustering

algorithms. Procedia computer science,

:127–135, 2018.

Aditya Sarma, Poonam Goyal, Sonal Kumari,

Anand Wani, Jagat Sesh Challa,

Saiyedul Islam, and Navneet Goyal. μdbscan:

an exact scalable dbscan algorithm for big

data exploiting spatial locality. In 2019 IEEE

International Conference on Cluster Computing

(CLUSTER), pages 1–11. IEEE, 2019.

Omkaresh Kulkarni. Mapreduce framework

based big data clustering using fractional

integrated sparse fuzzy c means algorithm.

IET Image Process, 14(12):2719–2727.

Hoill Jung. Social mining-based clustering

process for big-data integration. Journal of

Ambient Intelligence and Humanized Computing.

Mustafa Jahangoshai Rezaee, Milad Eshkevari,

Morteza Saberi, and Omar Hussain.

Gbk-means clustering algorithm: An improvement

to the k-means algorithm based

on the bargaining game. Knowledge-Based

Systems, 213:106672, 2021.

Chen Zhen. Using big data fuzzy k-means

clustering and information fusion algorithm

in english teaching ability evaluation. Complexity,

, 2021.

Chunqiong Wu, Bingwen Yan, Rongrui Yu,

Baoqin Yu, Xiukao Zhou, Yanliang Yu, and

Na Chen. k-means clustering algorithm and

its simulation based on distributed computing

platform. Complexity, 2021, 2021.

Fouad H Awad and MurtadhaMHamad. Improved

k-means clustering algorithm for big

data based on distributed smartphone neural

engine processor. Electronics, 11(6):883,

Lin Ma, Yi Zhang, V´ıctor Leiva, Shuangzhe

Liu, and Tiefeng Ma. A new clustering algorithm

based on a radar scanning strategy

with applications to machine learning

data. Expert Systems with Applications,

:116143, 2022.

Laith Alzubaidi, Jinglan Zhang, Amjad J

Humaidi, Ayad Al-Dujaili, Ye Duan, Omran

Al-Shamma, J Santamar´ıa, Mohammed A

Fadhel, Muthana Al-Amidie, and Laith

Farhan. Review of deep learning: Concepts,

cnn architectures, challenges, applications,

future directions. Journal of big Data,

(1):1–74, 2021.




DOI: https://doi.org/10.31449/inf.v47i6.4445

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.