Big Data Clustering Techniques Challenged and Perspectives: Review
Abstract
Clustering in big data considers a critical data mining and analysis technique. There are problems with adapting clustering algorithms to large amounts of data, along with new challenges brought by big data. As the size of big data is up to petabytes of data, and clustering methods have high processing costs, the challenge is how to overcome this issue and utilize clustering techniques for big data promptly. The purpose of this work is to investigate the history and advancement of clustering platforms and techniques to handle big data issues, from the basic suggested techniques to today's novel solutions. The methodology and specific issues for building an effective clustering mechanism are presented and evaluated, followed by a discussion of the choices for enhancing clustering algorithms. A brief literature review of the recent advancement in clustering techniques has been presented to address the main characteristics and drawbacks of each solution. In addition, an example of big data set clustering has been presented for a further overview of the clustering techniques.
Full Text:
PDFReferences
Timothy C Havens, James C Bezdek, and
Marimuthu Palaniswami. Scalable single
linkage hierarchical clustering for big data.
In 2013 IEEE eighth international conference
on intelligent sensors, sensor networks
and information processing, pages 396–401.
IEEE, 2013.
Philicity K Williams, Caio V Soares, and
Juan E Gilbert. A clustering rule based
approach for classification problems. International
Journal of Data Warehousing and
Mining (IJDWM), 8(1):1–23, 2012.
Vishnu Priya and A Vadivel. User behaviour
pattern mining from weblog. International
Journal of Data Warehousing and Mining
(IJDWM), 8(2):1–22, 2012.
Omran Alshamma, Fouad H Awad, Laith
Alzubaidi, Mohammed A Fadhel, Zinah
Mohsin Arkah, and Laith Farhan.
Employment of multi-classifier and multidomain
features for pcg recognition. In 2019
th International Conference on Developments
in eSystems Engineering (DeSE),
pages 321–325. IEEE, 2019.
Laith Alzubaidi, Reem Ibrahim Hasan,
Fouad H Awad, Mohammed A Fadhel, Omran
Alshamma, and Jinglan Zhang. Multiclass
breast cancer classification by a novel
two-branch deep convolutional neural network
architecture. In 2019 12th International
Conference on Developments in eSystems
Engineering (DeSE), pages 268–273.
IEEE, 2019.
Harihar Kalia, Satchidananda Dehuri, and
Ashish Ghosh. A survey on fuzzy association
rule mining. International Journal of Data
Warehousing and Mining (IJDWM), 9(1):1–
, 2013.
Francois G Meyer and Jatuporn Chinrungrueng.
Spatiotemporal clustering of fmri time
series in the spectral domain. Medical Image
Analysis, 9(1):51–68, 2005.
Jason Ernst, Gerard J Nau, and Ziv
Bar-Joseph. Clustering short time series
gene expression data. Bioinformatics,
(suppl 1):i159–i168, 2005.
F´elix Iglesias and Wolfgang Kastner. Analysis
of similarity measures in times series clustering
for the discovery of building energy
patterns. Energies, 6(2):579–597, 2013.
Ying Zhao and George Karypis. Empirical
and theoretical comparisons of selected criterion
functions for document clustering. Machine
learning, 55(3):311–331, 2004.
Richard J Hathaway and James C Bezdek.
Extending fuzzy and probabilistic clustering
to very large data sets. Computational Statistics
and Data Analysis, 51(1):215–234, 2006.
Hiba Asri, Hajar Mousannif, Hassan
Al Moatassime, and Thomas Noel. Big data
in healthcare: challenges and opportunities.
pages 1–7, 2015.
In Lee. Big data: Dimensions, evolution,
impacts, and challenges. Business horizons,
(3):293–303, 2017.
Gartner hype cycle for emerging technologies:
Ai, ar/vr, digital platforms - what-s the
big data? Accessed 10 June 2017.
Btissam Zerhari, Ayoub Ait Lahcen, and
Salma Mouline. Big data clustering: Algorithms
and challenges. In Proc. of Int.
Conf. on Big Data, Cloud and Applications
(BDCA’15), 2015.
Xinwang Liu, Xinzhong Zhu, Miaomiao Li,
Lei Wang, En Zhu, Tongliang Liu, Marius
Kloft, Dinggang Shen, Jianping Yin, and
Wen Gao. Multiple kernel k k-means with
incomplete kernels. IEEE transactions on
pattern analysis and machine intelligence,
(5):1191–1204, 2019.
Erich Schubert and Peter J Rousseeuw.
Faster k-medoids clustering: improving the
pam, clara, and clarans algorithms. In International
conference on similarity search and
applications, pages 171–187. Springer, 2019.
Fanyi Xie. Semiconductor scheduling problem
based on k-mode clustering algorithm. In
International Conference on Frontier Computing,
pages 867–873, 2020.
Erich Schubert and Peter J Rousseeuw.
Faster k-medoids clustering: improving the
pam, clara, and clarans algorithms. In International
conference on similarity search and
applications, pages 171–187. Springer, 2019.
T. Gupta and S.P. Panda. A comparison of
k-means clustering algorithm and clara clustering
algorithm on iris dataset. International
Journal of Engineering & Technology,
(4):4766–4768,.
L. Matioli, S. Santos, M. Kleina, and
E. Leite. A new algorithm for clustering
based on kernel density estimation. Journal
of Applied Statistics, 45(2):347–366,.
X. Cao, T. Su, P. Wang, G. Wang, Z. Lv,
and X. Li. An optimized chameleon algorithm
based on local features. In Proceedings
of the 2018 10th International Conference
on Machine Learning and Computing,
pages 184–192. ACM.
T. Xiong, S. Wang, A. Mayers, and
E. Monga. Dhcc: Divisive hierarchical clustering
of categorical data. Data Mining and
Knowledge Discovery, 24(1):103–135,.
Maurice Roux. A comparative study of divisive
and agglomerative hierarchical clustering
algorithms. Journal of Classification,
(2):345–366, 2018.
K.M. Kumar and A.R.M. Reddy. A fast
dbscan clustering algorithm by accelerating
neighbor searching using groups method.
Pattern Recognition, 58:39–48,.
A. Idrissi, H. Rehioui, A. Laghrissi, and
S. Retal. An improvement of denclue algorithm
for the data clustering. In 2015 5th
International Conference on Information &
Communication Technology and Accessibility
(ICTA, pages 1–6. IEEE.
Z. Deng, Y. Hu, M. Zhu, X. Huang, and
B. Du. A scalable and fast optics for clustering
trajectory big data. Cluster Computing,
(2):549–562,.
Matthias Carnein and Heike Trautmann.
Optimizing data stream representation: An
extensive survey on stream clustering algorithms.
Business & Information Systems Engineering,
(3):277–297, 2019.
Nguyen Duy Dat, Vo Ngoc Phu, Vo Thi Ngoc
Tran, Vo Thi Ngoc Chau, and Tuan A
Nguyen. Sting algorithm used english sentiment
classification in a parallel environment.
International Journal of Pattern Recognition
and Artificial Intelligence, 31(07):1750021,
Yan Jin, Bowen Xiong, Kun He, Yangming
Zhou, and Yi Zhou. On fast enumeration
of maximal cliques in large graphs. Expert
Systems with Applications, 187:115915, 2022.
Attri Ghosal, Arunima Nandy, Amit Kumar
Das, Saptarsi Goswami, and Mrityunjoy
Panday. A short review on different
clustering techniques and their applications.
Emerging technology in modelling and graphics,
pages 69–83, 2020.
S. Balakrishnan, M.J. Wainwright, and
B. Yu. Statistical guarantees for the em algorithm:
From population to sample-based
analysis. The Annals of Statistics, 45(1):77–
,.
N. Mulani, A. Pawar, P. Mulay, and A. Dani.
Variant of cobweb clustering for privacy
preservation in cloud db querying. Procedia
Computer Science, 50:363–368,.
Wanli Zhang and Yanming Di. Model-based
clustering with measurement or estimation
errors. Genes, 11(2):185, 2020.
A. Jovic, K. Brkic, and N. Bogunovic. A
review of feature selection methods with applications.
In 2015 38th International Convention
on Information and Communication
Technology, Electronics and Microelectronics
(MIPRO, pages 1200– 1205. IEEE.
Q. Zhang, C. Zhu, L.T. Yang, Z. Chen,
L. Zhao, and P. Li. An incremental cfs algorithm
for clustering large data in industrial
internet of things. IEEE Transactions on Industrial
Informatics, 13(3):1193–1201,.
Y.Wang, J.Wang, H. Liao, and H. Chen. An
efficient semisupervised representatives feature
selection algorithm based on information
theory. Pattern Recognition, 61:511–
,.
X. Kong, C. Hu, and Z. Duan. Generalized
principal component analysis. In Principal
Component Analysis Networks and Algorithms,
pages 185–233. Springer.
D. Chu, L.-Z. Liao, M.K.-P. Ng, and
X. Wang. Incremental linear discriminant
analysis: a fast algorithm and comparisons.
IEEE transactions on neural networks and
learning systems, 26(11):2716–2735,.
T. Wu, S.A.N. Sarmadi, V. Venkatasubramanian,
A. Pothen, and A. Kalyanaraman.
Fast svd computations for synchrophasor algorithms.
IEEE Transactions on Power Systems,
(2):1651–1652,.
T Ragunthar, P Ashok, N Gopinath, and
M Subashini. A strong reinforcement parallel
implementation of k-means algorithm using
message passing interface. Materials Today:
Proceedings, 46:3799–3802, 2021.
Tanvir Habib Sardar and Zahid Ansari. An
analysis of mapreduce efficiency in document
clustering using parallel k-means algorithm.
Future Computing and Informatics Journal,
(2):200–209, 2018.
K Indira, S Karthiga, CV Nisha Angeline,
and C Santhiya. Parallel clarans algorithm
for recommendation system in multi-cloud
environment. In Computer Networks and Inventive
Communication Technologies, pages
–472. Springer, 2021.
Kheyreddine Djouzi and Kadda Beghdad-
Bey. A review of clustering algorithms for
big data. In 2019 International Conference
on Networking and Advanced Systems (ICNAS),
pages 1–6. IEEE, 2019.
Xiaoxiao Cao, Tianyun Su, Pengyu Wang,
GuoyuWang, Zhihan Lv, and Xinfang Li. An
optimized chameleon algorithm based on local
features. In Proceedings of the 2018 10th
International Conference on Machine Learning
and Computing, pages 184–192, 2018.
Panthadeep Bhattacharjee and Pinaki Mitra.
A survey of density based clustering
algorithms. Frontiers of Computer Science,
(1):1–27, 2021.
Madhav Poudel and Michael Gowanlock.
Cuda-dclust+: Revisiting early gpuaccelerated
dbscan clustering designs. In
IEEE 28th International Conference
on High Performance Computing, Data, and
Analytics (HiPC), pages 354–363. IEEE,
Murtadha M Hamad et al. Big data
management using hadoop. In Journal of
Physics: Conference Series, volume 1804,
page 012109. IOP Publishing, 2021.
Murtadha M Hamad. A comparative study
of indexing techniques effect in big data
system storage optimization. In 2020 2nd
Al-Noor International Conference for Science
and Technology (NICST), pages 18–21.
IEEE, 2020.
Laith Alzubaidi, Mohammed A Fadhel,
Omran Al-Shamma, Jinglan Zhang, and
Ye Duan. Deep learning models for classification
of red blood cells in microscopy images
to aid in sickle cell anemia diagnosis. Electronics,
(3):427, 2020.
Sreekanth Rallapalli, RRb Gondkar, and
Uma Pavan Kumar Ketavarapu. Impact of
processing and analyzing healthcare big data
on cloud computing environment by implementing
hadoop cluster. Procedia Computer
Science, 85:16–22, 2016.
Taiwo Kolajo, Olawande Daramola, and Ayodele
Adebiyi. Big data stream analysis: a
systematic literature review. Journal of Big
Data, 6(1):1–30, 2019.
Marcos Dias de Assuncao, Alexandre
da Silva Veith, and Rajkumar Buyya. Distributed
data stream processing and edge
computing: A survey on resource elasticity
and future directions. Journal of Network
and Computer Applications, 103:1–17, 2018.
Eduardo PS Castro, Thiago D Maia, Marluce
R Pereira, Ahmed AA Esmin, and Denilson
A Pereira. Review and comparison
of apriori algorithm implementations on
hadoop-mapreduce and spark. The Knowledge
Engineering Review, 33, 2018.
Yassine Benlachmi and Moulay Lahcen Hasnaoui.
Big data and spark: Comparison
with hadoop. In 2020 Fourth World Conference
on Smart Trends in Systems, Security
and Sustainability (WorldS4), pages
–817. IEEE, 2020.
Yassir Samadi, Mostapha Zbakh, and Claude
Tadonki. Performance comparison between
hadoop and spark frameworks using hibench
benchmarks. Concurrency and Computation:
Practice and Experience, 30(12):e4367, 2018.
Mithu Mary George and PS Rasmi. Performance
comparison of apache hadoop and
apache spark for covid-19 data sets. In
4th International Conference on Smart
Systems and Inventive Technology (ICSSIT),
pages 1659–1665. IEEE, 2022.
Laith Alzubaidi, Muthana Al-Amidie,
Ahmed Al-Asadi, Amjad J Humaidi, Omran
Al-Shamma, Mohammed A Fadhel, Jinglan
Zhang, J Santamara, and Ye Duan. Novel
transfer learning approach for medical
imaging with limited labeled data. Cancers,
(7).
K Rajendra Prasad, Moulana Mohammed,
LV Narasimha Prasad, and Dinesh Kumar
Anguraj. An efficient sampling-based visualization
technique for big data clustering
with crisp partitions. Distributed and Parallel
Databases, 39(3):813–832, 2021.
Mehdi Assefi, Ehsun Behravesh, Guangchi
Liu, and Ahmad P Tafti. Big data machine
learning using apache spark mllib. In 2017
ieee international conference on big data (big
data), pages 3492–3498. IEEE, 2017.
Laith Alzubaidi, Mohammed A Fadhel, Omran
Al-Shamma, Jinglan Zhang, J Santamar
´ıa, and Ye Duan. Robust application
of new deep learning tools: an experimental
study in medical imaging. Multimedia Tools
and Applications, pages 1–29, 2021.
Gunasekaran Manogaran, V Vijayakumar,
R Varatharajan, Priyan Malarvizhi Kumar,
Revathi Sundarasekar, and Ching-Hsien Hsu.
Machine learning based big data processing
framework for cancer diagnosis using hidden
markov model and gm clustering. Wireless
personal communications, 102(3):2099–2116,
Gourav Bathla, Himanshu Aggarwal, and
Rinkle Rani. A novel approach for clustering
big data based on mapreduce. International
Journal of Electrical & Computer Engineering
(2088-8708), 8(3), 2018.
Ahmed Z. Skaik. Clustering big data based
on iwc-pso and mapreduce. In Thesis Submitted
in Partial Fulfillment of the Requirements
For the Degree of Master in Computer
Engineering.
Behrooz Hosseini and Kourosh Kiani. A robust
distributed big data clustering-based on
adaptive density partitioning using apache
spark, 2018.
Mo Hai, Yuejing Zhang, and Haifeng Li.
A performance comparison of big data processing
platform based on parallel clustering
algorithms. Procedia computer science,
:127–135, 2018.
Aditya Sarma, Poonam Goyal, Sonal Kumari,
Anand Wani, Jagat Sesh Challa,
Saiyedul Islam, and Navneet Goyal. μdbscan:
an exact scalable dbscan algorithm for big
data exploiting spatial locality. In 2019 IEEE
International Conference on Cluster Computing
(CLUSTER), pages 1–11. IEEE, 2019.
Omkaresh Kulkarni. Mapreduce framework
based big data clustering using fractional
integrated sparse fuzzy c means algorithm.
IET Image Process, 14(12):2719–2727.
Hoill Jung. Social mining-based clustering
process for big-data integration. Journal of
Ambient Intelligence and Humanized Computing.
Mustafa Jahangoshai Rezaee, Milad Eshkevari,
Morteza Saberi, and Omar Hussain.
Gbk-means clustering algorithm: An improvement
to the k-means algorithm based
on the bargaining game. Knowledge-Based
Systems, 213:106672, 2021.
Chen Zhen. Using big data fuzzy k-means
clustering and information fusion algorithm
in english teaching ability evaluation. Complexity,
, 2021.
Chunqiong Wu, Bingwen Yan, Rongrui Yu,
Baoqin Yu, Xiukao Zhou, Yanliang Yu, and
Na Chen. k-means clustering algorithm and
its simulation based on distributed computing
platform. Complexity, 2021, 2021.
Fouad H Awad and MurtadhaMHamad. Improved
k-means clustering algorithm for big
data based on distributed smartphone neural
engine processor. Electronics, 11(6):883,
Lin Ma, Yi Zhang, V´ıctor Leiva, Shuangzhe
Liu, and Tiefeng Ma. A new clustering algorithm
based on a radar scanning strategy
with applications to machine learning
data. Expert Systems with Applications,
:116143, 2022.
Laith Alzubaidi, Jinglan Zhang, Amjad J
Humaidi, Ayad Al-Dujaili, Ye Duan, Omran
Al-Shamma, J Santamar´ıa, Mohammed A
Fadhel, Muthana Al-Amidie, and Laith
Farhan. Review of deep learning: Concepts,
cnn architectures, challenges, applications,
future directions. Journal of big Data,
(1):1–74, 2021.
DOI: https://doi.org/10.31449/inf.v47i6.4445
This work is licensed under a Creative Commons Attribution 3.0 License.