Parallel Fuzzy Rough Support Vector Machine for Data Classification in Cloud Environment

Arindam Chaudhuri

Abstract


Classification of data has been actively used for most effective and efficient means of conveying knowledge and information to users. The prima face has always been upon techniques for extracting useful knowledge from data such that returns are maximized. With emergence of huge datasets the existing classification techniques often fail to produce desirable results. The challenge lies in analyzing and understanding characteristics of massive datasets by retrieving useful geometric and statistical patterns. We propose a supervised parallel fuzzy rough support vector machine (PFRSVM) for data classification in cloud environment. PFRSVM is an in-stream data classification engine adhering to the fundamental rules of stream processing. The classification is performed by PFRSVM using hyperbolic tangent kernel. The fuzzy rough set model takes care of sensitiveness of noisy samples and handles impreciseness in training samples bringing robustness to results. The membership function is function of center and radius of each class in feature space and is represented with kernel. It plays an important role towards sampling the decision surface. The success of PFRSVM is governed by choosing appropriate parameter values. The training samples are either linear or nonlinear separable. The different input points make unique contributions to decision surface. The algorithm is parallelized with a view to reduce training times. The system is built on support vector machine library using Hadoop implementation of MapReduce. The algorithm is tested on large datasets to check its feasibility and convergence. The performance of classifier is also assessed in terms of number of support vectors. The challenges encountered towards implementing big data classification in machine learning frameworks are also discussed. The experiments are done on the cloud environment available at University of Technology and Management, India. The results are illustrated for Gaussian RBF and Bayesian kernels. The effect of variability in prediction and generalization of PFRSVM is examined with respect to values of parameter C. It effectively resolves outliers’ effects, imbalance and overlapping class problems, normalizes to unseen data and relaxes dependency between features and labels. The average classification accuracy for PFRSVM is better than other classifiers for both Gaussian RBF and Bayesian kernels. The experimental results on both synthetic and real datasets clearly demonstrate the superiority of the proposed technique. PFRSVM is scalable and reliable in nature and is characterized by order independence, computational transaction, failure recovery, atomic transactions, fault tolerant and high availability attributes as exhibited through various experiments.

Full Text:

PDF

References


J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh and A. H. Byers, “Big Data: The next frontier for Innovation, Competition and Productivity,” Technical Report, McKinsey Global Institute, McKinsey and Company, 2011.

M. B. Miles, M. A. Huberman and J. Saldaňa, Qualitative Data Analysis: A Methods Sourcebook. Sage Publications, 2014.

J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers, 2011.

J. Canny and H. Zhao, “Big Data Analytics with Small Footprint: Squaring the Cloud,” in 2013 Proc. Nineteenth ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, pp. 95–103.

C. Statchuk and D. Rope, “Enhancing Enterprise Systems with Big Data,” Technical Report, IBM Business Analytics Group, IBM Corporation, 2013.

J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. Cambridge University Press, 2014.

HDFS (Hadoop Distributed File System) Architecture: http://hadoop.apache.org/common/docs/current/hdfs_design.html, 2009.

K. Hwang, G. C. Fox and J. J. Dongarra, Distributed and Cloud Computing: From Parallel Processing to Internet of Things. Morgan Kaufmann, 2011.

E. Capriolo, D. Wampler, J. Rutherglen, Programming Hive. O’Reilly Media, 2012.

J. Abonyi, B. Feil and A. Abraham, “Computational Intelligence in Data Mining,” Informatica, vol. 29, no. 1, pp. 3–12, 2005.

D. Dubois and H. Prade, “Putting Rough Sets and Fuzzy Sets together,” in R. Slowinski (Editor) Intelligent Decision Support, Handbook of Applications and Advances of the Rough Set Theory, pp. 203–232, Kluwer Academic Publishers, 1992.

C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998.

C. M. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2007.

Q. Hu, S. An, X. Yu and D. Yu, “Robust Fuzzy Rough Classifiers,” Fuzzy Sets and Systems, vol. 183, no.1, pp. 26–43, 2011.

A. Chaudhuri, Data Classification through Fuzzy and Rough versions of Support Vector Machines: A Survey. Technical Report, Samsung Research and Development Institute Delhi, 2014.

V.N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.

A. Chaudhuri, K. De, and D. Chatterjee, “A Comparative Study of Kernels for Multi-Class Support Vector Machine,” in 2008 Proc. Fourth Conf. on Natural Computation, vol. 2, pp. 3–7.

A. Chaudhuri and K. De, “Fuzzy Support Vector Machine for Bankruptcy Prediction,” Applied Soft Computing, vol. 11, no. 2, pp. 2472–2486, 2011.

A. Chaudhuri, “Modified Support Vector Machine for Credit Approval Classification,” AI Communications, vol. 27, no. 2, pp. 189–211, 2014.

H. J. Zimmermann, Fuzzy Set Theory and its Applications. Boston: Kluwer Academic, 2001.

S. Perera and T. Gunarathne, Hadoop MapReduce Cookbook. Packt Publishers, 2013.

Ron Bekkerman and Mikhail Bilenko, John Langford, Scalable Machine Learning. Cambridge University Press, 2012.

C. C. Chang and C. J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011.

A. Chaudhuri, Studies on Parallel SVM based on MapReduce. Technical Report, Birla Institute of Technology Mesra, Patna Campus, India, 2010.

I. W. Tsang, J. T. Kwok and P. M. Cheung, “Core Vector Machines: Fast SVM Training on very Large Datasets,” Journal of Machine Learning Research, vol. 6, pp. 363–392, 2005.

S. Ramaswamy, R. Rastogi and K. Shim, “Efficient Algorithms for Mining Outliers from Large Datasets,” in 2000 Proc. ACM SIGMOD Conf. on Management of Data, pp. 427–438.

V. Punyakanok, D. Roth, W. Tau Yih and D. Zimak, “Learning and Inference over Constrained Output,” in 2005 Proc. 19th Joint Conf. on Artificial Intelligence, pp. 1124–1129.

B. Ellis, Real Time Analytics: Techniques to Analyze and Visualize Streaming Data. John Wiley and Sons, 2014.

M. Stonebraker, U. Cetintemel and S. Zdonik, The Eight Rules of Real Time Stream Processing. White Paper, StreamBase Systems, MA, United States, 2010.

J. L. Hennessy and D. A. Patterson, Computer Architecture – A Quantitative Approach. 5th Edition, Morgan Kaufmann Publications, Elsevier Inc., 2012.

Borealis: Second Generation Stream Processing Engine: http://nms.lcs.mit.edu/projects/borealis, 2003.

R. Jhawar, V. Piuri and M. Santambrogio, “Fault Tolerance Management in Cloud Computing: A System Level Perspective,” IEEE Systems Journal, vol. 7, no. 2, pp. 288 – 297, 2013.




Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.