A Novel Algorithm for Clustering and Feature Selection of High Dimensional Datasets

A Novel Algorithm for Clustering and Feature Selection of High Dimensional Datasets

Thulasi Bikku Alapati Padma Priya

Member, IAENG, Department of CSE

Vignan's Nirula Institute of Technology and Science for Women, Palakaluru Road, Guntur, Andhra Pradesh 522005, India

Corresponding Author Email: 
thulasi.jntua@gmail.com, padmapriyaalapati01@gmail.com
26 December 2017
15 January 2018
31 September 2017
| Citation



The exponential development of perplexing, heterogeneous, dynamic, and unbounded information, produced by an group of fields including health, genomics, material science, climatology, and interpersonal organizations posture noteworthy difficulties in information preparing and desired speed-execution. The responsibility of collection and arrangement of objects such that items in a similar group are more comparable to each other than to those in different groups (clusters). Exploratory information technique is clustering, which arranges the information of dataset into a few groups. There are many grouping methods are accessible. Various types of calculations are best utilized for various types of information. K-means is mostly utilized for clustering analysis algorithm. Big data analytics includes numerous imperative data mining undertakings including clustering, which arranges the information into important clusters in view of the likeness or uniqueness among objects. Experiments are performed on a benchmark dataset to assess the attainability and effectiveness of our calculation. Immense measure i.e. Gigabytes, Terabytes) of information processing and analysis is done using the big data environment.  For Cluster analysis technique, mainly the K-mean clustering algorithm is executed through the Hadoop and MapReduce to analyse high dimensional datasets. In big data analytics, the clustering is done when the unlabelled information is handled and used to group clusters of the information. Also when it is examined by the conventional k-means algorithm does not works well with the Hadoop framework and MapReduce programming in this manner it is mandatory to change the algorithm so as to improve the  performance on the data analysing techniques. In this manner another clustering algorithm with improvement on conventional k-means clustering algorithm is proposed and executed. This approach initially upgrades the quality of the data by evacuating the anomaly focuses in datasets and afterward the bi-part technique is utilized to play out the grouping. The proposed algorithm for clustering method executed utilizing the Hadoop framework and MapReduce programming at long last the execution of the proposed algorithm of grouping approach is assessed and contrasted and the conventional k-means clustering technique. The acquired execution demonstrates the compelling outcomes and improved accuracy of group construction with the evacuation of the de-effectiveness. In this way the proposed work can be applied for big data environment with enhancing the execution of grouping


K-means, Classification, Clustering, MapReduce, Big data, Accuracy, Hadoop

1. Introduction
2. Algorithm for Traditional K-Means
3. Related Work
4. Proposed Algorithm (KHDD)
5. Experimental Results
6. Conclusion and Future Work

[1] Bikku, Thulasi, N. Sambasiva Rao, Ananda Rao Akepogu. Hadoop based feature selection and decision making models on Big Data. 2016, Indian Journal of Science and Technology, vol. 9, no. 10.

[2] Bhatt, Chintan M., S. K. Peddoju, eds. Cloud Computing Systems and Applications in Healthcare. 2016, IGI Global.

[3] Queiroz, Rodrigo, et al. Does feature scattering follow power-law distributions? An investigation of five pre-processor-based systems. 2014, Proceedings of the 6th International Workshop on Feature-Oriented Software Development. ACM.

[4] Ye, Qiang, Ziqiong Zhang, Rob Law. Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. 2009, Expert systems with applications, vol.  36, no. 3, pp.  6527-6535.

[5] Endert, A., et al. 2017, The state of the art in integrating machine learning into visual analytics. Computer Graphics Forum.

[6] Hartigan, John A., Manchek A. Wong. Algorithm AS 136: A k-means clustering algorithm. 1979, Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp.  100-108.

[7] Jain, Anil K. Data clustering: 50 years beyond K-means. 2010, Pattern recognition letters, vol.  31.8, pp.  651-666.

[8] Fahim, A. M., et al. An efficient enhanced k-means clustering algorithm. 2006, Journal of Zhejiang University-Science A, vol.  7, no. 10, pp.  1626-1633.

[9] Celebi, M. Emre, Hassan A. Kingravi, Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. 2013, Expert systems with applications, vol. 40, no. 1, pp.  200-210.

[10] Bikku, Thulasi. A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets. 2017, IOP Conference Series: Materials Science and Engineering, vol. 225, no. 1. IOP Publishing.

[11] Arthur, David, Sergei Vassilvitskii. How slow is the k-means method?. 2006, Proceedings of the twenty-second annual symposium on Computational geometry. ACM.

[12] Faber, Vance. Clustering and the continuous k-means algorithm. 1994, Los Alamos Science 22.138144.21.

[13] Pakhira, Malay K. Clustering large databases in distributed environment. 2009, Advance Computing Conference, 2009. IACC 2009. IEEE International. IEEE.

[14] Chen, Dar-Ren, et al. Classification of breast ultrasound images using fractal feature. 2005, Clinical imaging, vol.  29, no. 4, pp.  235-245.

[15] Khan, Shehroz S., Amir Ahmad. Cluster center initialization algorithm for K-modes clustering. 2013, Expert Systems with Applications, vol.  40, no. 18, pp.  7444-7456.

[16] Nazeer, KA Abdul, SD Madhu Kumar, M. P. Sebastian. Enhancing the k-means clustering algorithm by using a O (n logn) heuristic method for finding better initial centroids. Emerging Applications of Information Technology (EAIT), 2011, 2011 Second International Conference on. IEEE.

[17] Wu, Jieming, Wenhu Yu. Optimization and improvement based on K-Means Cluster algorithm. Knowledge Acquisition and Modeling, 2009. KAM'09. Second International Symposium on, vol. 3. IEEE.

[18] Mingoti, Sueli A., Joab O. Lima. Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. 2006, European Journal of Operational Research, vol.  174, no. 3, pp.  1742-1759.

[19] Topchy, Alexander, Anil K. Jain, William Punch. Combining multiple weak clusterings. Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE.

[20] Krzanowski, Wojtek J. Statistical principles and techniques in scientific and social research. 2007, Oxford University Press on Demand.

[21] Bikku, Thulasi. A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets. 2017, IOP Conference Series: Materials Science and Engineering, vol. 225, no. 1. IOP Publishing.

[22] Chang, Xiangyu, et al. Sparse Regularization in Fuzzy $ c $-Means for High-Dimensional Data Clustering. 2017, IEEE transactions on cybernetics. 

[23] Strehl, Alexander, Joydeep Ghosh, Raymond Mooney. Impact of similarity measures on web-page clustering. 2000, Workshop on artificial intelligence for web search (AAAI 2000). Vol. 58.

[24] Jung, Se-Hoon, et al. A Novel on Automatic K Value for Efficiency Improvement of K-means Clustering. 2017, Advanced Multimedia and Ubiquitous Engineering. Springer, Singapore, pp. 181-186.