Home Journals IJDNE An Ensemble Clustering for Mining High-Dimensional Biological Big Data

JOURNAL METRICS

CiteScore 2022: 2.0 ℹCiteScore:

CiteScore is the number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years.

SCImago Journal Rank (SJR) 2022: 0.254 ℹSCImago Journal Rank (SJR):

The SJR is a size-independent prestige indicator that ranks journals by their 'average prestige per article'. It is based on the idea that 'all citations are not created equal'. SJR is a measure of scientific influence of journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from It measures the scientific influence of the average article in a journal, it expresses how central to the global scientific discussion an average article of the journal is.

Source Normalized Impact per Paper (SNIP) 2022: 0.699 ℹSource Normalized Impact per Paper(SNIP):

SNIP measures a source’s contextual citation impact by weighting citations based on the total number of citations in a subject field. It helps you make a direct comparison of sources in different subject fields. SNIP takes into account characteristics of the source's subject field, which is the set of documents citing that source.

123.png

An Ensemble Clustering for Mining High-Dimensional Biological Big Data

Dewan Md. Farid| Farid, Ann Nowe | Bernard Manderick

Computational Modeling Lab, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium

Received:

N/A

| |

Accepted:

N/A

| | Citation

dne110318f.pdf

OPEN ACCESS

https://www.witpress.com/elibrary/dne-volumes/11/3/1204

Abstract:

Clustering of high-dimensional biological big data is incredibly difficult and challenging task, as the data space is often too big and too messy. The conventional clustering methods can be inefficient and ineffective on high-dimensional biological big data, because traditional distance measures may be dominated by the noise in many dimensions. An additional challenge in biological big data is that we need to find not only the clusters of instances (genes), but also for each cluster a set of features (conditions) that manifest the cluster. In this paper, we propose an ensemble clustering approach with feature selection and grouping for clustering high-dimensional biological big data. It uses two well-approved clustering methods: (a) k-means clustering and (b) similarity-based clustering. This approach selects the most relevant features in the dataset and grouping them into subset of features to overcome the problems associated with the traditional clustering methods. Also, we applied biclustering on each cluster that generated by ensemble clustering to find the sub-matrices in the biological data by the mean squared residue scores. We have applied the proposed clustering method on unlabeled genomic data (148 Exome datasets) of Brugada syndrome to discover previously unknown data patterns. Experiments verify that the proposed clustering method achieved high performance clustering results on high-dimensional biological big data.

Keywords:

biclustering, biological big data, brugada syndrome, clustering, high-dimensional data

References

[1] Li, Y. & Chen, L., Big biological data: challenges and opportunities. Genomics Proteomics Bioinformatics, 12(5), pp. 187–189, 2014. http://dx.doi.org/10.1016/j.gpb.2014.10.001

[2] May, M., Big biological impacts from big data. Science, 344(6189), pp. 1298–1300, 2014. http://dx.doi.org/10.1126/science.344.6189.1298

[3] Marx, V., The big challenges of big data. Nature, 498(7453), pp. 255–260, 2013. http://dx.doi.org/10.1038/498255a

[4] Qin, Y., Yalamanchili, H.K., Qin, J., Yan, B. & Wang, J., The current status and challenges in computational analysis of genomic big data. Big Data Research, 2(1), pp. 12–18, 2015. http://dx.doi.org/10.1016/j.bdr.2015.02.005

[5] Herland, M., Khoshgoftaar, T.M. & Wald, R., A review of data mining using big data in health informatics. Journal of Big Data, 1(2), pp. 1–35, 2014. http://dx.doi.org/10.1186/2196-1115-1-2

[6] Jing, L., Ng, M.K. & Huang, J.Z., An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering, 19(8), pp. 1026–1041, 2007.

http://dx.doi.org/10.1109/TKDE.2007.1048

[7] Jing, L., Tian, K. & Huang, J.Z., Stratified feature sampling method for ensemble clustering of high dimensional data. Pattern Recognition, 48(11), pp. 3688–3702, 2015. http://dx.doi.org/10.1016/j.patcog.2015.05.006

[8] Han, J., Kamber, M. & Pei, J., Data Mining Concepts and Techniques, 3rd edn., Morgan Kaufmann, 2011.

[9] Zhu, L., Cao, L., Yang, J. & Lei, J., Evolving soft subspace clustering. Applied Soft Computing, 14(B), pp. 210–228, 2014.

[10] Xu, R. & Wunsch, D., Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), pp. 645–678, 2005. http://dx.doi.org/10.1109/TNN.2005.845141

[11] Filippone, M., Camastra, F., Masulli, F. & Rovetta, S., A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), pp. 176–190, 2008. http://dx.doi.org/10.1016/j.patcog.2007.05.018

[12] Tsapanos, N., Tefas, A., Nikolaidis, N. & Pitas, I., A distributed framework for trimmed kernel k-means clustering. Pattern Recognition, 48(8), pp. 2685–2698, 2015. http://dx.doi.org/10.1016/j.patcog.2015.02.020

[13] Malinen, M.I., Mariescu-Istodor, R. & Fr¨anti, P., K-means: clustering by gradual data transformation. Pattern Recognition, 47(10), pp. 3376–3386, 2014. http://dx.doi.org/10.1016/j.patcog.2014.03.034

[14] Bagirov, A.M., Ugon, J. & Webb, D., Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognition, 44(4), pp. 866–876, 2011. http://dx.doi.org/10.1016/j.patcog.2010.10.018

[15] Tzortzis, G.F. & Likas, C.L., The global kernel k-means algorithm for clustering in feature space. IEEE Transactions on Neural Networks, 20(7), pp. 1181–1194, 2009. http://dx.doi.org/10.1109/TNN.2009.2019722

[16] Yang, M.S. & Wu, K.L., A similarity-based robust clustering method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4), pp. 434–448, 2004.

http://dx.doi.org/10.1109/TPAMI.2004.1265860

[17] Farid, D.M., Zhang, L., Hossain, A., Rahman, C.M., Strachan, R., Sexton, G. & Dahal, K., An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 40(15), pp. 5895–5906, 2013.

http://dx.doi.org/10.4304/jait.4.3.129-135

[18] Farid, D.M. & Rahman, C.M., Mining complex data streams: discretization, attribute selection and classification. Journal of Advances in Information Technology, 4(3), pp. 129–135, 2013. http://dx.doi.org/10.1016/j.eswa.2013.05.001

[19] Mitra, P., Murthy, C.A. & Pal, S.K., Unsupervised feature selection using feature similarity.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), pp. 301–312, 2002. http://dx.doi.org/10.1109/34.990133

[20] Iam-On, N., Boongoen, T., Garrett, S. & Price, C., A link-based approach to the cluster ensemble problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), pp. 2396–2409, 2011. http://dx.doi.org/10.1109/TPAMI.2011.84

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

An Ensemble Clustering for Mining High-Dimensional Biological Big Data