A Computational Experience for Automatic Feature Selection on Big Data Frameworks

A Computational Experience for Automatic Feature Selection on Big Data Frameworks

Y. Orenes A. Rabasa A. PÉrez-Martín J.J. Rodríguez-Sala J. Sánchez-Soriano 

Miguel Hernández University of Elche, Spain

Page: 
168-177
|
DOI: 
https://doi.org/10.2495/DNE-V11-N3-168-177
Received: 
N/A
| |
Accepted: 
N/A
| | Citation

OPEN ACCESS

Abstract: 

The classification rule system is one of the predictive analytical techniques used in Big Data problems, where finding datasets with millions of rows but also with dozens of variables (attributes) is common. Classification rule systems consist of rule sets which have a so-called antecedent (variable or set of variables that can be numeric or nominal) and a consequent (target variable, provided nominal). If the antecedent variables are numerical, many generator algorithms of classification rules employ traditional methods of automatic feature selection, based on techniques already established in the scientific field, such as discriminant analysis or cluster analysis. In this paper, the authors propose the comparison of their own method of feature selection and classification, RBS (originally designed to manage only nominal variables) and classical methods of feature selection. After the formal definition of our own method, this paper presents the design of a computing experience that allows a qualitative and quantitative comparison of the adapted RBS and other methods for feature selection. Finally, optimal conditions of application of each method are discussed and future research areas in the field of automatic feature selection are identified.

Keywords: 

 big data, classification rule systems, feature selection

  References

[1] Quinlan, J.R., Induction of decision trees. Machine learning, 1, pp. 81–106, 1986.

http://dx.doi.org/10.1007/BF00116251

[2] Lê Cao, K.A., Boitard, S. & Besse, P., Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics, 12(253), pp. 1–16, 2011. http://dx.doi.org/10.1186/1471-2105-12-253

[3] Fisher, R.A., Use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), pp. 179–184, 1936. http://dx.doi.org/10.1111/j.1469-1809.1936.tb02137.x

[4] Almiñana, M., Escudero, L.F., Pérez-Martín, A., Rabasa, A. & Santamaría, L., A classification rule reduction algorithm based on significance domains, TOP, 22, pp. 367–416, 2012.

[5] Rabasa, A., Compañ, A., Agulló, J.J., Rodríguez-Sala, J.J., Santamaría, L. & Noguera, L., Data management for an anaesthesiology department optimization. WIT Transactions on Information and Communication Technologies, eds. A. Rabasa, C.A. Brebbia & A. Bia, WIT Press, 45, pp. 175–183, 2013.

[6] WEKA, Waikato Environment for Knowledge Analysis. Machine Learning Group at the  University of Waikato: New Zealand, available at http://www.cs.waikato.ac.nz/ml/weka/

[7] Team, R., A language and environment for statistical computing. R Foundation for Statistical Computing, R Core Team, Vienna, Austria, available at http://www.r-project.org/

[8] Venables, W.N. & Ripley, B.D., Modern Applied Statistics with S, 4th edn, Springer: New York, 2002, ISBN 0-387-95457-0 http://dx.doi.org/10.1007/978-0-387-21706-2