OPEN ACCESS
Used the idea of cloud computing, according to MapReduce model to solve the traditional Bayesian classification algorithm suited to large-scale data deficiencies, greatly improved the speed of classification. The combination of the characteristics of the parallel algorithm was improved accordingly. Adding synonyms and word frequency filtering combined approach allows vector dimensionality reduction, reducing false positives. Wherein the particular keyword was then weighted to enhance the accuracy of the classification. Finally, the Hadoop cloud computing platform was experimentally proved that the traditional text classification algorithm after parallelization on Hadoop cloud computing platforms, has better speedup, and the improved algorithm can improve the classification accuracy.
Cloud computing, Text classification, Parallel, Hadoop
[1] Jing Y. S., Pavlovic V., Rehg J. M., “Boosted Bayesian network classifiers,” Machine Learning, 2008, vol. 73, no. 2, pp. 155-184.
[2] Webb G. I., Boughton J. R., Zheng F., et al. “Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification,” Machine Learning, 2012, vol. 86, no. 2, pp. 233-272
[3] Tillman R. E., “Structure learning with independent non-identically distributed data,” Proceedings of the 26th Annual International Conference on Machine Learning, New York, 2009, pp. 1041-1048.
[4] Su J., Zhang H., Ling C. X., et al., “Discriminative parameter learning for Bayesian networks,” Proceeding of the 25th International Conference on Machine Learning Helsinki, Finland, 2008, pp. 1014-1023.
[5] Ekanayake J., Li H., Zhang B., et al. “Twister: A runtime for interactive MapReduce,” Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, USA, 2010, pp. 810-818.
[6] Dean J, Ghemawat S. Mapreduce, “Smiplified data processing on large clusters,” Proceedings of the 6th Symposium on Operating System Design and Implementation, SanFrancisco, California, USA: USENIX Association, 2004, pp. 137-150.
[7] Thusoo A., Sarms J. S., Jain N., et al., “Hive: A warehousing solution over a map-reduce framework,” Proceedings of the Conference on Very Large Databases, Ly-on, France, 2009, pp. 1626-1629.
[8] Dean J., Ghemawat S., “Map/Reduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs,” Communications of the ACM, vol. 53, no. 1, pp. 72-77.
[9] Dittrich J., Quiane-Ruiz J. A., Jindal A., et al., “Hadoop++: Making a yellow elephant run like a cheetah(without it evennoticing),” Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 518-529, 2010.
[10] Bu Y., Howe B., Balazinska M., et al., “HaLoop: Efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 285-296, 2010.