An Improved Parallel Bayesian Text Classification Algorithm

An Improved Parallel Bayesian Text Classification Algorithm

Panpan Shen Hao Wang  Zhouqing Meng  Zhenyu Yang  Zhaoping Zhi  Ran Jin  Aimin Yang 

School of Computer Science and Information Technology, Zhejiang Wanli University, Ningbo, China

Corresponding Author Email: 
1172340155@qq. com
31 March 2016
| Citation



Used the idea of cloud computing, according to MapReduce model to solve the traditional Bayesian classification algorithm suited to large-scale data deficiencies, greatly improved the speed of classification. The combination of the characteristics of the parallel algorithm was improved accordingly. Adding synonyms and word frequency filtering combined approach allows vector dimensionality reduction, reducing false positives. Wherein the particular keyword was then weighted to enhance the accuracy of the classification. Finally, the Hadoop cloud computing platform was experimentally proved that the traditional text classification algorithm after parallelization on Hadoop cloud computing platforms, has better speedup, and the improved algorithm can improve the classification accuracy.


Cloud computing, Text classification, Parallel, Hadoop

1. Introduction
2. Naive Bias Classification Algorithm and Its Paralleization
3. Classification Algorithms in Cloud Computing Environment
4. Experimental Results and Analysis on Cloud Platform
5. Conclusions

[1] Jing Y. S., Pavlovic V., Rehg J. M., “Boosted Bayesian network classifiers,” Machine Learning, 2008, vol. 73, no. 2, pp. 155-184.

[2] Webb G. I., Boughton J. R., Zheng F., et al. “Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification,” Machine Learning, 2012, vol. 86, no. 2, pp. 233-272

[3] Tillman R. E., “Structure learning with independent non-identically distributed data,” Proceedings of the 26th Annual International Conference on Machine Learning, New York, 2009, pp. 1041-1048.

[4] Su J., Zhang H., Ling C. X., et al., “Discriminative parameter learning for Bayesian networks,” Proceeding of the 25th International Conference on Machine Learning Helsinki, Finland, 2008, pp. 1014-1023.

[5] Ekanayake J., Li H., Zhang B., et al. “Twister: A runtime for interactive MapReduce,” Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, USA, 2010, pp. 810-818.

[6] Dean J, Ghemawat S. Mapreduce, “Smiplified data processing on large clusters,” Proceedings of the 6th Symposium on Operating System Design and Implementation, SanFrancisco, California, USA: USENIX Association, 2004, pp. 137-150.

[7] Thusoo A., Sarms J. S., Jain N., et al., “Hive: A warehousing solution over a map-reduce framework,” Proceedings of the Conference on Very Large Databases, Ly-on, France, 2009, pp. 1626-1629.

[8] Dean J., Ghemawat S., “Map/Reduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs,” Communications of the ACM, vol. 53, no. 1, pp. 72-77.

[9] Dittrich J., Quiane-Ruiz J. A., Jindal A., et al., “Hadoop++: Making a yellow elephant run like a cheetah(without it evennoticing),” Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 518-529, 2010.

[10] Bu Y., Howe B., Balazinska M., et al., “HaLoop: Efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol. 3, no. 1, pp. 285-296, 2010.