On Bavesian Text Classification Learning Under Mapreduce Framework

Posted on:2013-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:J Wei

Full Text:PDF

GTID:2268330374963190

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Text classification is an important foundation for information retrieval and text mining. Atpresent, although the various text classification algorithms have been applied successfully in eachdomain, the solely usage of merely one of them is extremely liable to lead to poor performance andpan-capacity of constructed classifiers. At such a critical time, integrated learning algorithm usesthe distinctions between the numerous single classifiers, effectively improved the classificationperformance and generalization ability of the classifiers. However, with the rapid increment ofnetwork data, together with gradual richness of application types, it seems typically difficult fortraditional framework of system to meet the demands of massive data storage, processing andlearning. Therefore, the MapReduce parallel programming model of Google, the abstraction ofwhich is highly enough, encapsulates the detailed parallel underlying problems, providing such asimple programming framework for the programmers.After analyzing the plus ability of Na ve Bayes, the paper has designed Bayesian textclassification algorithm, based on MapReduce parallel programming model and improved byTFIDF. This algorithm applies five MapReduce functions to complete the process of training of theclassifiers and classification text. Experiments have been made on the basis of Hadoop platform,achieved by MapReduce open source, the outcome of which clearly shows that classifiersconstructed by this algorithm are possessed with characteristics of high data capacity, efficiency andability of performance. At the same time, the research itself highlights the ensemble learningalgorithm, furthermore, combines the algorithm of Bagging, being possessed of parallel features,with algorithm of Bayesian text, to make a design proposal for distributed ensemble Bayesian textclassifier that grounded on MapReduce. The training of such classifier first to select training subsetwith random properties to break the stability of the Na ve Bayes, and then to take advantage ofdistributed Bayes text classification as the basic classifier training subset to study, simultaneously,producing more basic classifiers. The overall text of this classifier involves two fundamental steps:on the one hand, exploiting each basic classification classifier parallel to generate intermediateresults; on the other hand, making use of no weighted voting underway to intermediate results toobtain final ones. The consequence of this experiment illustrates that this algorithm not only can improve the classification performance effectively, but also be equipped with superior advantages ofhigh reliability, efficiency and scalability.

Keywords/Search Tags:

Text Categorization, Ensemble Learning, MapReduce, NaiveBayes, Hadoop

PDF Full Text Request

Related items

1	An Implementation Of Text Categorization System Based On Hadoop
2	Research And Implementation Of Automatic Text Classification Based On Hadoop
3	Design And Implementation Of Text Classification System Based On Hadoop Platform
4	Text Categorization Research Based On TAN Model
5	Massive Academic Resources Classification Research For Personalized Recommender
6	The Research On Text Categorization Technology Based On Hierarchical Categorization And Ensemble Learning
7	Research On Classification For Text With Natural Group
8	Research And Application Of Patent Map Service System
9	A Study On Text Categorization Based On Machine Learning
10	Researches About Transfer Learning Algorithm Based On Ensemble Selection Methods