Font Size: a A A

On Bavesian Text Classification Learning Under Mapreduce Framework

Posted on:2013-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:J WeiFull Text:PDF
GTID:2268330374963190Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Text classification is an important foundation for information retrieval and text mining. Atpresent, although the various text classification algorithms have been applied successfully in eachdomain, the solely usage of merely one of them is extremely liable to lead to poor performance andpan-capacity of constructed classifiers. At such a critical time, integrated learning algorithm usesthe distinctions between the numerous single classifiers, effectively improved the classificationperformance and generalization ability of the classifiers. However, with the rapid increment ofnetwork data, together with gradual richness of application types, it seems typically difficult fortraditional framework of system to meet the demands of massive data storage, processing andlearning. Therefore, the MapReduce parallel programming model of Google, the abstraction ofwhich is highly enough, encapsulates the detailed parallel underlying problems, providing such asimple programming framework for the programmers.After analyzing the plus ability of Na ve Bayes, the paper has designed Bayesian textclassification algorithm, based on MapReduce parallel programming model and improved byTFIDF. This algorithm applies five MapReduce functions to complete the process of training of theclassifiers and classification text. Experiments have been made on the basis of Hadoop platform,achieved by MapReduce open source, the outcome of which clearly shows that classifiersconstructed by this algorithm are possessed with characteristics of high data capacity, efficiency andability of performance. At the same time, the research itself highlights the ensemble learningalgorithm, furthermore, combines the algorithm of Bagging, being possessed of parallel features,with algorithm of Bayesian text, to make a design proposal for distributed ensemble Bayesian textclassifier that grounded on MapReduce. The training of such classifier first to select training subsetwith random properties to break the stability of the Na ve Bayes, and then to take advantage ofdistributed Bayes text classification as the basic classifier training subset to study, simultaneously,producing more basic classifiers. The overall text of this classifier involves two fundamental steps:on the one hand, exploiting each basic classification classifier parallel to generate intermediateresults; on the other hand, making use of no weighted voting underway to intermediate results toobtain final ones. The consequence of this experiment illustrates that this algorithm not only can improve the classification performance effectively, but also be equipped with superior advantages ofhigh reliability, efficiency and scalability.
Keywords/Search Tags:Text Categorization, Ensemble Learning, MapReduce, NaiveBayes, Hadoop
PDF Full Text Request
Related items