Font Size: a A A

Text Categorization Research Based On TAN Model

Posted on:2010-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178360278452477Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, especially the application and popularization of internet, electronic text information is expanding rapidly, which makes text categorization technology become an important research area. Although Bayesian methods are simple, intuitive and stable in performance, current text categorization algorithms based on Bayesian models are mainly confined to Naive Bayes method. Naive Bayes shows a poor classification performance because of its attribute independence assumption, which makes it unable to express the dependence among attributes. Compared with Naive Bayes, Bayesian networks express the dependence perfectly, but still can't be used for text categorization as a result of the complexity of learning. TAN (Tree-Augmented Naive Bayes) combines the simplicity of Naive Bayes with the ability to express the dependence among attributes in Bayesian network, which reflects an appropriate compromise between the efficiency of learning and accurate description of correlation among attributes. At present, little text categorization research is based on TAN model and there are some defects in the existing TAN text categorization models, consequently, several issues of text categorization based on TAN are studied in this thesis.On the one hand, this thesis intensively studys the existing TAN text categorization model BL-TAN and points out three problems. First of all, BL-TAN don't take the features that are not appeared in the text into account. For this problem, combined with multi-variate Bernoulli text categorization model of Naive Bayes, this thesis proposes the first improved algorithm BNL-TAN. Experiments results show BNL-TAN has better classification performance than BL-TAN. Secondly, BL-TAN ignores the information of word frequency, which is very important for features. For this problem, combined with multinomial model of Naive Bayes, this thesis proposes the second improved algorithm MUL-TAN. Experiments results show MUL-TAN has significantly better performance than BNL-TAN. Finally, there is a threshold selection problem not only in BL-TAN, but also in BNL-TAN and MUL-TAN. For this issue, drawing on the searching+scoring ideas of traditional Bayesian network learning, this thesis makes use of the strategy of "fixed network" and "sequential searching", and proposes an automatic TAN text categorization framework ATAN, which gets rid of the threshold selection problem completely. Experiments results show ATAN can receive exactly the same classification performance as methods by manually selecting the best thresholds.On the other hand, this thesis studies the framework and primary learning methods of ensemble learning, and proposes three ensemble modes based on TAN, which are all using voting as the unified conclusion generating method while different in the generation strategy of individual classifier. AdaM1-TAN combines TAN with AdaBoost.Ml algorithm, learning different individual classifiers through continuous adjustment of the weight distribution on training set. EBag-TAN expands the idea of bagging algorithm, obtaining various individual classifiers by randomly selecting root variables in the course of constructing TAN structure when a weighted undirected tree transferred to a directed tree. FRS-TAN makes use of feature sets based ensemble methods, which randomly selects feature subsets and receives different individual classifiers by learning these feature subsets. In experiments, these three ensemble models are used for text categorization; this thesis compares the classification performances and gives corresponding analysis of the experimental results.
Keywords/Search Tags:Text Categorization, TAN, Ensemble Learning, Bayesian Network, Na(?)ve Bayes
PDF Full Text Request
Related items