Text Categorization Research Based On TAN Model

Posted on:2010-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:J Liu

Full Text:PDF

GTID:2178360278452477

Subject:Computer Science and Technology

Abstract/Summary:

With the development of information technology, especially the application and popularization of internet, electronic text information is expanding rapidly, which makes text categorization technology become an important research area. Although Bayesian methods are simple, intuitive and stable in performance, current text categorization algorithms based on Bayesian models are mainly confined to Naive Bayes method. Naive Bayes shows a poor classification performance because of its attribute independence assumption, which makes it unable to express the dependence among attributes. Compared with Naive Bayes, Bayesian networks express the dependence perfectly, but still can't be used for text categorization as a result of the complexity of learning. TAN (Tree-Augmented Naive Bayes) combines the simplicity of Naive Bayes with the ability to express the dependence among attributes in Bayesian network, which reflects an appropriate compromise between the efficiency of learning and accurate description of correlation among attributes. At present, little text categorization research is based on TAN model and there are some defects in the existing TAN text categorization models, consequently, several issues of text categorization based on TAN are studied in this thesis.On the one hand, this thesis intensively studys the existing TAN text categorization model BL-TAN and points out three problems. First of all, BL-TAN don't take the features that are not appeared in the text into account. For this problem, combined with multi-variate Bernoulli text categorization model of Naive Bayes, this thesis proposes the first improved algorithm BNL-TAN. Experiments results show BNL-TAN has better classification performance than BL-TAN. Secondly, BL-TAN ignores the information of word frequency, which is very important for features. For this problem, combined with multinomial model of Naive Bayes, this thesis proposes the second improved algorithm MUL-TAN. Experiments results show MUL-TAN has significantly better performance than BNL-TAN. Finally, there is a threshold selection problem not only in BL-TAN, but also in BNL-TAN and MUL-TAN. For this issue, drawing on the searching+scoring ideas of traditional Bayesian network learning, this thesis makes use of the strategy of "fixed network" and "sequential searching", and proposes an automatic TAN text categorization framework ATAN, which gets rid of the threshold selection problem completely. Experiments results show ATAN can receive exactly the same classification performance as methods by manually selecting the best thresholds.On the other hand, this thesis studies the framework and primary learning methods of ensemble learning, and proposes three ensemble modes based on TAN, which are all using voting as the unified conclusion generating method while different in the generation strategy of individual classifier. AdaM1-TAN combines TAN with AdaBoost.Ml algorithm, learning different individual classifiers through continuous adjustment of the weight distribution on training set. EBag-TAN expands the idea of bagging algorithm, obtaining various individual classifiers by randomly selecting root variables in the course of constructing TAN structure when a weighted undirected tree transferred to a directed tree. FRS-TAN makes use of feature sets based ensemble methods, which randomly selects feature subsets and receives different individual classifiers by learning these feature subsets. In experiments, these three ensemble models are used for text categorization; this thesis compares the classification performances and gives corresponding analysis of the experimental results.

Keywords/Search Tags:

Text Categorization, TAN, Ensemble Learning, Bayesian Network, Na(?)ve Bayes

Related items

1	Correlation Between The Text Classification. Word
2	A Study On Text Categorization Based On Machine Learning
3	Massive Academic Resources Classification Research For Personalized Recommender
4	Text Categorization On Machine Learning Algorithm
5	The Study Of Chinese Text Categorization Based On Na(?)ve Bayes
6	The Research Of Tibetan Text Classification Algorithms For The Analysis Of Network Public Opinion
7	Research On XML Text Categorization Based On Bayesian Classifier
8	Chinese WEB Document Automatic Categorization
9	The Research On Text Categorization Technology Based On Hierarchical Categorization And Ensemble Learning
10	Study And Realization Of Text Categorization In Chinese Speech Recognition Results