Font Size: a A A

Research On The Ensemble Classification Algorithm Of Web Text

Posted on:2010-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Q K RuanFull Text:PDF
GTID:2178330338482181Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, the information on the Internet increases exponentially. It's very difficult for user to find what he wanted in the mass of information. One important research focuses on how to deal with these great capacities of online documents. Test classification is to classify the information extracted from the Internet into categories, for the convenience of retrieval. This thesis mainly studies some related algorithms on text classification and hypertext classification.Firstly, this thesis introduces general development and some techniques of information classification. Then, some analyses and remarks are made to compare the performance of some typical classification algorithms. Thereof, basic theory support of text classifcication and hypertext classification is provided.Secondly, in the research of text classification, this thesis study and analyse the bayes classification which is based on probability principle. But the classification is built on the hypothesis of attribute independency. However, there exsists so much relations between text attributes and in text information there is much"noise", the assumption isn't meet. Using Tree Augmented Naive Bayes (Tan) this thesis proposed an ensemble classification algorithm with Bayes (ECB). Via K-Mean clustering, the independent attribute subsets are distilled, and Tan classifiers on these subsets are constructed and these classifitcation are ensembled. The experiment is carried on 20 newsgroup and mini-newsgroup, and the experimental results show that the ECB gained more robust performance.Thirdly, nn the research of hypertext classification, this thesis study the hypertext information rules and analysis the classification performance with these rules. Via substracting the hypertext rules some classifiers are constructed and neural network is used to ensemble these results. And an ensemble hypertext meta information classification is proposed (EHC). This classification can integrate the structure information effectly. The experiment shows that EHC gained better performance contrast to only using single rule.
Keywords/Search Tags:Text classification, Hypertext classification, Ensemble algorithm, Metainfomation
PDF Full Text Request
Related items