Font Size: a A A

Algorithm Of Text Classification And Its Application

Posted on:2005-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y PengFull Text:PDF
GTID:2168360125958750Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet techniques, the information on the Internet increases exponentially. One important research focuses on how to deal with these great capacities of online documents. Information classification is one of the crucial parts of information processing. It is a task to classify the information extracted from the Internet into categories, for the convenience of retrieval. This thesis mainly studies some related algorithms on text classification and hypertext classification.This thesis firstly introduces general development and some techniques of information calssification. Then, some analyses and remarks are made to compare the performance of some typical classification algorithms. Thereof, basic theory support of text classifcication and hypertext classification is provided.In the research of text classification, this thesis emphasizes on improvement of half-surpervised classification algorithms. Considering the contradiction of deadly need for large labeled train-set to obtain high classification accuracy and the scarcity of labeled documents, this thesis makes study in two aspects. On one side, to enlarge the train-set, an EM_SVM classification algorithm is proposed, based on the analysis of traditional SVM algorithm and EM_NB algorithm. Experimental results show that, with the same scale of labeled documents, EM_SVM algorithm, which involves unlabeled documents in training process, performs better than SVM algorithm. And EM_SVM acquires higher classificatoin accuracy than EM_NB algorithm on small data set. On the other side, to improve the training method of classifier, this thesis presents a new cooperative training classification algorithm, which cooperates TFIDF and NB classifers to combine labeled and unlabeled documents. The experimental results show that the new algorithm has higher classification accuracy and lower average error than those comparable algorithms.In the research of hypertext classification, this thesis concentrates on the cooperation and synthetize of rules of hypertext. To solve the problem of variety of hypertext and unsteady performance of using single rule of hypertext, after analyzing different rules of using hypertext, this thesis presents a new hypertext classification algorithm based on co-weighting multi-information. Experimental results show that the new algorithm performs better than using single hypertextinformation individually.
Keywords/Search Tags:Information classification, Text classification, Hypertext classification, Cooperative training, Co-weighting hypertext information
PDF Full Text Request
Related items