Font Size: a A A

Research On High-Performance Text Categorization

Posted on:2007-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:S B TanFull Text:PDF
GTID:1118360185995689Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid growth of text information of the Internet brings forward new standards and challenges for text classification with respect to accuracy and speed. So the trend requests us not only to boost the accuracy but also to accelerate the speed. In order to confront the challenge, the author conducts an extensive research on the perspectives of feature selection and learning algorithms, and achieves significant progress.The author gains insights from attribute reduction based on discernability matrix and proposes a few rough-set based text feature selection algorithms, i.e., DB1, DB2 and LDB. The experimental results indicate that DB2 and LDB yield high accuracy, which is nearly the same as Information Gain; when the feature number is small, DB2 and LDB can beat Information Gain in accuracy. Meanwhile, the time requirement of DB2 and LDB is about the same as that of Document Frequency, Mutual Information or CHI Statistics, but obviously smaller than that of Information Gain."No Free Lunch Theorem"indicates: any pattern classification algorithm cannot hold the superiority in its blood. In other words, all pattern classification algorithms suffer from"classifier bias"to some extent. The reason is that all classifiers are based on some kind of hypothesis (model). Generally speaking, this so-called bias will result in increase of training-set and test-set error rate. Very naturally, the author adopts misclassified examples to revise the classifier model online. This is the basic idea of"DragPushing Strategy". The author applies this strategy to three base classifiers, i.e., Centroid Classifer, Na?ve Bayes Classifier, Nearest Neighbor Classifier, and obtains three refined classifier, i.e., RCC, RNB, RKNN. Among the three refined classifiers RCC performs the best. Experimental results indicate that RCC delivers high precision that approaches state-of-the-art SVM. And the running time of RCC scales linearly with the size of training-set, and so is far lower than that of SVM.However, DragPushing Strategy has just reduced the empirical error, but has not reduced the generalized error. Very directly, the author not only demands that the similarity (sim1) of each training example to its true class is bigger than the similarity (sim2) to any other classes, but also requires that there exists a"Margin"at least between sim1 and sim2. The proposed algorithm takes advantage of not only the misclassified examples but also the small-Margin examples to refine the classifier model. The experimental results indicate this algorithm can not...
Keywords/Search Tags:Feature Selection, Feature Abstraction, Text Classification, Text Mining, Machine Learning, Information Retrieval
PDF Full Text Request
Related items