Font Size: a A A

A Study On Chinese Text Categorization

Posted on:2009-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:X G WangFull Text:PDF
GTID:2178360242976832Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are presented. The thesis summarizes systematically some techniques about word segmentation, feature selection, categorizing algorithm and performance estimating in Chinese text categorization. It discusses three kinds of Chinese text categorization methods like Bayes method, k Nearest Neighbor (kNN) and Adaboost.Author develops three text classifiers like naive-Bayes classifier, k Nearest Neighbor classifier and Adaboost classifier. Furthermore, including the three classifiers, one text categorization system is built up, and it has high practicability.Because of the weakness appeared in kNN prediction an improved kNN with more satisfactory performance is put forward here. There are latent semantic, feature aggregations, the text attribute factor of semantic chain, iterative KNN.It gives a stress on Adaboost. The thesis discussed the Adaboost problem theory and application instance. Naive Bayesian classifier is a kind of effective text categorization methods, but it is hard to improve its performance by Boosting procedure because of its stability. So the main problem derived from the Boosting procedure using Naive Bayesian classifier as the basic classifier is how to break its stability. Three methods that break the stability of naive Bayesian classifier are given. The first method changes samples of the training set, the second adopts the random selected feature group, and the third creates different feature set using different method to extract text features in the each iteration of Boosting procedure. The three methods have respective advantages and disadvantage, but all of them are more accurate and effective than the original Naive Bayesian classifier.It can be much clear from the experiment result that the Adaboost classifier is more satisfactory than others. Naive Bayesian classifier is simple and quick, KNN classifier's performance is suitable. The three classifiers are suitable for Chinese text categorization.
Keywords/Search Tags:Feature Selection, Text Categorization, Naive Bayesian Text Classifier, K Nearest Neighbor Classifier, Adaboost Classifier Algorithm
PDF Full Text Request
Related items