Font Size: a A A

Research On Word Segmentation And Feature Selection Of Chinese Text Chinese Text Classification

Posted on:2012-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2178330335950376Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese automatic text classification is a process, in which the computer constructs discriminate formula and classification algorithm according to certain classification rules, to assign text to a pre-defined text category. From the point of view of computer technology, text classification is a process to inductive and applies knowledge.Since 1990s, machine learning has applied in automatic text classification techniques, such as support vector machines, K nearest neighbor, naive Bayes and so on. At present, the field in the Chinese text categorization has been made a lot of research results. However, there are some shortcomings in the Chinese text classification. For example, in the text pre-processing stage, either reverse maximum matching algorithm or maximum matching algorithm used to segment the ambiguity word, neither of accuracy is enough. What's more, in the feature selection stage, the traditional TFIDF weights algorithm is not accurate enough in calculating the weight of feature item, which means the accuracy needs increasing.Based on the existing research work, we research on Chinese automatic text classification and related technology. We not only propose a two-way matching algorithm, but also improve the TFIDF weight algorithms. The main work of this paper is as follows:In the text pre-processing stage, we propose a two-way matching algorithm to improve the algorithm in segmenting the ambiguous words. In the improved algorithm, we use reverse maximum matching algorithm (RMM for short) and maximum matching algorithm (MM for short) sequentially. If the results of both algorithms are different, we delete the phrases, else we retain them. By analyzing the results of experiments, we conclude that the improved algorithm increase almost 3% in accuracy, precision and Fl test value.In the feature selection stage, we improve the traditional TFIDF weight algorithm. We analyze the feature items weight algorithm named TFIDF, which is a traditional weight algorithm. On the one hand, the distribution of feature items was not considered between the categories, resulting in some feature items which distribute uniform between the categories but not important in classification were given high weight. On the other hand, the distribution of feature items was not considered in texts which belong to some category, resulting in a category only in a few feature items of some text were given high weight. In this paper, we use information entropy to calculate uncertainty measure of the feature items in the corpus. Experimental results show that the algorithm we improved is feasible and effective.Based on the Chinese text categorization algorithm LIBSVM, and public corpus which provided by Dr. Li Ronglu, Fudan University, we contrast the improved algorithm and the existing algorithms by experiment. With the confusion matrix, performance evaluation, and contrast the Experimental results, we prove that combining the two-way matching algorithm and the improved TFIDF has higher classification precision, recall, and Fl test value than neither combining MM and the traditional TFIDF nor RMM and the traditional TFIDF. Thus the algorithm in this paper is feasible and effective.In this paper, we improve the algorithms in both feature selection stage and preprocessing stage. The research of this thesis can be used in field of digital libraries, information filtering, text database management and some other fields, so the research highlights of this thesis have both theoretical and practical benefits.
Keywords/Search Tags:Text classification, Chinese word segmentation, Feature selection, TFIDF
PDF Full Text Request
Related items