X ~ 2 Statistics-based Chinese Text Categorization Feature Selection Method

Posted on:2010-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:T Xiao

Full Text:PDF

GTID:2208360275952215

Subject:Computer software and theory

Abstract/Summary:

With the development of WWW, the number of documents on the Internet increases swiftly and violently. The user must find the information which in the magnanimous information oneself needed, likely looked for a needle in a haystack the same difficulty. How to obtain the useful information from the large number of complex text information? Text categorization is one of the most important ways. Feature selection methods and classification algorithms are important research direction of text categorization.Text Feature selection is an important part of text categorization. It will directly affect the precision of text classification. In this paper, a comprehensive analysis of the characteristics of text classification on the basis of selection methods, we focus on X~2 statistics feature selection method. The traditional X~2 statistics feature selection has two limitations:1) it only take text frequency of feature in all texts into account, in defiance of feature frequency in one text, that means it is not reliable to feature of low text frequency. If a feature term appears frequently in a few documents of a category, it may have most contribution to the categorization such as expert terms, obviously, they can be a good representative of the characteristic of this category, however, traditional X~2 approach does not take this case into account. 2) The feature term appearsfrequently in other classes not in the specified class. Obviously, such feature term cannot represent this specified class. However, traditional X~2 approach does not takethis case into account.To overcome the shortcomings of traditional X~2 approach, this papercomprehensively takes criterions such as document frequency and Class accuracy of the traditional statistical methods to improve X~2 approach. Feature terms which appearfrequently in one category a good representative of the characteristics of this category, so we take frequency into account; A helpful feature term should mostly appear in one category rather than appear in all categories, so we take concentration among categories into account; A feature term evenly distributed among documents of a category is helpful to the category, so we take distribution within categories into account.The other work of this paper is to build a Chinese text categorization system. Word segmentation, feature selection and text categorization are three parts of the system. They are independent, but they have consistent interface. It means every part can conveniently use other parts and changing of one part is transparent to other parts. It is very convenient to improve one part without effects to other parts.In order to verify efficiency of improved X~2 approach, there is a contrastiveexperiment. The experiment results show that improved X~2 approach is superior to traditional X~2 approach and the improved approach in feature selection, which verifies efficiency and probability of improved X~2 approach.

Keywords/Search Tags:

Text categorization, Feature selection, x~2 approach, Chinese text categorization

Related items

1	An Improved Approach To CHI In Feature Selection Of Chinese Text Categorization
2	Research And Implementation On Web Chinese Text Categorization Technology
3	Research And Implementation Of The Automatic Chinese Text Categorization
4	Research On Chinese Text Categorization Algorithms Based On Technology Text
5	Design And Realization Of Automated Text Categorization System For Chinese Documents Based On Relevancy
6	Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set
7	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
8	A Study On Text Categorization Based On Machine Learning
9	The Studies On Chinese Text Categorization Based On Pso And Svm
10	Study On Chinese Text Categorization