Font Size: a A A

Study On Feature Selection Of Chinese Document Categorization

Posted on:2007-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:S M PengFull Text:PDF
GTID:2178360185974910Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
People can gain more and more knowledge along with the fast development of the network and information technology. However, in the face of specific knowledge, it is difficult to obtain it quickly in the vast world of information. When they meet the contraction, technical knowledge classification occurs to the people, and it immediately wins wide concern as soon as it emerges. At the same time, document classification which is one of classification of knowledge becomes hot research.Feature selection algorithm is an important issue in document classification .The traditional TFIDF algorithm is mainly studied in this paper and is found that it has several limitations: 1) it does not take into account the inter-category distribution of the feature terms. If a feature term is evenly distributed among categories, it almost has no contribution to the classification, conversely if a feature term is concentratedly distributed in a category but rarely appear in other categories, it can be a good representative of the characteristics of this category. Nevertheless, the two cases beyond the traditional TFIDF; 2) Traditional TFIDF algorithm does not take into account the inner-category distribution of feature term. If a feature term is evenly distributed in a category, it can be a good representative of the characteristic of this category, however, if it only appears in several documents of a category and not in others of this category, obviously, it can't represent the feature of this category.In response to the shortcomings of traditional TFIDF algorithm, the proposed measure is introduced which improves TFIDF by using the inter-category and inner-category distribution information of the feature terms in this paper. Variance is an index that can describe distribution of random variables, and is used to describe inter-category distribution of feature term . If the value of variance is small, that is to say, the feature term is evenly distributed among categories and it has little contribution to classification, so variance is used to decrease the weight of this feature term. The inner-category distribution of feature term can be described by variance of inner-category. Different from the inter-category distribution, the smaller variance of feature term, the more it can represent the category, so the weight of feature term should be increased.The other work of this paper is to apply Genetic algorithm to feature selection. We do not adopt the traditional idea that selection is done in every document, but adopt the...
Keywords/Search Tags:Feature selection, Feature vector, Vector space model, Genetic algorithm
PDF Full Text Request
Related items