Study On Feature Selection Of Chinese Document Categorization

Posted on:2007-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:S M Peng

Full Text:PDF

GTID:2178360185974910

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

People can gain more and more knowledge along with the fast development of the network and information technology. However, in the face of specific knowledge, it is difficult to obtain it quickly in the vast world of information. When they meet the contraction, technical knowledge classification occurs to the people, and it immediately wins wide concern as soon as it emerges. At the same time, document classification which is one of classification of knowledge becomes hot research.Feature selection algorithm is an important issue in document classification .The traditional TFIDF algorithm is mainly studied in this paper and is found that it has several limitations: 1) it does not take into account the inter-category distribution of the feature terms. If a feature term is evenly distributed among categories, it almost has no contribution to the classification, conversely if a feature term is concentratedly distributed in a category but rarely appear in other categories, it can be a good representative of the characteristics of this category. Nevertheless, the two cases beyond the traditional TFIDF; 2) Traditional TFIDF algorithm does not take into account the inner-category distribution of feature term. If a feature term is evenly distributed in a category, it can be a good representative of the characteristic of this category, however, if it only appears in several documents of a category and not in others of this category, obviously, it can't represent the feature of this category.In response to the shortcomings of traditional TFIDF algorithm, the proposed measure is introduced which improves TFIDF by using the inter-category and inner-category distribution information of the feature terms in this paper. Variance is an index that can describe distribution of random variables, and is used to describe inter-category distribution of feature term . If the value of variance is small, that is to say, the feature term is evenly distributed among categories and it has little contribution to classification, so variance is used to decrease the weight of this feature term. The inner-category distribution of feature term can be described by variance of inner-category. Different from the inter-category distribution, the smaller variance of feature term, the more it can represent the category, so the weight of feature term should be increased.The other work of this paper is to apply Genetic algorithm to feature selection. We do not adopt the traditional idea that selection is done in every document, but adopt the...

Keywords/Search Tags:

Feature selection, Feature vector, Vector space model, Genetic algorithm

PDF Full Text Request

Related items

1	An Improved Approach To Weighting Chinese Terms Using Information Gain
2	Research On Image Retrieval Method Based On Multi-Feature
3	A Study On Feature Selection Algorithms Based On Support Vector Machine And Its Application
4	Research On Feature Selection Of Text Classification
5	Research Of Chinese Page Automatic Classification Based On Vector Space Model
6	Improvement And Application To Weighting Terms Based On Text Classification
7	Text Representation Model And Feature Selection Algorithm
8	A Feature Selection Method Based On Multi-Objective Genetic Algorithm And Support Vector Machines
9	Study On Intelligent Optimization Algorithms And Its Application For Classification Problems
10	Research On Feature Extraction, Selection And Classification Algorithms For Pulmonary CAD