Font Size: a A A

Feature Selection Methods For Text Categorization

Posted on:2016-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ChenFull Text:PDF
GTID:2308330479991051Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the explosive growth of information resources, people is very difficult to efficiently get the required information from a huge amount of information. If the computer can give the user appropriate support in the information categorization and retrieval, so will greatly improve the plight of the current user. And most of the information in the fo rm of text, therefore, in recent years, by classifying the vast amounts of information resources to improve the efficiency of retrieval, became a hot area.In the process of text categorization, facing a great challenge is the feature space of high dimensional, resulting in a great impact on the overall classification performance. So selecting effective features can greatly improve the performance and precision of text categorization. However, although there are many studies on feature selection, but facing a new task, it is difficult to choose a suitable method.In this paper, several feature selection algorithms are introduced and compared with each other. Focus on the six more significance feature selection algorithms. Experiments are carried on some open source text categorization corpus and make a certain way for the following proposed feature selection algorithm.On the basis of comparative analysis, this paper proposes a new feature selection algorithm——feature selection based on categorization membership of feature, FMFS, it avoid the traditional algorithms is not very good considering the factor of feature frequency distribution between classes and class. The experimental results show that FMFS is better than other classical feature selecti on algorithms.Imbalanced texts is common in many fields, because of the difference of data spatial distribution, the traditional feature selection methods is usually default that training sample class distribution is balance or close to balance, cause to pay more attention to features of big category, and ignore the features of small category, resulting in small category classification result is not ideal. In order to improve this problem, this paper proposes a better feature selection algorithm to deal wi th the problem——Strengthen Feature Category Information Feature selection,SFCI, it use three information factors, data category distribution and feature frequency distribution between classes and class, avoid the interference caused by the difference of category distribution. The experimental results show that SFCI is better to improve the classification performance of small category and the whole.
Keywords/Search Tags:Text Categorization, Feature Selection, Category Differentiation Ability, Imbalance Text
PDF Full Text Request
Related items