Font Size: a A A

Improvement On Mutual Information In Feature Selection Based On Composite Ratio

Posted on:2015-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:K LuFull Text:PDF
GTID:2268330428469438Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is still an open and important research direction in the field of information processing and is widely used. Text classification includes many stages, and there are text segmentation, feature selection, building the feature model, training classifier and text classification. Among them feature selection is one of the most important stages in the text classification. Whether can use feature selection to select suitable feature has an important influence on the effects of text categorization.This article is mainly directed against the shortcomings of mutual information, and comes up with the idea and method of improving the mutual information. Thesis as follows:1. Firstly, this article introduces the text categorization, discusses and researches each stage in text categorization and related technology. It’s just to research, with some emphasis, about the mutual information feature selection, and describe the disadvantages of traditional mutual information method, and put forward corresponding improvement in view of the shortcomings of thought.2. About the question that mutual information only considers the document frequency of terms in the text sets, does not consider the characteristics of word frequency information and the correlation between text categories and questions, so this article comes up with introducing the composite ratio into the mutual information. This thinking can take the term frequency and the correlation between categories and important information into consideration; this article also uses balance factor to solve the question of the positive and negative correlation, adjust the proportion between positive correlation and negative correlation, considers the effect of the negative correlation, In the end experiments shows that the improved mutual information feature selection method improves the classification result.3. About the question that mutual information ignores the semantic information of terms, this article uses the semantic dictionary Hownet to build a table named "conception-domain". If a word from the text was existent in the table, it would be replaced by its domain value with more general meanings. By this way, more semantic information was added to the selected features and the redundancy between features of items could be eliminated to some extent. The experiment is carried out by the improved mutual information, and the results show that the method can effectively improve the accuracy.
Keywords/Search Tags:Text classification, Feature selection, Mutual information, Compositeratio, Semantic
PDF Full Text Request
Related items