Font Size: a A A

The Research Of Text Feature Weighting Method Based On Information Entropy

Posted on:2013-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:C FengFull Text:PDF
GTID:2248330371976583Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
How to find out useful and accurate information in a large number of complicated documents, is an urgent problem in information processing areas. One of the effective ways to solve the problem is Text Classification. Feature weighting on text documents is widely used as a simple and effective way to improve the effect of the text classification. Nowadays, the TF/IDF (term frequency-inverse document frequency) method is one of the popular methods among lots of feature weighting methods. However, one problem of the TF/IDF method is that it doesn’t consider the distribution of all features.To address the issue, the paper proposes a new improved TF/IDF algorithm based on Information Entropy. The algorithm is based on Information Theory and regards the document set as information sources which obey some certain regularities of distribution. The information entropy of a feature in document measures its distinguish power on text classification (classification ability). In other words, the importance of all the features in the text classification can be measured by information entropy. The feature weighting value of a feature should be adjusted according the value of information entropy.In order to verify the validity of the proposed algorithm, we experiment on three aspects:the impact of different forms of the Corpus; the impact of the number of features; the impact of the different classification methods. We compare the proposed algorithm with the traditional TF/IDF algorithm and other improved algorithms. The results show that the proposed algorithm is better than the traditional TF/IDF algorithm and other improved algorithms in the Mac-Avg F1indicator and the Mic-Avg F1indicator. Meanwhile, the proposed algorithms good performance on unbalance data sets.
Keywords/Search Tags:text categorization, feature weighting, information entropy, TF/IDF
PDF Full Text Request
Related items