Font Size: a A A

Chinese Keyword Extraction Method Based On Word Span And Its Application In Text Classification

Posted on:2012-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:J XieFull Text:PDF
GTID:2218330368993575Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Keyword extraction has played an important role in text automation, and turned out to be the critical technology. If the massive text resources can be keywords identified and classified according to their content, then the effective text management can be achieved. At present, the widely used keyword extraction method is baed on statistical for its advantage of simple-thought and convenience for practical application. However, it has over rely on the statistical data of word frequency, thus the extracted keywords generally include some noise words with high-frequency but not critical.Based on the noise problem in the Chinese keyword extraction, this paper did an in-depth research on how to improve the traditional method based on statistics, as well as the application of the keyword in the text classification. The main work of this paper is summarized as follows:(1) In order to improve the accuracy of the keyword extraction method, this paper proposes a new keyword extraction method in Chinese text by using analyzing word span, which is used to achieve accurate identification of the noise data and filtering. Some experiments were made to test this method, and the results shows that this approach improved the accuracy of the keyword extraction, and has a stable performance on various texts.(2) Feature dimension reduction in text classification refers to the selection of feature terms, with the purpose of reducing the dimension of feature space. Then, because of the large quantity of these terms and its complex selection calculation, this paper adopts the keyword extraction method to filter out these terms, which with low weight in the single text, to reduce the number of terms in feature selection. The experimental result showed that the computational complexity of feature reduction is reduced by this method, without the loss of valuable terms.(3) As for the weight calculation of terms in vector space model, the classical method, TF*IDF, was improved to TW*IDF*CHI, where term frequency(TF) is replaced by weight of term (TW) in the single text, and CHI is added considering the relationship between terms and categories. The experimental result showed that this method can improve the performance of classification.
Keywords/Search Tags:Word span, Keyword extraction, Text classification, Feature dimension reduction, Weight calculation
PDF Full Text Request
Related items