Chinese Keyword Extraction Method Based On Word Span And Its Application In Text Classification

Posted on:2012-06-14

Degree:Master

Type:Thesis

Country:China

Candidate:J Xie

Full Text:PDF

GTID:2218330368993575

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Keyword extraction has played an important role in text automation, and turned out to be the critical technology. If the massive text resources can be keywords identified and classified according to their content, then the effective text management can be achieved. At present, the widely used keyword extraction method is baed on statistical for its advantage of simple-thought and convenience for practical application. However, it has over rely on the statistical data of word frequency, thus the extracted keywords generally include some noise words with high-frequency but not critical.Based on the noise problem in the Chinese keyword extraction, this paper did an in-depth research on how to improve the traditional method based on statistics, as well as the application of the keyword in the text classification. The main work of this paper is summarized as follows:(1) In order to improve the accuracy of the keyword extraction method, this paper proposes a new keyword extraction method in Chinese text by using analyzing word span, which is used to achieve accurate identification of the noise data and filtering. Some experiments were made to test this method, and the results shows that this approach improved the accuracy of the keyword extraction, and has a stable performance on various texts.(2) Feature dimension reduction in text classification refers to the selection of feature terms, with the purpose of reducing the dimension of feature space. Then, because of the large quantity of these terms and its complex selection calculation, this paper adopts the keyword extraction method to filter out these terms, which with low weight in the single text, to reduce the number of terms in feature selection. The experimental result showed that the computational complexity of feature reduction is reduced by this method, without the loss of valuable terms.(3) As for the weight calculation of terms in vector space model, the classical method, TF*IDF, was improved to TW*IDF*CHI, where term frequency(TF) is replaced by weight of term (TW) in the single text, and CHI is added considering the relationship between terms and categories. The experimental result showed that this method can improve the performance of classification.

Keywords/Search Tags:

Word span, Keyword extraction, Text classification, Feature dimension reduction, Weight calculation

PDF Full Text Request

Related items

1	Method Of Webpage Keyword Extraction Based On Word Span
2	Research And Application Of Text Feature Reduction And Classification Rule Extraction
3	Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm
4	Research On Chinese Text Classification Based On Keyword Strategy And CNN
5	Research On Feature Dimension Reduction In Text Classification
6	Research On Text Classification Based On Rough Set
7	Study On Feature Extraction Based On Maximizing The Distance Between Classes
8	Research And Application Of Feature Dimension Reduction Algorithm In Text Classification
9	The Research And Application Of Text Association Rule Mining Method
10	Sort Of Facing Pages Keyword Weight Calculation