Font Size: a A A

Research On Statistic-based Methods Automatic Keypharse Extraction From Chinese Texts

Posted on:2010-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y HanFull Text:PDF
GTID:2178360275959253Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Keyword Extraction is an important technique of text information processing,and it is an important technique for automatic summarization,classification,subject extraction, patent search and analysis etc.Keyphrases are meant to capture the main topics of a given document.Many journals demand their authors to provide a list of keywords for the articles.We call these keyphrases,rather than keywords,because they are often phrases consisting of two or more words,of course,including single words.This thesis proposes integrated phrase extension identification based on left and right neighbors by the theoretical justifications of phrase advantage,which are summarized from linguistics,cognitive psychology and computational linguistics.On basis of integrated phrases,we approach the problem as a supervised learning task by decision tree.A document is treated as a set of words,which the learning algorithm must learn to classify as positive or negative examples of a component word of the keyphrases.The experiments applies C4.5 decision tree induction algorithm to this learning task and candidate keywords are extracted.Then,combining them into keyphrases based on integrated phrases. Experimental results have shown that the proposed method offers a good performance.Considering that text features got by analyzing the scientific and technical literature, the approach above is limited to scientific papers.Otherwise not enough training data are available.Thus,an automatic keyphrase extraction algorithm for Chinese documents based on complex network is proposed.Each document is treated as a semantic network. Combined degree centrality and betweeness centrality,integrated network feature value are calculated to extract keyphrases,which has better performance than using only one network feature value.Network separation is proposed to deal with network connectivty problem and betweenness on the basis of boundary nodes is proposed to reduce the complexity of calculation.The results have indicated that these two methods are effective. The algorithm shows good performance and supplies guidance for extracting keyphrases from documents using complex network.
Keywords/Search Tags:Keyphrae Extrachtion, Integrated Phrase, OOV, Left and Right Neighbor, C4.5 DecesionTree, Small-world Network, Text Feature
PDF Full Text Request
Related items