Font Size: a A A

The Research Of Keywords Extraction Algorithm In Text Mining

Posted on:2014-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:L F WangFull Text:PDF
GTID:2268330401482725Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the information technology develops continually, a large amount of text information stored in the form of computer readable and many areas of information emerged as the explosive growth. So how to extract useful information to readers quickly and accurately in the large amount of information will be an important issue. Keywords extraction is an effective means of solving the above problems. Keywords refine subject information of texts, to enable readers to grasp the important information about the text quickly, and to improve the efficiency of the access to information. So it has obvious practical significance.Keywords extraction is one of the core technologies in the field of text mining and plays a very important role. The main carrier of information is based on text. However the vast majority of text messages are also not yet provided keywords. Existing keyword extraction algorithms can not be a good solution to the problems about word sense disambiguation, synonym redundant expression, over-fitting in the classifier training process and lexical chain can not expresses the text semantic structure accurately and so on. So this thesis proposes two improved method based on semantic analysis. They mine the potential expression of text theme in the semantic level. The proposed methods are better to solve the problems about word sense disambiguation, lexical chains expresses the text semantic structure accurately and comprehensively, etc. While synonymous avoid redundant expression, especially articles have synonym the more the more obvious. The main work of this thesis are shown as followed.1. Keywords extraction algorithm based on semantic dictionary and Lexical ChainThe semantic dictionary of Tongyici Cilin not only coding is simple but also has the advantage of synonyms group that is more abundant and easy to semantic understanding than other knowledge bases, and the vocabulary chains express the semantic structure of the text and multiple topics excellently. Therefore, a complete keywords extraction algorithm based on semantic dictionary and Lexical Chain is proposed ultimately. Its name is KETCLC (Keyword Extraction based on Tongyici Cilin and Lexical Chain). It analysis the characteristics of Tongyici Cilin and lexical chains, combing both of them and according to the advantage of their combined, then makes the text processing to improve the quality of the keyword extraction in terms of preprocess, polysemy disambiguation, synonym mergence, the construction of lexical chains, feature selection and improvement of weights computation.2. Keywords Extraction Algorithm Based on Semantic Expansion Integrated With Lexical ChainThe building methods of lexical chain are based on semantic similarity values calculated or semantic relevancy values calculated independently at present. However, these lexical chains can not express the association and semantic relations between words accurately and fully. Accordingly, they affect the accuracy of the expression of the theme of the article and the quality of keywords extraction. Therefore, a complete keywords extraction algorithm based on semantic expansion integrated with lexical chain is proposed ultimately. Its name is KESELC (Keyword Extraction based on Semantic Expansion integrated with Lexical Chain). It calculates the semantic similarity and semantic correlation based on the semantic dictionary of Tongyici Cilin from the perspective of semantic analysis, then comes to a semantic expansion degree and its calculation method considering the both of the above. Finally, it integrated semantic extension with lexical chain to extract keywords. It is good to mine the vocabularies that their frequency is not high but having an important contribution to the article.The experimental results show that the above two methods both are able to take full account of the semantic knowledge. The extracted keywords not only avoid a redundant expression, but also cover the subjects of the article accurately and comprehensively. The above two methods of keyword extraction both have excellent performance to extract keywords from Chinese texts.
Keywords/Search Tags:Tongyici Cilin, Lexical Chain, Keyword Extraction, Semantic Analysis, Semantic Expansion
PDF Full Text Request
Related items