Font Size: a A A

Keyword Extraction Base On Statistical And Collaborative Filtering

Posted on:2016-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:H C LiFull Text:PDF
GTID:2348330488957088Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet, vast amounts of information are being created daily on the network. Especially the explosive growth of text information become an important issue in the field of computer natural language processing. How to accurately find the information that people need in the flood of information, it has become the current problem to be solved. To retrieve the vast amounts of textual information, first of all, the document must be effective and accurate keyword extraction. The text keyword extraction technology research plays a very important role, the application of this technology is widely used in the field, can be used for information retrieval, document classification, information feedback system, automatic summarization.This thesis focus on the keyword extraction algorithm. Firstly, based on the characteristics of Chinese text structure, an improved participle algorithm on the basis of the ICTCLAS participle system was proposed. Then the statistical characteristics of the document were analyzed and discussed. Four features which has the word frequency, the speech of word, the position of word and the length of word were selected from the common statistics features. The formula was proposed to calculate the statistical features score of the words and keywords were selected by comparing the size of the statistical features score. In addition to the statistical features, this thesis also considers the similarity between two the document. The keyword extraction algorithm based on collaborative filtering was proposed. The algorithm firstly trains the existing keyword documentation and uses collaborative filter algorithm to calculate the similarity between the document that needs extraction keywords and the document that owns keywords. Then the algorithm select the documents which have high similarity as candidate keywords. Finally, the algorithm calculate the statistical composite score for the candidate keywords in the document to select keywords. Finally, the keyword extraction algorithm based on statistical features and the keyword extraction algorithm based on collaborative filtering were combined. When there is a lot of content-related documents in the database, the algorithm will use the keyword extraction algorithm based on collaborative filtering. When there is only a few content-related documents in the database, the algorithm will use the keyword extraction algorithm based on statistics. The experiment showed that the new algorithm are more universal.This thesis also discussed the values of the parameters that appear in the algorithm and compared the performance of several different algorithms. Finally, the thesis also describes the algorithm validation tool which is designed for validation and analysis algorithm, and introduces the framework of the tool and the various functional modules.
Keywords/Search Tags:Keyword extraction, collaborative filtering, Chinese participle
PDF Full Text Request
Related items