Font Size: a A A

Statistic-based Automatic Keypharse Extraction And Summarization From Multi-document

Posted on:2011-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y G ZhangFull Text:PDF
GTID:2178360305976431Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Extracting keyphrase and summarization is the important technology in Intelligent information processing, which is used in document information processing, such as clustering, classification, engine searching and TDT.The multi-document topic is reflected by multi-document keyphrase and summarization, in the form of phrases and sentences.In this paper, we firstly introduce the phrase recognition; and the phrase recognition quality may affect the extracting results for keyphrase and summarization directly. Secondly, the paper focuses on three systems for extracting keyphrase and summarization. The S-MMR system applies single-document extraction method to multi-document extraction system. G-HITS system evaluates the importance of sentences and terms, based on the link analysis algorithm. The two systems both adopt MMR calculation to avoid extracting redundant information.Taking into account that, in the process of processing similar information, the MMR technology is adopted to avoid extracting similar information, we propose based Co-clustering Keyphrase and Summarization Extraction System, which is based on S-MMR and G-HITS systems, combining Co-clustering algorithm. In the proposed system, we make full use of similar information. We firstly construct directed graphs of sentence and that of terms which are contained in the sentence; and these directed graphs are converted into equivalent weight matrixes. Based on the weight matrixes, we use Co-clustering algorithm, in which we calculate the weight of sentences and terms and cluster for sentences and terms. In the process, weight calculating and clustering interact with each other; and the weight and clustering for respectively sentences and terms also interact and reinforce each other, finally achiving the global optimal weight. The results show that the keyphrase and summarization extracted by the proposed system have high quality and efficiency. Experiments on DUC2004 dataset show that, based on ROUGE method, the Average-F value of ROUGE-1 is 38.459%, and that the Average-F value of ROUGE-2 is 0.09382%; and the two values are higher than that based on other methods.
Keywords/Search Tags:Information extraction, keyphrase, Multi-document summarization, Co-clustering, terminology, Natural Language Processing
PDF Full Text Request
Related items