Font Size: a A A

Semantic Hierarchical Clustering Based Multi-document Summarization Research

Posted on:2015-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:L HuFull Text:PDF
GTID:2298330422982040Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the amount of information on the networkexpands rapidly. There is a large part of similar or even duplicate content even in the multipledocuments with the same topic. Even if we remove duplicate documents, there would still betoo much content and some similar information. It is getting more and more difficult to get themain content quickly and accurately from these complex and redundant information.Multi-document summarization is a good technique to solve this problem, which is extractingmain information from multiple documents with the same topic and show them to users assummary.This paper analyzes several of multi-document summarization techniques, and studiessemantic relations between words and a variety of clustering methods in the sentenceclustering based multi-document summarization method. After that, this paper proposes amulti-document summarization method based on semantic hierarchical clustering. The maincontributions are as follows:(1) Based on context, we process word sense disambiguation by using semanticdictionary(WordNet or HowNet) to reduce the adverse impact on the clusteringanalysis.(2) Propose a sub-topic extraction method based on semantic hierarchical clustering.Firstly, we can find out the semantic concepts through clustering on words. Then webuild the vector space model of the sentences and process clustering on sentences toget the sub-topics. This method can reduce the adverse impact from the skew betweencomponents of vector.(3) Propose a clustering method on sentences by combining density clustering andhierarchical clustering method. By doing this, we can discover as much as sub-topicsto improve the main information coverage in the summary.(4) Propose the sentences extraction and sorting method to make the summary contain asmuch as more main information. In the same time, we can improve the readability ofthe summary by sorting the sentences based on importance and logic structure. We evaluate the method with ROUGE on the English corpus of DUC2004. The resultsshow that it ranked4thin ROUGE-1, ranked2ndin ROUGE-2, ranked1stin ROUGE-3,ranked3rdin ROUGE-4and ranked5thin both ROUGE-L and ROUGE-W-1.2. That meansmulti-document summarization method based on semantic hierarchical clustering achievesgood results on recall, precision and readability. In addition, based on this method, we build amulti-document summarization system for news, which covers English and Chinese. Thequality summaries show that the method can be used in practical systems.
Keywords/Search Tags:multi-document summarization, semantic hierarchical clustering, words clustering, semantic concept, sentences clustering
PDF Full Text Request
Related items