Font Size: a A A

Research On The Method Of Multi-document Summarization Based On Topic Model

Posted on:2014-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Q F LiFull Text:PDF
GTID:2248330398452620Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Currently, the Internet is in a period of rapid development, information explosion is seen in every field, electronic text gradually replaces the traditional hand-style text, and more and more information is stored in the form of electronic information on the Internet. Thus, Internet has become an important channel for people to obtain information. Meanwhile, Internet is flooded with a lot of redundant information, people are faced with how to quickly and accurately find useful information from the mass information, and how to quickly read new information. Automatic summarization techniques is a good tool to solve this problem.Automatic summarization is the creation of a shortened version of a text by a machine from original text or texts. Multi-document summarization is one of the natural language processing technology aimed at extraction of information from multiple texts written about the same topic according to a certain compression ratio.This paper studys extraction multi-document summarization based on topic model. Sentence selection plays an important role in extraction multi-document summarization, they are expected to cover most content in the article and have little redundance. Hence, ranking sentences according to these two measures appropriately is a crucial research prob-lem to be solved. This paper presents the idea of sorting topic model, using topics rank and document’s structure information to sort sentences.This paper’s work includes:(1)Topics rank and sentence rank. This paper models the document collection us-ing CTM. We propose TopicRank algorithm to rank the topics, then CorrSum algorithm uses ranked topics to rank sentences, and guide the process of extracting sentences from the corpus. Experimental results on the DUC2002proved the effectiveness of CorrSum algorithm.(2)Rank sentences using document’s structure information.We discuss two common document’s structure, one is title-content structure. We propose Titled-LDA algorithm to rank sentences. Since the title of a document has indicative function, Titled-LDA al- gorithm establishs topic model for each document’s title and content, and then integrates the two models. The other is segment structure, we use a segmented topic modcl(STM) to discover the latent topic structure of each document and its segments. We propose another algorithm, StmSum, to rank sentences. Experimental results on the DUC2002proved the effectiveness of the algorithm.(3)As to CET4,6’s reading Comprehension, this paper proposes a new evaluation criteria.
Keywords/Search Tags:multi-document summarization, topic model, topic rank, sentencerank, document’s structure
PDF Full Text Request
Related items