Font Size: a A A

The Research Of Topic Based Multi-document Summarization

Posted on:2012-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:D P YueFull Text:PDF
GTID:2218330362460291Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of internet technology makes exponential growth of the existing literature and knowledge. Multi-document summarization technology is helpful in getting important information from large sum of documents, while effectively reducing the time and effort spent in reading, so it is of important value nowadays. Currently, the news reports are usually arranged in topics, which means, one event is described as the leading event, together with other events which are connected to it in one way or another reported or described in front of the audiences. Documents organized in the topic-based way describe a series of news events and the context clearly, which is easier for users'querying and reading and widely used. This thesis studies the technology of topic based multi-document summarization in this thesis.Different from other kind of multi-document sets, documents in topic based multi-document sets are highly related to each other in content, and also the information is highly redundant and contains less irrelevant information. If we take the advantage of these features that the ordinary document sets do not have, we can generate better summarization from the topic-based documentation set.Focused on the topic feature of the topic-based documentation set, this thesis makes improvements on the basis of classic summarization algorithm. There are two main improvements: distinction between seminal events and non-seminal events, adding time attribute to the sentences.This thesis deals with topic-based news report sets in experiments, and give rise to a topic-based Multi-document Summarization method on the basis of MMR (maximal marginal relevance) summarization algorithm. Considering that seminal events and non-seminal events play different roles, this thesis treats them differently in extracting topic key words. When calculating the similarity between sentences, the time-sensitive characteristic of news corpus has been taken into account. Each sentence is endowed with a certain time property, which makes it possible to calculate the similarity on the measure of time. In the work of sentence ordering, this thesis takes the time property of sentences into account, and use two different ways for two different document organizational structures.In this paper, TDT4 corpus has been used for testing and evaluating the above summarization method. Compared with two baseline systems, the topic-based multi-document summarization system achieves better results.
Keywords/Search Tags:Multi-document summarization, topic, Natural Language Process, news, Topic Detection and Tracking
PDF Full Text Request
Related items