Multi-document Summarization Based On Basic Element

Posted on:2008-10-09

Degree:Doctor

Type:Dissertation

Country:China

Candidate:D X Liu

Full Text:PDF

GTID:1118360215498501

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid growth of online information, it becomes more and more important to find and describe textual information effectively. Although it is convenient for users to obtain a great deal of documents with a search engine, users have to take the tedious burden of reading all those text documents. Automatic text summarization can alleviate uses'browsing burden by provideing users with a condensed version of the original text; while multi-document summarization, aiming at extracting major or user-interested information from the given multiple documents, which plays a vital role in Information Retrieval(IR), has become a hot topic in Natural Language Processing (NLP).In this dissertation, we investigate four key issues of multi-document summarization as followed: artificial behavior modes in content unit selection; content selection based on sentence extraction; ordering strategy for extracted sentences; and evaluation model for"coherence"of summary. The primary research and achievements can be summarized as follows:1. A content unit selection strategy based on Basic Element (BE) is proposed. By analyzing the correlation between the frequency of BE in original document collection and the probability of it to be chosen as summary contents, this thesis studies the potential behavior mode in manual summarization. Results of statistical analysis on dataset for DUC 2004 task 2 show that the manual summary prefers BEH (BE Head) or BEHM (BE Head or Modify) with high frequency in the BE document collection.2. The influence of topics (given by user) on content unit selection in user-focused summarization is analyzed. The analysis on dataset for DUC 2005 reveals that manual summary will consider the given topic as a reference: firstly, human find the sentences (key sentences) in original document collection which contain the content units in the given topic, and consider the content units around these key sentences as candidates; then select the content units with high frequency for the final summary.3. A sentence extraction method based on BE vector clustering is proposed. In this method, BE is employed as content unit. The evaluation results on DUC2004 task 2 show that this method is better than that uses word as content unit. Moreover, this thesis presents an adaptive method to auto-detect the number of clusters and a global search strategy to extract the representative sentence from each cluster. This method can automatically detect the number of clusters, and decide which sentence should appear in the summary in the global perspective. The results show that automatic detecting number of clusters is superior to fixing the length of summarization or fixing the number of clusters arbitrarily. Besides, the method choosing sentences in a global perspective from each cluster is better than directly extracting centroid sentence from each cluster. It is difficult to determine the number of clusters, so, to avoid this problem, another sentence extraction approach based on Genetic Algorithm (GA) is proposed. This approach treats sentence extraction as Knapsack problem, and employs GA to find a sub-optimal solution.4. A hybrid model for sentence reordering is proposed. This model integrates four kinds of relations between extracted sentences: chronology relation, location relation, dependence relation and topic relation. We construct a directional graph for extracted sentences, which use sentences as vertexes and the relations between sentences as edges. Then, sentences are reordered by using an extension of PageRank method. Experiments on datasets for DUC2005 task 2 and 5 imply that this hybrid model is better than other reference models in performance and robust.5. A BE-Relation-Grid based evaluation model for"coherence"is proposed. In this model, we view BE as the content unit and the"relation"part in BE as grammar role of the content unit. Then, the content coherence is scaled by BE relation transition probability in the BE-Relation-Grid. Our experiments select the manual summaries in DUC2005 dataset as the training data; select the summaries generated by machine as the testing data; use correlation between scores given by this model and those given by human as performance of this model. Pearson correlation coefficient is 0.408 if we select BE (only select BE which has relation"subj","obj","conj"or"nn") as content unit, which increased by about 66% comparing with the result of entity grid model presented in the literature. Experiment results illustrate that evaluation model based on BE-Relation-Grid can demonstrate the semantic information and structure information better.

Keywords/Search Tags:

Multi-document Summarization, Basic Element, Content Unit Selection, Sentence Extraction, Sentence Ordering, Content Coherence

PDF Full Text Request

Related items

1	Research On Key Technologies Of Chinese Multi-Document Summarization
2	Research On Summary Sentence Selection And Ordering In Query-focused Multi-document Summarization
3	Research And Application Of Multi-document Automatic Summarization
4	Research On Some Key Technologies Of Sentence Ordering For Information Fusion
5	Research On The Topic-oriented Summarization For Web Documents
6	Chinese Query-Focused Multi-document Summarization Based On Cloud Model
7	The Approach For Event-based Multi-document Automatic Summarization
8	Research On EBM Multi-Document Summarization Technique
9	Research Of Web Multi-document Automatic Summarization
10	Research On Extractive Multi-document Summarization Using Supervised Deep Learning