Font Size: a A A

Multi-document Summarization Based On Basic Element

Posted on:2008-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:D X LiuFull Text:PDF
GTID:1118360215498501Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid growth of online information, it becomes more and more important to find and describe textual information effectively. Although it is convenient for users to obtain a great deal of documents with a search engine, users have to take the tedious burden of reading all those text documents. Automatic text summarization can alleviate uses'browsing burden by provideing users with a condensed version of the original text; while multi-document summarization, aiming at extracting major or user-interested information from the given multiple documents, which plays a vital role in Information Retrieval(IR), has become a hot topic in Natural Language Processing (NLP).In this dissertation, we investigate four key issues of multi-document summarization as followed: artificial behavior modes in content unit selection; content selection based on sentence extraction; ordering strategy for extracted sentences; and evaluation model for"coherence"of summary. The primary research and achievements can be summarized as follows:1. A content unit selection strategy based on Basic Element (BE) is proposed. By analyzing the correlation between the frequency of BE in original document collection and the probability of it to be chosen as summary contents, this thesis studies the potential behavior mode in manual summarization. Results of statistical analysis on dataset for DUC 2004 task 2 show that the manual summary prefers BEH (BE Head) or BEHM (BE Head or Modify) with high frequency in the BE document collection.2. The influence of topics (given by user) on content unit selection in user-focused summarization is analyzed. The analysis on dataset for DUC 2005 reveals that manual summary will consider the given topic as a reference: firstly, human find the sentences (key sentences) in original document collection which contain the content units in the given topic, and consider the content units around these key sentences as candidates; then select the content units with high frequency for the final summary.3. A sentence extraction method based on BE vector clustering is proposed. In this method, BE is employed as content unit. The evaluation results on DUC2004 task 2 show that this method is better than that uses word as content unit. Moreover, this thesis presents an adaptive method to auto-detect the number of clusters and a global search strategy to extract the representative sentence from each cluster. This method can automatically detect the number of clusters, and decide which sentence should appear in the summary in the global perspective. The results show that automatic detecting number of clusters is superior to fixing the length of summarization or fixing the number of clusters arbitrarily. Besides, the method choosing sentences in a global perspective from each cluster is better than directly extracting centroid sentence from each cluster. It is difficult to determine the number of clusters, so, to avoid this problem, another sentence extraction approach based on Genetic Algorithm (GA) is proposed. This approach treats sentence extraction as Knapsack problem, and employs GA to find a sub-optimal solution.4. A hybrid model for sentence reordering is proposed. This model integrates four kinds of relations between extracted sentences: chronology relation, location relation, dependence relation and topic relation. We construct a directional graph for extracted sentences, which use sentences as vertexes and the relations between sentences as edges. Then, sentences are reordered by using an extension of PageRank method. Experiments on datasets for DUC2005 task 2 and 5 imply that this hybrid model is better than other reference models in performance and robust.5. A BE-Relation-Grid based evaluation model for"coherence"is proposed. In this model, we view BE as the content unit and the"relation"part in BE as grammar role of the content unit. Then, the content coherence is scaled by BE relation transition probability in the BE-Relation-Grid. Our experiments select the manual summaries in DUC2005 dataset as the training data; select the summaries generated by machine as the testing data; use correlation between scores given by this model and those given by human as performance of this model. Pearson correlation coefficient is 0.408 if we select BE (only select BE which has relation"subj","obj","conj"or"nn") as content unit, which increased by about 66% comparing with the result of entity grid model presented in the literature. Experiment results illustrate that evaluation model based on BE-Relation-Grid can demonstrate the semantic information and structure information better.
Keywords/Search Tags:Multi-document Summarization, Basic Element, Content Unit Selection, Sentence Extraction, Sentence Ordering, Content Coherence
PDF Full Text Request
Related items