Font Size: a A A

Research On Key Techniques Of Multiple Documents Automatic Summarization

Posted on:2008-03-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y D XuFull Text:PDF
GTID:1118360245496623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Multi-document automatic summarization extracts important or user-interesting information according to texts related to same topic or interesting of users, and automatically generates fixed length summarization. It is a application technique that is related to multiple research domains including Linguistics, Computational Linguistics, Artificial Intelligence and Information System etc. So the research of multi-document automatic summarization can effectively contribute to the progress of these domains. In addition, a feasible multiple documents automatic summarization system has the important practice value for promote speed and precision of web information processing.Thus, this paper researches common multi-document automatic summarization based on discourse structure. We first research the discourse relatives of each pairs of text units, including similar relative of cross-document units, text temporal information extraction and temporal relative identification of events, text rhetorical structure identification and hierarchical topic extraction. In addition, a multi-document represent structure based on rhetorical structure MRS is proposed. By representing interrelationship between text units at different levels of granularity and the happen and change of various events at time dimension, this structure can achieve information parallel fusion of multi-document while reserve original information of set of related documents. Finally, a series of algorithms including summary sentences extraction based on MRS, summary ordering, and summarization generation are proposed. This paper is composed of four parts:At first, this paper research Chinese temporal information extraction and temporal semantic calculation and in addition, research temporal reason and temporal relative identification of events. Text temporal information is very importance in node anchor, key events identification, events ordering and summary content reform. According to Chinese text temporal information expression trait, this paper decomposes temporal phrase which bear time information into some"little"elements which have single signification and can be easily extracted, and then, combinate these elements to temporal expression by integrate rules. In this course, calculate final temporal semantic value and temporal relative of events.Second, the text units similarity calculation method is researched in this paper. There exists semantic similar relative between units from cross-document which is important cue of finding important summary sentences. Because the text units semantic similarity cannot be calculated by full document similarity strategy, this paper propose a units similarity calculation method based on multiple features fusion which dig useful features as far as possible and automatically fuse these features by machine learn method so as to avoid information absence problem caused by the method of traditional single text expression by words or conception. We use logistic regression model to automatically fit the relations between the features and text units similarity. Such model has better fitness characteristic and can easily add new features or erase existing features and has more strong expansibility.Third, because that topic automatic identification is key technique of summarization, this paper propose the notion of hierarchical topic through the analysis of text set topics distributing and topic bound, and use hierarchical tree to replace traditional monolayer topic structure. We think that such processing can more effectively reflect true content of text set. Concretely, we use hierarchical clustering algorithm to build hierarchical topic tree and use density curve inflexion identification method to automatically get clustering threshold.Fourth, building a reasonable formalization representative structure of text set is foundation of next research. Dratomir R. Radev proposed two basic data structure: cube and graph when he described cross-document structure theory (CST). The cube structure considers influence of temporal information in topic identification of text set. The graph structure divides relationship of text units into multiple fine-grained rhetorical relationships. Inspired by this idea, this paper propose a multiple document rhetorical structure (MRS), and design a series of algorithms including summary sentences extraction based on MRS, summary ordering, and summarization generation. MRS comprise node which represent text units and link which represent the relation between these units. The links contain rhetorical relations which determine the importance of unit in text and similar relations which show the similarity between unit and all correlative nodes from other documents. The temporal information of unit shows occurrence and change of event described by nodes. So comprehensively combining these factors can assure the importance of node in whole set. Finally, this paper proposed a multi-document automatic summarization evaluating system which a single standard summary sentence in text set is extended to a standard summary set and the rationality of summary precision and redundancy result are improved. Our experiment result shows that the multi-document automatic summarization system based on MRS can generate good quality abstract.
Keywords/Search Tags:Multiple Documents Automatic Summarization, Temporal Information Process, Text Units Similarity, Hierarchical Topic Identify, Multi-Document Rhetorical Structure
PDF Full Text Request
Related items