Font Size: a A A

Research On Document Summarization Algorithms And Their Applications

Posted on:2012-06-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:F JinFull Text:PDF
GTID:1118330362467997Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Document summarization is one of the possible ways to solve theinformation overload problem caused by "information explosion". It can also beused to produce concise text for widely used portable devices with small andlow resolution screens, such as cell phones, to help mitigate the inconvenienceof reading large sections of text. In this thesis, we focus on some key issues indocument summarization, including the ranking and selection problems of textunits, structure-aware summary generation, sentence compression and gene textsummarization. The major contributions of this thesis include:Firstly, we present a systematic comparative study on the two key problemsin extractive summarization methods: the ranking and selection of text units.Experimental results on standard datasets show the superiority of pairwse andlistwise learning to rank methods, as well as the Integer Linear Programming(ILP) based selection strategies. Then a joint learning framework forsummarization which combines generalized perceptron learning and ILP isproposed. And experimental results show the effectiveness of this approach. Weproceed to study the performance upper bound of extractive methods.Secondly, we present a novel diversification framework to generatestructure-aware summaries with the guidance of pre-defined aspects. Theproposed framework attempts to maximize the expected satisfaction of allaspects during summary generation. The aspects and the given documentcollection are modeled through Labeled Latent Dirichlet Allocation (LDA).Then the importance of each aspect and the relevance of each sentence to eachaspect are calculated based on probabilistic inference.Thirdly, we propose a Markov Logic Network based sentence compressionmethod, which compresses English sentences by deleting unimportant words.Through first order logic formulae, the method is able to incorporate locallinguistic features and capture global dependencies between word deletionoperations to determine whether a word should be removed. Experimentalresults on both written and spoken news corpora show that the proposed approach outperforms two state-of-the-art methods.Finally, a gene text summarization system GeneSum which automaticallyextracts representative sentences from biomedical documents is presented. Thesystem employs ListNet learning to rank method and incorporates various textfeatures and biological data resources to rank sentences, and then an IntegerLinear Programming based method is used to choose salient sentences toconstruct a summary. We evaluate the system on a large dataset of7,294genesand conduct in-depth analysis of the test results.
Keywords/Search Tags:document summarization, learning to rank, structure-awaresummarization, sentence compression, gene text summarization
PDF Full Text Request
Related items