Font Size: a A A

Multi-document Summarization Based On Patterns With Wildcards And Probabilistic Topic Modeling

Posted on:2017-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J P QiangFull Text:PDF
GTID:1318330512468660Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, a huge amount of electronicdocuments are available online, such as Web news, scientific literature, digital books,email, microblogging, and etc. How to effectively organize and manage such vast amount of text data, and make the system facilitate and show the information to users,have become challenges in the field of intelligent information processing. Therefore,now more than ever, users need access to robust text summarization systems, which can effectively condense information found in a large amount of documents into a short,readable synopsis, or summary. In recent years, with the rapid development of e-commerce and social networks, we can obtain a large amout of short texts, e.g., book reviews, movie reviews, online chatting, and product introductions. A short text probably contains a lot of useful information that can help to learn hidden topics among texts. Meanwhile, only very limited word co-occurrence information is available in short texts compared with long texts, so traditional multi-document summarization algorithms cannot work very well on these texts. Thus, how to generate a summary from multiple documents has important research and practical values.In this thesis, we study multi-document summarization (MDS) on long texts and multi-document summarization on short texts, and propose several multi-document summarization algorithms based on patterns with wildcards and probability topic modeling. Our main contributions are as follows.(1) A novel pattern-based model for generic multi-document summarization is proposed. There are two main categories of multi-document summarization: term-based and ontology-based meth- ods. A term-based method cannot deal with the problems of polysemy and synonymy. An ontology-based approach addresses such problems by taking into account of the semantic information of document con- tent, but the construction of ontology requires lots of manpower. To overcome these open problems,this paper presents a pattern-based model for generic multi-document summarization,which exploits closed patterns to extract the most salient sentences from a document collection and reduce redundancy in the summary. Our method calculates the weight of each sentence of a document collection by accumulating the weights of its covering closed patterns with respect to this sentence, and iteratively selects one sentence that owns the highest weight and less similarity to the previously selected sentences, un- til reaching the length limitation. Our method combines the advantages of the term-based and ontology-based models while avoiding their weaknesses. Empirical studies on the benchmark DUC2004 datasets demonstrate that our pattern-based method significantly outperforms the state-of-the-art methods.(2) A new MDS paradigm called user-aware multi-document summarization is proposed. The aim of MDS meets the demands of users, and the comments contain implicit information of their care. Therefore, the generated summaries from the reports for an event should be salient according to not only the reports but also the comments.Recently, Bayesian models have successfully been applied to multi-document summarization showing state-of-the-art results in summarization competitions. News articles are often long. Tweets and news comments can be short texts. In this thesis, the corpus includes both short texts and long texts, referred to as heterogeneous text. Long text topic modeling views texts as a mixture of probabilistic topics, and short text topic modeling adopts simple assumption that each text is sampled from only one latent topic.For heterogeneous texts, in this case neither method developed for only long texts nor methods for only short texts can generate satisfying results. In this thesis, we present an innovative method to discover latent topics from a heterogeneous corpus including both long and short texts. Then, we apply the learned topics to the generation of summarizations. Experiments on real-world datasets validate the effectiveness of the proposed model in comparison with other state-of-the-art models.(3) A new short text topic model based on word embeddings is proposed. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Based on recent results in word embeddings that learn semantical representations for words from a large corpus,we introduce a novel method, Embedding-based Topic Modeling (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo-texts,but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. Experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.
Keywords/Search Tags:Multi-document Summarization, Wildcards, Sequential Pattern, Topic Modeling, Short text, Heterogeneous Text, Word Embeddings
PDF Full Text Request
Related items