Font Size: a A A

The Study On Extractive Multidocument Summarization

Posted on:2012-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H M ShaoFull Text:PDF
GTID:2218330338963714Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the prevalence of Internet and personal mobile devices nowadays, people has to deal with a lot of information every day. They need some kind of a tool to help them extract important information from a lot of documents, which has necessitated intensive researches on automatic summarizing techniques recently.In this paper, we focus on the extractive multi-documents generic summarization task. We study a kind of methods based on a generative topic model of textual data called LDA. We also propose another kind of methods based on a learned ranking function which can identify the best summares encountered. The main contributions of our work includes:1. We study the automatic multi-document summarizing problem in a greedy framework, where the sentence selection problem is reduced to a measure of the contribution of each sentence to the theme construction of the summary. As topic models tend to gain more attentions in this field, we concentrate on the problem of scoring sentences under a probabilistic topic model LDA(Latent Dirichlet Allocation). We develop consistent probabilistic representations of the relations between texts and topics based on the result of LDA, and propose two scoring methods using these representations. Evaluation results on the DUC2002 test set using ROUGE metrics show the pertinence of these probabilities, and the effectiveness of our scoring methods. In addition the sentence length as an important factor in document summarizing is also studied.2. We propose a new idea of using a learned predication function to search for high quality summaries. Traditional methods use heuristic quality measures for summaries explicitly or implicitly. These measures lack objectiveness. We think, without the knowledge of an objective summary quality predication function, searchings for good summaries in these techniques all are kind of blind. In this paper, we discuss the possible forms the learned prediction function, and anlayze the feasibility of them. We describe desired properties of underlying quality features in learning, and quantify them for the conveniece of feature selection. We also discuss the possible usage of the learned prediction function. 3. We design an approach to learn the prediction function, and construct a high-quality summary searching system called RBSS. Instead of learning a regression function to predict the scores, we learn a ranking function by borrowing learning-to-rank techniques from the IR&ML Community. The learned ranking function can give orders to lists of summaries according to their qualities, and these orders help to find the really good summaries during the search. We present four ranking features using the unigram and co-occurrence information which are thought robust to the changes of source texts. We also give a stochastic training dataset construction method which works well in practice. Based on these work, We design RBSS using genetic programming. In RBSS, the fitness of an individual is determined by its ranking position given by the learned ranking function at that moment. Evaluation results show that summaries generated by our approach gain higher ROUGE-1 scores than the best systems in DUC 2002.
Keywords/Search Tags:Multi-document Summarization, LDA, Learning to Rank, Summary Quality Predication Function, RBSS
PDF Full Text Request
Related items