Font Size: a A A

Design And Implementation Of Multi-Documents Clustering And Summarization On Single-Event News

Posted on:2015-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:D J ZhangFull Text:PDF
GTID:2268330428961662Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays with multiple online news sites emerging up, people are flooded with a large number of news without being sorted, finding it’s becoming harder and harder to keep up with the updating speed of information. Thus, there is an urgent need of a news-browsing system, which could not only gather together articles on major news sites, but also classify and summarize them. By using this assembling tool, it would significantly save our time, as we can quickly focus on what we are interested in and get a list of refined articles to read.Based on several researches on the related technologies of topic detection and multi-document summarization, this paper builds a prototype of a system which integrates single-event cluster and summarization; and this system would mainly focus on three parts: news classifying, single-event cluster and multi-documents summarization on single events. The main work of this paper includes the following two aspects:First of all, this paper achieves the main module algorithms about the single-event cluster system. After deep studying theory about LDA, this paper combines VSM models with LDA models to compute similarity between two news articles. We implement KNN based on similarity-weighted voting to sort news set, based on the combined similarity. The combined similarity is also involved with SinglePass, which cluster the classified news on single-event. Tests were done to prove the effect of the improved KNN and SinglePass.This paper has built a multi-documents summarization system, which will be described as below. At text representation part, we import Hownet into traditional VSM model to both compute words similarity semantically and put together words whose similarity is beyond a fixed threshold as a Synonym; and finally we get a promoted VSM model. The rest computation is based on the promoted VSM. At sentence weight calculated part, we combine some sentence features with LexRank to get sentence weight and rank the sentences by their weights. At sentence extraction part, MMR is used to make a nonredundancy summary. Meanwhile, we set some simple sequential rules to output the summarized sentences.
Keywords/Search Tags:multi-documents summarization, KNN, Single-Pass, Latent DirichletAllocation, LexRank
PDF Full Text Request
Related items