Font Size: a A A

Study On Multi-Document Summarization Algorithm Based On Fusing Topic Sentences Semantic

Posted on:2017-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z P LiuFull Text:PDF
GTID:2348330509453998Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the exponential growth of text information on the Internet, Natural Language Processing has become a hot research topic because of computer intelligent processing of mass text messages. Currently, the international research hotspots of Natural Language Processing, Machine Translation, Sentiment Analysis, Semantic Analysis, Document Summary, etc, has made a better performance. Among them, Multi-document Summarization technology provides a good way for users to organize a lot of information, extract important informationquickly and effectively. Its effort to extract an important theme content from some similar topic texts and present it to the user in the form of a short, concise, readable text. It enhances the user information processing efficiency.As to Research on multi-document summarization, thematic relations and semantic information are critical to understanding the text information. LDA model is a model of text generation, and through the Gibbs sampling, vocabularies and sentences in the corpus are mapped to the underlying theme dimension, thus revealing the hidden theme. The syntactic parsing for the summarization is helpful to reduce the complexity of nested modifiers, and a new information structure can enrich the diversity expression of abstract sentence, having a positive role with promoting research on redundancy elimination of Automatic Summarization. This project study mainly focuses on combining thematic relations with semantic fusion of multi document summarization sentence generation. The key work and innovations are as follows:The first part is the design for a general summarization algorithm framework based on integer linear programming. The algorithm optimizes an important semantic information under each theme to fuse and assemble into a new abstract sentence, taking polishing the candidate summary sentence, padding auxiliary informations of sentence components and rewriting the noun phrases and verb phrases into consideration. Its significance is to improve the information coverage and readability of generating multi-document abstract summarization.Second, topic sentence clustering algorithm, T-means, is put forward based on improving LDA model and the k-means algorithm. It makes full use of the consistent number between of latent topics of consistency the large document sets clusters and the number of latent topics in sentence sets. T-means has solved the estimation the best number of topics for the LDA model, and designed a new model of computing important subject to select the representative topics dimensions of the important sentences as the initial cluster center. At last, the topic sentence clustering is completed.In third, our proposed algorithm is compared to different summarization algorithms with DUC 2003 and DUC 2004 public data of document understanding conferences through several experiments, and the experimental results were discussed in detail. Its results indicates that the ROUGE score of our proposed multi-document abstract summarization algorithm is obviously better than the extraction and compression summarization in terms of the information richness and readability.
Keywords/Search Tags:Multi-document automation summarization, Topic sentence cluster, Latent Dirichlet Allocation, Information fusion
PDF Full Text Request
Related items