Font Size: a A A

Research On Multi-Document Summarization Based On Topic Modeling And Semantic Analysis

Posted on:2016-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2298330467991802Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of technology, the amount of information people received each day is also increasing. In addition to the main message of concern, it also fills with a lot of redundant information. Multi-document summarization technology just can help people to extract the useful one from a large number of information. The current mainstream algorithm of multi-document summarization is topic model, but most scholars use topic model with plane relations in the area of multi-document summary.Hierarchical Latent Dirichlet Allocation (hLDA) model proposed by Blei et al., can not only dig out the theme feature underlying the set of documents, but also establish a link between the various themes of the hierarchy. This hierarchical structure can better express the theme features when we need to summarize the contents of multiple documents. Because of its diversified modeling results, the results cannot get enough guaranteed. Even under the same conditions of the parameter settings and corpus, the modeling results also have some randomness.Therefore, this thesis summarized the experimental process of Chinese multi-document summary applying hierarchical topic modeling and semantic analysis on the basis of existed work. And we proposed automatic evaluation methods for unsupervised multi-document hLDA modeling results, then verified the validity through manual evaluation. Next, we used the automatic evaluation method based on the hLDA modeling results to adjust the super-parameters of modeling settings in order to optimize modeling results.Finally this thesis compared the hLDA modeling results with that of other models through automatic and manual evaluation. With this, we verified the superiority of hLDA in Chinese text clustering, and also confirmed the effectiveness of automatic evaluation method. On the other hand, this thesis made the comparative experiments from different segmentation methods, whether to remove the stop words, different approach of duplicate sentences, whether to add a user dictionary with synonyms to replace, these four aspects, to explore more suitable pretreatment processes of hLDA modeling for Chinese multi-document summary.This work was supported by the National Natural Science Foundation of China," hLDA based Chinese multi-document summarization"(project approval number:61202247) and "On the management of uncertainties in Web2.0user generated content"(project approval number:71231002).
Keywords/Search Tags:hierarchical latent Dirichlet allocation, topic modeling, automatic evaluation methods, super-parameter adjusting
PDF Full Text Request
Related items