Font Size: a A A

Chinese And English Automatic Summarization Based On Topic Modeling

Posted on:2012-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:M H ZhangFull Text:PDF
GTID:2218330368991829Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the computer technology and the Internet, various in formation is increasing explosively; people's demand for precisely locating information give a strong impetus to the research in the natural language processing technology (NLP). Meanwhile, as the progressively research on cross-document information fusion technologies, multi-document summarization become a hot research subject, multi-document summarization can be used in question answering, search engines, topic detection and other applications.In this paper, we analysis the existing methods on automatic multi-document summarization deeply, and apply a topic model to the sentence silence detection. In addition, we use a dynamic model to control redundancy. At last, we implement an automatic multi-document summarization system based on those methods. Experimental results on TAC2008 and TAC2009 corpus show that the system has a good Rouge performance.This paper mainly analysis the most two key multi-document summarization technologies:Sentence salience determination and redundancy control. In terms of sentence salience determination, we propose a sentence topic feature based on topic modeling. The results show that the topics feature plays a significant role in the MDS. And the combination of topic feature and other traditional features can also improve the system performance. In terms of redundancy control, we use dynamic modeling to control redundancies; we also design the update dynamic modeling for the update summarization task based on this. After using the update dynamic modeling, the summary can effectively avoid history redundancies. The results of TAC2008 corpus show that after combined the two strategies (Sentence salience determination and redundancy control) we can achieve a better system performance. Especially in the update summarization task, our result is better than the best result in the entry system. Finally, this paper also gives the evaluation of Chinese corpus before and after joining topic model and dynamic model. The result shows that topic modeling and dynamic model have equally effective on the Chinese corpus. However, the result of Chinese MDS is obviously worse than the one of English MDS, and the reason may be that the Chinese corpus needs more preprocessing which can affect the performance of the whole system.
Keywords/Search Tags:topic modeling, Multi-document summarization, latent Dirichlet allocation, natural language processing
PDF Full Text Request
Related items