Font Size: a A A

Multi-document Summarization Based On HLDA Hierarchical Topic Model

Posted on:2013-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiuFull Text:PDF
GTID:2248330371467410Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and the coming of knowledge economy era, the pace of life is becoming more faster, people have higher requirements on the speed and quality of accessing information. At this time, multi-document summarization technology emerged. It becomes the focus in the field of Natural Language Processing.Multi-document summarization aims at generate a fluent, wide coverage of information, short, brief and readable summary based on articles that talked about related topics, it can meet the need of accessing information quickly and accelerate the speed of information transmission.This paper has a research on multi-document summarization method that based on hierarchical Latent Dirichlet Allocation (hLDA) topic model and sentence compression. hLDA is a representative of the fully generated probability model. It can mining topics for large-scale discrete data, and automatically adapt to the growth of data sets, and turn the organization into a hierarchical tree, identify the abstractive relationship of the topics, at last achieve a deeper semantic analysis. Through the construction of hierarchical topic model, sentences can be distributed to different paths, sentences that assigned to one path sharing common topics will have a strong semantic relevance, it’s very convenient for us to identify sub-topic. The major work of this paper has following aspects:1) We build topic tree hierarchy structure model for the sentence as the basic processing unit. In hLDA model, each node in the tree represent a topic, a topic is composed of different words, a path is generated by select node from the root to leaf, the topics near the root are more abstractive, the topics close to the leaf node are more specific, all the sentences automatically assigned to different paths, sentences that assigned to the same path form a cluster, which is an important foundation for we extract sentences.2) The strategy of sentence extraction. We extract sentences from different clusters according to the importance of topics based on the results of hierarchical clustering. We mainly consider the similarity of the sentence and the title, the informative of the sentence, the abstractive of sentence, the length of the sentence and other characteristics to measure the weight of the sentence. As the limitation of the summary length, to improve its quality and readability, we use the sentence compression method.3) We introduced the evaluation method of the multi-document summarization. We evaluate all summaries that published by TAC 2010 using the collection of original documents and expert summaries as references, the experiment results show that the effectiveness of our method.
Keywords/Search Tags:multi-document summarization, hierarchical Latent Dirichlet Allocation, topic model, sentence compression
PDF Full Text Request
Related items