Font Size: a A A

Chinese Multi-Document Summarization Based On Hlda Hierarchical Topic Model

Posted on:2014-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:P A LiuFull Text:PDF
GTID:2248330398470707Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
There are about800EB new-generated content in Internet every day, which means that you will cost about1.68billion DVD discs to store all of them And a large portion of the daily information is in the form of text. Given this numerous text information, it’s an urgent and important task to provide an effective text information representation mechanism and help users to browse and access the content theme as quickly as possible. The essential of accessing the topics of the large text information is to do dimension reduction for multiple texts of the similar topic, find the kernel topics closely related to the topic description, and present the user a short, readable summary. We can divide the above task into two sub-tasks. One is to find the topics contained in the documents. The other is to explore the approach to form a short, readable summary.For the first sub-task, we introduce the hLDA (hierarchical Dirichlet Latent Allocation) topic model to explore the latent topics and their hierarchical relationship in large text corpus. HLDA is a Bayesian non-parameter probabilistic model. It avoids the linear growth of latent topic number with the growth of the corpus in LDA topic model, and learn the topics and their hierarchical relationship automatically from the text data. From the view of dimension reduction, hLDA reduce dimension of the multiple related documents from the high dimension in form of bag of words to the low dimension of topics of these documents. hLDA provides the nCRP (nested Chinese Restaurant Process) to model the hierarchy tree structure of topics in document sets. And with hLDA modeling, a document may contain multiple topics and these topics belong to a path in the hierarchy tree. Also, this path can be shared by other documents. With the above hLDA model process, we can implement topic discovery and topic clustering.For the sub-task two, this thesis completes it by two steps. First we choose the hierarchical topic model based summary sentence extraction method. The principal of sentence extraction is as following:1. Topic contained in the sentence to be extracted must be of high importance.2. Sentence belongs to the topic must be strongly representative.3. The words in the sentence to be extracted must be of higher level of abstraction.Secondary, for the purpose of human readability, we need to do some sorting and polishing operations for the extracted sentences in step one. For the sorting operation, we use a generic sentence ordering method that is sorting according to time. It selects a certain time as a reference point, and then sorts by calculating the absolute time of other relative time.Based on the analysis of hLDA topic model theory, we first verify th e superiority of text clustering based on hLDA topic model by compariso n test, then extract sentence by multi-features fusion, and finally generate the abstract. The analysis of experiment results has shown the effectiveness and practical applicability of this method.
Keywords/Search Tags:Chinese Multi-document Summarization, Hierarchicaltopic model, nested Chinese Restaurant Process, Bayesiannonparametric
PDF Full Text Request
Related items