Chinese Multi-Document Summarization Based On Hlda Hierarchical Topic Model

Posted on:2014-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:P A Liu

Full Text:PDF

GTID:2248330398470707

Subject:Computer Science and Technology

Abstract/Summary:

There are about800EB new-generated content in Internet every day, which means that you will cost about1.68billion DVD discs to store all of them And a large portion of the daily information is in the form of text. Given this numerous text information, itâ€™s an urgent and important task to provide an effective text information representation mechanism and help users to browse and access the content theme as quickly as possible. The essential of accessing the topics of the large text information is to do dimension reduction for multiple texts of the similar topic, find the kernel topics closely related to the topic description, and present the user a short, readable summary. We can divide the above task into two sub-tasks. One is to find the topics contained in the documents. The other is to explore the approach to form a short, readable summary.For the first sub-task, we introduce the hLDA (hierarchical Dirichlet Latent Allocation) topic model to explore the latent topics and their hierarchical relationship in large text corpus. HLDA is a Bayesian non-parameter probabilistic model. It avoids the linear growth of latent topic number with the growth of the corpus in LDA topic model, and learn the topics and their hierarchical relationship automatically from the text data. From the view of dimension reduction, hLDA reduce dimension of the multiple related documents from the high dimension in form of bag of words to the low dimension of topics of these documents. hLDA provides the nCRP (nested Chinese Restaurant Process) to model the hierarchy tree structure of topics in document sets. And with hLDA modeling, a document may contain multiple topics and these topics belong to a path in the hierarchy tree. Also, this path can be shared by other documents. With the above hLDA model process, we can implement topic discovery and topic clustering.For the sub-task two, this thesis completes it by two steps. First we choose the hierarchical topic model based summary sentence extraction method. The principal of sentence extraction is as following:1. Topic contained in the sentence to be extracted must be of high importance.2. Sentence belongs to the topic must be strongly representative.3. The words in the sentence to be extracted must be of higher level of abstraction.Secondary, for the purpose of human readability, we need to do some sorting and polishing operations for the extracted sentences in step one. For the sorting operation, we use a generic sentence ordering method that is sorting according to time. It selects a certain time as a reference point, and then sorts by calculating the absolute time of other relative time.Based on the analysis of hLDA topic model theory, we first verify th e superiority of text clustering based on hLDA topic model by compariso n test, then extract sentence by multi-features fusion, and finally generate the abstract. The analysis of experiment results has shown the effectiveness and practical applicability of this method.

Keywords/Search Tags:

Chinese Multi-document Summarization, Hierarchicaltopic model, nested Chinese Restaurant Process, Bayesiannonparametric

Related items

1	Chinese Query-Focused Multi-document Summarization Based On Cloud Model
2	Research On Parsing And Multi-Document Summarization Based On Generative Probabilistic Models
3	Research On Key Technologies Of Chinese Multi-Document Summarization
4	Chinese Multi-document Automatic Summarization Extraction Based On The Combination Of LDA And TextRank
5	Research And Implementation On Chinese Web Pages Summarization
6	A Study Of Chinese Multi-document Summarization Based On Adaptive Clustering Algorithm
7	Multi-Document Automatic Summarization Of Chinese
8	Design And Realization For Automatic Summarization System Of Search Engine On Chinese Web Document Of Science And Technology
9	Research On Key Technologies Of Chinese Multi-Document Summarization
10	The Research And Implementation Of Single-document Chinese Text Summarization System