Font Size: a A A

Research On Parsing And Multi-Document Summarization Based On Generative Probabilistic Models

Posted on:2010-05-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:X YangFull Text:PDF
GTID:1118360302983794Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the rapid growth of text information on the World Wide Web, natural language processing has become a research hotspot since it is a key technique for computer to exploit the text information. This thesis focuses on the syntactic parsing, which is a problem on the theoretical aspect and the automatic multi-document summarization, which is a problem on the applied aspect in the natural language processing. In which, syntactic parsing is a key technique for natural language processing, and many applications, such as automatic summarization, machine translation and information retrieval rely on the parsing result as their support. On the other hand, researches on parsing can supplement the understanding of human language, so the parsing has important significance on both theoretical and practical aspects. Multi-document summarization can provide users with a clear and concise overview, so it can help users to deal with the documents quickly. Automatic multi-document summarization receives more and more attention as the number of documents increases dramatically on the Web.The key problem of natural language parsing is to select the optimal parse tree corresponding to a single sentence. In addition, the key problem of automatic multi-document summarization is to select a number of optimal abstract sentences. Aiming at the problem of parse tree and abstract sentence selection, the generative probabilistic models are used for both syntactic parsing and multi-document summarization modeling to study both of parsing and automatic summarization problems. This thesis focuses on the grammar system, parsing algorithm and parsing model in the natural language parsing problem, and the sentences ordering in the automatic multi-document summarization problem. Experiments are done to validate and analyze the effects of the models. The work was mainly supported by the grant of National Natural Science Foundation and Natural Science Foundation of Shanodong Province.A new grammar system and a parsing algorithm for this grammar are proposed, and then the pruning rule and the new structure information integrated into the parsing model for this grammar are studied. For the extractive automatic summarization, a sentence modeling method based on probabilistic topic model is given to find the latent topics underlying the corpus. Based on the topic model, sentence scoring and redundancy reducing are also studied in this thesis.Major research contents and innovations of this dissertation include the following four aspects.1 A Binary Combinatorial Grammar which describe syntax via word combination is proposedDependency grammar has become a research hotspot in grammar expression of natural language processing, since it is more suitable for information retrieval than the phrase structure grammar as the dependency grammar can easily express the relations between the headword and its modifiers. However, for the lack of internal structure, dependency grammar can not explicitly express the complex syntactic structure, so bring a certain obstacle to the identification of grammatical structure.Aiming at the lack of relative collocation strength between phrases and the lack of internal structure representation in the dependency grammar, a Binary Combinatorial Grammar (BCG), which represents syntactic structure through the combination of adjacent headwords, is proposed in this thesis. According to the combinatorial characteristics of the words in the BCG grammar, the local priority between adjacent binary relations is introduced to the grammar to describe the relative collocation strength between phrases and to restrict the order of combination. Introducing the internal nodes can help to express and recognize syntactic structure, and introducing the local priorities to express the relative collocation strength can help to restrict the generation of illegal structures.2 A syntactic parsing algorithm based on local priority is proposedParsing algorithm is an important component in syntactic parsing, which directly affects the accuracy and efficiency of syntactic parsing. Aiming at the Binary Combinatorial Grammar proposed in this thesis, local priority defined in the BCG grammar is integrated to the BCG parsing algorithm as the pruning rule. A BCG parsing algorithm based on local priority is given by improving the traditional CYK (Cocke, Younger, and Kasami) chart-parsing algorithm. Parsing experiment was done on the manual collected grammar rules and the parsing results show that both of the number of result parse trees and the spent time of the improved CYK algorithm are significantly lower than that of traditional CYK algorithm.3 A parsing model based on restriction of nested level is proposedIn the construction of a probabilistic model, how to use the information of syntactic structure of the sentence in the parsing modeling is a major problem. Information considered is mainly degree of dominance and the length of dependency chain in the current studies. For the language is harder to understand when there are more nested modifiers, the generation of parse trees is restricted by introducing the constraint information of modifier's nested level into the generative probabilistic parsing model. The ability of syntactic structure identification is enhanced, and the generation of illegal structures is partially avoided. After building the CYK chart algorithm based on the priority, a BCG parsing model integrating the restriction of nested level is given. In the parsing experiment, the BCG form treebank was built through converting the dependency treebank to the BCG form, and the syntactic relations between words and the priority information between relations were acquired, and then the parameters of the probabilistic model were estimated based on the treebank. Chinese parsing experiment was done on the BCG treebank, and the results show that the syntactic parsing model utilizing the restriction of nested level could achieve higher accuracy on BCG parsing compared to the parsing model based on the degree of dominance. The influence of local priority was also investigated in the experiment. The results show that the restriction of local priorities and nested level can effectively avoid the generation of illegal structures.4 An automatic multi-document summarization based on the generative probabilistic topic model LDA is presented The Latent Dirichlet Allocation (LDA) is used for sentence modeling to capture the latent topic information. Two sentence-scoring methods are proposed based on the word distributions p(w|z) of each topic and the topic distributions p(z\s) of each sentence, which are acquired from the LDA model. The sentences with high score and having little topic overlap with selected sentences are chosen as the abstract sentences. In the summarization experiment for English, the generic multi-document summarization data provided by DUC 2002 conference was used as test data, and the ROUGE metrics were used as the automatic metrics. Evaluated by the ROUGE, results show that both of the two proposed methods surpassed the word-frequency based and other LDA based summarization systems for all the ROUGE scores, in which probabilistic generative model is better than all the other models in all ROUGE metrics.The further work includes the following aspects: in order to facilitate a more accurate description of parsing algorithm, the labels of binary relations are to be integrated into parsing algorithm as the parsing context. In the syntactic parsing model, other useful structural information and topic information are to be added in the parsing model to improve the correct parsing rate. In the multi-document summarization, the syntactic topic model will be used as sentences modeling, so both of the syntactic and thematic information will be taken into account to improve the summarization results.
Keywords/Search Tags:local priority, nested level, syntactic parsing, topic model, multi-document automatic summarization
PDF Full Text Request
Related items