Font Size: a A A

Research On Key Technologies Of Chinese Multi-Document Summarization

Posted on:2012-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiongFull Text:PDF
GTID:2178330335960304Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Multi-Document summarization is an important branch of natural language processing. It aims to extract important information from a group of documents which sharing similar topic, and generates well-covered and concise summary, which can help getting and using information in a quick way.In this thesis, we studied on Chinese multi-document summarization based on sub topic, which generate summaries by first dividing the whole document set into several sub topics and then selecting the most important sentences from each sub topic and organizing them in a logical way. It includes the following two subtasks:sub topic clustering, and sentence selection from sub topics.We cluster the document set into several groups according to their semantic similarity. It includes the following three sub tasks:how to express the document information; how to calculate the similarity; which clustering method is the best. For the first sub task, this paper studies two critical steps:similarity computation of words and irrelevant words removing by PPMI method. For the second one, we use two methods:the traditional VSM-based statistical method and semantic similarity based on the shortest paragraph method to measure the similarity between two paragraphs. For the third one, we test the improved K-means clustering method and hierarchical clustering method to compare the clustering accuracy.For the sentence selection task, we believe that the summary sentences must meet two requirements:1, the sentence itself is important; 2, these sentences contain the least redundant information. So we take the position of the sentence; sentence length features and lexical information into account and weight them according to certain proportion, then get the overall score to measure the importance of the sentence. Lexical information mainly refers to the sub topic keywords, we test tfidf-pos, hypothesis testing to extract the keywords, and merge them as the final output. In this paper, the choice of sentence selection is devided into two steps. Step 1:in each sub topic, the sentences are sorted in descending order of score; we select a certain percentage of them. Step 2:gradually remove the sentence having the least contribution to new information until the remaining length of sentences achieves the target.
Keywords/Search Tags:multi-document summarization, sub topic, semantic similarity, PPMI, sentence-removing
PDF Full Text Request
Related items