Research On Key Technologies Of Chinese Multi-Document Summarization

Posted on:2012-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Xiong

Full Text:PDF

GTID:2178330335960304

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Multi-Document summarization is an important branch of natural language processing. It aims to extract important information from a group of documents which sharing similar topic, and generates well-covered and concise summary, which can help getting and using information in a quick way.In this thesis, we studied on Chinese multi-document summarization based on sub topic, which generate summaries by first dividing the whole document set into several sub topics and then selecting the most important sentences from each sub topic and organizing them in a logical way. It includes the following two subtasks:sub topic clustering, and sentence selection from sub topics.We cluster the document set into several groups according to their semantic similarity. It includes the following three sub tasks:how to express the document information; how to calculate the similarity; which clustering method is the best. For the first sub task, this paper studies two critical steps:similarity computation of words and irrelevant words removing by PPMI method. For the second one, we use two methods:the traditional VSM-based statistical method and semantic similarity based on the shortest paragraph method to measure the similarity between two paragraphs. For the third one, we test the improved K-means clustering method and hierarchical clustering method to compare the clustering accuracy.For the sentence selection task, we believe that the summary sentences must meet two requirements:1, the sentence itself is important; 2, these sentences contain the least redundant information. So we take the position of the sentence; sentence length features and lexical information into account and weight them according to certain proportion, then get the overall score to measure the importance of the sentence. Lexical information mainly refers to the sub topic keywords, we test tfidf-pos, hypothesis testing to extract the keywords, and merge them as the final output. In this paper, the choice of sentence selection is devided into two steps. Step 1:in each sub topic, the sentences are sorted in descending order of score; we select a certain percentage of them. Step 2:gradually remove the sentence having the least contribution to new information until the remaining length of sentences achieves the target.

Keywords/Search Tags:

multi-document summarization, sub topic, semantic similarity, PPMI, sentence-removing

PDF Full Text Request

Related items

1	Research And Application Of Multi-document Automatic Summarization
2	Study On Multi-Document Summarization Algorithm Based On Fusing Topic Sentences Semantic
3	Research On Automatic Multi-document Summarization Based On Statistics And Semantic Analysis
4	Research And Implementation Of Topic-based Mutli-Document Summarization
5	Research On Key Technologies Of Chinese Multi-Document Summarization
6	The Approach For Event-based Multi-document Automatic Summarization
7	Sentence Extraction For Multi-Document Summarization Based On Topic Model And Semantics
8	Research Of Web Multi-document Automatic Summarization
9	Research On The Topic-oriented Summarization For Web Documents
10	Multi-document Summarization Based On HLDA Hierarchical Topic Model