A Study Of Chinese Multi-document Summarization Based On Adaptive Clustering Algorithm

Posted on:2009-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:H S Xiao

Full Text:PDF

GTID:2178360245958449

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularization of internet, the information that people obtain is abundant day by day. In order to obtain useful information fast and exactly, to raise the efficiency of getting information for the users, the automatic summarization becomes more and more important. The multi-document summarization technology is becoming a research focus in the field of natural language processing.Multi-document automatic summarization can enable user to get useful information from multi-document more conveniently and effectively. At present a mature method in Chinese multi-document summarization field is to sort all the sentences in document set according to the combination of some characteristics, and to extract summarization sentences according to the orders. This method is easy. But with the diversification of the summarization target, people have more requirements to the topic coverage of summarization, and this method is hard to keep balance between topic coverage and redundancy. Another method is to extract sentences from some different subtopics which contained in the document set. In this method, documents clustering technique is used to find the subtopics. However, most clustering algorithm which need to stipulate the number of clusters artificially can not reflect the right document situation exactly and affect the summarization quality. As the problem described above, this paper proposes a scheme of Chinese multi-document summarization which applying the improved K-means clustering algorithm. The following are the main research work:(1) We propose the strategy of automatic finding subtopics in the document set. In the multi-document automatic summarization, clustering algorithms are usually used to find subtopics. In this paper, we apply the improved K-means clustering algorithm to confirm the number of clusters which find the subtopics automatically by utilizing the entire documents' statistical information. The advantage of the strategy is that it doesn't need person's subjective experience to determine the subtopics.(2) This paper confirms the initial class center by using the entire documents' statistical information. We make the discovery of the subtopic centre more objective and more rational, and make the sentences which extracted from subtopics based on cancroids more representative. (3) For text vector, we optimize VSM by using some linguistic tool which remedied the defect in blurry feature and high dimension.(4) A Chinese Multi-document automatic summarization system has been designed and implemented. The experiment shows the summary which is generated by the system has a good quality, and verifies the feasibility of the method.

Keywords/Search Tags:

Multi-document summarization, K-means clustering, subtopic discovery, sentence extraction

PDF Full Text Request

Related items

1	The Approach For Event-based Multi-document Automatic Summarization
2	Research And Application Of Multi-document Automatic Summarization
3	Multi-document Summarization Based On Improved Fuzzy C-means Clustering Algorithm
4	Research On EBM Multi-Document Summarization Technique
5	Research On Automatic Multi-document Summarization Based On Statistics And Semantic Analysis
6	Research On Summary Sentence Selection And Ordering In Query-focused Multi-document Summarization
7	Research On Key Technologies Of Chinese Multi-Document Summarization
8	Sentence Extraction For Multi-Document Summarization Based On Topic Model And Semantics
9	Multi-document Summarization Based On Basic Element
10	Statistic-based Automatic Keypharse Extraction And Summarization From Multi-document