Font Size: a A A

A Study Of Chinese Multi-document Summarization Based On Adaptive Clustering Algorithm

Posted on:2009-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:H S XiaoFull Text:PDF
GTID:2178360245958449Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularization of internet, the information that people obtain is abundant day by day. In order to obtain useful information fast and exactly, to raise the efficiency of getting information for the users, the automatic summarization becomes more and more important. The multi-document summarization technology is becoming a research focus in the field of natural language processing.Multi-document automatic summarization can enable user to get useful information from multi-document more conveniently and effectively. At present a mature method in Chinese multi-document summarization field is to sort all the sentences in document set according to the combination of some characteristics, and to extract summarization sentences according to the orders. This method is easy. But with the diversification of the summarization target, people have more requirements to the topic coverage of summarization, and this method is hard to keep balance between topic coverage and redundancy. Another method is to extract sentences from some different subtopics which contained in the document set. In this method, documents clustering technique is used to find the subtopics. However, most clustering algorithm which need to stipulate the number of clusters artificially can not reflect the right document situation exactly and affect the summarization quality. As the problem described above, this paper proposes a scheme of Chinese multi-document summarization which applying the improved K-means clustering algorithm. The following are the main research work:(1) We propose the strategy of automatic finding subtopics in the document set. In the multi-document automatic summarization, clustering algorithms are usually used to find subtopics. In this paper, we apply the improved K-means clustering algorithm to confirm the number of clusters which find the subtopics automatically by utilizing the entire documents' statistical information. The advantage of the strategy is that it doesn't need person's subjective experience to determine the subtopics.(2) This paper confirms the initial class center by using the entire documents' statistical information. We make the discovery of the subtopic centre more objective and more rational, and make the sentences which extracted from subtopics based on cancroids more representative. (3) For text vector, we optimize VSM by using some linguistic tool which remedied the defect in blurry feature and high dimension.(4) A Chinese Multi-document automatic summarization system has been designed and implemented. The experiment shows the summary which is generated by the system has a good quality, and verifies the feasibility of the method.
Keywords/Search Tags:Multi-document summarization, K-means clustering, subtopic discovery, sentence extraction
PDF Full Text Request
Related items