Font Size: a A A

Multi-document Summarization Based On Improved Fuzzy C-means Clustering Algorithm

Posted on:2010-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z X HaoFull Text:PDF
GTID:2178360332457853Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet around the world, the information on the internet is increasing. People urgently need a method to get useful information from this mass information, eliminate the redundant information and combine the information organically. The goal of automatic multi-document summarization technology is just to resolve this problem. Multi-document summarization will put the information repeated more than one times into the summarization only once; other relevant information will be extracted according to the importance and compression ratio. Based on the thought of sub-topic, the sentences under the same theme collection will be reset according to the similarity to generate the sub-topics which representing the aspect information. The summarization will be got through the extraction and sort of abstract sentences in each sub-topic.Sentence similarity calculation is very important in the field of multi-document summarization; its precision will directly effect the determination of sub-topic and the generation of abstract. This paper describes the sentence similarity calculation algorithms based on word weights, latent semantic analysis, semantic distance and semantic dependency. And employ a multi-feature fusion algorithm which combines key word weights feature, semantic distance feature and semantic dependency feature to calculate the similarity between different sentences. This makes the description of sentence more comprehensive and the calculation result more accurate.To find out the potential subtopics of document collections, the paper proposes an advanced fuzzy c-means clustering algorithm according to the fuzziness of Chinese sentence belonging to different classes. To weaken the influence of improper initialization for fuzzy c-means, the algorithm employs a voting method by combing the threshold based hierarchical clustering algorithm and sample density algorithm. Initialize the prototype matrix and division matrix of fuzzy c-means algorithm and generate the subtopics of multi-document summarization.Sort the subtopics according their importance and generate abstract sentence set through the dynamic extraction of the abstract sentences. Finally, we combine the abstract sentence ordering algorithm of document frame and topic position to sort the extracted sentences. The experiments show that our method gets a better clustering result than the other clustering algorithms used in multi-document summarization. Meanwhile, the generated summarization's information coverage value and fluency value are also satisfying.
Keywords/Search Tags:Multi-document Summarization, Sub-topic, Fuzzy C-means, Hierarchical Clustering, Sample Density
PDF Full Text Request
Related items