Font Size: a A A

A Study Of Chinese Text Summarization Based On Adaptive Clustering Algorithm

Posted on:2006-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:P HuFull Text:PDF
GTID:2168360152995233Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Automatic summarization is an important research issue in natural language processing. Now, more and more researchers over the world are paying attention to this area. For one thing, automatic summarization technology can compensate the pitfalls of traditional information retrieval technology in a certain degree when dealing with information overload problem; for another, automatic summarization technology can release users' browsing pressure.There are still a lot of problems in the research of Chinese document summarization. For instance, a lot of researchers are adopting the traditional summarization method, which extracts relevant sentences from the entire text according to each sentence's score. However, these methods do not take the document's thematic structure into account, so the generated summaries using these methods will cover only those main themes while neglecting the others, and sometimes have a high level of redundancy. In addition, in the course of developing a practical automatic summarization system, dimensionality reduction of various linguistic units will be a fundamental and important step.In this paper, we propose a Chinese summarization method based on adaptive clustering algorithm. Four key technologies are adopted in this method:The key technology one: Feature vector representations of various linguistic units based on unsupervised feature extractionThe key technology two: Discovery of latent themes based on adaptive clustering algorithmThe key technology three: Selection of representative sentences from different themes using theme-sentence similarity calculationThe key technology four: Quantitative evaluation of summary's redundancy based on representation entropyWe choose thirty different genres of documents as experimental samples from the Modern Chinese Corpus of State Language Commission. By using the proposed method and traditional baseline method, we get the relevant results. And the experimental results indicate that the proposed method is more effective and efficient when dealing with various genres of documents, for it can balance the generated summary's thematic coverage and redundancy in a certain degree.
Keywords/Search Tags:automatic summarization, thematic discovery, unsupervised feature extraction, clustering, representation entropy
PDF Full Text Request
Related items