Font Size: a A A

Research On Chinese Automatic Summarization System

Posted on:2009-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H X ZhuFull Text:PDF
GTID:2178360245475970Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic summarization is an important research topic in the natural language processing. In recent years, along with the vigorous development of Internet, the amount of information increases sharply as well as the literature quantity assumes the exponential order to grow. As a supplementary means which solves the overload problem of information, automatic summarization's value gets more and more sufficient embodiment, it can help to improve the speed of the information retrieval and save the browsing time of information.Automatic summarization is closely relative to the semantics, but the traditional statistics summarization extracts sentences through the establishment of the vector space model based on statistics of word frequency. Vector space model's basic assumption is the irrelevance among the items, but in real text, because of the diversity of language, even if the same concept often has many different forms of expression, words in divided items sometimes have great relevance, not totally be independent. In addition, the article contains an overall theme generally, but the author sometimes illustrate this theme from multiple sides. If only extracts summaries in accordance with the importance of sentences in the full article, the result often abstracts the theme which has more density distribution but ignores the existence of other subjects, the integrality is not high.Devoted to the settlement of the above-mentioned problems, this paper adopts the statistical method combining with the semantic knowledge, puts forward the solution based on concept counting and text structure partition for automatic summarization and realizes the prototype system. The specific works are as follows: Firstly, the paper reviews study history of the domestic and abroad automatic summarization and summarizes current situation, introduces theories such as vector space mode, Chinese lexical analysis and evaluation of automatic summarization, and so on. Then, concept counting to the research of automatic summarization based on HIT IR-Lab Tongyici Cilin (Extended) is introduced, the maximum matching algorithm is utilized to solve the problem of the word equivocal during the course of concept obtaining preliminary. In order to enable the summarization cover the main contents of the original, an algorithm for topic partition based on a comprehensive investigation of both adjacent paragraphic similarity and consecutive average paragraphic similarity is put forward. Finally, the automatic summarization system based on concept counting and text structure partition is realized. In the evaluation of the system, to make it more objective, impartial and reasonable, designed evaluating standard aimed at evaluation corpus's characters.To verify validity and feasibility of our summarization method. The compared experiments on automatic summarization based on traditional method, concept counting method and the method we put forward are carried out separately. The experimental results reveal that, the method in text can reflect the content structure of the article effectively and gain better effect than the traditional method, especially with the increase in the length of the summarization, the effect of this method is more obvious. It is applicable to the long as well as the short articles. Besides, the contrast with other available summarization tools shows that the method we put forward approaches the automatic summarization function of IRLab-NLPML system in the HIT IR-Lab and surpasses the corresponding function imbedded in WORD.
Keywords/Search Tags:automatic summarization, statistical summarization, vector space model, lexical analysis, concept counting, text structure partition, word sense disambiguation, evaluation
PDF Full Text Request
Related items