Font Size: a A A

Research On Keyphrase Extraction Based Automatic Summarization Method For Chinese Webpage

Posted on:2011-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:C J JiangFull Text:PDF
GTID:1118360308463887Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic text summarization is the process of distilling the most important information or the most interested information for users from source text(s) by computer to produce an abridged version of it (them) for a particular user or task. The generation process of an automatic text summarization could be divided into three major steps: text analysis and understanding, thematic information selecting, and summary generating. The development history of automatic text summarization is generally as follows: the early statistical based automatic summarization (such as word frequency statistics, sentence position method); the text understanding summary based on knowledge representation (such as scripts, frames, templates, or first-order predicate) in seventies and eighties; information retrieval based automatic summarization in nineties; the integrated automatic summarization by integrating natural language understanding technology and artificial intelligence method in recent years.By study the history of automatic summarization we can draw a conclusion: We could not get a good automatic summarization if we only use one or single technology to generate it. So we utilize text shallow information, text semantic information and knowledge base and integrate natural language understanding methods and artificial intelligence techniques to obtain automatic summarization. In this dissertation, we first analyzed the essential characteristics of the thematic words/phrases in text. By considering these essential characteristics we designed a formula to calculate weight of the words in a text. The formula can let the thematic words get higher weight than other non thematic words. In chapter III, we proposed a method for question parsing of question answering system. In chapter IV, we proposed a method to find the most important sentences for theme describing of a text by analyzing thematic information and semantic information in it, and then we use it for summarization generation. In the last chapter, we proposed an adjacent paragraph clustering method to improve the quality of an automatic summarization.This dissertation has the following innovations:(1) In order to increase the extraction accuracy of thematic words/phrases in a text/document, we proposed a thematic words'/phrases'weight computation algorithm after the analysis of the effect of Chinese Word Segmentation Systems, synonym and polysemy. This algorithm first recognizing the combined words and unlisted words in a text by utilizing combined words recognition algorithm, then combining the frequencies and avoiding co-occurrence of synonyms in the final thematic words/phrases set. Lastly the algorithm computed the word's/phrase's weight by its frequency, part of speech, length and position. The experimental results indicate that this algorithm can recognize and extract out the thematic words/phrases of a text successfully. Compared with the traditional TF-IDF method, this method had a better performance.(2) To overcome the problem that Named Entity is difficult to identify in a query of a Question Answering system, we constructed a high coverage rate domain dictionary and put forward a Chinese query sentence parsing algorithm based on the domain dictionary and a query type mapping list. The implementation process described in detail is as follows: First profound research and analysis has been made for the query parsing module of a Chinese Question Answering System, and then the construction of the domain dictionary which used to Chinese Named Entity Recognition was described. In order to identify the type of a query, a query type list was constructed. After parsing and segmenting a query, the nouns, verbs and adjectives in it were extracted. These words extracted from the query formed an initial query vector. This initial query vector then was extended to an extended query vector by adding synonyms of the nouns and verbs in it and associate words of its type. The experiment results show that the query using extended query vector has a higher precision than the query using original query sentence.(3) In view of the drawback that the title of a webpage does not necessarily express the theme of it but existing automatic summarization systems always surpose it really does, we proposed an algorithm to classify the title of a webpage to two types. We also designed a formula for weight computation of the sentences in a webpage. The weight of a sentence are determined by its content, position, cue word/phrase in it and user's preference. So the formula which computing the weight of a sentence is an fitting function of its content based weight, position based weight, cue word/phrase based weight and user's query based weight.To avoid redundance,only one of the two similar sentences in the candidate set was selected for generating the final summarization. The experimental results show that the proposed algorithm has a better performance compared with the traditional automatic summarization algorithms based on TF-ISF method.(4) In order to overcome the drawback that the existing and popular automatic summarization evaluation approach which based on precision and recall can only work on an sentence level, we proposed an word level based method for automatic summarization evaluation. We first discribed the definition of intersection operation and union operation for generalized multisets, and then used generalized multisets for the expression of manual summarization and automatic summarization. Lastly we renewed the computational formulae for precision computing, recall computing and F-measure computing. Experiment proved that the evaluation for automatic summarization could be more reasonable by using the modified formulae.(5) In order to further improve the quality of automatic summarization and overcome the shortcoming of existing text clustering algorithm. This dissertation puts forward an adjacent paragraph clustering algorithm for subtopic detecting. The algorithm first selects the first paragraph as the first cluster. For the subsequent paragraph, the algorithm computes the similarity between it and its adjacent paragraph before it. If the similarity is equal to or greater than the threshold, this paragraph will put into the same cluster which its adjacent paragraph before it belongs to. Otherwise, a new cluster will be created and this paragraph will be added in it. The algorithm will not finish until all paragraphs have been processed. experimental results show that the proposed adjacent paragraph clustering algorithm has lower complexity but equal clustering quality than/to the well-known K-Means clustering algorithm.
Keywords/Search Tags:automatic summarization, weight computation, paragraph clustering, Chinese Web page, combined word
PDF Full Text Request
Related items