Font Size: a A A

Research On Document Summarization Based On The Cluster Of Graph

Posted on:2016-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z D WuFull Text:PDF
GTID:2308330482467326Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid growth of the Internet yielded a massive increase of the amount of information available, especially regarding text documents (e.g. blogs, news articles, scientific papers, electronic books, etc.). By liberal estimates, the size of the web in 2014 was around 4.7 billion pages. Thus, to deal with the large amount of information like this, it has become impossible for human to sieve information efficiently. So we have to lend a hand from computers to process the Internet data efficiently, scavenging useful information from it.To deal with the problem above, we utilize the research on document summarization based on the cluster of graphs. There are three main goals for our research: (ⅰ) identify the relevant contents from texts; (ⅱ) eliminate redundant information. A good summary should avoid repeated information. Redundancy may be perceived as a kind of "noise" that affects the quality of the final summary, (ⅲ) keep a high level of coverage of contents and information diversity. Summaries are supposed to encapsulate the maximum amount of information from texts, making possible the understanding of the main ideas from the original texts.Based on the massive information of the Internet, we begin our research with building a joint model of sentences. We calculate the term frequency(TF) and inverse document frequency(IDF), and calculate the TFIDF for each sentence as the score; Further step is Sentence clustering: after we have sentence scoring methods, we use four different relations among sentences to represent the documents:(ⅰ) statistical similarity; (ⅱ) semantic similarity; (ⅲ) co-reference; (ⅳ) discourse relations. By this, we convert the text into a graph model and with a configuration file contains information like graph type, edge selection, language, domain selection, and threshold, we can do the sentence clustering, and the final summarization is consist of the closest sentences from the clusters. The clustering algorithm we proposed here have a good performance in removing redundancy, and a better consequence in consisting the summarization, and we had proved it by the evaluation.The main work and results are summarized as follows:(1)Establish the text graph model. In order to make the relationship between the documents text more intuitive, we used the classic TF/IDF method to deal with the document worlds which are preprocessed. By doing this we built a sentence scoring module to calculate the similarity between sentences. In additional, we adjust the module we had built from different dimensions such as semantic similarity, co-reference and discourse relations, by doing this, we had a more precise and reasonable module.(2) To deal with the problem of information redundancy and diversity, we solve it by clustering the module we had built, and it was completely unsupervised and generic; The system we proposed here deal with the input documents as a single file, and the central assumption is that building a joint model of sentences and connections yields a better model to identify diversity among them. With the module’s help we built, we can minimize co-references in most cases.(3) To testify the effectiveness of the method based on the cluster of graph, we use the most representative dataset for summarization——The DUC Dataset, it provide different datasets for specific tasks. To verify the effective of the system we propose, we designed two other systems:the summarization system based on statistics and the summarization system based on k-means clustering algorithm. By horizontal comparative research, we can prove that our method can product a better output.
Keywords/Search Tags:text summarization, text statistic, sentence cluster, linguistic treatment, graph
PDF Full Text Request
Related items