Font Size: a A A

Single Document Automatic Summarization Based On TextRank Algorithm

Posted on:2017-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2308330485971031Subject:Information Science
Abstract/Summary:PDF Full Text Request
The advent of the information age led to exponential growth literature, and information users transfer from poor information to information overload quickly. This makes the traditional manual abstract speed far behind the needs of the user. Researching on automatic summarization from 1958 has been a hot field of automatic processing of information. Automatic summarization based on graph mainly uses the lexical or semantic information in the text to construct the topology map. TextRank is one of the representative algorithms. The basic idea of TextRank algorithm is derived from Google’s PageRank algorithm by dividing a document into nodes that is composed of a number of text units (word or sentence), the edge of the nodes that is composed of the similarity between the text elements, and build a graph model. By using PageRank algorithm, the graph model is iterated until convergence, and then all nodes are sorted. Finally, output key words or summarization. Using only single document information itself can realize keyword extraction and summarization. TextRank algorithm does not need to learn a number of documents prior to the training. It is unsupervised learning and widely used. Based on the TextRank algorithm, this paper improves the similarity of the sentence and the weight of the sentence in the process of automatic summarization. In the end, a single document automatic summarization method is proposed. Main research work are as follows:(1) Research questions. The main steps of automatic summarization based on TextRank algorithm are analyzed and sorted out. This research points out that the preprocessing part of the text and the iterative calculation of the TextRank algorithm have been more mature, improved space are limited. Improving the calculation of the sentence similarity and the weight of the sentence is the point.(2)Sentence similarity. This paper mainly compares method based on edit distance, WordNet, BM25, TextRank. It is found that the effect of the automatic summarization based on the BM25 similarity calculation method is optimal. This paper focuses on IDF part of BM25 formula. When n(si) is larger than N/2, IDF(si) takes a negative value, and need to improve the weight of a negative value. This paper puts forward two improved methods of BM25. The first one is to replace the original IDF(si) formula with the classic IDF formula, and the denominator of the classical IDF formula by adding 1 smoothing. The second is when the n(si) is less than or equal to N/2, IDF(si) formula unchanged, IDF(si) takes positive; when the n(si) is greater than N/2, with a ·avgIDF replaced the original formula, a is the parameters(0≤α≤1), avgIDF is the average IDF value of all word items.(3) Sentence weight. The classical TextRank method takes into account the global information of the sentence, but ignores the characteristics of the sentence itself. This paper introduces the weight of the sentence to integrate sentence position, clue words and classical TextRank.(4) Summarization experiment. This paper selects Due 2002 corpus, the main work includes:corpus pretreatment (sentence segmentation, word segmentation, POS tagging, word filtration); sentence similarity calculation; sentence weight calculation; summarization generation.(5) Summarization evaluation. The ROUGE evaluation method is used to examine the performance of different summarization tasks (100 words, compression 10%, and compression 20%). Experiments show that on all indicators of ROUGE, the paper proposed the sentence similarity calculation method and the sentence weight calculation method are improved compared with the classical TextRank method. At the same time, this paper presents a value strategy of improved BM25 when facing different summarization tasks.The experiments show that the improved TextRank algorithm based on single document automatic summarization method has certain innovation and applicability.
Keywords/Search Tags:automatic summarization, TextRank, BM25, single document automatic summarization
PDF Full Text Request
Related items