Font Size: a A A

Automatic Summary Extraction Based On TF-IDF And TextRank

Posted on:2020-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ZhangFull Text:PDF
GTID:2428330623961140Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The Internet contains a lot of information.How to retrieve the required information from a lot of information is a problem to be solved.The emergence of search engine is just to search with the help of keywords.It is difficult to generalize the content of text.Automatic summarization technology will become the next generation of search engine.The automatic summarization technology can compress the redundant content in the text without affecting the content in the text center,and can facilitate users to quickly understand the content contained in the text.Automatic text summarization technology has great application value in real life.For example: news topics include news content;text sentiment analysis;automatic question answering and other fields.The process of automatic abstract extraction is divided into two parts: the first part is the extraction of text subject words.The second part is sentence extraction.TF-IDF(term frequency – inverse document frequency)algorithm mainly measures the importance of quantifiers based on the frequency of words appearing in the text.It has defects such as not considering the semantic and location information of words,which has an impact on the accuracy of topic word extraction,and then affects the results of sentence extraction later.Textrank algorithm uses co-occurrence words as the edge of graph model to measure the similarity between sentences,without considering the semantic,location and other information.The results will affect the redundancy and readability of the extracted sentences.The main work of this paper is as follows:(1)Propose an improved topic word extraction algorithm based on TF-IDF,which integrates location information and semantic information.Firstly,use the Jieba word segmentation tool to segment the sentence into words,mark the part of speech for the segmented words,and remove the stop words and non nouns.Secondly,Using the original TF-IDF algorithm,the frequency of each word in the text is calculated,that is,the TF value and IDF value and TF * IDF value are calculated.Thirdly,integrate the word location information,assign the corresponding weight Pi according to the word location,and get the new value TF * IDF * Pi.Finally,the CBOW model in Word2 Vec is used to transform words into word vector representation,and the function wordsim is used to measure the similarity between words,and the synonyms with high semantic similarity are combined to complete the final determination of text subject words.The comparison experiment between TF-IDF algorithm and improved TF-IDF algorithm is carried out with the text "2018 China artificial intelligence white paper".The results show that the value distribution of the topic words extracted by the latter is more reasonable.That is to say,the extraction of subject words is more accurate.(2)Improve sentence processing based on TextRank algorithm.TextRank algorithm takes sentence as vertex and co-occurrence word as edge to build graph model.In order to avoid that the similarity of two sentences can not be objectively reflected by co-occurrence words in TextRank algorithm model.In this paper,the twin neural network model is used to measure the similarity between two sentences.First,the algorithm graph model is reconstructed,and the edges are connected by vectors of all words in the edge after sentence preprocessing.This information is input into the convolution neural network,and the twin network is generated by comparing different sentences,and the sentence similarity weight information is calculated.In the graph model,vertex sentences are reconstructed,and position information and subject word information are integrated.According to the above algorithm,the sentence value is calculated to select the text similar sentence and remove the redundant similar sentence.Then,the sentences with subject words are sorted according to the order of subject words.The sentences with the same subject words have time and other information,and the other sentences are sorted according to the sentence value.Finally,the text summary is formed by removing the redundancies of the sorted sentences.In the experiment,ROUGE,Recall and Precision,were used as evaluation indexes for comparative experiment.The results show that the improved algorithm is effective.(3)Finally,the prototype system is implemented with Python and JavaScript,and the function and performance of the system are tested,and good results are achieved.
Keywords/Search Tags:TF-IDF, TextRank, CBOW, semantic information, position information
PDF Full Text Request
Related items