Font Size: a A A

Research On Chinese Automatic Summary Based On Doc2Vec Algorithm And Graph Model

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:N ZhaoFull Text:PDF
GTID:2428330611989720Subject:Mathematics
Abstract/Summary:PDF Full Text Request
In the rapidly developing era of big data,knowledge resources are expanding.To enable users to quickly obtain accurate information from massive internet information,it is necessary to use automatic summarization technology to condense text information.Automatic summary is to compress the text document by using computer technology,extract the main content of the document,and facilitate the readers to grasp the main idea of the article quickly.This essay focuses on the research of single document extraction automatic summarization technology.Textrank algorithm is one of the classical algorithms in the field of automatic summarization based on graph models.It belongs to the unsupervised method,which does not need to train the corpus and has strong operability.The main idea of the algorithm is to create a graph model,which takes sentences as graph model nodes and similarity between sentences as graph model edges.The final weight of each sentence is calculated by the iterative formula of textrank,and a certain number of sentences with higher weight score are extracted as the summary.Considering the three problems of this method in the construction of graph model:The calculation method of edge weight similarity is not accurate,the calculation of node weight is not comprehensive,and there will be redundancy in extracting multiple sentences as abstract sentences.This essay optimizes the textrank algorithm according to the characteristics of Chinese text :(1)TextRank automatic summarization,when constructing the edges of the graph model,simply measures the similarity between sentences by the vocabulary coverage between sentences,ignoring the semantic information of sentences.Therefore,this essay uses the Doc2 Vec model to convert the text into a numerical vector of specified dimensions containing semantic and contextual information.On this basis,combined with cosine similarity formula to measure the similarity between sentences,it can more accurately reflect the relationship between sentences.(2)This essay improves the sentence weight part of the traditional TextRank algorithm.Based on the sentence weight of the TextRank algorithm,this method comprehensively consider the characteristics of the abstract sentence such as the similarity between the sentence and the title,the position of the sentence,and the sign word,and modify the sentence weight.In this way,we can highlight the structural features of special sentences and lay a foundation for the extraction of abstract sentences.(3)Considering that the extraction of sentences is greatly affected by sentence similarity,multiple sentences expressing the same meaning will be extracted,resulting in redundancy.The maximum edge correlation algorithm is used to deal with the redundancy of the candidate sentence groups.The experimental results show that the abstract sentence extracted by "Doc2Vec combined with improved TextRank algorithm" in this paper has an increase of 12.73% and 17.19% for the abstract F value extracted with the "10%" and "20%" compression ratios,It shows that the algorithm can effectively improve the accuracy of automatic summarization.
Keywords/Search Tags:Doc2Vec model, TextRank algorithm, maximal marginal relevance algorithm, automatic summarization
PDF Full Text Request
Related items