Font Size: a A A

Research And Implementation Of Abstract Automatic Generation Algorithm Based On Gensim

Posted on:2020-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y J XiaoFull Text:PDF
GTID:2428330596998337Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the advancement and development of information technology,the way people get news has gradually shifted from newspapers to the Internet.Due to the large data traffic of news websites,it is impossible to write a summary of tens of thousands of articles by means of manpower.Therefore,it is very theoretical and practical to study the use of computer to generate abstracts.By analyzing the shortcomings of the existing automatic digest algorithm,the paper proposes a Chinese abstract automatic generation algorithm using Gensim natural language processing framework.At present,the methods of automatic summarization are mainly divided into generation and extraction.Due to the length of news articles,it is difficult to replace long texts with another short word sequence through deep learning.Therefore,the generation method is not applicable.The extraction method forms abstract by extracting the key sentences of the article,which is not affected by the length of the article,but it will cause the problem of low fluency of the abstract.Therefore,the key to generating high-quality abstract technology lies in two aspects.One is to accurately contain the key information of the article,and the other is to ensure the degree of compliance between the statements.Starting from the above two aspects,the paper first analyzes the traditional TextRank key sentence extraction algorithm,and improves the low accuracy of the key sentences extracted by the algorithm,and then designs abstract generation framework,which can solve the problem of low abstract fluency formed by the extraction method,and can verify the criticality of the information contained in the generated abstract.The algorithm is divided into two stages:(1)The key sentence generation stage,the text corpus of 300,000 Chinese articles is preprocessed,the Word2 vec word vector model is trained to vectorize the text,and the TextRank algorithm is improved to accept the input of the word vector,thereby calculating cosine similarity between sentences to extract key sentences.(2)The construction stage of the abstract generation framework,the purpose of which is to distribute the weights of the key sentences extracted in the first stage,and further optimize and verify.The first is the assignment of the weight of the article structure.This process combines the idea of the decimation digest method and assigns the weight according to the position of the key sentence in the original text,which plays a role in improving the fluency between sentences.Secondly,the weight distribution of the critical degree of the sentence,the process can verify the accuracy of the key sentence.The paper uses the LDA topic model to extract the article keywords.If the key sentence contains more keywords,the higher the weight.The calculation results of the above two processes are added to obtain a total score,and are sorted according to the descending order of the scores,and then the top sentences are sequentially combined to generate an abstract of the article.Finally,using Rouge abstract evaluation method,the following experiments are carried out on the proposed algorithm:(1)Test the impact of different weights on abstract quality in the abstract generation framework;(2)Comparative analysis with other automatic abstract algorithms;(3)Analyze the impact of article length factors on the algorithm.The experimental results show that compared with other algorithms,the abstract generated by the algorithm is improved in terms of fluency and key information.
Keywords/Search Tags:gensim natural language processing framework, word2vec model, textrank algorithm, abstract generation framework, rouge abstract evaluation method
PDF Full Text Request
Related items