Font Size: a A A

Chinese Multi-document Automatic Summarization Extraction Based On The Combination Of LDA And TextRank

Posted on:2019-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:B F ZhangFull Text:PDF
GTID:2348330566459846Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It has become an effective way to get news for users in everyday life.Users mainly obtain news through search engines.While the network provides people with abundant information resources,it is accompanied by the emergence of a large amount of redundant information.The process of obtaining information will inevitably waste a lot of unnecessary time.The emergence of multi-document automatic summarization technology solves the problem well.It uses machine learning,neural network and other technologies to obtain the main information and abstract the documents.Finally,we can get short abstracts that can interpret the main contents of documents,so as to achieve accurate extraction of useful information from documents.This technology can facilitate users to obtain useful information in a timely and effective manner and extract key parts of the news,greatly improving efficiency.At present,the commonly used abstract extraction technology is based on an extractive method that extracts key sentences as text abstract in the original document.Based on this method,the paper proposes the combination of Latent Dirichlet Allocation(LDA)topic model and TextRank(graph model)for the key issue of how to select an accurate sentence scoring method in the abstract extraction process.Firstly,a sentence model is established by establishing a LDA topic model for the pre-processing news document set;Secondly,the pre-processed sentence is used as TextRank input to complete the construction of the TextRank graph model of the document,at the same time,when computing the final weight of the graph node,the theme probability obtained by the LDA topic model is used as the basis,among which the sentence with a high probability is firstly calculated for its node weight,and thus the score isfurther improved;Finally,extracting the top sentence as the abstract sentence according to the respective compression ratio of 10% and 20%.Using the above method,this paper abstracts the news corpus under the same topic and obtains a summary under the topic.Finally,using the five indicators ROUGR-1,ROUGE-2,P,R,F to evaluate the performance of the results obtained,the experiment shows that compared with a single algorithm,the effect of generated summary is better,and the accuracy of the results is improved significantly,At the same time,it has some advantages such as obvious themes and prominent keywords.
Keywords/Search Tags:multi-document automatic summarization, LDA topic model, TextRank algorithm, abstract evaluation
PDF Full Text Request
Related items