Font Size: a A A

Research And Application Of Multi-document Extractive Summarization

Posted on:2022-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:C J YangFull Text:PDF
GTID:2518306752997539Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Multi-document summarization is one of the hot research issues in the field of natural language processing.Compared with a single-document abstract that extracts an abstract from a document,multi-document abstract research is to extract an abstract from multiple documents.This abstract is a high-level summary of the content of multiple documents,through the multi-document abstract technology can help people master the subject content information of multiple documents in a short time.In recent years,single-document summarization and multi-document summarization technologies have gradually been widely adopted.Compared with single-document summarization,multi-document summarization has problems such as higher algorithm complexity,poor readability of summary,and higher summary redundancy.Therefore,the research has high quality The automatic extraction algorithm for multi-document summarization is currently an important research topic in the field of automatic summarization.To improve the quality of abstracts,this paper has carried out the research and application of the multi-document extraction abstraction algorithm based on keyword density through research on keyword extraction technology and other content.The main research work is as follows:1.Aiming at the problems of excessive weight parameters of the scoring function and insufficient consideration of semantic features in the keyword extraction algorithm based on text graphs,a keyword extraction algorithm based on k-truss graph decomposition was proposed.The algorithm first obtains the hierarchical structure of the text graph through ktruss graph decomposition technology and then extracts the semantic features,location features,complex network features,and other information of the text based on this.Then the importance score of each node(word)in the text graph is calculated through the unique nonparameter scoring function,and then the keywords are extracted according to the score ranking of the words.Finally,compared with other representative keyword extraction algorithms,the experimental results show that the proposed algorithm has an average improvement of 0.7% on the F1 index on the four benchmark data sets,which verifies the effectiveness of the proposed algorithm.2.Since most of the current graph-based multi-document extractive summarization algorithms use word co-occurrence as the edge connection relationship when constructing sentence graphs,they ignore more semantic relations and the high redundancy of the summary generated by the sentence-level scoring function.A multi-document extractive summarization and de-redundancy algorithm based on a pre-trained language model and keyword density is proposed.The algorithm first obtains the semantic vectors of all sentences in the document through the proposed pre-training language model,then uses cosine similarity to create the edge connection relationship between sentences and proposes a sentence scoring method based on keyword density to extract candidate abstracts,and then through Abstract de-redundancy framework based on text similarity to obtain high-quality abstract without redundancy.Finally,another representative multi-document extractive summarization algorithms are compared and analyzed.The experimental results show that the algorithm proposed in this paper improves on the Rouge-1,Rouge-2,and Rouge-L indicators on the 4 benchmark data sets.2.14%,0.73%,and 0.52% verify the effectiveness of the algorithm proposed in this paper.3.The proposed multi-document extraction summarization algorithm is applied to the task of web page classification.In the feature extraction stage of the current web page classification algorithm,too much useless redundant feature information on the web page is considered to cause the problem of too large a dimension of the extracted feature vector.,A web page classification algorithm based on multi-document summarization technology is proposed.The algorithm first extracts the main content of the webpage,obtains each topic paragraph of the webpage text by topic segmentation,and then extracts the topic paragraphs through the multi-document summary technology to obtain the core text information of the entire webpage.Then semantically encode the abstract sentences with the help of a pretraining language model,and then construct a classifier based on a convolutional neural network to realize web page classification.Finally,the collected news webpage classification data set is compared with the other four advanced webpage classification algorithms.The experimental results show that the algorithm proposed in this article has an average improvement in the F-measure index on the news webpage benchmark data set.0.9%,which verifies the effectiveness of the algorithm proposed in this article.
Keywords/Search Tags:multi-document summary, keyword extraction, text graph, web page classification
PDF Full Text Request
Related items