In the Internet era,massive data has brought about the problem of information overload.How to quickly identify useful texts and extract key information from texts has become the focus of research in the field of natural language processing.Automatic text summarization technology is a technology that uses the computer to comprehend and compress the original text to generate a non-redundant short summary text that is consistent with the original text information.As an important sub-task in the field of natural language processing,automatic text summarization technology is widely used in technical services such as search engines,news summaries,and scientific literature summaries,effectively improving the efficiency of people’s browsing and processing of information.However,the summaries generated by the current summarization system often cannot accurately extract the central idea of the original text,and face problems such as low content relevance and poor accuracy.These problems severely limit the further promotion and use of summarization techniques.The summarization system is usually composed of indicators,data,and algorithm models.Only when the three work together robustly can a complete summarization system be constructed.Aiming at the problem of low content correlation existing in text summarization technology,combined with relevant technical research at home and abroad,this thesis will propose three complementary optimization methods from three different perspectives of evaluation indicator,model,and data,so as to realize the comprehensive and systematic enhancement of the content correlation of text summarization system.The main work of this thesis is as follows:(1)At the evaluation indicator level,a multi-dimensional evaluation indicator SummScore based on text matching is proposed to measure summary quality.The current mainstream summarization evaluation indicators face the problem of low relevance with human ratings and cannot accurately measure the relevance of summaries’ content.The SummScore proposed is a comprehensive indicator for the evaluation of summary quality based on the text matching model Cross-Encoder.It can not only measure the content relevance of summaries but also evaluate the consistency,coherence,and fluency of summaries.Experiments on the Summ Eval dataset show that SummScore can not only achieve a comprehensive evaluation of summarization systems but also outperform existing evaluation metrics in relation to human scoring.(2)At the model optimization level,a joint summarization model with semantic guidance and keyword coverage based on Pointer Network is proposed.The model firstly uses the extractive summarization model to grasp the advantages of key sentences in the article and uses the semantic guidance module to fuse the semantic vectors of multiple abstractive summaries to generate a global semantic guidance vector to help grasp the central idea of the full text.The keyword coverage mechanism is then used to facilitate the distribution of attention to keywords,which facilitates the generation of sentence fragments that are more logically related to the original content.Experimental analysis shows that the joint summarization model not only performs far better than the baseline model but also can effectively alleviate the problems of low correlation in the baseline model.(3)At the data strategy level,a correlation optimization training strategy based on the data augmentation of summary hard samples is proposed.Firstly,we use the trained summarization model to decode the training set and filter the decoded samples by the proposed SummScore metric and statistical features to screen out the hard sample dataset.Secondly,by using the data augmentation algorithm of equivalent greedy compression of the original text,the hard sample data can be extended.Finally,the summarization model is trained with two-stage augmentation on the augmented dataset.Under the combined effect of hard sample data augmentation and Enhance Loss adjustment,the summary model has the ability to deal with hard samples.Experiments on data augmentation training strategies are carried out on multiple baseline models,and the experiments show that our method enhances the content relevance of summaries generated by the summarization model. |