Font Size: a A A

Research On Key Technologies Of English-to-Chinese Cross-lingual Summarization Based On Deep Learning

Posted on:2024-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:H Y PanFull Text:PDF
GTID:2568307100473154Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has made it easier for people to obtain information than ever before.With the arrival of big data era,text resources are growing exponentially.However,massive texts contain more and more redundant and invalid information,which makes it difficult for people to obtain useful information timely and effectively.Cross-lingual summarization(CLS)can sum up texts in an unfamiliar language into summaries in a familiar language,making it easier for people to accurately and efficiently obtain main information of texts in an unfamiliar language.In view of the fact that English is the most widely used language in the world,this dissertation focuses on the key technologies of English-to-Chinese CLS.At present,the mainstream methods of CLS are based on deep learning.There are at least three problems in the researches on Englishto-Chinese CLS based on deep learning:(1)The datasets of CLS are of low quality and inadequate size.(2)The CLS models are poor in the following three types of abilities: semantic understanding,cross-lingual alignment and text generation.(3)The traditional objective of CLS only relies on the statistical information of characters,but ignores the guiding significance of semantic information in training process.This dissertation has carried out targeted research and improvement on the above three problems,and mainly completed the following work:1.To solve the problem that the datasets of CLS are of low quality and inadequate size,this dissertation proposes a dataset construction method of CLS based on filtering and text augmentation.Firstly,multi-strategy filtering algorithm is used to remove the low-quality samples of the monolingual summarization dataset from the perspectives of characters and semantics,improving the quality of the data source.Then,the transformation method which translates texts on the source side is used to convert the monolingual summarization dataset into CLS dataset by a reliable machine translation system,forming an initial CLS dataset.Finally,the text augmentation algorithm based on the pre-trained model is used to generate CLS samples through the self-attention mechanism and masked language model,expanding the scale of the CLS dataset on the premise of quality assurance.The comprehensive evaluation of the build process and result indicates that this method can obtain high-quality and large-scale English-to-Chinese CLS datasets at a low cost.In addition,in experiments on two benchmark datasets,the optimal model of three baseline models is optimized by the proposed text augmentation algorithm.Compared to its raw performance,ROUGE-1 is increased by 2.06% and 2.04%,ROUGE-2 is increased by 4.23% and 2.18%,ROUGE-L is increased by 4.65% and 2.05%,respectively.It shows that the algorithm can effectively improve the performance of CLS models.2.To solve the problem that the CLS models are poor in the above three types of abilities,this dissertation proposes an English-to-Chinese CLS model based on the multi-stage training.Firstly,the model is trained by the multilingual denoising pre-training task,learning the general language knowledge of Chinese and English,and obtaining good initialization parameters.Then,the model is trained by the multilingual machine translation task,simultaneously learning the following three types of abilities: semantic understanding of English,cross-lingual alignment from English to Chinese,and text generation of Chinese.Finally,the model is trained by the CLS task,further learning the above abilities for CLS.The experimental results indicate that the proposed model has good CLS performance,and the tasks of multilingual denoising pre-training and multilingual machine translation can both improve CLS performance.In experiments on two English-to-Chinese CLS datasets,compared to the optimal performance in the three baseline models,this model increases ROUGE-1 by 79.65% and 78.05%,ROUGE-2 by 108.54% and147.37%,ROUGE-L by 84.05% and 81.44%,respectively.3.To solve the problem that the traditional objective of CLS only relies on the statistical information of characters but ignores the guiding significance of semantic information in training process,this dissertation proposes a semantic-fused objective,and completes the training of CLS model based on this objective.Firstly,a monolingual semantic similarity called fast BERTScore is designed based on the BERT model and used as the monolingual semantic objective.Then,a cross-lingual semantic similarity called XLM-Ro BERTa Score is designed based on the XLM-Ro BERTa model and used as the cross-lingual semantic objective.Finally,the two semantic objectives are fused with the traditional objective of CLS to obtain the semantic-fused objective,which can jointly guide training through the concrete supervision information of characters and the abstract supervision information of semantics.The experimental results indicate that the proposed objective can effectively improve the performance of CLS models.In experiments on two English-to-Chinese CLS datasets,the optimal model of five baseline models is trained with the proposed objective.Compared with the training result on the traditional objective,ROUGE-1 is increased by 1.81% and 2.12%,ROUGE-2 is increased by 2.23% and8.14%,ROUGE-L is increased by 2.49% and 3.02%,respectively.
Keywords/Search Tags:cross-lingual summarization, deep learning, dataset construction, multi-stage training, semantic-fused objective
PDF Full Text Request
Related items