Font Size: a A A

Improvement And Research Of Sequence-to-Sequence Model For Chinese Text Summarization

Posted on:2021-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y L YinFull Text:PDF
GTID:2428330611465563Subject:Engineering
Abstract/Summary:PDF Full Text Request
The advent of the era of big data has allowed online news,social commentary and other text data to grow in an explosive way.In order to attract more readers to view,a group of "Clickbaitors" have appeared.They deliberately render and exaggerate the headlines which are seriously inconsistent with the actual contents,mislead and deceive readers,and hence,increase the time cost of readers to obtain key information.Therefore,it is a challenging research task to condense the content and make a long article become a concise summary.With the development of deep learning technology,the research of text summary generation has also made breakthrough progress with the help of deep neural network models.Most of the main researches are based on sequence-to-sequence model,but this model still has following problems:(1)Needs a large amount of marked data,and the cost of data acquisition is high;(2)The model is simple,and the encoder is difficult to acquire the global semantic information;(3)The text is converted into a word vector input model,and the relative isolation between words leads to a difficulty in extracting local features of the text.This thesis firstly conducted some basic experiments on the Chinese text summary task based on the sequence-to-sequence model,including exploring the impacts of word segmentation and several different word vectors on the model,and then based on this,the above problems were improved accordingly.For the first problem,this thesis studies several data enhancement algorithms related to natural language processing tasks,and proposes a non-core word easy data augmentation method(NCW-EDA),and conducted a comparative experiment with the back-translation method.In the case of insufficient data samples,after the data is augmented by the NCW-EDA method,the effect of summary generation is greatly improved,and the effect achieved by the model trained on the complete data set is similar,which can effectively alleviate the impact of data insufficiency.For the second problem,in order to strengthen the encoder's acquisition of the global semantic information,this thesis tries two different methods.The first is to propose a pre-encoder structure.Before the encoder encodes,pre-encode the original text to obtain the general semantic information of the original text and enhance the encoder's ability to control the global semantic information.In addition,a self-encoder structure is proposed.The real summary is encoded by adding an self-encoder in the training stage,and the similarity between the encoding vector of the summary and the original text is added to the final loss function which enables the encoder to enhance the ability to obtain key information by approximating the true summary.For the last problem,this thesis adds region convolutional layers to the word vector and the hidden state vector of the encoder to establish the local connection between words or characters,so that each vector can not only represent the information of the current position word,but also obtain information about some neighboring words that can better extract the semantic features of the text.Finally,the optimal combination model obtained by combining the above-mentioned several improved modules is compared with the model with the highest scores of ROUGE-1,ROUGE-2 and ROUGE-L among the comparison models on the LCSTS(Large Scale Chinese Short Text Summarization)data set.The three evaluation indicators are improved by 7.59%,12.02% and 7.54% respectively.
Keywords/Search Tags:Chinese Text Summarization, Sequence-to-sequence Model, Text Data Augmentation, Pre-encoder, Self-encoder, Region Convolutional Layer, Combined Model
PDF Full Text Request
Related items