Font Size: a A A

Improving Sequence-to-Sequence Research On Chinese Text Summarization Generation Method

Posted on:2022-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:L PengFull Text:PDF
GTID:2518306575466674Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of information technology,the scale of data is increasing exponentially.How to quickly and effectively obtain the required information from the massive information has become an urgent problem to be solved,and the task of text summarization has emerged.Text Summararization aims to compress and reduce a long text to generate a short text.The short text can not only express the meaning of the original text accurately,but also retain its key information.Compared with the English text summarization method,the development of Chinese text summarization study is slow.Traditional extraction methods are prone to produce redundant information,and make semantic incoherence appear between sentences easily.Similarly,though the generative method can express the original text information in concise text by understanding the key content of text,it is prone to generate a large number of unregistered words and repeated words,which affects the readability of the generated summary and reduces the summary quality.Therefore,an improved sequence-tosequence model is studied in the thesis.For long text in the legal field,a hybrid model combining extractive and abstractive methods is also put forward according to its characteristics.The main research work in this thesis is summarized as the following form three aspects:1.The method of text summarization generation based on global gated dual encoders.Firstly,LSTM is introduced to make up for the defect of RNN which is forgetting semantic information,and the Transformer structure is used to obtain the global semantics of the text;Secondly,to avoid the redundancy problem caused by the extraction of semantics by Transformer,a global gating unit is designed to filter key information;Thirdly,combining with the coverage mechanism,a pointer generator network is used to directly copy from the semantic representation of unregistered words.Hence,the semantics of text summaries generated by the model are more complete and accurate.2.The method of text summarization from the news text based on improved partial coding.Since the LSTM local encoder preserves a continuous sequence of information during the process of extracting semantics,its ability to extract keyword information is limited.However,Text CNN can enhance the expression of n-gram information by controlling the size of the convolution kernel.Thereby,the Local-Convolution structure is used to make up for the shortcomings of the local encoder in extracting word-level information.It improves the understanding of n-gram text information for the model effectively.3.The study of extractive and abstractive hybrid research methods based on legal data.At present,there is a serious lack of summary data in the legal field.Since the pleading materials of both parties in the law are usually long texts,the long texts will be truncated by the first model in the thesis.It will bring about the loss of text information.A framework combined of extractive method with abstractive method is presented in the thesis.Firstly,the BERT model is used to extract the data,and the extracted sentence pairs are used as the input of the Transformer.Then based on the attention mechanism,the sequence-to-sequence summary generation method is used to form the legal text summary.The experimental results show that the results on ROUGE measure obtained by the hybrid method have been significantly improved compared with the single generative method.
Keywords/Search Tags:Sequence to sequence framework, global semantics, global gating unit, pointer generator network
PDF Full Text Request
Related items