Font Size: a A A

Research On Chinese Grammatical Error Correction Based On Sequence-to-Sequence Model

Posted on:2022-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q QiuFull Text:PDF
GTID:2518306563479044Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Grammatical error correction(GEC)is an important task in natural language processing,which aims to detect and correct grammatical errors in text.With the development of deep learning and the explosive growth of data,translation mode has become the primary choice for GEC task,and neural sequence-to-sequence(seq2seq)model has been widely used in GEC task.Compared with alphabetic languages such as English,Chinese has many distinct characteristics.Moreover,there are fewer relevant data sets for Chinese grammatical error correction,which limits the learning ability of seq2 seq model.In order to solve the above problems,this paper conducts a further study on the task of Chinese GEC on the basis of existing research.The main work of this paper is as follows:(1)A two-stage model for Chinese GEC(TS-GEC)is proposed.The model consists of two independent sub-modules: a spelling check sub-module based on language model and a GEC sub-module based on seq2 seq model.The spelling check sub-module is responsible for correcting spelling errors in the given text,mainly non word errors;while the GEC sub-module based on seq2 seq model is responsible for correcting other grammatical errors in a given text,including grammatical errors and spelling errors.According to the characteristic that the source sentence and the target sentence in GEC task are the same language,recycle inference method based on language model is proposed based on seq2 seq model,which perfectly corrects multiple grammatical errors contained in the text through multi-round inferences.At the same time,different initialization methods are adopted for the embedding layer of the seq2 seq model.The pretrained word embedding is used to initialize the embedding layer of the decoder,and the encoder is initialized randomly.This method can ensure that the word vector learned by the encoder is more consistent with the characteristics of syntax error sentences,and has better representation ability.(2)A Chinese GEC model based on dynamic masking words(DMasking GEC)is proposed.The model is based on the transformer model,and dynamic masking words algorithm is introduced in the model input stage,including four basic masking methods:random masking,random substitution,unk substitution and reorder.In the training stage of model,a group of masking methods are randomly selected from four different masking methods to add noise data to the source sequence,and the data set is modified in a small range to obtain more diverse training samples with grammatical errors.To a certain extent,the dynamic masking words algorithm alleviates the problems of less training samples and error categories in Chinese GEC task.(3)Experiments are conducted on the NLPCC 2018 GEC public data set.The TS-GEC model and DMasking GEC model proposed in this paper reach 31.01 and 33.71 on the F0.5 score,respectively,which exceed the optimal results of NLPCC 2018 GEC task(F0.5 = 29.91)1.1 and 3.8 respectively.The experimental results prove the effectiveness of the proposed model for Chinese GEC task.
Keywords/Search Tags:Chinese grammatical error correction, sequence-to-sequence model, recycle inference, dynamic masking words
PDF Full Text Request
Related items