| Chinese Grammar Error Correction(CGEC)is an important task in the field of natural language processing,which aims at automatically detecting and correcting the grammatical errors in given texts.In the early stage,the researchers of CGEC applied rule-based and statistical machine learning methods to correct grammatical errors.As NLPCC successfully held the CGEC evaluation task in 2018 and published the first and the largest CGEC dataset,the data-driven neural network,such as sequenceto-sequence neural network,has been widely used in CGEC task.However,although the neural network based methods have achieved remarkable performance in CGEC task,there still some problems to be solved.For example,the Chinese texts with grammatical errors may cause the sample quality problems such as incoherent semantics and semantic losses.Also,the grammatical errors in the texts will usually lead to the incorrect word location problem,which may cause the low quality of the sample location information.Moreover,the current dataset of CGEC only contains millions of samples,which cannot match the number of parameters in the neural network.Such mismatch causes a serious sample sparsity problem in CGEC.To address the aforementioned problems,in this thesis,we focus on applying sample enhancement to CGEC task.Specifically,sample enhancement contains three different perspectives:enhanced sample semantic extraction,sample augmentation and sample location information modeling.The contributions of our work can be summarized as follows:(1)To solve the problem of sample semantic quality in Chinese grammatical error correction,we propose a novel architecture for CGEC task to enhance the semantic modeling of the input samples.The proposed architecture is an encoder-decoder model with a grammatical error weakening module,which aims at learning to weaken the representation weight of grammatical error words and enhancing the contextual information of words in the input text.As such,the model can better extract the semantic information of the input samples.Experimental results on NLPCC-201 8 test dataset show that the method of weaken the grammatical error influence can effectively improve the performance of CGEC models.(2)To address the problem of low quality of the sample location information in Chinese grammatical error correction,we propose a novel sample location information enhancing model to dynamically model the word location information in the input samples.In CGEC task,correcting the grammatical error words in the text usually causes the word position change.The proposed model targets at modeling the position change information when the encoder of sequence-to-sequence model extracts the information of input grammatical error text.With the help of the position change information,the CGEC model can accurately capture the word position change during the error correcting process.Experimental results on NLPCC-2018 test dataset show that the proposed approach can help the CGEC model accurately locates and corrects the grammatical error words in the input sentences and brings substantial improvement in accuracy and F 0.5 values.(3)To solve the sample sparsity problem in Chinese grammatical error correction,in this thesis,we propose a novel generator-discriminator framework,called Adversarial CVAE-CGEC to generate pseudo training samples for CGEC.The discriminator in the proposed framework is a CGEC model and it is employed to assign high rewards to those useful generated grammatical error sentences,guiding the learning of the generator.The generator in the proposed framework is a conditional variational auto-encoder based sequence-to-sequence model,which is used to fit the distribution of the CGEC dataset and generate diverse pseudo grammatical error sentences.Experimental results show that the performance of the CGEC models can be effectively improved after training with the pseudo training samples generated by Adversarial CVAECGEC. |