Font Size: a A A

Incorporating Confusion Set Knowledge In Chinese Grammar Error Correction

Posted on:2024-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:J C LiFull Text:PDF
GTID:2568306941463654Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
The Chinese Grammatical Error Correction(CGEC)task aims to automatically correct Chinese sentences with grammatical errors through natural language processing technology.Chinese grammatical errors can be divided into four types:substitution,redundancy,deletion and word order.According to the Chinese data of the NLPCC 2018 GEC task,the substitution errors accounted for the largest proportion of all error types,accounting for about 50%,and more than 90%of the substitution errors were caused by misuse of near-homophones or nearvisually characters.Therefore,phonological and visual similarity knowledge is very critical for solving substitution errors.But no one has tried to incorporate phonological and visual similarity knowledge into neural network-based grammatical error correction models and back-translation data augmentation for grammatical error correction.This thesis attempts to incorporate phonological and visual similarity knowledge in the form of confusion sets from two aspects of model and data augmentation.Finally,this thesis attempts to use deep learning compiler to improve the inference speed of the neural network-based GEC model in practical use.The main contents of this thesis are as follows:(1)Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error CorrectionFor Chinese Grammatical Error Correction(CGEC)task,although substitution errors account for the largest proportion of all the errors in the data set,no researcher has tried to incorporate phonological and visual similarity knowledge into the neural network-based GEC model.To tackle this problem,this thesis makes two attempts.First,this thesis proposes a GEC model which incorporates with the confusion set knowledge based on the pointer network.Specifically,this model is Seq2Edit-based GEC model and use the pointer network to incorporate phonological and visual similarity knowledge.Second,during the training data pre-process stage,i.e.,in the process of extracting edit sequences from wrong-correct sentence pairs,this thesis proposes a confusion set guided edit distance algorithm to better extract substitution edit of phonological and visual similarity characters.The experimental results show that the two proposed methods can both improve the performance of the model and can provide complementary contributions;and the proposed model achieves the current state-of-the-art results in the NLPCC 2018 evaluation data set.Experimental analysis shows that compared with the baseline Seq2Edit GEC model,the overall performance gain of our proposed model is mostly contributed by correction of substitution errors.(2)Back-Translation Data Augmentation Method Incorporating Confusion Set KnowledgeRule data augmentation based on confusion sets and neural network data augmentation based on back-translation are two common data augmentation methods in the field of grammatical error correction.The artificial data generated by the two have their own advantages and can complement each other.In previous studies,the neural network data augmentation based on back-translation mostly used the sequence-to-sequence(Seq2Seq)model.This method cannot control the error rate of generating erroneous sentences,and may generate data that is inconsistent with the true error distribution.To address this problem,this thesis propose a back-translation data augmentation method based on a sequence-to-edit(Seq2Edit)model to more controllably generate artificial data.Further,in order to combine the advantages of two common data augmentation methods to generate artificial data,this thesis also propose a back-translation data augmentation method based on the Seq2Edit model that incorporates confusion set knowledge.Finally,the experimental results show that the two data augmentation methods proposed in this paper can both improve the performance of the model,and the back-translation data augmentation method incorporating the knowledge of the confusion set is better than the ordinary back-translation data augmentation method.Experimental analysis shows that the further performance improvement after incorporating the confusion set knowledge mainly comes from the correction of substitution errors.(3)Grammatical Error Correction Model Deployment Optimization Based on Deep Learning CompilerIn order to meet the inference delay requirements of the neural network-based grammatical error correction model in practical applications,this paper attempts to automatically optimize the deployment inference speed of the neural network-based grammatical error correction model through the deep learning compiler TVM.TVM is an end-to-end deep learning compiler designed to enable machine learning engineers to efficiently optimize and run deep learning models on any hardware backend.This thesis conducted experiments on CPU and GPU.The experimental results show that compared with the baseline deployment method using the Pytorch framework for deployment,using the grammatical error correction model compiled by TVM for deployment can effectively speed up the inference time and reduce the graphics memory usage on the GPU.
Keywords/Search Tags:Grammatical Error Correction, Confusion Set, Back-Translation, Data Augmentation, Deep Learning Compiler
PDF Full Text Request
Related items