Font Size: a A A

Research On Error Correction Method Of Chinese Short Text Based On BERT

Posted on:2023-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:K W WuFull Text:PDF
GTID:2568307118995689Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Chinese text error correction technology has been widely used in many natural language processing tasks,such as document editing,machine translation,speech recognition,etc.However,the research on Chinese text error correction started late,and the Chinese grammatical structure is more complex than that of English,resulting in the error correction effect of current Chinese text error correction models is not very satisfactory.At present,Chinese text error correction is mainly divided into three directions,which are Chinese grammatical error detection,Chinese grammatical error correction and Chinese spelling check.In this paper,we address the problems of the mainstream models in these three directions respectively to further improve the model performance.The main research contents of this paper are as follows.(1)To address the problems that the Chinese grammatical error detection model BERT-CRF is difficult to identify nested errors and has a long training and inference time,we propose a Chinese grammatical error detection method based on Global Pointer.Unlike the existing grammatical error detection methods that use conditional random fields for sequence tagging,Global Pointer uses the whole error as the unit for sequence tagging to solve the problem that BERT-CRF is difficult to identify nested errors.To address the problem that Global Pointer easily identifies adjacent errors as one error,rotary position embedding is introduced to enhance the model’s focus on the location and span of errors.In addition,the error tagging task is considered as a multi-label classification task,and the multi-label classification loss function is used in training,which enables parallel computation during training and inference to solve the problem of long training and inference time of BERT-CRF.The effectiveness of the Chinese grammatical error detection method proposed in this paper is verified by conducting experiments on the CGED2018 public dataset.(2)To address the problem that the Chinese grammatical error correction model GECTo R pays insufficient attention to the output sequence as a whole and the existence of exposure bias,we propose a Chinese grammatical error correction method based on iterative training and conditional random field.To address the problem that GECTo R is difficult to learn the association between output labels by using frame-by-frame softmax for labeling,this paper uses conditional random field with approximate parameter matrix to combine frame-by-frame softmax for training labeling and increase the model’s focus on the output labels as a whole.And a focus penalty strategy is applied to the loss function to alleviate the difference in classification difficulty caused by fewer error characters.For the exposure bias problem caused by using iterative inference in GECTo R,a dynamic iterative training method is proposed to enable the model to learn the corrected results after multiple rounds of prediction.The effectiveness of the Chinese grammatical error correction method proposed in this paper is verified by conducting experiments on the NLPCC2018 dataset.(3)To address the problem that conventional Chinese spelling correction models do not consider the similarity between input and output enough and it is difficult to handle continuous character errors,we propose a Chinese spelling correction model based on similar word graph and candidate recall.To address the problem that the features extracted by BERT do not sufficiently consider similar character relationships,a similar character graph is constructed and a graph attention network is used to introduce similar character features so that the model can take into account the similarity relationship between the corrected characters and the original characters.To address the problem that it is difficult to handle continuous character errors,a gate attention unit is used to learn the association between adjacent characters of candidate sentences and select the candidate sentence with the best overall coherence as the final output,thus increasing the overall coherence of the output sentences.The effectiveness of the Chinese spelling correction method proposed in this paper is verified by conducting experiments on the open source dataset SIGHAN2015.
Keywords/Search Tags:Chinese grammatical error detection, Chinese grammatical error correction, Chinese spelling check, Pre-training language model
PDF Full Text Request
Related items