| With the in-depth development of information electronization and the rapid expansion of self-media groups,the number of Chinese electronic texts has grown exponentially,accompanied by a large number of text spelling errors and grammatical errors,the quality of Internet text information is reduced greatly.It has a negative impact on the use and dissemination of text,especially in the field of news,the proofreading of the first draft of the text is a huge work,and only relying on manual correction is extremely costly and inefficient.Chinese text error correction is a classic task in the field of natural language processing,using MASKED mechanism to locate and correct text sequences is a common method in text error correction research.By using deep learning models to improve the accuracy of computer correction of Chinese errors,it can assist journalists to complete text correction work more efficiently.In this paper,the following work is done for the Chinese text error correction task:(1)Capture the public news manuscript data,and convert the original data into a small-volume data set that can be used for alignment after processing,focusing on the error correction effect of the model on a small-sample data set in a specific field.(2)In view of the lack of global error detection capability in the native BERT model,this paper uses a bidirectional GRU network to construct a composite network BGSM-BERT with an error detection network and an error correction network to improve the model.The accuracy of predicting error locations improves the robustness of language models to text detection.Compared with the native BERT on the two datasets SIGHAN and HIT News Set,the error detection accuracy is increased by 2.8and 8.2 percentage points respectively.(3)In view of the inflexible mask problem caused by the fixed mask in the native BERT,an improved KMCS-BERT network is proposed in this paper,and the confusion set is used to replace the original single character mask in the pre-training stage,customize the confusion set according to the characteristics of the data set,control the proportion of speech error text and structural error text in the confusion set,and further improve the flexibility of the model in error correction function by modeling the speech and stroke of the corpus performance and accuracy.Compared with the native BERT on the SIGHAN and HIT News Set datasets,the method improves the error detection accuracy by 13.6 and 18.9 percentage points respectively. |