Research On Error Correction Of News Text Based On Masked Language Model

Posted on:2023-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:L H Wu

Full Text:PDF

GTID:2568306803476704

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the in-depth development of information electronization and the rapid expansion of self-media groups,the number of Chinese electronic texts has grown exponentially,accompanied by a large number of text spelling errors and grammatical errors,the quality of Internet text information is reduced greatly.It has a negative impact on the use and dissemination of text,especially in the field of news,the proofreading of the first draft of the text is a huge work,and only relying on manual correction is extremely costly and inefficient.Chinese text error correction is a classic task in the field of natural language processing,using MASKED mechanism to locate and correct text sequences is a common method in text error correction research.By using deep learning models to improve the accuracy of computer correction of Chinese errors,it can assist journalists to complete text correction work more efficiently.In this paper,the following work is done for the Chinese text error correction task:(1)Capture the public news manuscript data,and convert the original data into a small-volume data set that can be used for alignment after processing,focusing on the error correction effect of the model on a small-sample data set in a specific field.(2)In view of the lack of global error detection capability in the native BERT model,this paper uses a bidirectional GRU network to construct a composite network BGSM-BERT with an error detection network and an error correction network to improve the model.The accuracy of predicting error locations improves the robustness of language models to text detection.Compared with the native BERT on the two datasets SIGHAN and HIT News Set,the error detection accuracy is increased by 2.8and 8.2 percentage points respectively.(3)In view of the inflexible mask problem caused by the fixed mask in the native BERT,an improved KMCS-BERT network is proposed in this paper,and the confusion set is used to replace the original single character mask in the pre-training stage,customize the confusion set according to the characteristics of the data set,control the proportion of speech error text and structural error text in the confusion set,and further improve the flexibility of the model in error correction function by modeling the speech and stroke of the corpus performance and accuracy.Compared with the native BERT on the SIGHAN and HIT News Set datasets,the method improves the error detection accuracy by 13.6 and 18.9 percentage points respectively.

Keywords/Search Tags:

News field, Chinese text error correction, MASKED mechanism, Small sample, Bi-GRU, Confusion set

PDF Full Text Request

Related items

1	Research On Word Error Correction Methods Of Chinese Text
2	Incorporating Confusion Set Knowledge In Chinese Grammar Error Correction
3	Research And Application Of Chinese Text Error Correction Methods For Various Error Type
4	Research On Chinese Grammatical Error Correction Based Sample Enhancement
5	Research And Application Of Chinese Text Error Correction Method
6	Research On Chinese Text Error Correction For Different Error Types
7	Research On Chinese Text Error Correction
8	Research On Chinese Text Real-Word Error Automatic Detection And Correction Algorithm
9	Chinese Picture Text Extraction And Error Correction Based On Deep Learning
10	Research On Error Correction Method Of Chinese Short Text Based On BERT