Font Size: a A A

Research On Automatic Generation Technology Of Chinese Text Proofreading Corpora

Posted on:2022-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:L J PanFull Text:PDF
GTID:2518306494471284Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The automatic proofreading technology of Chinese text is one of the important application technologies of natural language processing.At present,the use of deep learning methods for Chinese text proofreading or the use of deep learning combined with traditional methods has become the mainstream.However,one of the most important challenges of using deep learning methods is that there is not enough labeled data for model training.In order to solve the problem of insufficient proofreading data,this paper proposes a series of methods of automatically generating Chinese text proofreading corpora.In view of the fact that the current Chinese proofreading data is divided into two categories,namely Chinese text spelling proofreading and Chinese text grammar proofing,the main work of this article has two points: one is to automatically generate a Chinese spelling proofreading corpus;the other is to automatically generate a Chinese grammar proofreading corpus.When generating the Chinese proofreading corpus,because different input methods will produce different forms of spelling errors,this paper builds corpora in order to check the text generated by the three mainstream input methods,and the three mainstream input methods include Pinyin input method,OCR input method and ASR input method.The generated text constructs corresponding corpora,which are referred to as the Pinyin corpus,the OCR corpus,and the ASR corpus,respectively.This paper uses the technology of pinyin and Chinese character conversion to generate the Pinyin corpus.Generate the OCR corpus based on optical character recognition technology.Generate the ASR corpus based on ASR technology.When generating a grammar proofreading corpus,two types of methods are mainly used.One is related methods for constructing a spelling proofreading corpus,including Chinese character-to-pinyin technology,Pinyin-to-Chinese character technology and OCR technology.These methods are mainly used to generate mischaracter errors.One is a translation-based method,which translates correct sentences into wrong sentences.This method is mainly used to generate errors such as multiple characters,fewer characters,and character order inversion.We use error detection-based methods,statistical-based methods,and manual evaluation to evaluate the quality of the corpora.Experiments have proved that the corpora generated in this article can better simulate real wrong sentences,which is of great help to the task of Chinese text proofreading.Among them,on the spelling proofreading data sighan2015,the corpus constructed using this article is better than that without using the corpus constructed by this article,and F1 score increased by about 4.8%.On the published grammar proofreading data NLPTEA 2018,using the corpus constructed in this article,the F1 score increased by about 5.7% at most.
Keywords/Search Tags:Chinese text proofreading, automatic generation, spelling proofreading, grammar proofreading
PDF Full Text Request
Related items