Font Size: a A A

Research And Implementation Of The Chinese Spell Check Model Based On Deep Learning

Posted on:2022-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:P P SongFull Text:PDF
GTID:2518306605468594Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of electronic publications has increased dramatically.Spelling errors result in inaccurate text,which further leads to the deviation of information transmission.In order to reduce the burden of manual work and improve work efficiency,automatic text proofreading systems are urgently needed in media and publishing industry to detect and correct spelling errors effiently.The core algorithms of mainstream Chinese proofreading systems can be divided into three categories: based on rules,based on statistical machine learning methods and based on deep learning methods.Methods based on rules or statistics learning rely on thesaurus or statistical language model.These kind of methods lack long range dependency of text contents,leading to a large number of low-level false positive errors.The method based on deep learning has stronger correction ability,but due to the lack of labeded training data and the complexity of data in application scenarios,this method is also difficult to achieve ideal performance in industrial scenarios.To solve the above problems,this paper proposes an integrated correction method based on pre-training language models.To solve the problem of training data lacking,this paper combines data augmentation and rule processing to enhances the generalization ability of the model in industrial scenarios.To solve the problem of high complexity of data in application,the paper integrates a confusable words correction module,which enhances the correction ability of the model for correcting confusable words.The detail research include the following aspects:(1)In order to achieve rapid correction and enhance models generalization capability,the paper constructs an end-to-end spelling correction model with fusion threshold module.Firstly,this paper implements the end-to-end generated correction based on BERT.Then,to solve the problem of insufficient training data,this paper proposes a data augmentation method for deep neural network models,combined with the test results feedback.The closed-loop structure of training data is constructed to make the sustainable optimization of the model.Finally,to reduce false positives predictions,this paper designs a threshold processing module to enhance the generalization capability of the model in industrial scenarios.(2)In order to solve the problem of misuse of high frequency confusable words in application scenarios,this paper proposes a module to correct confusable words based on masked language model.In application scenarios,a large number of errors are confusable words,but it is difficult to distinguish them only by the end-to-end correction model.This paper collects 55 pairs of confusable words and designs an correction method for confusable words based on the masked language model.Experimental results show that the proposed method has better performance than other correction methods.By integrating the correction module of confusable words,the system's overall correction ability is further enhanced.(3)To solve the problem of low speed of model prediction,service optimization based on model solidification is used in this paper.The tornado framework is used to deploy the model as an HTTP service,supporting thousands of connections per second.After optimization,the prediction speed of model was increased by 35 times.Finally the service processing speed reached 2 seconds / 10,000 words.The end-to-end spelling correction model with fusion threshold module constructed in this paper obtains a character-level correction F1 value of 0.5560 on the test set of business scenarios.After integrating the confusable words correction module,the precision of character level correction in the whole model is improved by 11.2 percentage points.This research have been applied in the product "Founder Intelligent Checking V2.0",serving more than 1,000 publishing and media organizations and hundreds of thousands of individual users.In the future,we will conduct more deeply research on Chinese grammar error correction in application scenarios.
Keywords/Search Tags:Chinese spell check, masked language model, confusable words, pre-train language model, data augmentation
PDF Full Text Request
Related items