With the rapid development of the Internet,the information generated on the network each day is increasing at an exponential rate.Simultaneously,a huge amount of incorrect text data on the Internet contributes to a lack of information distribution.Therefore,text error correction is one of the most pressing concerns that must be addressed immediately.Chinese,unlike English,has a huge number of characters and flexible grammars,making Chinese text error correction more challenging.Chinese error correction includes two tasks: Chinese Spelling Correction and Chinese Grammatical Error Diagnosis.Spelling correction focuses on correcting the spelling of individual words in a document;Chinese Grammatical Error Diagnosis focuses on structural issues including redundant words,missing words,wrong words and disorder words.In the field of Chinese Spelling Check,current technology primarily employs the generally used language model,it ignores the Phonological and Visual features features of the incorrect words.Although the neural machine translation models are applied to Chinese Grammatical Error Diagnosis,it is difficult to train the model to have human-level semantic understanding capacity when the amount of data is minimal.And when there are many different types of errors,the existing model is difficult to correct all at one time.The main contributions of this thesis are as follows:(1)Aiming at spelling type errors,a Statistical Language model integrating the Phonological and Characteristic features(SLPC)and a Pre-training Model integrating Multiple Features are proposed(PMMF).SLPC model mainly uses statistical language model to score the confusion set integrating Chinese Phonological and Characteristic features,and the candidate word with the highest score is taken as the final output;PMMF model uses a variety of features as input in the error detection level,uses different pre-training models to generate multiple candidate sets in the error correction level,compares Phonological and Characteristic confusion sets,and the intersection part is used as the final output.While Phonological confusion sets are designed with consideration of each province.The differentiation of crowd dialects generates additional confusing sets.Experiments demonstrate that PMMF can effectively correct spelling errors,outperforming the traditional statistical model by 10% in F1 score,and surpasses the latest pre-training model in some assessment measures.(2)Aiming at Grammatical errors,an end-to-end neural network Grammatical Error Diagnosis model is improved by using the Data Enhancement method based on Boost Learning and Inference(Seq2Seq-attention,Transformer).When the amount of data is small,the neural machine translation model is difficult to have human-level semantic understanding ability.Furthermore,when multiple errors exist at the same time,it is difficult to correct them all at one time.Therefore,this thesis employs Bert pre-training word vector as well as data enhancement based on boost learning and inference to improve the representation and error correction ability of the model,Trans&E&B model achieves the state-of-the-art performance in some tasks and their evaluation indicators.(3)Finally,based on the characteristics of Chinese and the defects of existing technology,a Chinese spelling error correction system integrating Multi-Features and Pre-training model(MFP)is proposed.The system is mainly divided into four modules: error detection module,error correction module,verification module and visualization module.The Radical feature,word Boundary feature,and Bert’s pre-trained word vector serve as the Encoder for the error detection module,which is fed into the Bi LSTM & CRF network to train and forecast the mistake location.The error correction module corrects incorrect words based on the integrated pre-training model’s context.The final verification module checks the correction outcomes against the confusion set,and scores and sorts the candidate sets using the n-gram and perflexity language models.The visualization module displays the sorted results.Experiments show that MFP system effectively complete the task of spelling error correction,and surpasses the best pre-training model in some evaluation indicators. |