| Chinese spelling correction is an important and challenging task that aims to locate and correct misspelled Chinese characters in Chinese text.In Chinese natural language processing scenarios,Chinese spelling correction is an important post-processing step for Chinese input,such as automatic speech recognition,and a key pre-processing step for all downstream applications,including text classification,query parsing,etc.Due to the lack of clear boundaries between words in Chinese and the large number of Chinese characters,simple and effective spelling correction algorithms such as edit distance on alphabetic languages cannot be applied directly.Therefore,CSC requires the ability to model the language and capture the implicit pattern of Chinese error generation.Most Chinese spelling correction algorithms fail to directly consider the phonology and glyph information of Chinese characters.This paper proposes a deep-learning based Chinese spelling correction model Phonology and Glyph Enhanced Pre-training(PGEP).The main work and innovations of the method are as follows:(1)For phonology,PGEP uses Bi-GRU to encode single Chinese character"s pinyin sequence as phonology embedding;(2)For glyph,PGEP introduces Ideographic Description Sequence(IDS)to decompose Chinese character into binary tree of basic strokes,and then an encoder based on gated units is utilized to encode the glyph tree structure recursively;(3)At each layer of original model,PGEP extends extra channels for phonology and glyph encoding respectively,then performs a multi-channel fusion function and a residual connection to yield an output for each channel;(4)PGEP artificially generate a large number of annotated sentences with spelling errors and train the model to recover correct utterances;this task allows the newly initialized PGEP to be adapted to the language model.Using a stand-alone spelling correction model,while general,increases the overhead of the system.For deep learning-based Chinese text classification model,this thesis designs a framework to enhance its robustness:ECRTC.Even if there are spelling errors in the input text,the model can still understand the correct spelling,making the classification results almost unaffected.The main work and innovations of the method are:(1)ECRTC designs a Chinese spelling correction pre-training task that adds Chinese error correction pre-training before text classification training.(2)For a specific classification task,ECRTC designs a set of adversarial sample generation algorithms that can attack the classification model more effectively by locating keywords in the text and modifying them to produce text errors.(3)ECRTC proposes a dynamic data enhancement algorithm to dynamically generate adversarial samples during the training process,so that the same training sample in each epoch contains different Chinese spelling errors.(4)Experiment verifies that Chinese spelling correction pre-training improves both original classification learning and data augmentation learning. |