Research And Application Of Chinese Spelling Correction Technology Incorporating Phonology And Glyph Features

Posted on:2023-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:L J Bao

Full Text:PDF

GTID:2568306914956479

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

Chinese spelling correction is an important and challenging task that aims to locate and correct misspelled Chinese characters in Chinese text.In Chinese natural language processing scenarios,Chinese spelling correction is an important post-processing step for Chinese input,such as automatic speech recognition,and a key pre-processing step for all downstream applications,including text classification,query parsing,etc.Due to the lack of clear boundaries between words in Chinese and the large number of Chinese characters,simple and effective spelling correction algorithms such as edit distance on alphabetic languages cannot be applied directly.Therefore,CSC requires the ability to model the language and capture the implicit pattern of Chinese error generation.Most Chinese spelling correction algorithms fail to directly consider the phonology and glyph information of Chinese characters.This paper proposes a deep-learning based Chinese spelling correction model Phonology and Glyph Enhanced Pre-training(PGEP).The main work and innovations of the method are as follows:(1)For phonology,PGEP uses Bi-GRU to encode single Chinese character"s pinyin sequence as phonology embedding;(2)For glyph,PGEP introduces Ideographic Description Sequence(IDS)to decompose Chinese character into binary tree of basic strokes,and then an encoder based on gated units is utilized to encode the glyph tree structure recursively;(3)At each layer of original model,PGEP extends extra channels for phonology and glyph encoding respectively,then performs a multi-channel fusion function and a residual connection to yield an output for each channel;(4)PGEP artificially generate a large number of annotated sentences with spelling errors and train the model to recover correct utterances;this task allows the newly initialized PGEP to be adapted to the language model.Using a stand-alone spelling correction model,while general,increases the overhead of the system.For deep learning-based Chinese text classification model,this thesis designs a framework to enhance its robustness:ECRTC.Even if there are spelling errors in the input text,the model can still understand the correct spelling,making the classification results almost unaffected.The main work and innovations of the method are:(1)ECRTC designs a Chinese spelling correction pre-training task that adds Chinese error correction pre-training before text classification training.(2)For a specific classification task,ECRTC designs a set of adversarial sample generation algorithms that can attack the classification model more effectively by locating keywords in the text and modifying them to produce text errors.(3)ECRTC proposes a dynamic data enhancement algorithm to dynamically generate adversarial samples during the training process,so that the same training sample in each epoch contains different Chinese spelling errors.(4)Experiment verifies that Chinese spelling correction pre-training improves both original classification learning and data augmentation learning.

Keywords/Search Tags:

Chinese natural language process, Chinese spelling correction, pre-trained language model, multi-modal

PDF Full Text Request

Related items

1	Research On Chinese Spelling Error Correction Model Based On Deep Learning
2	A Technology Of Generating SQL Through Chinese Natural Language Queries Based On Deep Learning
3	Research Of Chinese Spelling Correction System Based On Multimodal Language Model
4	OCR Error Post-correction Based On Chinese Character-level Features And Language Model
5	Optimization And Implementation Of Chinese Spelling Error Detection And Correction Algorithm
6	Research On Error Correction Method Of Chinese Short Text Based On BERT
7	Research On Error Detection Of Chinese Studentsâ€™ English Compositions
8	Research On The Traditional Chinese Spelling Error Detection
9	Research On Chinese Spelling Check Technology Based On Machine Learning
10	Research On Automatic Scoring Of L2 Chinese Composition Based On Fusion Strategy