Font Size: a A A

Research On Chinese Error Correction Based On Pronunciation And Glyph

Posted on:2023-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2558307169481364Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,manual proofreading can no longer meet the demand for error correction of massive electronic texts.The automatic Chinese Error Correction technology has been widely used in writing assistance,search engines,speech recognition,and other fields.Chinese Error Correction started late,and the difference between Chinese and foreign languages leads to more difficult Chinese Error Correction tasks.Chinese Error Correction task has an excellent prospect for development.Based on the error characteristics of actual text,this paper studies the Chinese Spelling Correction technology in Chinese Error Correction.The main contents are as follows:1.Preprocessing SIGHAN dataset and Automatic Corpus Generation ACG dataset,converting it into sentence pair format of sentence to be corrected and correct sentence to construct training data of this paper.Aiming at the shortcomings of the existing Chinese Spelling Correction confusion set,which is low in quality and contains a large number of confusion errors that are not easy to occur in actual texts,this paper uses the Jaccard coefficient to calculate the similarity of characters and characters between Chinese characters through Chinese pinyin and components and uses dynamic screening rules to construct Chinese spelling correction confusion set.2.Aiming at the shortcomings of existing pre-training MASK methods,such as the tendency to revise correct sentences,the non-existence of Chinese [MASK] tokens,and the failure to consider the proportion of real text error types,a pre-training MASK strategy based on the pronunciation and glyph is proposed.At the same time,aiming at the disadvantages of non-randomness of adversarial attack algorithm’s replacement rules and excessive repetition of adversarial samples,as well as the weakness of the existing CSC model’s inability to correct unknown errors,and Error-Prone Sentence Generation algorithm EPSG was proposed to continuously replace error-prone sentences in training data,to enable the model to explore unknown spelling errors.Finally,aiming at the problem that the BERT model could not learn the pinyin connection between error words and real words,use Pinyin embeddings instead of Segment Embeddings in the BERT,and the EPSC was constructed by combining the pre-training mask strategy and the ErrorProne Sentence Generation algorithm in this paper.3.Verify the dataset,pre-training mask strategy,Error-Prone Sentence Generation algorithm,and EPSC model through 8 research questions.Compared with the baseline method,the correction method in this paper shows significant performance improvement in many evaluation indexes.EPSC’s F1 score was 7% higher than BERT’s.By comparing the performance of the EPSC and the 4 baseline models on the three benchmark datasets,all evaluation indicators of EPSC are higher than that of the baseline model,and the F1 score of correction level increases by 3.1%,2.6%,and 3.8%,respectively.The EPSC model is used to correct the errors of three popular online novels,and part of the correction results are extracted for manual correction and analysis.The total proportion of correct correction and optional correction is 83.1%,which proves the effectiveness of EPSC in practical application.
Keywords/Search Tags:Chinese Spelling Correction, Error-Prone Sentence Generation, Pronunciation and Glyph, Pinyin Embedding, Confusion Set, Pre-Training
PDF Full Text Request
Related items