Font Size: a A A

Research On Chinese Spelling Check Technology Based On Machine Learning

Posted on:2022-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q B ZhaoFull Text:PDF
GTID:2518306605497444Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advancement of technology and the development of the times,electronic documents have gradually replaced paper documents as an important storage method for Chinese characters.With the change of storage media,the input method of text has also changed from traditional handwriting to efficient input methods such as keyboard input,image recognition,and voice recognition.However,these technologies inevitably lead to text input errors.In the Internet era,information transmission is more convenient,and the data stored is more and more huge,which makes it take a lot of time and human resources to manually correct wrong data.The research on Chinese spelling check can not only be applied to pinyin input methods,document editing tools,search engines,chat robots,voice assistants,etc.,but also assist Chinese learners at home and abroad to learn Chinese,improve learning efficiency and reduce learning pressure.However,the existing Chinese spelling check technology has problems such as low insufficient context information of character vectors,and the generated candidate characters limited by the confusion set,so there is an urgent need for new Chinese spelling check to automatically detect and correct data to improve the efficiency of spelling check.Based on the predecessors,this thesis conducts research on the traditional method and deep learning methods of Chinese spelling check technology.Its main contributions are as follows:(1)Aiming at the problem of poor performance of Chinese spelling check technology based on statistical language model,this thesis designs and implements NCM-Spell,a Chinese spelling check model which combines Chinese word segmentation,pinyin-to-character conversion dictionary and noise channel model.The spelling detection module of NCM-Spell combines with Chinese word segmentation and pinyin-to-character conversion dictionary to check spelling errors,avoiding exhaustive candidate sentences,which improves the spelling detection efficiency and performance.The spelling correction module of NCM-Spell uses the noise channel model and the Beam Search algorithm to screen candidate sentences,which improves the spelling correction performance.Experiments show that NCM-Spell model performs well compared with the benchmark model LMC.On the SIGHAN 2013 dataset,the spelling detection and spelling correction F1 values of the NCMSpell model are 1.1% and 3.3% higher than those of the LMC model,respectively;On the SIGHAN2015 dataset,the spelling detection and spelling correction F1 values of the NCM-Spell model are 2%and 2.9% higher than those of the LMC model,respectively.(2)Aiming at the problems that the context information of the character vector of the sequence tagging model is insufficient and the spelling correction performance of the statistical language model and the deep learning model is limited by the confusion set,this thesis design and implement MLSLSpell,a Chinese spelling check model based on pre-trained context vectors and multi-label sequence annotation.The spelling detection module of MLSL-Spell uses the context vector generated by the Transformer's encoder trained in massive corpora and multiple pre-training tasks in order to make the input character vector have context information,and then the context vector and pinyin vector are fused together,and input into the sequence labeling model composed of bidirectional GRU neural network and CRF model to clearly label the wrong character type;The spelling correction module of MLSL-Spell uses the Masked Language Model(MLM)model to infer the candidate characters in the wrong position,and then screen the candidate characters according to the type of error.Because the MLM model generates candidate characters in the range of all Chinese characters,the spelling correction performance of the model is no longer limited by the confusion set.After obtaining the final candidate characters,MLSL-Spell extracts the candidate characters and their characteristics in the sentence,and uses the XGBoost classifier to screen out the correct characters.Experiments show that MLSL-Spell model has excellent performance compared with the benchmark model PN.On the SIGHAN 2013 dataset,the spelling detection and spelling correction F1 values of the MLSL-Spell model are 18.3% and 10.9% higher than those of the PN model,respectively;On the SIGHAN 2015 dataset,the spelling detection and spelling correction F1 values of the MLSL-Spell model are 15.7%and 6.8% higher than those of the PN model,respectively.
Keywords/Search Tags:Chinese Spelling Check(CSC), Noise Channel Model, Sequence Labeling, Masked Language Model model, XGBoost
PDF Full Text Request
Related items