Font Size: a A A

Research Of Chinese Spelling Correction System Based On Multimodal Language Model

Posted on:2024-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:F CuiFull Text:PDF
GTID:2568306917954139Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
The Chinese Spelling Correction(CSC)task aims to automatically detect and correct erroneous Chinese text.Although many excellent models and methods have been proposed in this field,there are still many problems and challenges that need to be addressed.Most Chinese spelling errors occur due to the misuse of characters with similar pronunciations or shapes.Many research works combine the semantic features of pre-trained language model BERT with external features such as phonetic and character shape features to solve this task.However,the mismatch of the three feature types due to different training corpora limits the performance of the models,and the model framework usually performs poorly on consecutive error problems.Furthermore,with the development of digitization,an increasing amount of text is created through input methods,especially pinyin input methds,which also contribute to spelling errors primarily stemming from similar pronunciations.Spellers often work in specific domains in real life,and the presence of many uncommon domain-specific terms makes it challenging for models to handle out-of-domain texts effectively.In this paper,the research focuses on the following:1)To address the issues of feature mismatch and consecutive errors,two Chinese spelling correction methods based on the Chinese multimodal pre-trained language model ChineseBERT are proposed.The methods based on ChineseBERT introduce the multimodal Chinese pre-trained language model for the first time into the CSC task,solving the feature mismatch problem and simplifying the model network structure.The SepSpell method splits the CSC task into two steps,utilizing a detection network to identify characters,preserving pinyin and character shape features for erroneous characters,and only masking the semantic features for correction,thus improving the model’s performance on consecutive error problems.The proposed methods achieve good results on three official test sets.2)To address the issue of modern text errors primarily caused by similar pinyin misuse,a Chinese spelling correction method based on Pinyin-Enhanced BERT is proposed.Single pinyin is difficult to convert into an exact Chinese character;but multiple pinyin combined together can be easily determined.Leveraging this characteristic of Chinese characters,this method uses a convolutional neural network to extract continuous pinyin features and combines them with BERT semantic features for correction.Additionally,by introducing a user dictionary,the model’s adaptability to text domains is partially alleviated.The method achieves satisfactory results on three official test sets and a constructed medical dataset.3)A Chinese Spelling Correction Web system is designed and implemented using the Django framework.It automatically corrects erroneous characters in Chinese text,documents,or images entered by users.The system consists of three main functional modules:the system environment configuration module,the text preprocessing module,and the Chinese spelling correction module.The first module is responsible for system environment configuration,including the configuration of the Chinese correction model and the server.The second module is the preprocessing module,which includes document reading,text segmentation,and image recognition.The third module is the Chinese spelling correction module,which is responsible for automatically correcting erroneous characters in Chinese text.
Keywords/Search Tags:Chinese Spelling Correction, Multimodal, Pinyin, ChineseBERT
PDF Full Text Request
Related items