Font Size: a A A

Research On Vietnamese Text Grammatical Error Correction Method Integrating Multi-granularity Feature

Posted on:2023-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:2555306797473304Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The research on methods of Vietnamese grammatical error correction has significant academic value for the natural language processing engineering applications of Southeast Asian languages.There is few related work on Vietnamese grammatical error correction,and there are few manually annotated corpora for grammatical error correction in Vietnamese texts,which leads to the scarcity of available data resources.When using the pretrained language model and the sequence to sequence generation model to correct Vietnamese grammatical errors,it exists the following problems:insufficient training data leads to poor model performance;it lacks semantic information of different granularities,such as syllable tone,part of speech and phrase components when the encoder of the model represents word embeddings,which leads to the poor performance of error detection and error correction;the sequence generation model is uncontrollable when generating sentences,resulting in a low precision of error correction.In response to the above problems,the thesis proposed a grammatical error correction method for Vietnamese texts that incorporates multi-granularity features such as Vietnamese characters,syllables and sentences.The detailed research work of the thesis is as follows:(1)Construction of Vietnamese grammatical error correction corpus:In order to expand the scale of grammatical error correction data resources,a data augmentation algorithm that uses the right Vietnamese texts to generate corresponding error texts is proposed to build"error-correction"parallel sentence pairs.An artificially constructed syllable confusion set dictionary and an open source part-of-speech tagging tool were used to generate errors,and then an algorithm for automatic tagging of erroneous sentences was designed.208,000"error-correction"parallel sentence pairs with annotations were manually constructed using the proposed method,which provided the basic data for the subsequent research.(2)Grammatical error detection incorporating Vietnamese character and syllable features:In response to the problem that the multilingual BERT model lacks syllable and tonal information when the word embedding is represented by the encoder,a method that incorporates Vietnamese character and syllable features is proposed to improve the error detection performance.In the proposed method,additional character and syllable feature embeddings are added to the multilingual BERT encoder to make the detection model learn more semantic knowledge.The experimental results prove that the proposed method achieves the highest F0.5 score and F1 score on the test set,which are 71.36%and 72.91%,respectively.(3)Vietnamese grammatical error correction method incorporating different granularity features:The sequence generation model is uncontrollable when generating sentences and fails to utilize Vietnamese sentence features effectively,which leads to the precision of error correction is not high.In response to this problem,an“error detection-error correction”pipeline model is proposed,additional Vietnamese sentence feature embeddings are added to the encoder of the pipeline model,and then the BERT masked language model is used to re-predict the syllables in the wrong positions according to the detection results.Then the correction part uses the language model to score the corrected candidate sentences,and select the candidate sentence with the highest sentence score as the final output.Experiments on the Vietnamese corpus prove that the proposed method achieves 42.59%F0.5 score and 42.67%F1 score on the test set,which are 16.69%and 17.84%higher than the F0.5 score and F1 score of the best baseline model,respectively.(4)Vietnamese text error correction prototype system:On the basis of the above research,a web system based on“error detection-error correction”pipeline model is implemented and the system includes text input module,text detection module,text error correction module,etc.The detection and correction modules use deep neural network and N-gram language model to implement the system functions.The system can detect the spellings and grammars of the input Vietnamese text,and then it can correct the wrong spellings and error grammars if the input text exists any errors.
Keywords/Search Tags:Vietnamese Grammatical Error Correction, Grammatical Error Detection, Data Augmentation, Multi-Granularity Feature Fusion
PDF Full Text Request
Related items