Font Size: a A A

Research On Automatic Error Correction Of Tibetan Grammar

Posted on:2024-06-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:G C R HuaFull Text:PDF
GTID:1528307361982659Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Grammar automatic correction is a task that integrates natural language understanding and generation,with the purpose of automatically correcting grammar errors in texts through various computational and modeling approaches.It is a challenging technology that involves different error types,adaptation to different semantic contexts,and handling of different grammatical nuances.In recent years,grammar automatic correction technology has gained increasing attention in the field of Tibetan natural language processing due to the innovation and development in the field of natural language processing in general.There is a growing and urgent demand for Tibetan grammar correction technologies,but currently there is no application specifically for this domain.This technology has a wide range of applications.It can not only help Tibetan writers identify errors in their texts and assist in the text editing and post-processing of machine translation,but also be applied to keyboard input,information retrieval,speech transcription,information extraction,data preprocessing,and other Tibetan information processing tasks and application scenarios.Therefore,Tibetan grammar automatic correction is a research subject with great value and broad application prospects.Written Tibetan is a phonetic system,with rich inflectional and agglutinative features.It mainly expresses subject-object events through case relationships,with verbs and function words as the main clues for judgment conditions and semantic connections.Therefore,processing verbs and function words computationally in Tibetan texts has become an important research topic in many Tibetan natural language understanding and generation tasks.Currently,the mainstream approach in grammar automatic correction research is based on sequence-to-sequence modeling.However,this kind of approach is only suitable to limited pairs of correction sentences,and its performance is severely constrained by the scale and quality of the labelled training data.For low-resource grammar correction tasks,such as Tibetan grammar correction,limited annotated data cannot effectively optimize models with millions or even tens of millions of parameters through supervised training process.Therefore,based on the characteristics of Tibetan texts,this thesis explores and analyzes the grammar error types in Tibetan and proposes four key solutions:(1)construction and evaluation metrics of Tibetan grammar automatic correction dataset;(2)an end-to-end approach for Tibetan verb automatic correction;(3)a Tibetan functional word automatic correction method for post-processing of Chinese-Tibetan machine translation;(4)a Tibetan grammar automatic correction method integrating pointer networks with confusion sets.The main contributions and research achievements of this thesis are summarized as follows:First,addresses the problem of limited training data for Tibetan grammar automatic correction task.Four Tibetan grammar correction data augmentation strategies are proposed,including machine translation based augmentation,confusion set based augmentation,automatic generation,and collection of compositions from primary and secondary school students.After pre-processing and integration,a Tibetan grammar automatic correction dataset consisting of 4.54 million sentence pairs is constructed.The construction of this dataset effectively solves the problem of limited Tibetan grammar automatic correction data resources and provides reliable data resources for further improving the correction effectiveness of the models.Secondly,focuses on the verbs in Tibetan sentences,which play an important role in the Tibetan semantic structure.Two end-to-end Tibetan verb automatic correction methods are proposed.The first method is a neural network-based Tibetan verb correction model that combines bidirectional Long Short-term Memory network and attention mechanism.This model captures the contextual information using bidirectional Long Short-term Memory network and incorporates attention mechanism to improve model performance.While the second method is a Transformer-based Tibetan verb correction model.This model utilizes the self-attention mechanism of the Transformer architecture to capture long-distance dependencies in sentences and uses positional encoding to retain the sequence information.Experimental results demonstrate that both methods achieve good performance in Tibetan verb correction tasks.They exhibit excellent correction accuracy and high efficiency,clearly outperforming baseline models composed of multiple modules.Then,addresses the issue of a large number of Tibetan functional word errors in Chinese-Tibetan machine translation and proposes a novel Tibetan functional word correction model based on Fusion of pre trained language model and bidirectional Long Short-term Memory network.The aim is to solve the problem of Tibetan functional word errors in post-processing of Chinese-Tibetan machine translation.Firstly,the model is pre-trained on a large-scale corpus of Tibetan text to learn the prior knowledge of the language and transfer it to the Tibetan functional word correction task.This avoids training the entire model from scratch and improves training speed and performance.Then,a bidirectional Long Short-term Memory network layer and a fully connected layer are stacked on top of the Pre-training language model to correct the functional word errors in Tibetan text.The bidirectional Long Short-term Memory network layer captures the contextual information of the sentences,enabling a better modeling capability of semantic relationships and grammatical structures.The fully connected layer performs the classification task to predict the correct form of Tibetan function word.Experimental results demonstrate that this method effectively corrects functional word errors in Tibetan text and significantly improves the accuracy and precision of correction.Lastly,addresses the issue of a large number of homophone and homograph errors in Tibetan text and proposes a new Tibetan grammar automatic correction method that integrates pointer networks with confusion sets.This method uses the Seq2 Edit model to perform editing operations on input sentences,generating edit sequences and utilizes pointer networks to select relevant Tibetan characters from the confusion set as replacement options to improve the effectiveness of Tibetan grammar automatic correction.The confusion set contains common grammatical error patterns and is added as training data,enabling the model to learn a wider range of error patterns and correction abilities.Experimental results show that this method achieves significant improvement in Tibetan grammar automatic correction tasks and outperforms other baseline models such as Tibetan pre-training models in terms of correction effectiveness and robustness.
Keywords/Search Tags:grammatical error correction, Tibetan grammar, Tibetan data, Tibetan grammatical error correction, Tibetan pre-training language model
PDF Full Text Request
Related items