Font Size: a A A

Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On Pre-trained Language Models

Posted on:2024-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:L C R SuoFull Text:PDF
GTID:2555307085470694Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
The Tibetan script has a long history,with a well-established grammatical system.The Chinese government and party have long been concerned about the protection and inheritance of language and script in Tibetan regions.Research on Tibetan information technology has achieved significant results.In Tibetan information processing research,Tibetan word segmentation and part-of-speech tagging serve as fundamental tasks in Tibetan natural language processing,and are essential for syntactic analysis,text classification,machine translation,speech recognition,and many other natural language processing tasks.Many scholars and research institutions have conducted in-depth investigations,proposing various solutions including rule-based,statistical-based,and deep learning-based methods.Although good research results have been achieved,there are still some challenges due to the unique characteristics of the Tibetan language,leaving room for further research in Tibetan lexical analysis.To improve the accuracy of Tibetan word segmentation and part-of-speech tagging,and further promote the fundamental research in Tibetan information processing,this study compares different neural network models for this task and selects the outstanding BERT model as the foundation.Combined with the characteristics of Tibetan text,a masked pre-training method based on Tibetan subword dictionaries is proposed,and a pre-trained Tibetan language model is obtained.Experiments show that this method not only solves some problems in Tibetan word segmentation and part-of-speech tagging,but also lays a foundation for other research in Tibetan information processing.This study mainly accomplishes the following research tasks:1.Construction of a Tibetan word segmentation and part-of-speech tagging dataset.For the needs of machine learning,Tibetan corpora are pre-processed,including sentence segmentation.Training and testing corpora are created for Tibetan pre-trained language model training and Tibetan word segmentation and part-of-speech tagging.2.Construction of a Tibetan subword dictionary.In response to the current low quality and small quantity of Tibetan corpora,mixed with Chinese,English,and other texts,this study proposes a method between Tibetan components and characters for learning Tibetan subwords.By breaking Tibetan characters or words into one or more component subwords according to the character and word construction rules,a Tibetan subword dictionary is created.The effectiveness of the method is demonstrated through testing.3.Pre-trained language model based on Tibetan subword masking.Addressing the shortcomings of the Tibetan static word vector model,this study proposes a Tibetan masked pre-training method based on subword level.The Tibetan subword-level word vector representation and Tibetan subword-level position vector embedding are used as inputs for the BERT model.A pre-trained language model suitable for Tibetan is obtained,and the effectiveness of the model is verified through word prediction experiments and comparative experiments.4.Tibetan word segmentation and part-of-speech tagging based on the pre-trained language model.To address the difficulties in identifying function words and out-of-vocabulary words in current Tibetan word segmentation and part-of-speech tagging,the pre-trained language model is fine-tuned using a small amount of annotated data.Through transfer learning,the syntactic and semantic features learned on high-quality data are mapped to low-quality Tibetan data.Experiments demonstrate the effectiveness of the pre-trained language model in specific downstream tasks such as word segmentation and part-of-speech tagging.This paper has achieved the following research results:1.Obtained training and testing corpora of 178 M for pre-training models for Tibetan language,as well as training and testing corpora of about 31 M and 75 M for word segmentation and part-of-speech tagging fine-tuning stages.The corpora cover various fields,such as news,entertainment,poetry,culture,and religion.2.Validated the effectiveness of the proposed subword dictionary method for Tibetan language through experiments.Proposed a subword-based method for constructing a Tibetan subword dictionary,which can accurately split standard Tibetan characters and serialize them,meet the input requirements of pre-training language models,improve network performance,and also recognize out-of-vocabulary words and polysemous words.3.Demonstrated the effectiveness of pre-training language models through experiments.Based on the masked pre-training approach for Tibetan subwords,trained on 1,062,180 Tibetan sentences,and obtained a pre-trained language model suitable for Tibetan language.During pre-training,the word vector and position vector of Tibetan syllables are superimposed as the model’s input,and the masking training is performed at the subword level to mask the entire Tibetan syllable(word),thereby improving the BERT model’s Tibetan word vector representation and prediction ability.4.Demonstrated the effectiveness of pre-trained language models for Tibetan word segmentation and part-of-speech tagging through experiments.By modifying the network parameters of the pre-trained language model for Tibetan language and using a small amount of annotated data for fine-tuning,the recognition performance was significantly improved.Proposed a subword-based method for Tibetan word segmentation and part-of-speech tagging.a)In the Tibetan word segmentation task,the publicly available Tibetan word segmentation evaluation corpus was used for experiments.The results showed that the F1 value of the pre-trained language model-based word segmentation was92.97%,which was 2.57% higher than that of the non-pre-trained word segmentation.Among different word-position tagging methods,the F1 value of the 2-word-position tagging method was 95.8%,which was 2.83%higher than that of the 4-word-position tagging method;b)In the Tibetan part-of-speech tagging task,based on the word segmentation,the dataset was trained and tested according to the standard specifications,and the accuracy reached 97.04%,which was significantly better than traditional part-of-speech tagging methods;c)In Tibetan named entity recognition,compared with the four mainstream neural network models for named entity recognition,Bi GRU-CRF model,Bi LSTM-CRF model,IDCNN-Bi LSTM-CRF model,and IDCNN-Bi GRU-CRF model,the F1 value of this article was 98.85%,which performed the best.
Keywords/Search Tags:Tibetan, subword dictionary, pre-training, word segmentation, part-of-speech tagging, Name entity recognition
PDF Full Text Request
Related items