Font Size: a A A

A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion

Posted on:2022-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:W TangFull Text:PDF
GTID:2518306521990739Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Laos is located in the northern part of the Southeast Peninsula.It is a Southeast Asian country and is adjacent to China.As one of the countries along the "Belt and Road" initiative,its development destiny is closely linked with China.Due to the language barrier between the two countries and the lack of research on the Lao language at home and abroad,which severely restricts the exchange and development of the two countries,the research in this article provides a foundation for the study of the Lao language and provides for the study of other follow-up tasks of the Lao language.It has important research value and can promote further research in the field of natural language processing in Lao.Through the analysis and research of Lao language components,Lao sentence features,and Lao word structure features,this article mainly focuses on a series of research work on the part-of-speech tagging method that integrates the multi-features of Lao language,which mainly includes the following three parts :(1)Due to the long sentences in Lao,the data is easily lost in the process of transmission,which severely restricts the research of Lao language part-of-speech tagging.After comparing and analyzing each model,this article uses Transfomer+CRF as the basic model for the first time,and uses it to extract long-term contextual information of Lao sentences to solve the problem of important information loss.First,use Lao word vectors as input;second,use Transformer to extract long-term contextual information of Lao sentences to solve the problem of loss of important information;finally,use CRF to extract adjacent part-of-speech constraints to obtain optimal part-of-speech tags.The experimental results show that in the same Lao corpus,the accuracy,recall and F1 value of the basic network model are 93.73%,92.68%,and93.20%,respectively.(2)The current popular part-of-speech tagging methods rely heavily on the size of the corpus and the quality of manually extracted features;however,the lack of resources in Lao and the complex morphology of the Lao language have led to not only the challenges of corpus and feature selection,but also a large number of low-frequency words and Part of speech recognition of unregistered words.Therefore,this paper proposes a Lao part-of-speech tagging method that integrates multi-granularity features.By integrating Lao characters,syllables and word features on the basis of the Transfomer+CRF model,it can obtain rich Lao corpus information and improve the accuracy of the model's recognition of Lao part of speech tagging.First,input the Lao character vector and syllable vector into the CNN,and automatically obtain the character word feature vector and the syllable word feature vector that are rich in Lao word information;secondly,the character word feature vector,the syllable word feature vector and the pre-trained word vector Linear splicing is used to obtain the Lao word feature vector that incorporates multi-granularity features;then,the Lao word feature vector is input to the Transformer layer to obtain the semantic features of the Lao sentence pattern;finally,the CRF is used to extract the adjacent part-of-speech constraint relations to obtain the optimal part-of-speech tag.The experimental results show that in the case of the same Lao corpus,the fusion of multi-granularity features can effectively improve the part-of-speech tagging effect of the basic network model for Lao,with an accuracy rate of 94.64%.(3)The Lao language corpus is scarce,which leads to unstable model performance and the risk of overfitting.This paper proposes a multi-task Lao part-of-speech tagging research method that integrates multi-granularity features,and builds a multi-task Lao part-of-speech tagging model that integrates Lao characters,syllables and word features.Since named entity recognition and part-of-speech tagging are both basic natural language processing tasks,and their processing task methods are extremely similar,based on the idea of multi-task learning,the Lao language part-of-speech tagging task is the main task,and the Lao named entity recognition task is the auxiliary task.Conduct joint training.In order to further prove the effectiveness of multi-task learning,the Att-Bilstm-CRF model was added as a comparison model in the experimental part.Under the same Laos data set,part-of-speech tagging was used as the main task and named entity recognition as the auxiliary task.Compare and analyze the single-task learning model with only the part-of-speech tagging task.The experimental results show that in the case of limited corpus,parameter sharing between the main task and auxiliary tasks can further enhance the performance of the model,improve the generalization ability of the model,reduce the risk of model overfitting,and finally obtain a better part of speech The label is marked,and the accuracy of the network reaches 94.64%.
Keywords/Search Tags:Part-of-Speech Tagging, Lao, Transfomer
PDF Full Text Request
Related items