Font Size: a A A

Research On Lao Language Part-of-speech Tagging With Multiple Features

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X J WangFull Text:PDF
GTID:2438330620480346Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Lao network texts contains a large amount of information related to public opinion.How to extract valuable information from these data has become one of the research focuses of natural language processing,but there is less research work on Lao natural language processing at home and abroad.As one of China's neighbors,Laos is an important ally of "the Belt and Road Initiative",but it has not met the requirements of language exchange.Part-of-speech tagging is an important basic task in information extraction research.This paper proposes a Lao part-of-speech tagging method combining multi-feature to solve the research difficulties,which mainly includes the following three parts:(1)Because Lao language expresses grammatical meaning through word order,and it is characterized with long sentence,the BiLSTM-Attention-CRF model is established as the basic framework of the POS tagging model to integrate word order features and long-term context features.Firstly,the model uses the BiLSTM network with Attention mechanism to process the vector of each Lao word.Then,CRF model considers the correlation of part-of-speech to calculate part-of-speech tags.In the experimental stage,HMM,CRF,CNN-CRF and BiLSTM-CRF models were used as comparative models.The results show that BiLSTM-Attention-CRF model is superior and its accuracy rate reaches 92.67%.(2)Facing the main challenge of Lao low-frequency word recognition,this paper proposes a "phoneme-level" word vector method to fuse phoneme features into BiLSTM-Attention-CRF model.Phoneme features are conducive to expressing morphological and structural information of words.Firstly,the model takes phonemes as atomic units,and uses Convolutional neural network with multiple filter widths to extract the feature relationship between phoneme vectors to form "phoneme-level" word vectors.Then the"phoneme-level" word vectors will be contacted with the pre-trained word vectors by FastText to build the word feature vector of "phoneme level" for further enriching the word morphological features.According to the experimental results,the accuracy rate of BiLSTM-Attention-CRF model is 93.11%after fusing phoneme features.The experiment also measured the absolute improvement rate of F1 of BiLSTM-Attention-CRF,which integrated phoneme features,to the main part of speech tags.The consistency improvement of F1 values of part-of-speech proves the rationality of the proposed method.(3)In order to further strengthen the recognition of low-frequency words by the model,this paper proposes a multi-task learning method that combines TF-ISF auxiliary loss and main consonant auxiliary loss,which helps the model to fuse sentence topic features and main consonant distribution features.The TF-ISF algorithm applies the topic extraction algorithm TF-IDF algorithm to sentence level,and main consonant is an important part of Lao syllable.Under the fusion of multi-feature,the accuracy rate of the model reaches 93.41%,which has its own advantages over the language-assisted model.Moreover,in order to be reasonable in the experiment,this paper also used BiLSTM-CNNs-CRF as a comparison model and tested the performance of some ideas of the model in the public Danish and Spanish corpora.The results show that the proposed method is efficient in recognizing low-frequency words.
Keywords/Search Tags:Part-of-Speech Tagging, Lao, phoneme-level, Multitask learning
PDF Full Text Request
Related items