Unsupervised And Low-Resource Part-of-Speech Tagging Based On CRF Auto Encoder

Posted on:2024-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2568306941464494

Subject:Computer Science and Technology

Abstract/Summary:

Part-of-speech(POS)tagging is intended to determine the grammatical category of each word in a sentence and assign corresponding POS tags,such as nouns,verbs and adjectives.POS tags effectively express the grammatical role of the word in the sentence.Therefore,POS tags can provide assistance for numerous natural language processing(NLP)tasks,such as named entity recognition,dependency parsing,and information extraction.With the development of deep learning,the performance of supervised POS tagging models has reached a high level.However,those models heavily rely on high-quality annotation data for training,and the performance of supervised POS tagging models in low-resource or even nonresource languages is still unsatisfactory.Therefore,for most languages with scarce tagging resources,it is significant and valuable to research POS tagging methods that do not require tagging data or only require a small amount of data.This thesis first attempts to improve the performance of Conditional Random Field Auto Encoder(CRF-AE)for unsupervised POS tagging from two perspectives.Then we try to provide a certain amount of additional resources for the unsupervised POS tagging model from a real scenario to explore its performance in various low-resource scenarios.Specifically,the main contents of this thesis are as follows:(1)Unsupervised POS tagging based on Gaussian reconstructed CRFCRF-AE makes assumption that the sequential latent structure of a sentence,e.g.,POS tags,should permit reconstruction of the original sentence with high probability.Therefore,the model first generates latent structure with a CRF encoder,and then(re)generates the observations conditional on just the predicted structure.First,we propose to regenerate the pre-trained word embeddings given the POS tag,rather than directly regenerate the words themselves which may lead to the data sparsity problem.We assume that word vectors with different POS conform to different multivariate Gaussian distributions,and estimate the probability of reconstructing different words using the corresponding probability density function.For the encoder part,we introduce pre-trained language model(PLM)to encode the context information,but the POS irrelevant information contained therein is prone to bring noise to model learning.To solve this problem,we propose to compress the information in the PLM through local information representation and information bottleneck perceptron to remove redundant information.The final experimental results show that our method achieves similar performance to the current best method on the Penn Treebank(PTB)dataset,and surpasses existing methods on the Universal Dependencies Treebank 2.0(UD)dataset.(2)Unsupervised POS tagging based on artificial knowledge reconstructed CRFIn the previous chapter,we used CRF-AE as the basic model to conduct research on unsupervised POS tagging tasks.We propose to reconstruct the pre-trained word embedding given POS tags as the learning target of the model.However,word embeddings still contain a lot of POS irrelevant information,which may bring noise to model learning.Therefore,in this chapter,we attempt to directly introduce artificial knowledge features related to POS through feature templates,and take the reconstruction of artificial knowledge features by POS as the new learning goal of the model.Compared to word embeddings,the association between artificial knowledge features and POS information is simpler,more straightforward,and more explicit.The final experimental results show that our model performs significantly better than existing methods on both datasets.(3)Low resource POS tagging based on unsupervised modelConsidering the existing unsupervised POS tagging models are still difficult to satisfy practical requirements,low-resource POS tagging is more of practical research value.Existing low-resource POS tagging methods primarily focus on enhancing the model from the data aspect,rather than focusing on the models.In this chapter,we attempt to apply the unsupervised POS tagging model to low-resource scenarios,and improve the performance of the model in low-resource scenarios by utilizing the learning ability for unlabeled data of unsupervised models.Based on the summary of previous work,we set up two low-resource scenarios,including a few-samples scenario and a dictionary-labeling scenario.We select a variety of representative unsupervised POS tagging models to conduct experiments.We also conduct a detailed analysis and provided a reasonable explanation for the final experimental results.In summary,we first conducted an in-depth study of unsupervised POS tagging.Through improvements to the CRF-AE model,the performance of unsupervised POS tagging has been significantly improved.We also investigated the performance of unsupervised POS tagging methods under two given low-resource scenarios,and analyzed the experimental results in detail.We sincerely hope that the work of this thesis can be helpful for future related research.

Keywords/Search Tags:

Part-of-speech tagging, Unsupervised learning, Pre-trained language model, Low-resource learning

Related items

1	Research On Parallel Corpora-based Unsupervised Part-of-speech Tagging For Chinese
2	Research On Lao Language Part-of-speech Tagging With Multiple Features
3	Research On Part-of-Speech Tagging Algorithms Of Mathematical Corpus Based On Deep Learning
4	Improved Harmony Search Algorithm And Its Application In Part-of-speech Tagging Model Of Miao Language In Western Hunan
5	Research On Kirghiz Basic Part-of-Speech Tagging Based On HMM
6	Research On Laodian Participle And Part-of-speech Tagging Method
7	Research On Part-of-Speech Tagging With Transformation-Based Learning
8	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
9	A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion
10	Chinese Word Found Its Part Of Speech Tagging