Font Size: a A A

Research On Key Techniques In Zero-shot Cross-lingual Sequence Labeling

Posted on:2022-12-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:S N LiangFull Text:PDF
GTID:1488306758479254Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development from economic globalization to cultural globalization,people hope to obtain information from all over the world anytime and anywhere and to facilitate production and life.In global communication,the text is the primary form for human beings to record and express information,and the process of information fusion is inevitably influenced by the language of the text.Therefore,how to effectively deal with multilingual textual information is an important challenge currently faced by academia and industry today.In recent years,deep learning technology has profoundly affected the development process and research paradigms of natural language processing.Deep-learning-based methods typically require large-scale parameters and data to obtain good performance,which is limited in multilingual scenarios.For multilingual natural language processing tasks,first,training data in low-resource languages is usually scarce,and second,maintaining multiple models for each language significantly increases the complexity of research and application.Therefore,cross-lingual natural language processing has arisen with the core idea of transferring knowledge from the source language(highresource)to target languages(low-resource),so as to solve the problems including data scarcity,system complexity,target languages model performance,etc.As a fundamental task and an important component of natural language processing,sequence labeling aims to extract specific information from unstructured text data,which is the essential prerequisite for many downstream tasks and has the same value in cross-lingual natural language processing.In this thesis,we take the cross-lingual sequence labeling as the research content,summarize the current methods around zero-shot learning,and propose corresponding solutions to the existing problems.The detailed work is as follows:1.Cross-lingual named entity recognition based on reinforcement learning and knowledge distillation.Deep neural networks have been widely adopted for sequence labeling tasks,however most approaches are only suitable for a few high-resource languages.In fact,most languages in the world have limited labeled data or even unlabeled data for sequence labeling.Faced with this challenge,cross-lingual sequence labeling early relied on source language training data and translation data.To leverage the unlabeled data in target languages that are relatively easy to collect in industrial applications,existing research proposes a semi-supervised knowledge distillation method to transfer the knowledge from the teacher model to the student model.The drawback is that the student model imitates all the predictive behaviors of the teacher model.To address the above drawbacks,from the data perspective,we propose a knowledge distillation method based on reinforcement learning and semi-supervised learning for cross-lingual named entity recognition,which can make good use of unlabeled data in the target languages.The knowledge distillation process can iterate for several rounds and adaptively select the unlabeled data samples employed in distillation based on reinforcement learning.The experimental results show that the proposed method can dynamically select distillation examples for the model in different iterations in different languages,and achieve efficient use of unlabeled data in the target languages.The performance is significantly better than the baselines and even higher than that of the multi-source language knowledge distillation method.2.Cross-lingual spoken language understanding based on label semantic and contrastive learning.While the pre-trained multilingual language models show good performance in zero-shot cross-lingual tasks,the representation alignment between the languages is not perfect,resulting in sub-optimal model transfer performance in some target languages.Considering that the use of translation data to fine-tune the model will be affected by the errors and usability of machine translation,existing research proposes a method that only relies on bilingual dictionaries to perform multilingual codeswitching on the source training data,so as to construct the mixed-language data of the downstream tasks to align cross-lingual representations.This method is limited to only implicitly aligning representations based on data augmentation and ignoring semantic information.To address the above problems,from the model perspective,we propose a label-aware self-supervised multi-level contrastive learning method for crosslingual spoken language understanding in this thesis.First,we leverage the slot type set as the anchor for the cross-lingual knowledge transfer in the label-aware joint model.Second,to utilize the inherent semantic structure of spoken language understanding,i.e.,utterance-slot-word,we propose a multi-level contrastive learning framework.In the three aforementioned levels,the contrastive learning paradigms are formulated by“source utterance-code-switched utterance”,“source slot value-code-switched slot value”,and “slot label-slot word”.The experimental results show that the proposed method constructs contrastive learning at different levels by the label semantic information in zero-shot cross-language spoken language understanding to optimize the crosslingual representation from a semantic perspective,and significantly improves the performance compared with the multilingual dynamic code-switching method.3.Cross-lingual sequence labeling based on pre-training and calibration networks.In the cross-lingual classification tasks,the performance gap between the zero-shot transfer results in the target languages and the supervised training results in the source language or the target languages of the existing work is acceptable.However,the gap in the cross-lingual sequence labeling tasks is substantial,making it difficult for the model to meet practical requirements.Therefore,in this thesis,we analyze the results of the zero-shot cross-lingual sequence labeling model from the task perspective and conclude that a major performance obstacle is the boundary errors in the model predictions.Although these multilingual sequence labeling models can effectively locate the local context of targets,they often fail to give the precise boundaries of the target spans in the target languages.To address this bottleneck,a two-stage cross-lingual sequence labeling framework is proposed in this thesis.In the first step,the base module adopts a sequence labeling model to generate an initial answer.In the second step,the calibration module refines the boundary of the initial answer based on the input and output of the base module in a machine reading comprehension way.To tackle the challenge of lack of training data in low-resource languages,we develop an unsupervised and weak-supervised phrase boundary recovery pre-training task to enhance the multilingual boundary detection capability of the calibration module.The experimental results show the proposed method obtains significant improvement compared with multiple baselines of multiple cross-lingual sequence labeling tasks,even in target languages not covered in the pre-training task.
Keywords/Search Tags:Cross-lingual, Sequence labeling, Knowledge distillation, Contrastive learning, Pre-training
PDF Full Text Request
Related items