The emergence of pre-training models has brought milestone changes to Natural Lan-guage Processing(NLP).The pre-training model not only unifies the input and modeling methods of NLP tasks,but also has the easy-understanding concepts and effective ex-perimental results,making it popular in the field of NLP rapidly.Question Answering(QA)is a relatively important task in NLP.This dissertation specifically studies the ex-tractive question answering,which also called span extractive reading comprehension task(MRC).That is,given a <question,passage> pair,and assumes that if the question is an-swerable,the answer must be the continuous text span in the passage.In the extractive question answering task based on the pre-training model,this dissertation mainly studies the following three key technologies:1)The first key question is how to better apply the pre-trained model to improve the Exact Match(EM)metric multilingual MRC task(EM is the answer boundary detection performance).The pre-training task of the pre-training model(upstream)is a sub-word level masked language model,and the extractive MRC task(downstream)predicts a word or a phrase.Obviously,there is a disparity between upstream and downstream tasks.Meanwhile,there are two challenges in the multilingual question answering task: 1)the insufficient non-English data? 2)the lack of non-English related knowledge.Motivated by the above observations,this dissertation proposes two auxiliary tasks to enhance boundary detection for multilingual MRC,especially for low-resource languages.First,We propose a knowledge phrase masking task as well as a language agnostic method to generate per-language knowledge phrases from the WEB.This method is lightweight and easy to scale to any language.Through the phrase level mask language model,knowledge is injected and the disparity between upstream and downstream tasks is compensated.At the same time,in order to solve the problem of the insufficient non-English corpus for reading com-prehension tasks,a data enhancement method is proposed.This method increases data quality and maintains better align the language representation.Finally,experiments were carried out on two multilingual MRC datasets.The experimental results prove that the improvement brought considerable gains to all languages including English.Simultane-ously,the fine-grained answer type analysis proves that both tasks can enhance the answer boundary detection of extractive multilingual MRC task.Therefore,the method proposed in this dissertation improves the effectiveness of pre-training models for the multilingual MRC task.(2)The evaluation metric of the extractive MRC are EM and F1,but the training ob-jective is the cross-entropy loss function in fine-tuning.This directly leads to the training-test disparity.This disparity is common in deep learning framework training.It is mainly manifested in: the optimization goal during training is cross-entropy,and the actual eval-uation metric is F1/Accuracy/BLEU.In order to compensate for such disparity,existing work has proposed to replace cross-entropy with new training objectives,such as Dice loss and sentence-level BLEU.However,the new training objectives have brought new challenges to the train-test disparity.This dissertation mainly analyzes such a challenge—there is a systematic bias that makes training enter the local optimum.This system-atic bias is defined as Simpson’s bias,which is similar to the Simpson’s paradox induced in statistics.In order to analyze Simpson’s bias,this dissertation proposes a theoretical classification method in the entire machine learning theory,revealing the similar effect of bias in various evaluation metric,from ones as simple as Accuracy,to ones as sophisti-cated as F1/BLEU.At the same time,it is theoretically proved that despite the existence of Simpson’s bias,some special smoothing item can be used to solve the Simpson’s bias,such as precision and recall.Finally,the experiment proves that Simpson bias exists in the training of various NLP tasks such as classification,question answering,and translation,and it has a non-negligible impact,which will extend the training time and increase the risk of training to reach the local optimum.This dissertation proves through theoretical analysis and experiments that the model trained by Dice loss similar to F1 is worse than the model obtained by cross-entropy training under the same settings,and cannot achieve the purpose of making up for the disparity between training and testing.This dissertation provides a set of theoretical analysis framework,theoretically gives some inconsistent so-lutions,and shows that Simpson’s bias exists in natural language-related task training,and its impact cannot be ignored.(3)After making up for the disparity between upstream and downstream and solving the training-test disparity in fine-tuning process,we can get multiple pre-training models with good performance.But these models cannot be directly applied to actual production.The pre-training model has ultra-large-scale parameters.Take BERT-base as an example,which contains 100 million parameters.A single 12 G graphics card resource can process up to 6 input sequences with a length of 512 at a time? BERT-large contains 300 million parameters,A single GPU with 12 G memory cannot process a sequence with an input length of 512,and can process at most one sequence with a length of 320 at the same time.This greatly affects the use of pre-training models in practice.In order to make the pre-training model work in practical applications,balancing model scale and model performance has become a key issue in model application.As a method of compressing a large model into a small model,knowledge distillation is widely used in practice.It is a method based on the teacher-student framework.Larger models such as BERT are used as teachers and small models used in practice are used as students.The knowledge extracted from the teacher is applied to the students to improve the quality of the student model.But most of the models are based on one teacher,even if some work is based on multiple teachers,it is just a simple weighted fusion of the knowledge from multiple teachers.This dissertation focuses on how to coordinate multiple teachers and students in knowledge distillation when there are multiple excellent teachers,so that the pre-training model can be used in practice.This dissertation first introduces the basic framework of knowledge distillation and the processing methods of multiple teacher models that are commonly used,and proposes a new method to transform the guidance of multiple teacher models into a problem of reinforcement learning.How to assign different sample sizes to student models in the distillation framework is introduced in detail.Simultaneously conduct experiments on many important tasks of NLP.The experimental results show that the multi-teacher selection strategy under the reinforcement learning framework can effectively improve the quality of the student model,make the pre-training model usable in practice,and achieve the goal of balancing model performance and model scale for multi pre-training models. |