| For a long time,NLP(Natural Language Processing)has been a hotspot and difficulty in the field of artificial intelligence.Its research directions include text classification,machine translation,intelligent answering and so on.It is oriented to the complex and changeable unstructured data-natural language.It hopes to understand the meaning of sentences through the elaborate mathematical model,so as to complete the interaction between human and machine.Early scientists used the grammatical rules to make the computer parse sentences and understand their meanings.Later,they used statistical methods to describe the probability of sentence generation,thus transfor a sequence to another sequence(seq2seq).These two methods establish the important position of language model in the field of natural language processing.Recently,the rise of machine learning has provided a new direction for researchers.How to use models to represent words or sequences as real vectors with highly abstract features has become a new research hotspot in this field.Text representation can be simply divided into word representation and sequence representation.In the past few years,Word Embedding,represented by Word2Vec and Glove,has been a core representation technology in the field of natural language processing.These methods can represent a single word as a dense real vector which contains word meaning information.How to use word vectors to represent variable length sequences is also the research focus of text representation model.In the past,commonly used models include simple bag of words model and N-gram model that consides partial co-occurrence relations.Later,with the rise of deep learning,scientists have proposed better sequence representation models,such as Recurrent Neural Networks(RNNs)and Convolutional Neural Networks(CNNs).These models have been widely used in the field of natural language processing and have achieved many results.However,their dominance is being shaken by a series of larger and systematic text representation models such as ULMFiT,ELMo,OpenAI GPT and BERT.These models use pre-training and fine-tuning architectures,which first using deep network to capture the complex semanitic information with unsupercised learning methods(pre-training)and then deploy it on downstream tasks through migration learning(fine tuning).Benefit from the effective unsupervised pre-training,the model can better understand the meaning of the statement and thus be faster and better in the fine-tuning process,meanwhile generalization of the model is also stronger.These four models have been proposed one after another in one year,each represented the highest level of the time.BERT model,as the latest and most powerful model in the field of natural language processing,deserves careful study.In this paper,the BERT model is deeply studied and implemented.This paper argues that although BERT model is very powerful,its huge parameters and slow pre-training speed seriously limit the application scenario of the model.After carefully studying the principle and function of Multihead Self-Attention mechanism and location embedding in the model,and verifying the help of pre-training for model upgrading through experiments,four improvements are proposed for BERT model in order to get better results with smaller parameters and faster pre-training speed.The work of this paper is summarized as follows:1.Relative Position Embedding.This paper holds that the relative position information of words is far more important than the absolute position information of words in understanding natural language,and the model needs a lot of additional parameters and time to learn absolute position information.The BERT model considers both absolute and relative positions.That is to say,the BERT model theoretically needs to capture the effect of each word on each position with all words at all positions.This is a total of22 distribution patterns(is the total number of words,is the length of the sequence).If only the relative position is considered,the model only needs to learn22distribution patterns,which can reduce the parameters and training time of the model to a certain extent and can improve the generalization of the model.2.Independent Bidirectional Multihead Self-Attention mechanism.Multihead Self-Attention layer is the core structure of BERT model,which plays a decisive role in extracting sentence meanings accurately.This paper holds that the words order should be taken into account when computing the effect between words.That is,word A appears before and after word B,and the meaning of word A to word B should be different.However,in the BERT model,this effect is not realized at the word level,but indirectly by absolute position embedding.This paper argues that this design requires a model to learn the redundant relationship between the absolute position and words,resulting in a large number of parameters and a long training time.In order to capture this key information directly,this paper proposes an Independent Bidirectional Multihead Self-Attention mechanism,that is,two"multiheads"are used to process the front and rear words independently,so that word A is represented as different real vector when it appears before and after word B.3.Hierarchical Dense Connection Network.Dense connection refers to establish dense additional data paths between different network layers.This can make the gradient transmit to the deep layer better,and improve the utilization rate of features extracted from each layer of network,and make the total loss function more"smooth",so that it can converge faster and better when optimizing the network.In BERT model,there is only residual network connection within each attention layer,but there is no additional data path between layers.This paper argues that the BERT model has large network depth and difficult gradient conduction,which results in long training time.Besides,the features extracted from each attention layer can only be used for the next layer,which is inefficient.Therefore,this paper establishes a hierarchical dense connection network between attention layers in order to reduce the training time of the model and improve the utilization rate of features extracted by each attention layer,so as to improve the performance of the model.4.A new pre-training task,Disorder Judgment,is proposed.The BERT model is not explanatory enough in capturing words order information and the location embedding mechanism is not convincing.This paper hopes to make the model more sensitive with words order through pre-training,so it proposes a pre-training task of disorder judgment,that is,randomly disrupting part of the words order in a sentence or keeping it unchanged,and then using the model to judge whether the words order of the sentence is reasonable.The experimental results show that the text representation model proposed in this paper is better than the BERT model with the same parameter quantity in the pre-training task.Relative Position Embedding improves the accuracy of Next-Sentence task by 0.96%;Independent Bidirectional Multihead Self-attention mechanism improves the accuracy of the MASK-LM task by 0.57%;Hierarchical Dense Connection network improves the accuracy of the three pre-tasks by an average of 0.21%.In addition,the model has excellent performance on 8 authoritative Chinese datasets in 4 different downstream tasks,which surpasses most of the widely used models and outperforms the BERT model with the same parameter quantity.Among them,the text classification task increased by 0.47%on average;the semantic similarity task increased by 0.80%;the reading comprehension task increased by 1.57%;and the word segmentation task increased by 0.93%.This indicates that the improved method proposed in this paper can effectively improve the performance of the BERT model,and thus effectively reduce the model parameter quantity and speed up the pre-training speed under the same performance. |