| In recent years,Transformer has been widely used in a variety of natural language processing tasks,due to its core component self-attention mechanism can well capture the dependencies of arbitrary word pairs.However,this mechanism’s quadratic com-plexity to the sequence length limits its application in long sequence tasks.In order to obtain a more efficient attention mechanism,our paper proposes two efficient models based on neural clustering algorithm.In our paper,we first design a clustering algorithm based on neural network,and then propose two kinds of sparse work,namely Block Attention Model and Approximate Attention Model based on this algorithm.Block Attention Model reduces the complexity of Transformer from O(N2d)to O(NN1/2d).The approximate attention model reduces the Transformer to linear complexity O(Nkd),where N is the sequence length,d is the dimensionality of the word embedding in the attention and kis the number of clusters.Both models reduce the complexity of standard Transformer and greatly improve the model’s time and memory efficiency.We validated Block Attention Model on machine translation,text classification,natural language inference,text matching tasks,and pre-training tasks.In terms of ef-fectiveness,compared with baseline models(Transformer,Reformer and Routing Transformer),the model shows a comparable or even better effectiveness in each task,and has obvious efficiency advantages(time and memory)in long sequence tasks.In ad-dition,we also validated the Approximate Attention Model on machine translation,text classification,natural language inference,text matching tasks.Experiments confirm that the Approximate Attention Model has great advantages over the baseline model in both effectiveness and efficiency.Especially in terms of efficiency,Approximate Atten-tion Model on IMDB dataset of text classification saves the memory at least 33.7%,and the training time by at least 32.4%. |