Font Size: a A A

Code Completion Research Based On Open Vocabulary And Self-attention Mechanism

Posted on:2022-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:B H WangFull Text:PDF
GTID:2518306479993379Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of the Internet,more and more people begin to learn programming.As an important service to improve coding efficiency in coding process,code completion has been favored by more and more people.As technology evolves,users expect code completion to be more intelligent.Code completion is one branch of source code modeling tasks.Using a deep learning method to implement it has explored the possibilities of modeling source code with a statistic language model.As a universal feature extractor in Natural Language Processing(NLP)field,Recurrent Neural Network(RNN)has been widely used in source code modeling and has achieved some results in the code completion task.The performance of some deep learning models based on RNN exceeds the traditional N-gram based model.However,the current RNN-based code completion models have some disadvantages and have room for further improvement.Disadvantages including:(1)OOV issue.There are many custom Token in the programming language,and these Tokens are not in the vocabulary.It is difficult for common deep learning language models to predict the tokens that are not in the vocabulary.(2)Poor ability of long-range context dependency.Most language models lack the ability of long-rang context dependency,that is,they cannot refer to distant tokens to do predictions,which further affects the accuracy of predictions.(3)Slow training speed.In the training of language models in source code modeling,a large corpus is usually needed to achieve better performance.Slow model training speed will lead to lengthy training time and consumption of computational resources.This paper proposes a novel code completion framework,SABCCOV,to remedy the above shortcomings.Our framework use the idea of Open Vocabulary and divide the Tokens of the code into sub Tokens.Then,we propose a sub Token level language model based on the self-attention mechanism.A sub Token level language model is the vocabulary of the model and the predicted word are all sub Tokens.Finally,the predicted sub Tokens are combined into Tokens to partially solve the OOV problem.As a new feature extractor,the self-attention mechanism in the language model improves ability of longrange context dependency and speeds up the training speed.It effectively solving the latter two problems.At the same time,according to the advanced completion strategy,this paper proposes three different code completion modes according to the different needs of users in terms of privacy and performance.Experiments show that the framework has good code completion performance and requires less training time.More importantly,the sub Token language model proposed in this study provides a wide range of possibilities for other branch tasks in the area of source code modeling.
Keywords/Search Tags:code completion, self-attention, source code modeling, open vocabulary
PDF Full Text Request
Related items