Code Completion Research Based On Open Vocabulary And Self-attention Mechanism

Posted on:2022-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:B H Wang

Full Text:PDF

GTID:2518306479993379

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development and popularization of the Internet,more and more people begin to learn programming.As an important service to improve coding efficiency in coding process,code completion has been favored by more and more people.As technology evolves,users expect code completion to be more intelligent.Code completion is one branch of source code modeling tasks.Using a deep learning method to implement it has explored the possibilities of modeling source code with a statistic language model.As a universal feature extractor in Natural Language Processing(NLP)field,Recurrent Neural Network(RNN)has been widely used in source code modeling and has achieved some results in the code completion task.The performance of some deep learning models based on RNN exceeds the traditional N-gram based model.However,the current RNN-based code completion models have some disadvantages and have room for further improvement.Disadvantages including:(1)OOV issue.There are many custom Token in the programming language,and these Tokens are not in the vocabulary.It is difficult for common deep learning language models to predict the tokens that are not in the vocabulary.(2)Poor ability of long-range context dependency.Most language models lack the ability of long-rang context dependency,that is,they cannot refer to distant tokens to do predictions,which further affects the accuracy of predictions.(3)Slow training speed.In the training of language models in source code modeling,a large corpus is usually needed to achieve better performance.Slow model training speed will lead to lengthy training time and consumption of computational resources.This paper proposes a novel code completion framework,SABCCOV,to remedy the above shortcomings.Our framework use the idea of Open Vocabulary and divide the Tokens of the code into sub Tokens.Then,we propose a sub Token level language model based on the self-attention mechanism.A sub Token level language model is the vocabulary of the model and the predicted word are all sub Tokens.Finally,the predicted sub Tokens are combined into Tokens to partially solve the OOV problem.As a new feature extractor,the self-attention mechanism in the language model improves ability of longrange context dependency and speeds up the training speed.It effectively solving the latter two problems.At the same time,according to the advanced completion strategy,this paper proposes three different code completion modes according to the different needs of users in terms of privacy and performance.Experiments show that the framework has good code completion performance and requires less training time.More importantly,the sub Token language model proposed in this study provides a wide range of possibilities for other branch tasks in the area of source code modeling.

Keywords/Search Tags:

code completion, self-attention, source code modeling, open vocabulary

PDF Full Text Request

Related items

1	Research And Implementation Of Automatic Code Summarization And Retrieval Technology For Open Source Reuse
2	Research On Code Summary And Tag Automatic Generation Technology Based On Massive Open Source Resources
3	Research On Detection Methods Of Reused Open-source Code
4	Code Completion Based On Local And Global Relationships In Abstract Syntax Tree
5	Research On Relationship Between Code Quality And Software Defects For Open Source Software
6	Study And Implementation Of Multidimensional Open Source Crowdsourcing Code Annotaiton Evaluation Method
7	Programming Logic Modeling And Application Of Object-Oriented Source Code
8	Code Completion Based On Semantic Context
9	Research And Implementation Of Efficient Clonal Detection Method For Massive Open Source Code
10	Source Code Based Suspicious Code And Bad Programming Practice Detecting