Font Size: a A A

Deep Code Search Based On Evolving Information

Posted on:2022-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:F Q CaiFull Text:PDF
GTID:2518306497952129Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Code search has become an essential part of modern software development and has received increasing attention from researchers in recent years.Code search belongs to the intersection of software engineering and information retrieval and aims to use information retrieval and related techniques to improve the search performance.When a developer searches code,he will first enter a query into a code search model,which then finds the source code related to the query from the codebase and returns it to the developer according to the degree of relevance.Although the existing code search method can achieve good results when searching for relevant code,the search results drop dramatically when searching for source code that is compatible with the local programming language.This is because the code evolves from time to time,and the deleted and newly added code tokens during the evolution process have a significant impact on the performance of the code search.To address this issue,this paper focuses on the evolving information generated during the code evolution process.Specifically,this paper extracts evolved code tokens and evolution description from the code evolution process and use them as one feature of source code and code descriptions,respectively.Based on such a more adequate code representation,two DCSE(Deep Code Search based on Evolving Information)models are proposed in this paper:DCSELSTM and DCSESBERT.The DCSELSTM model uses LSTM(Long Short-Term Memory)to embed the source code and its corresponding code description in the same high-dimensional vector space and make their cosine distance shortest by training the network.When a user inputs a query,DCSELSTM first embeds the query into the vector space to obtain the query vector,and uses this vector as the code description vector,calculates the cosine similarity with each source code vector in the code base,and gives the search results in descending order according to the distance.The validity of the method was experimentally verified:4-11%higher than CODEnn in terms of Precision@k and 56.9-60.9%higher than CODEnn in terms of RFVersion.Although the DCSELSTM model can search compatible source codes to achieve better results,the model still has the following drawbacks:1.the LSTM network does not completely solve the problem of gradient disappearance;2.the computation is time-consuming;3.the LSTM network cannot handle phrases with length over 100.To solve the above disadvantages,this paper proposes DCSESBERT model based on DCSELSTM.DCSESBERT replaces LSTM with BERT(Bidirectional Encoder Representation from Transformers),which solves the problem of gradient disappearance and the problem of long sequence information loss.In addition,because BERT supports parallel computing,the training speed of DCSESBERT can be greatly accelerated.Finally,to further reduce the parameters of the neural network,DCSESBERT uses twin networks to share the parameters of the source code sub-network and the code description sub-network.The work in this thesis,i.e.,a deep code search model based on evolving information,explores the research topics and problems in this field of study,proposes solutions to the problems,and improves the performance of code search.
Keywords/Search Tags:code search, code evolution, LSTM, BERT, siamese network
PDF Full Text Request
Related items