Font Size: a A A

Research On Source Code Searching By Multi-modal Representation Learning

Posted on:2022-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q YinFull Text:PDF
GTID:2518306524490294Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Source code search is to obtain the corresponding function code fragment according to the natural language query statement.It mainly involves the natural language processing technology and practices NLP tech into the source code text.The task needs to align two models of the natural language and source code to implement semantics search.Based on the traditional sequence representation model,the existing methods deal with natural language queries and source code text output to vector representation and achieve the search task through similarity comparison.However,the traditional sequence representation models such as the word bag model,recurrent neural network model,and so on are lack semantic feature extraction ability,and the information content of source code is more sparse than that of natural language text,which needs stronger feature extraction ability.Based on the above problems,this thesis introduces a self-attention model and designs and implements a source code search model based on self-attention.This model using the Transformer encoder as a feature extractor to represent the sequence data in vector and inputting the natural language and source code vectors jointly into a multilayer perceptron to calculate the final similarity score.Based on this,a self-coding language model based on BERT structure is also trained to obtain a source code language model that can better understand the code for better extraction of code semantic features.Finally,the original Transformer encoder is replaced with the pre-trained language model to complete the code feature extraction and achieve the source code search model targeted in this thesis.This thesis has conducted experiments on six different programming languages.The source code search model based on the pre-trained language model can get the best results.For example,the MRR score of the Python language is 0.74,which is improved compared to the attention-based model by 17%.At the same time,the effects of the pre-training language model on code filling and code completion tasks,as well as the effect comparison of several source code search models with different structures,are also explored,which proves the feasibility of the source code model and the effectiveness of the search model proposed in this thesis.
Keywords/Search Tags:Multimodality, Representation Learning, Semantic Searching, Source Code Searching
PDF Full Text Request
Related items