Research On Source Code Searching By Multi-modal Representation Learning

Posted on:2022-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:Z Q Yin

Full Text:PDF

GTID:2518306524490294

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Source code search is to obtain the corresponding function code fragment according to the natural language query statement.It mainly involves the natural language processing technology and practices NLP tech into the source code text.The task needs to align two models of the natural language and source code to implement semantics search.Based on the traditional sequence representation model,the existing methods deal with natural language queries and source code text output to vector representation and achieve the search task through similarity comparison.However,the traditional sequence representation models such as the word bag model,recurrent neural network model,and so on are lack semantic feature extraction ability,and the information content of source code is more sparse than that of natural language text,which needs stronger feature extraction ability.Based on the above problems,this thesis introduces a self-attention model and designs and implements a source code search model based on self-attention.This model using the Transformer encoder as a feature extractor to represent the sequence data in vector and inputting the natural language and source code vectors jointly into a multilayer perceptron to calculate the final similarity score.Based on this,a self-coding language model based on BERT structure is also trained to obtain a source code language model that can better understand the code for better extraction of code semantic features.Finally,the original Transformer encoder is replaced with the pre-trained language model to complete the code feature extraction and achieve the source code search model targeted in this thesis.This thesis has conducted experiments on six different programming languages.The source code search model based on the pre-trained language model can get the best results.For example,the MRR score of the Python language is 0.74,which is improved compared to the attention-based model by 17%.At the same time,the effects of the pre-training language model on code filling and code completion tasks,as well as the effect comparison of several source code search models with different structures,are also explored,which proves the feasibility of the source code model and the effectiveness of the search model proposed in this thesis.

Keywords/Search Tags:

Multimodality, Representation Learning, Semantic Searching, Source Code Searching

PDF Full Text Request

Related items

1	The Research On Constituents And The Current Situation Investigation Of Searching Quotiem
2	Knowledge Searching And Its Core Techniques For Semantic Web
3	User Interface Design Searching Based On Information Architecture
4	The Research And Application Of Demain Ontology-based Intelligent Searching System
5	Searching, Selecting, and Synthesizing Source Code Component
6	Semantic Understanding Of Vulnerability Source Code Based On Representation Learning
7	Research On Academic Retrieval Strategies Through Search Service
8	Research And Development On Object Searching And Tracking In Presence Of Complicated Background
9	A P2P Network Resource Searching Mechanism Based On The Semantic Theory
10	The Research And Implementation Of The Searching System Based On Special Informations In The Internet Environment