Font Size: a A A

Research On Hierarchical Contrastive Learning Based Source Code Representation

Posted on:2023-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2568306614493634Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the increasing complexity of modern software scale,programmers are faced with an ever-increasing burden of software maintenance and development.Therefore,program comprehension technology has emerged to improve the efficiency of software development.However,in the existing program understanding technology environment,it is a challenging task to extract relevant information by analyzing the features of the program in multiple aspects,levels and perspectives.Currently,deep learning algorithms are becoming increasingly utilized to model deep neural networks grounded in existing data to explore features hidden in the data.Program understanding requires extracting feature information from the program that is relevant to the program understanding task.Therefore,it is particularly important to explore a deep learning technique that automatically learns features embedded in program data to improve the efficiency of program understanding.Due to the complex program structure of program code,the biggest challenge for program understanding is how to represent the source code to capture the syntactic and semantic information in the source code more effectively.The quality of the source code representation,a critical step in the program understanding model,plays a decisive role in the performance of the program understanding model.Most of the existing program understanding models parse the source code into different modalities and then do a series of extraction work for the required modal features.Most of them concentrate on extracting the syntactic structure or semantic information of the source code,yet the deep relationships that exist in the syntactic structure of the code are ignored.In particular,Abstract syntax trees(ASTs)play a crucial role in source code representation.In addition to the syntactic structure of the AST,the hierarchical relationships/structure between the nodes in the ASTs are also crucial to enhancing the code representation.However,it is still a challenge to learn the hierarchical structure in ASTs effectively due to a large number of nodes in ASTs and the usually deep hierarchical structure of ASTs.This paper proposes a source code representation model based on hierarchical comparative learning to address the problems in existing program understanding techniques.The method uses a contrast learning technique that allows the network to autonomously predict the node hierarchy of an AST and learn the hierarchical relationships between nodes,which allows the representation vectors of nodes with significant differences in AST hierarchy to be far apart in the embedding space.By using such vector representations,the structural similarity between code snippets can be measured more accurately,facilitating many source code downstream tasks related similarity detection.The main research of this paper is summarized in the following three aspects:(1)To address the current problem of underutilization of label information from the source code itself,a hierarchical contrastive learning method is used to automatically mine the label information of nodes on top of the original dataset.The method classifies nodes by assigning labels to them according to the hierarchical depth of the AST so that the vector representations of nodes with the same label are more similar in the embedding space,while the vector representations of nodes with different labels are less similar in the embedding space,to improve the performance of downstream tasks that compute similarity types.(2)A novel GNN(called residual self-attention graph neural network,RSGNN)is designed to address the common problems of incomplete feature extraction and easy gradient dispersion of current single graph networks.A new self-attention mechanism based on internal and external residuals is proposed and introduced into the graph convolutional network,enabling it to pay more attention to global information and enhance its features based on capturing the local features of AST.In particular,we add internal residual connections in addition to attention external residual connections to enhance the expressiveness of the network and significantly reduce the difficulty of training the network at a deeper level.(3)To address the current problem of incomplete extraction of AST structure from source code,an all-round AST embedding technique is proposed.A novel AST embedding method is used to analyze the AST from two perspectives in an all-round manner.The hierarchical contrastive learning technique focuses on learning hierarchical relationships of the AST in the horizontal direction(horizontal hierarchy representation).To maintain the integrity of AST representation,we realize the embedding of longitudinal paths(vertical path representation)from the longitudinal perspective of AST structure and then organically combine the lateral and longitudinal representation of AST.Finally,the approach learns the sophisticated structure of the AST to serve better the downstream tasks based on program understanding.
Keywords/Search Tags:Program Understanding, Self-supervised Learning, Contrastive Learning, Abstract Syntax Trees, Attention Mechanisms
PDF Full Text Request
Related items