Font Size: a A A

Code Pre-Training Model Based On Code Property Graph

Posted on:2022-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:J LinFull Text:PDF
GTID:2480306722971929Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development and popularization of computer technology,the number of software continues to increase,and the scale and complexity of software are also greatly increasing.The difficulty of software development,maintenance and reuse has also greatly increased.Program comprehension is the process that analyzes,abstracts,and inference programs,which plays an important role in software development.Through program comprehension,the hidden features in the program are fully explored,and the code is abstracted into feature vectors to be applied to downstream tasks such as code completion,code defect correction,and code clone detection,which can help software engineering,computer education and other fields.Deep learning is a data-driven end-to-end method that builds deep neural networks with large-scale data to discover hidden features in the data.In recent years,deep learning technology has made considerable progress.The development of open-source software and communities has provided a large amount of code,which makes it feasible to apply deep learning to code-related tasks.Many program comprehension research relies on labeled datasets for specific tasks,so it is difficult to be generalized to other tasks.The code pre-training model can exploit the semantic information of the code on unlabeled data,generate a general code representation,and apply it to different downstream tasks after fine-tuning.The current code pre-training models are mostly based on the pre-training models for natural language processing,and rarely consider the specific structural information of the code.To this end,this paper proposes a code pre-training model CPGCode based on the code property graph and designs three pre-training tasks to jointly learn the feature representation of the code property graph.The main contributions of this thesis around the above study are as follows.(1)We design a code pre-training framework based on graph neural networks:After abstracting the code into a code property graph,this framework uses a gated graph neural network as an encoder to transmit vertex information and learn vertex representations,aggregate vertex information through a self-attention mechanism to generate graph representations,and combine three pre-training tasks through multi-task learning to supervised mine semantic information of code and generate a general code representation.(2)We proposed a subgraph division algorithm of code property graph:The code property graph can capture richer structural information in the code,so we choose it as the intermediate representation of the code.Based on the code property graph,an algorithm for sub-graph division is designed.Combined with the edge attributes and topological sorting,the code property graph is divided into several subgraphs.Furthermore,we realized the algorithm of aggregating subgraphs to generate new graphs.These two algorithms can be applied to subsequent pre-training tasks.(3)We designed three graph pre-training tasks considering code characteristics:The first is attribute masking,which predicts the concealed edge and vertex attributes through the neighborhood structure;the second is predicting subgraphs,which predicts whether the subgraph appears in the code property graph;the third is edge reconstruction,which predicts the connectivity between subgraphs.These three pretraining tasks involve the three levels of vertex,edge,and subgraphs of the code property graph,which can combine the attribute of vertex and edge and the structural information of the graph to learn a more robust representation of code snippet.In summary,we propose a code pre-training model,which encodes the code property graph with gated graph neural network,and designs three pre-training tasks:attribute masking,predicting subgraphs,and edge reconstruction.The code representation is supervisedly learned through multi task learning,and finally applied to four different downstream tasks.
Keywords/Search Tags:Code Embedding, Code Pre-trained Model, Graph Neural Network, Deep Learning, Program Comprehension, Code Property Graph
PDF Full Text Request
Related items