Code Pre-Training Model Based On Code Property Graph

Posted on:2022-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:J Lin

Full Text:PDF

GTID:2480306722971929

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the development and popularization of computer technology,the number of software continues to increase,and the scale and complexity of software are also greatly increasing.The difficulty of software development,maintenance and reuse has also greatly increased.Program comprehension is the process that analyzes,abstracts,and inference programs,which plays an important role in software development.Through program comprehension,the hidden features in the program are fully explored,and the code is abstracted into feature vectors to be applied to downstream tasks such as code completion,code defect correction,and code clone detection,which can help software engineering,computer education and other fields.Deep learning is a data-driven end-to-end method that builds deep neural networks with large-scale data to discover hidden features in the data.In recent years,deep learning technology has made considerable progress.The development of open-source software and communities has provided a large amount of code,which makes it feasible to apply deep learning to code-related tasks.Many program comprehension research relies on labeled datasets for specific tasks,so it is difficult to be generalized to other tasks.The code pre-training model can exploit the semantic information of the code on unlabeled data,generate a general code representation,and apply it to different downstream tasks after fine-tuning.The current code pre-training models are mostly based on the pre-training models for natural language processing,and rarely consider the specific structural information of the code.To this end,this paper proposes a code pre-training model CPGCode based on the code property graph and designs three pre-training tasks to jointly learn the feature representation of the code property graph.The main contributions of this thesis around the above study are as follows.(1)We design a code pre-training framework based on graph neural networks:After abstracting the code into a code property graph,this framework uses a gated graph neural network as an encoder to transmit vertex information and learn vertex representations,aggregate vertex information through a self-attention mechanism to generate graph representations,and combine three pre-training tasks through multi-task learning to supervised mine semantic information of code and generate a general code representation.(2)We proposed a subgraph division algorithm of code property graph:The code property graph can capture richer structural information in the code,so we choose it as the intermediate representation of the code.Based on the code property graph,an algorithm for sub-graph division is designed.Combined with the edge attributes and topological sorting,the code property graph is divided into several subgraphs.Furthermore,we realized the algorithm of aggregating subgraphs to generate new graphs.These two algorithms can be applied to subsequent pre-training tasks.(3)We designed three graph pre-training tasks considering code characteristics:The first is attribute masking,which predicts the concealed edge and vertex attributes through the neighborhood structure;the second is predicting subgraphs,which predicts whether the subgraph appears in the code property graph;the third is edge reconstruction,which predicts the connectivity between subgraphs.These three pretraining tasks involve the three levels of vertex,edge,and subgraphs of the code property graph,which can combine the attribute of vertex and edge and the structural information of the graph to learn a more robust representation of code snippet.In summary,we propose a code pre-training model,which encodes the code property graph with gated graph neural network,and designs three pre-training tasks:attribute masking,predicting subgraphs,and edge reconstruction.The code representation is supervisedly learned through multi task learning,and finally applied to four different downstream tasks.

Keywords/Search Tags:

Code Embedding, Code Pre-trained Model, Graph Neural Network, Deep Learning, Program Comprehension, Code Property Graph

PDF Full Text Request

Related items

1	The Study Of Source Code Vulnerability Detection Based On Graph Neural Network
2	Study On Codes With Traceability Property
3	Construction Research On Quantum Error-correction Codes
4	Research On Smoothing Newton Methods For Two Optimization Problems
5	The Expolring Of The Binary Optimal Code Of Ring Code Based On The Gray Mapping
6	Residue Code And Cyclic Code Over F_p+uF_p+u~2F_p
7	Researches On The Constructions Of Three Kinds Of Special Quantum Error-Correcting Codes
8	Research On Topological Quantum Error-correcting Codes Against Noise Based On Deep Reinforcement Learning
9	Research On Graph Embedding Model Based On Deep Neural Networks
10	Studies On Some Topics Of Theory Of Codes