| A code clone is a similar piece of code in a code base.With the development of code clone detection,code clone detection technology has been applied to software development,maintenance and optimization.Source code clone detection relies on different intermediate code representations,such as identifiers,abstract syntax trees,control flow diagrams,and metrics,that provide different levels of abstraction to create explicit or implicit relationships between source code elements.the original code cloning detection techniques treat codes as text and Token sequences.These methods can detect textually similar codes,but cannot detect syntactic and semantically similar codes.Some studies use the deep learning method to detection clone code,learn from the data of potential grammatical and semantic features,most of these methods use the abstract syntax tree as a source of intermediate representation,greatly improve the detection effect of code clone detection method,but the method based on the AST still can’t make full use of the code snippet of semantic information,In particular,semantic information such as control flow and data flow Recent studies have shown that semantic expansion in abstract syntax tree,which is treated as a graph,can better represent the code,and then complete the downstream tasks,such as code completion and code cloning detection.Therefore,this paper focuses on the semantic expansion of abstract syntax tree and the representation of graph structure to improve the effect of code cloning detection method:In this paper,the first method is proposed: clone detection based on abstract syntax tree extension.This paper presents a new intermediate representation generation method.In this method,control flow information and data flow information of source code are added to abstract syntax tree according to control dependency and data dependency rules to form an extended semantic graph.Meanwhile,graph matching network and gated graph neural network are used to characterize the semantic graph and calculate the similarity between two code fragments.Experimental results show that the extended semantic graph proposed in this paper increases the ability of model to obtain semantic information to a certain extent,and improves the effect of code cloning classification.After semantic expansion of abstract syntax tree,syntactic information and semantic information are fused to a certain extent.However,the scale of the extended semantic graph is large and the existing graph neural network cannot capture the information existing between nodes with a long distance in the graph.This paper then proposes a second method: code clone detection based on heterogeneous graph neural network.This method uses deep learning technology to design and implement a heterogeneous graph neural network.The main modules can be divided into the following four parts according to different functions: in-graph selfattention mechanism module,cross-graph attention mechanism module,propagation layer module and aggregation output layer module.Through these four modules,the semantic extension graph extracted from the code is mapped to the vector space to realize the similarity calculation between the codes.Experimental results show that the heterogeneous graph neural network model proposed in this paper can effectively obtain semantic information in code. |