In software development,it is a common practice to copy code snippets and reuse them by pasting or with minor modifications in order to increase productivity.As a result,similar code fragments are often found in software systems,called software clones or code clones.While code cloning may improve initial productivity,it can lead to fault propagation and increase the maintenance cost of software systems.In recent years,code clone detection has become an increasingly important research topic in the field of software engineering.Recent research on detecting code clones has shown that neural network models based on abstract syntax trees can represent source code better than other methods.Researchers have used Tree based Convolutional Neural Network(TBCNN)or Tree based Long Short-Term Memory(Tree-LSTM)to encode abstract syntax tree to obtain a vector representation containing information about the syntactic structure of the program.The existing TBCNN and Tree-LSTM are effective,however,they have limitations.Recent studies have shown that,similar to long texts in NLP,these tree-based neural network models are also susceptible to the gradient disappearance problem when the abstract syntax tree is deep,i.e.,the gradient becomes smaller and smaller when trained.To solve the problem that existing tree-based neural network models can-not handle very large abstract grammar trees well.In this paper,two methods for decomposing abstract grammar trees are explored.The first method is to decompose the abstract syntax tree into a set of ASTpaths and use the compare-aggregate model to aggregate the set of AST paths into a vector as a representation of the whole abstract syntax tree.Experimental results show that this method is more effective than learning the whole abstract syntax tree directly with TreeLSTM or TBCNN.The second approach decomposes the abstract syntax tree into a sequence of path-augmented statements,and then proposes the corresponding neural network structure PCAN(path context augmented network)to learn the vector representation of the sequence of path context augmented statements as the representation of the whole abstract grammar tree.The experimental results show that the method does achieve performance improvement. |