Code clone detection is originated from the copy-paste-modify behavior pattern in the process of writing code,and its purpose is to identify similar code fragments.This is an important topic in the field of software.It is of great significance in three aspects:management of software development and system maintenance,software intellectual property protection,and software security.At the same time,it has engineering practical significance for any professional field that requires software programming.Therefore,this is a very valuable research topic.This thesis first introduces a variety of code clone detection algorithms and their latest research progress at home and abroad,and researches and improves the existing algorithms,and finally realizes the corresponding demo software.The main research points of this thesis are as follows:At present,code clone detection based on deep learning has achieved better results.However,in practical applications,on the one hand,the data sets in the actual industrial production environment will be frequently expanded or updated;and on the other hand,distributed devices or mobile devices are limited by computing power and power consumption,making it difficult to run detection models with high complexity.In response to this scene,this thesis proposes a low-complexity code clone detection algorithm based on a graph neural network,which significantly reduces the computational complexity of the model while ensuring that the performance indicators are similar to the baseline model.In the data preprocessing step,the algorithm prunes the edges between non-important nodes in the AST:this not only reduces the calculation amount of preprocessing,but also reduces the interference of redundant information in the message propagation of subsequent models.In the model,the double graph of the code pair is merged by the method of combing graphs,and GRU is used instead of GMN to realize the implicit mapping between the nodes of the double graph,which reduces the calculation amount of model training.The test results on BigCloneBench(BCB)show that the model algorithm in this thesis reduces 25%of the parameters and 16%multiply-accumulate operations(MACs),while obtaining performance indicators similar to the baseline model;the experimental results on GoogleCodeJam show that the algorithm is suitable for Small-grained code clone detection scenarios.From the perspective of NLP tasks,this thesis also conducts algorithm research on the code clone detection task.With the goal of realizing the hybrid encoding of code text and structure,the pre-training model in the field of code generation is extended to the field of code clone detection,and the fine-tuned network based on the TreeBERT pre-training model is proposed.Based on the encoder component of the TreeBERT pre-trained model,this thesis proposes the intra encoder to obtain the vector representations of the two code fragments in the code pair respectively.And this thesis also constructs the inter encoder to establish the interconnection of the both vectors of two code fragment,which implements the fine-tuning of the TreeBERT pre-training model for the code clone detection task.Experiments on BCB show that the detection performance of the proposed fine-tuning model can reach 70%precision and 60%recall,exceeding the baseline fine-tuning model.In addition,based on the low-complexity code clone detection algorithm using graph neural network,this thesis presents the code clone detection system software architecture,modularizes and implements the detection system separately,and completes a code clone detection software system oriented to function-level detection granularity.After manual testing,the expected detection function can be realized. |