Font Size: a A A

Research Of Code Clone Detection Method Based On Hierarchical Features Of Token Representation

Posted on:2022-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:D M ZhangFull Text:PDF
GTID:2518306542474194Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid popularization of the Internet and the development of application software,the harm of code cloning continues to appear.Code clone detection technology has become very important for software maintenance,code vulnerability detection and patching.Comprehensively analyzing existing detection methods from the time cost of code characterization,detection types and deployment methods,token-based detection methods have certain advantages.however,most of the current token-based cloning detection technologies can only detect syntactic cloning and it is difficult to detect semantics clone.Therefore,how to use token-based code clone detection technology to achieve more comprehensive and accurate detection is a major challenge in the current research field.At present,most of the token-based technologies only detect JAVA or C language and do not detect other languages.However,among the undetected languages,Python language is more likely to produce code clones due to its open source.At the same time,the existing open source Python clone data set has a small sample size,which is not conducive to model training and testing.In order to solve the above problems,we proposed a method to detect Python language code clones which combines token representation and machine learning.The main content and contributions are as follows:(1)We proposed a code clone detection method based on hierarchical features.First,we used the two-layer Bi-LSTM network to extract the semantic connection signs in the two-layer structure of the code,and on this basis,we introduced the attention mechanism to adjust the influencing weight of important tokens and code lines.Finally,we used the softmax classifier to classify the target code pair.The experimental results on the data set in CCIS show that,in terms of the recall rate of detection,this model can improve the detection effect by 4%compared with the CCIS method.(2)By learning the process of organizing the existing public data sets,we have collected the semantic cloning types of the Python language and proposed an improved hierarchical feature code cloning detection method.This method established a code information extraction and detection model based on the unique two-layer features of the code,which combined with cosine similarity and classifiers,without the need for large-scale raw data and manual annotation.Experiments show that the accuracy of this method is about 4% higher than that of the original model,and it is more suitable for small sample clone detection problems,which improves the robustness and detection effect of the model.
Keywords/Search Tags:token representation, hierarchical features, Bi-directional long short-term memory, attention mechanism, enhanced classification
PDF Full Text Request
Related items