Font Size: a A A

Python Code Similarity Detection Based On Token And Control Flow Graph

Posted on:2022-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:2518306314968199Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet technology and open source spirit,code reuse or source code plagiarism has become more and more accessible.People can copy or modify the source code of others through open source platforms,blogs,and the Internet.While bringing convenience,it also produced a series of problems.The research of source code similarity detection technology is also in-depth under this background,and new technologies are continuously applied in the field of code similarity detection.At present,there are four main directions in the field of code similarity detection,which are attribute counting,token-based,syntax tree and graph-based methods.Although there are many methods,each method has certain limitations,and it is difficult to deal with various problems in the similarity detection process.This article focuses on the similarity detection of Python source code,and proposes a Python code similarity detection method that combines Token and control flow graph.First,the source code is preprocessed to form the Token sequence,and then the Sim Hash algorithm is used to form the document fingerprint,and the degree of similarity is determined by comparing the Hamming distance of the fingerprint.However,the effect of this method i n dealing with variable name modification and redundant code insertion is not obvious.If there is no result,in the subsequent detection,the method of control flow graph is used to calculate the similarity.Use the abstract syntax tree module in Python to analyze and construct the source code to form a control flow graph.The graph embedding vector is formed by encoding,spreading and aggregating the graph,and the graph matching network with the attention mechanism is used to calculate the similarity of the graph embedding vector,and finally the similarity of the two source codes is obtained.In the experiment,comparing the results of graph matching network and traditional graph similarity calculation,it shows that the method in this paper can effectively calculate the similarity of Python source code,and has certain practical significance and application value in software engineering and code similarity applications.
Keywords/Search Tags:Code similarity detection, Python language, control flow graph, graph neural network
PDF Full Text Request
Related items