Font Size: a A A

Research On C Code Similarity Detection Based On AST And Graph Attention Network

Posted on:2022-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiangFull Text:PDF
GTID:2518306614958839Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the interactive sharing of information,code plagiarism appears in various ways,but it is very difficult to cover the whole plagiarism research field,so this thesis focuses on the detection of code plagiarism in academic circles.Taking C language source code as the research object,the code similarity detection research,and the results of the evaluation to assist the later artificial evaluation.Firstly,this thesis proposes a code similarity detection method combining abstract syntax tree(Abstract Syntax Tree,AST)and Token.Firstly,the source code is preprocessed.After removing redundant information,the code string is converted into AST by Eclipse CDT tool through lexical analysis and syntax analysis,and then AST is serialized by breadth-first traversal.Secondly,the feature vectors are extracted by using Token tag sequence.Finally,Sim Hash algorithm and term frequency–inverse document frequency(Term Frequency–Inverse Document Frequency,TF-IDF)similarity calculation method are combined to detect similar codes.TF-IDF can count the frequency of each keyword in the syntax tree sequence according to Sim Hash algorithm,and assign weight to AST nodes to distinguish similar codes.Through experiments,this method changes the previous detection method that only pays attention to similarity calculation,and has a certain improvement in detection effect,which also makes this method have certain scalability.Secondly,this thesis designs a code similarity detection method based on AST and graph attention network by using the characteristic that AST is also a graph.AST sequences are transformed into a group of graph data structures corresponding to nodes by pyg library in Py Torch framework,which are input into graph neural network for feature vector extraction,and graph attention network and twin neural network structure are introduced to calculate similarity.The methods based on deep learning commonly used in related fields of code similarity detection are compared respectively.Experiments show that this method can effectively detect the similarity of C language codes,assist the subsequent artificial judgment of plagiarism,and have certain guiding significance and application value for online education in academic circles.
Keywords/Search Tags:code similarity detection, abstract syntax tree, graph attention network, deep learning, online education
PDF Full Text Request
Related items