Font Size: a A A

Research On Code Similarity Detection Based On Siamese Network

Posted on:2022-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2518306335956809Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the software industry has produced a series of problems,such as software plagiarism and software intellectual property issues,etc.The study of software code similarity is the basis of software plagiarism,software intellectual property rights,software feature location and software reuse.The existing code similarity method has some problems: the method based on attribute measurement cannot accurately calculate the similarity between the codes;some methods based on structure measurement cannot better obtain the semantic information of the code;some software systems with small data or lack of historical data do not have a method to better measure the similarity between the codes.In response to the above problems,this paper establishes a code similarity detection model based on Siamese network.In this paper,the detection granularity of the code is set as classes.First,the source code is subjected to three preprocessing operations: word splitting,stemming and stop-words removal.On this basis,the doc2 vec method is used to convert it into word vectors.doc2 vec method uses the neural network language model to continuously train and optimize the context semantics of words,which solves the problem that the above method based on structure measurement cannot obtain the semantics.Use Siamese network to train and extract code features,and use cosine distance to calculate similarity.Siamese network uses two identical sub-networks for training and information extraction.The parameter sharing mechanism between the subnetworks can reduce the training time and reduce the over-fitting problem caused by too many parameters,in order to improve the accuracy.Aiming at the problem of lack of training data,this article improves the MMD(Maximum Mean Discrepancy)method,and based on this,realizes the screening of other software project codes,and uses the selected result as an extension of the training data set.In order to prove the effectiveness of the method in this paper,two open source softwares,Eclipse and Jab Ref,are used as experimental objects.The method without data expansion and the method using other vectorized models are used as the baseline method.The experimental results show that the method in this paper has a certain improvement in various indicators compared with the existing methods,and has higher accuracy.At the same time,the experiments also proved the superiority of doc2 vec method to other "bagof-words" methods and the feature extraction effect of Siamese network.
Keywords/Search Tags:Code similarity, Siamese network, Deep learning, Cosine distance, Data extension
PDF Full Text Request
Related items