Font Size: a A A

Source Code Plagiarism Detection Based On Information Retrieval And Stacking Integrated Learning

Posted on:2021-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2518306467976499Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Code programming is an indispensable skill for computer and other related majors in colleges and universities,but the development of Internet technology has made the problem of code plagiarism more and more obvious,from small students' programming assignments to software products.Plagiarism is harmful to the development and growth of students,nor is it conducive to the improvement of their abilities.For software companies,plagiarism may involve in infringement issues.In the existing research,most of the source code plagiarism detection methods are carried out on a small number of code files,and the one-to-one matching method is often used to detect plagiarism code pairs.As the number of source codes accumulates,the commonly used one-to-one matching method will be lower in time efficiency and inaccurate.In response to these problems,this paper proposes a code plagiarism detection method based on information retrieval(abbreviation:IR)and integrated learning classification,which aims to improve the efficiency and accuracy of code plagiarism detection.This article combines information retrieval and integrated learning for code plagiarism detection.The main tasks are as follows:(1)A method to retrieve potential plagiarized code pairs based on the code's abstract syntax tree and code domain division is proposed,and a mechanism for filtering code pairs with low matching scores is proposed.First,the code is preprocessed to remove code noise;then the code is parsed into an abstract syntax tree;then based on the idea of information retrieval domain,the corresponding domain information is extracted from the abstract syntax tree traversal;according to the proposed score function,calculation the score of the words in each domain,and when searching for potential plagiarism codes,the matching score of the code pairs is calculated according to the domain matching;finally,the code pairs with low matching scores are filtered according to the threshold to obtain the final set of potential plagiarism code pairs.(2)A hybrid similarity calculation method for calculating the similarity feature value of code pairs and a method for classifying potential plagiarism code pairs based on Stacking are proposed.The features of the code pair are extracted from three aspects: vocabulary features,structural features and code style features.Among the structural feature extraction,the proposed hybrid similarity calculation method is used to calculate the structural feature similarity,and the similarity of the remaining features is mapped using the similarity calculation formula to obtain the feature set of the code pair.The feature set of the training set with known classification is put into an integrated classifier based on Stacking for training,and then the feature set of the potential plagiarism code pair is put into the trained classifier for prediction,and the final classification result is obtained.Finally,after comparative analysis of the experiments,the experimental results of the two stages are summarized separately.In the retrieval stage,compared with the JPlag benchmark experiment,the text-based IR experiment and the AST-based IR experiment,it can be seen that the IR technology based on domain information division used in this article is effective,and the retrieval results based on domain division are better than others which precision is 0.9203 and recall is 0.9391 and MAP is 0.5360,these three indicators are higher than the results of other comparative experiments.In the classification stage,through the introduction of comparative experiments of JPlag,IR+RF and IR+GDBC,it was found that the accuracy of the IR+Stacking method used in this article is 0.9266,recall is 0.9012,and F-score is 0.9137,and the index results are better than other comparative experiments.Finally,the comprehensive experimental results are analyzed.Firstly,in terms of time,the total time used in the retrieval and classification of potential plagiarism codes in this article is about 11 hours,the shortest time used by JPlag is about 3.5 hours,and the longest time used by the deep learning method is about 72 hours;secondly,in terms of accuracy,this article uses the F-score evaluation index to show the result.The F-score value of this method is 0.9137,which is higher than JPlag's 0.4469 and deep learning-based 0.8933.As a result,source code plagiarism detection method used in this article is effective.
Keywords/Search Tags:Source code plagiarism detection, Information retrieval, Abstract syntax tree, Stacking integrated learning
PDF Full Text Request
Related items