Font Size: a A A

Research And Implementation Of Code Plagiarism Detection Based On Subtree Tracking

Posted on:2019-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ZhangFull Text:PDF
GTID:2428330566968735Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,communication becomes more and more convenient,which makes code plagiarism easier.Code plagiarism is a behavior which is complicated and difficult to define,and manual detection is inefficient,ineffective,and subjective.Due to the lack of a credible inspection system,the examination of program design questions detections in most domestic universities are still done manually.The purpose of this article is to solve this problem and improve the existing code plagiarism detection dilemma.Based on the analysis of the research results of the existing code plagiarism detection technology,we propose a code plagiarism detection method based on subtree tracking.In addition,the existing researches only detect the similarity between two samples and rarely consider the existence of plagiarism groups among plagiarism samples.Therefore,based on an improved k-means method,we further propose a detection grouping method which can effectively identify the plagiarism groups in detecting plagiarism.The specific research content of this thesis includes:(1)A code plagiarism detection method based on subtree tracking is proposed for high-level code masquerading detection.The main steps of the method include: Transform the code into an abstract syntax tree;extract the features of the abstract syntax tree and track the eigenvector of each subtree;Calculate the distance between each eigenvector and get the feature similarity matrix;Finally,the code similarity is quantified by code distance and distance threshold.The code distance is calculated by weighting the distance according to the nearest distance of each vector in the feature similarity matrix and the informa tion contained in the feature vector Experimental results show that this method can deal with a variety of plagiarism types,especially "code reordering" type and its detection efficiency is better than existing systems.(2)K-means clustering algorithm has to specify initial k value(cluster number)and is not suitable for the code plagiarism sample clustering.Therefore,we propose an improved k-means clustering method which automatically searches for k-values through comparing the cluster diameter and the distance between the clusters and determines whether a cluster is completely forming.All clusters are searched progressively.The experiment results show the efficiency of the method.(3)An online plagiarism detection system is designed and implemented based on the two methods proposed in this thesis.The system is mainly to serve the teachers and students.In order to make it more convenient to use,except for the code plagiarism detection and grouping functions,the function of question and answer platform is developed to provide an online communication learning platform for teachers and students.At the same time,in order to improve the user experience,we use MySQL master-slave replication and Nginx+keepalived tools to improve the high availability of onlin e plagiarism detection systems from both data and applications.
Keywords/Search Tags:Code plagiarism, Similarity, Clustering, k-means, Abstract syntax tree
PDF Full Text Request
Related items