Font Size: a A A

Research On Code Plagiarism Detection Model Based On Random Forest And Gradient Boosting Decision Tree

Posted on:2020-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:J D TangFull Text:PDF
GTID:2428330596498352Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology,computers are becoming more and more important,and the programming ability of computer majors has been paid more attention.In order to strengthen programming training,the OJ system(Online Judge System)is widely used.Students can submit assignments on OJ,and the system automatically determines if the question is correct,which can greatly reduce the workload of teachers.As the workload of the students increases,the plagiarism phenomenon becomes more serious.Therefore,a mechanism is needed to monitor plagiarism and strive to eliminate plagiarism.The plagiarism check involves many factors.In order to improve the accuracy as much as possible,the main work of this paper is as follows:(1)Calculate code similarityThe similarity is calculated based on the digital fingerprint technology for the newly submitted code of the student.It is processed in three steps: digitization,fingerprinting,and similarity calculation.(2)Feature extraction and calculationIn order to judge plagiarism using machine learning,features are defined and features are extracted.Features include whether the code similarity exceeds a threshold,a percentage category value with a similarity exceeding or below the threshold,a question difficulty,a code style similarity,a historical copy rate category value,a similar concentration,etc.(3)Machine learning model improvement and effect analysisThe existing code plagiarism detection method in the OJ system is improved,and the improved algorithm use the Random Forest and the Gradient Boosting Decision Tree to make up for the deficiency of the single algorithm.The results of the two algorithms are compared and tested to improve the accuracy and scientific of the OJ system.(4)Dynamic adjustment of similarity thresholdThe similarity threshold is subjectively given and may not be reasonable.If the threshold is set too high,it will cause leakage detection.If the setting is too low,it will cause false alarms.This paper analyzes the test results and combines the manual confirmation of the teacher to achieve the goal of dynamically adjusting the threshold.The plagiarism detection model realized by the above algorithm can achieve an accuracy rate of 95.6%,which minimizes the workload of the teacher and prevents the occurrence of plagiarism.
Keywords/Search Tags:OJ System, Digital Fingerprint, Random Forest, Gradient Boosting Decision Tree, Dynamic Adjustment Threshold
PDF Full Text Request
Related items