Font Size: a A A

Research Of Souce Code Plagiarism Detection Method Based On N-gram

Posted on:2013-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:F WuFull Text:PDF
GTID:2248330371471103Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of information networks and the widespread use of electronic documents bring enormous influence to our life. Some influence is good to our life, but some influence brings negative effect to our life and technology development itself. Compared to the traditional files, electronic documents are easier to be illegal copying and plagiarism. Source code plagiarism becomes a secious problem both in the computer education and software development. In order to protect the normal teaching order, protect the copy rights of software, restrain the spread of the plagiarism, the research of source code plagiarism detection technology and application has significant practical significance. The current research of source code plagiarism detection shows that:The researchers give high degree of concern to source code plagiarism detection technology and the best source code plagiarism detection system is JPLAG, YAP3 and MOSS, but the accuracy of the existing systems need to be improved and the existing systems are hard to deal with large-scale dataset.M. Damashek used N-gram to represent texts and used the generated N-grams to measure the similarity between texts in 1995. The experiments proved that using N-gram to represent text is helpful to improve the accuracy of text similarity. The source code plagiarism detection, in essence, is the similarity calculation between source code texts. Compared with the natural language text, the source code text contains more characteristics of code structure, so we proposes a source code plagiarism detection method based on N-gram, we turn the source code text into N-gram set, and then use the N-gram set and frequency to measure the similarity between source code texts. We use 5136 VB source code files as dataset and use our method and MOSS to detect the plagiarism on the dataset. The experimental results show that the accuracy of our method is higher than MOSS.Fork/Join parallel computing framework has a good thread control mechanism. It is quite good to deal with the hunger, competition and dead-lock between the threads and is suitable to solve the problem is composed by a large amount of small tasks. Our source code plagiarism detection method based on N-gram needs about n*(n+m-1)/2 times similarity calculation and each similarity calculation is independent, so the Fork/Join parallel computing framework is a good tool we can use to improve the efficiency of our method. Therefore, we split the n*(n+m-1)/2 times similarity calculation into n sub-tasks, use the Fork/Join parallel computing framework to finish them in parallel, and compose all the result of sub-tasks together. The experimental results prove that the efficiency of our method has been improved significantly, and our method can deal with a large-scale dataset.
Keywords/Search Tags:Source code Plagiarism Detection, Similarity MeasureN-gram, Parallel Computing, Fork/Join framework
PDF Full Text Request
Related items