Font Size: a A A

Clone Detection Technology, Digital Fingerprint-based C Program

Posted on:2012-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:L L HuangFull Text:PDF
GTID:2208330335486322Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of science and technology as well as the enhancement of automatization level on education, we often come into contact with coding. Among them, colleges and universities, as the cradle of software developers, offer a series of coding courses. Because the electronic document is easy to be copied, how to reduce or prevent the phenomenon of document copying or document cloning is always one of the most concerned questions of teachers. Therefore, it turns to be one of the research hotspots for researchers. In order to inspect the attitude to the coding course and the real gain in class of students, a tool for code similarity detection is urgently needed to judge whether code plagiarism is exist or not among student coding jobs.In this paper, research and analysis for current available technologies of code similarity detection is carried out firstly. Because the characters of source code are sparsest, it is very difficult to select characteristics effectively from code during code similarity detection. In order to overcome these drawbacks, this paper puts forward a method for C codes plagiarism detection based on digital fingerprint. There are six major steps to complete the proposed method. First, some pretreatment needs to be taken on both source codes, namely delete annotation, macro commands etc in source files, which are irrelevant with the code semantic. Second, participle is implemented on codes pretreated, which means adding a blank between different types of adjacent words. Third, formalization for codes is carried out after participle, including swap the keywords presenting data type and user-defined identifiers with formalized words respectively, and delete spaces between adjacent words, then produce the string of formalized code. Forth, the string of formalized code is transformed into a series of numerical value using digital fingerprint technology, and numerical sequence is formed. Fifth, eliminate the invalid numerical values from numerical sequence, then select some values from valid numerical values using the strategy of Lowest Hash Value as fingerprint to delegate the source code. Sixth, calculate the similarity between digital fingerprint sequences to present the similarity degree of two source codes.Actually, the method for C codes plagiarism detection is proposed based on a study system of C code similarity detection about digital fingerprint, which is used to research the key parts of digital fingerprint including the size of fingerprint granularity, the strategies of fingerprint selection, etc. As a result, the system of C codec plagiarism detection is accomplished Experiments have proved that the process of this method is simple and easy to understand, and this method can be able to enhance the overall computation speed effectively. Because this method select fingerprint after eliminating invalid values, the reliability of the detection result is improved and the probability of miscarriage is reduced. This method can be able to identify many kinds of code plagiarism conceal means including modify comments, reinstall the typesetting, identifier rename, data types replacement and so on.
Keywords/Search Tags:Digital Fingerprint, Program Plagiarism, Plagiarism Detection, Similarity Detection, Program Detection
PDF Full Text Request
Related items