| It is a common operation to copy-paste-edit code during the softwaredevelopment. This reuse mechanism usually leads to a lot of code duplicate or similarcode fragments in the code base, or code clones. Code reuse is convenient fordevelopers, but it brings a certain amount of resource consumption and increases thedifficulty of software maintenance. Since1990s, both the academia and industry beginto analyze code clones and research the detection methods. With the growing size ofthe software products, the original methods reach the bottleneck of using resource andcan’t even work on a single machine. For example, the similarity comparison has tobe carried out among the large scale of software systems developed by differentcompanies and organizations. If the detection method could not obtain the code clonesin time, it would significantly reduce the effectiveness of the code clone detectionapproach.This paper discusses the basic principles and key technologies of the traditionalcode clone detecting methods and presents the index-based and the sequence-matchcode clone detection approaches. The author confirms that the proposed methods havegood effects on the code clone detection of large-scale software systems. The mainwork includes:(1) It proposes an index-based code clone detection method. It normalizes thesource codes to lexeme sequences and statement segments, and then employs the hashvalue of segments to find cloned code. Since the lexeme sequences are persistentduring the detection process, it avoids generating the intermediate sequence andfragmentation repeatedly, and therefore significantly improves the speed of detection.(2) It proposes an improved Smith-Waterman algorithm to detect code clones. Itdetects the code clones based on a lexeme sequence. The method effectively solvesthe mosaic problem occurred in the conventional Smith-Waterman algorithm byadjusting the score matrix and improves the backtracking method to get the best localsimilar code sequence.(3) It presents the experiment on four kinds of software written by Java. It provesthat, compared with the traditional approach, the index-based approach improves thedetection efficiency without sacrificing the precision and recall. Meanwhile, compared with the traditional Smith-Waterman algorithm, the improved one increasesthe precision about1%and recall about2%. |