Font Size: a A A

Research And Application Of Duplicated Code Detecting

Posted on:2008-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2178360245997870Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recent studies have shown that large software suites contain significant amounts of replicated code, most of which is due to copy-and-paste activity. The duplicated code not only reduces the maintainability of the software, but also is prone to introduce a significant proportion of bugs into systems. However, existing copy-pasted code detecting tools are neither scalable to large software suites nor robust enough to detect replicated code which is modified with insertions and deletions.By comparing the advantages and disadvantages of various techniques for duplicated code copy-pasted detection,this paper chooses the analytical method by"token-based",and introduces data-mining techniques to implement a copy-pasted code detecting model. This model first build a sequence database by parsing the source code, so as to convert the copy-paste detection problem to a frequent subsequence mining problem. It then uses an enhanced algorithm of CloSpan to find frequent subsequences with support value of at least 2, which correspond to code segments that have appeared in the program at least twice. Finally, it improves the detection result by pruning false positives that are unlikely to be real copy-pasted code, and compose larger copy-pasted segments.Compared with the other methods, the computational complexity of token matching and CloSpan is lower, so it consumes less memory and time to analyze large-scale software code. The model maps all identifiers of similar type into same value, regardless of their real names. By doing this, it can detect renamed copy-pasted segments. A frequent subsequence can be interleaved in its supporting sequences, so that the model can identify the copy-pasted code which is modified with insertions and deletions correctly after setting gap threshold.The experimental results indicate that this model takes less than 40 minutes to identify 3,000 copy-pasted segments in Httpd 2.2.2(280K lines), including the ones which are modified with renaming, insertions and deletions.
Keywords/Search Tags:duplicated code, copy-pasted, token-based, data-mining
PDF Full Text Request
Related items