Research And Application Of Duplicated Code Detecting

Posted on:2008-04-22

Degree:Master

Type:Thesis

Country:China

Candidate:X Liu

Full Text:PDF

GTID:2178360245997870

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recent studies have shown that large software suites contain significant amounts of replicated code, most of which is due to copy-and-paste activity. The duplicated code not only reduces the maintainability of the software, but also is prone to introduce a significant proportion of bugs into systems. However, existing copy-pasted code detecting tools are neither scalable to large software suites nor robust enough to detect replicated code which is modified with insertions and deletions.By comparing the advantages and disadvantages of various techniques for duplicated code copy-pasted detection,this paper chooses the analytical method by"token-based",and introduces data-mining techniques to implement a copy-pasted code detecting model. This model first build a sequence database by parsing the source code, so as to convert the copy-paste detection problem to a frequent subsequence mining problem. It then uses an enhanced algorithm of CloSpan to find frequent subsequences with support value of at least 2, which correspond to code segments that have appeared in the program at least twice. Finally, it improves the detection result by pruning false positives that are unlikely to be real copy-pasted code, and compose larger copy-pasted segments.Compared with the other methods, the computational complexity of token matching and CloSpan is lower, so it consumes less memory and time to analyze large-scale software code. The model maps all identifiers of similar type into same value, regardless of their real names. By doing this, it can detect renamed copy-pasted segments. A frequent subsequence can be interleaved in its supporting sequences, so that the model can identify the copy-pasted code which is modified with insertions and deletions correctly after setting gap threshold.The experimental results indicate that this model takes less than 40 minutes to identify 3,000 copy-pasted segments in Httpd 2.2.2(280K lines), including the ones which are modified with renaming, insertions and deletions.

Keywords/Search Tags:

duplicated code, copy-pasted, token-based, data-mining

PDF Full Text Request

Related items

1	Research On Duplicated Code Detection And Automatic Refactoring
2	Research On NLP-Based Duplicated Web Pages Deletion Algorithm
3	Research Of Large-scale Text Collection Duplicated Deletion
4	Research On Detection Technology Of Duplicated Code
5	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
6	An Integrated Framework For Constraint-Based Mining Of Source Code
7	Research On Background Model And Score Issues For Speaker Recognition
8	Research On Removing Duplicated WebPages Algorithm Of Search Engine Based On Content
9	Source Code Based Suspicious Code And Bad Programming Practice Detecting
10	AUPLearner:A Code-context-sensitive Self-updating API Recommendation Approach