Font Size: a A A

Research On Program Recognition Approach Based On Structural Semantic Similarity

Posted on:2010-06-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:T T WangFull Text:PDF
GTID:1118360302465527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Program recognition recognizes the intention of a given program, by analyzing this program with a known pattern. It has wide application such as program comprehension, software system analysis, compiler optimizing, duplicated code detection, and bug detection.This dissertation proposes a structural semantic similar program recognition approach. The major original works in the research are listed in details as follows.Code variations are widely believed to impede program analysis. To solve this problem, a program normalization approach based on system dependence graph is proposed. The traditional representation of system dependence graph is augmented to sufficiently represent the syntactical structure and the semantic of programs. Existing pointer analysis algorithms usually adopt a low-level intermediate representation which can not sufficiently represent the syntactical structure and the semantic of programs. This makes them difficult to be applied to program normalization. To solve this problem, a flow-sensitive and context-sensitive pointer analysis algorithm based on control dependence tree is presented. The control dependence tree is used as the intermediate representation for the source program, and an improved point-to representation is proposed to represent alias information. Based on this, data flow equations are defined to compute the point-to information by traversing the control dependence tree. The point-to information can be directly used in program normalization to improve the accuracy of data flow analysis. Finally, the augmented system dependence graph, the pointer analysis, and the program normalization process are combined to form the model of program normalization. Semantic-preserving transformations are performed on the system dependence graphs of programs according to a set of predefined rules. As a result, code variations are removed. Experimental results show that the accuracy of our pointer analysis approach is higher than that of Emami's approach, and our pointer analysis approach can greatly improve the variation removal rate of the program normalization. The variation remove rate of our normalization approach is higher than that of the normalization approach proposed by Hattori and Ishi. Our normalization approach establishes a good framework for testing the semantic equivalence of source codes, and it can facilitate program analysis.The approach of extracting sub-programs based on graphs is high in space and time complexity. To solve this problem, a candidate sub-program extraction approach based on structural metrics similarity is proposed. Based on the existing theorems and corollaries of tree similarity and editing distance, new corollaries are proposed to apply the vector clustering to the candidate sub-programs extraction. First, source codes are represented as control dependence trees, and basic code normalization is performed to eliminate basic code variations that affect the calculation of metrics. Then, vectors are computed to describe the structural information of source codes, and the difficult graph similarity problem is reduced to a simpler vector clustering problem. Candidate sub-programs are quickly extracted. Experimental results show that most of the dissimilar sub-programs can be filtered out, and sub-programs with code variations can be recognized. This approach lays a good foundation for large-sized program recognition on the semantic level.Most of the traditional program recognition approach can not analyze code on the semantic level and can not evaluate the semantic similarity quantitatively. To solve this problem, a program recognition approach based on system dependence graph on the semantic level is proposed. Based on this approach, an automatic grading approach based on program transformation and semantic analysis is proposed. It can conquer the problem of the traditional automatic grading approaches which do not take into account how a student program answers a given programming task. Experimental results show that the program recognition approach based on system dependence graph can well solve the problem of the representation and accuracy of the recognition results.The graph-based program recognition approach can well recognize structurally and semantically similar programs, but it has a high computation complexity. On the contrary, the metrics-based program recogniton approach has a lower computation complexity but a lower precision. To take advantages of these two approaches, a metrics-based and graph-based combined approach for program recognition is proposed based on the above researches. First, source codes are represented as augmented system dependence graphs. Then, the candidate sub-program extraction based on structural metrics similarity according to a granularity of module is performed to filter out most of the dissimilar code pairs so as to lower the computational complexity. After that, advanced code normalization is performed on the candidate sub-programs to remove code variations so as to recognize similar sub-programs on the semantic level. Finally, program matching is performed on the normalized control dependence trees to output semantically similar sub-programs. A semantically similar code detection approach is proposed based on this approach, to provide a new insight into similar code detection. Experimental results show that this approach can be applied to large softwares, and it can recognize not only identical sub-programs but also sub-programs with code variations. The precision of this approach is higher than those of GPlag and integrated logic-domain model.In conclusion, this dissertation has proposed a structural semantic similar program recognition approach, and solved the following key problems: code variation recognition, the representation and precision of the results, the scalability.
Keywords/Search Tags:program recognition, semantic similar, program normalization, program matching, system dependence graph
PDF Full Text Request
Related items