Font Size: a A A

Research On Software Debugging Techniques Based On The Combination Of Machine Learning And Program Analysis

Posted on:2014-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:1228330392960338Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A crucial problem in software engineering is how to improve thecorrectness of programs. Due to the intrinsic complexity of software systemsand their bugs, currently there does not exist a universal approach that canensure the absolute correctness of programs. When bugs cause runtimefailures, they usually need to be localized and fixed via program debugging.Real-world program debugging requires considerable human effort and isconducted in an interactive and iterative style. Over50%of the softwaredevelopment and maintenance cost is taken by program debugging. Therefore,improvement of the effectiveness of program debugging techniques is helpfulto boost the program correctness and reduce the cost of software developmentand maintenance.Regarding the importance of program debugging and the relatively loweffectiveness of existing debugging techniques, this thesis has proposed aseries of novel techniques based on the combination of program analysis andmachine learning, in order to improve the effectiveness and usability ofdebugging techniques. The proposed techniques aim to support the wholeprocess of realistic interactive and iterative debugging scenarios. Taking intoaccount the characteristics of different debugging phases, this thesis focuseson the research of static bug detection, static and dynamic program slicing,interactive debugging based on logs and breakpoints and automatic coderecommendation.Aiming at the problem of high false positive rate in static bug detection,the thesis proposes the approach of refining and extending the control flow graph, making it more suitable for screening and prioritizing static warningsso that debugging developers can get more accurate bug warnings at thebeginning of their debugging activities. In the refinement of control flowgraph, the thesis is the first to invent an infeasible path detection techniquevia mining branch correlations using machine learning. The technique gathersdynamic branch data via program instrumentation, mines potential branchcorrelations using association rule learning, and checks the feasibility of eachprogram path by evaluating its consistency with the potential correlations.The technique does not need to analyze complex path conditions and thus issuitable for large-scale programs which contain complex branch predicatesand able to detect infeasible paths that can be hard to detect for existingapproaches. In the extension of control flow graph, the thesis focuses onprobabilistic control flow graphs. Aiming at the inability of existingapproaches to handle virtual function calls in object-oriented programs, thethesis is the first to propose the technique, named Festival, to estimatedynamic call frequencies of virtual functions based on static program features.Using a set of training programs, Festival extracts feature values thatrepresenting design intensions, obtains call frequencies of virtual functions,and builds an artificial neural network model based on the two kinds of data,in order to describe the relations between static features and dynamicfrequencies. For a test program whose virtual function frequencies are to beestimated, Festival only needs to extract its static features and feed them tothe model as input. Then the output of the model is used to represent thefrequency estimation. Besides remedying the short-coming of existingapproaches on handling object-oriented programs, Festival does not requireexecuting test programs and thus is independent of the quality of input data,making it free of the common limitation of dynamic techniques.To aid interactive debugging, this thesis proposes automatic techniquesfor analyzing, refining, and generating logs and breakpoints, in order toalleviate the problem of these commonly used debugging tools in costing toomuch human effort. In log analysis, this thesis has identified two major problems in the use of logs, namely1) too much redundant logs and2)missing critical logs, and is the first to propose the log slicing technique basedon the combination of dynamic information extracted from logs andtraditional static program slicing. In addition, the thesis proposes logrefinement based on log slicing. Log slicing is able to prune redundant logs aswell as compacting program slices. Log refinement automatically choosesprogram locations to insert new logging statements which can remedy thecritical information missed in the existing logs. Moreover, aiming at theproblem of high computational complexity of program slicing which makes ithardly usable in interactive debugging, this thesis proposes to split the processof program slicing into two phases: offline phase and online phase. In theoffline phase various static analyses are performed and their results are stored,while in the online phase program slices are computed in real time based onthe results of offline analyses. As a result, the wait time of debuggingdevelopers can be shortened significantly, enabling the application of logslicing in interactive debugging. Meanwhile, the technique uses incrementaland demand-driven data flow analysis algorithms to ensure the consistencybetween offline analysis results and the latest programs. Besides log analysis,this thesis proposes an automatic breakpoint generation technique based onthe synergy of nearest neighbor queries, dynamic program slicing andmemory graph analysis techniques, which can save the manual effort to setbreakpoints. In the proposed technique, the nearest neighbor query methodand dynamic program slicing are combined to select breakpoint locationsautomatically. In addition, the idea of generating breakpoint conditions usingmemory graph analysis at particular locations and opportunities is firstlyinvented by this thesis. The automatically generated breakpoints can shedlight on statements and program states relevant to bugs, providing debuggingdevelopers with comprehensive auxiliary information in their familiarrepresentation.In bug fixing, this thesis proposes to solve the problems of loweffectiveness of bug fixing caused by programmers’ unfamiliarity with program code, using real-time code recommendation. The thesis is the first topoint out the importance of recommending API parameters and propose asolution, named Precise, based on the combination of program analysis anddata mining. Based on data analysis on large-scale real-world programs, thethesis has proposed a group of empirical rules which effectively restricts thesearch space of candidate parameters, making Precise infeasible and practical.By analyzing, abstracting, and then transforming the code of trainingprograms, Precise builds a parameter usage instance database. For eachrequest of parameter recommendation, Precise uses the context of request toperform nearest neighbor queries to retrieve, from the database, the abstractparameter usage patterns in similar contexts, concretizes them, and providesadaptive parameter recommendations in real time. Because API parameterselection is an essential part of using APIs and the existing coderecommendation techniques only focus on recommending API methods,Precise has filled a gap in the field of code recommendation. Moreover, theempirical results obtained in the research on Precise provide valuablereference information for subsequent related works.
Keywords/Search Tags:program debugging, fault localization, bug fixing, programanalysis, control flow analysis, data flow analysis, program slicing, machinelearning, data mining, nearest neighbor query, log analysis, recommendationsystem, code completion, API parameter
PDF Full Text Request
Related items