Font Size: a A A

Research On Machine Learning-Based Software Defect Identification

Posted on:2022-05-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y W ZhangFull Text:PDF
GTID:1488306326979879Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Static program tools can examine the program code without needing to run it,and thus can consider all possible program executions.The diffuse usage of static analysis tools provides solid evidence that static program analysis is of great significance to aid developers.Unfortunately,the problem of finding all relevant errors in any program is undecidable.As such,static analysis tools aiming for practical solutions are forced to approximate,trading off precision against performance.Therefore,developers need to manually sift through a plethora of alarms reported by static analysis tools to partition them into true defects and false positives.Recently,most research has applied machine learning techniques in this area to automatically identify static analysis alarms by using classification al-gorithms.Aiming at learning patterns of false positives that are difficult to observe by traditional static analysis approaches,machine learning techniques thereby can greatly reduce the cost of manual inspection and improve the us-ability of static analysis tools in practice.Existing approaches usually come up with a set of hand-engineered features that are based on static code met-rics.These features mainly focus on the statistical characteristics of the source code under analysis and presume that actionable and unactionable alarms have distinguishable statistical characteristics.However,empirical results indicate that these features lack precision in representing the deep syntactic structure of alarms and cannot distinguish alarms with different semantics.Additionally,there is a major challenge in raising the accuracy of identifying defects from new software projects since it is difficult to build a machine learning classifier without labeled training instances.To address these limitations,this disserta-tion adopt machine learning techniques to improve the performance of software defect identification and the contributions can be summarized in the following.1.To bridge the gap between the reported alarms syntactic structure and fea-tures used for defect identification,we propose a set of fine-grained fea-tures for model building.Specifically,we first utilize a target-oriented path generation algorithm to generate paths from the defect-related source code and exclude the irrelevant path nodes by computing a backward path slicing.Then,we extract the def-use information of the related variables to produce feature vectors for building defect identification models.Ex-perimental results show that the proposed fine-grained features can better represent the structural information of the defect-related source code,and thus are promising and can yield significant improvement on defect iden-tification2.Deep learning-based methods have been widely applied for automated fea-ture generation.Researchers have utilized neural networks to generate se-mantic representations of the source code from the token vectors extracted from its abstract syntax tree(AST).But existing methods simply convert the code snippets into vectors of token sequences with structural and con-textual information preserved.Moreover,the sizes of ASTs are usually large and the neural network models are prone to the long-term depen-dency problem.To address the limitations,this paper first generates path sequences from the control-flow graph(CFG)of defect-related source code,instead of working on entire ASTs,for mapping the path sequences to token vectors by capturing the lexical and syntactical knowledge de-rived from the relevant CFG and AST nodes.Then,we adopt word em-bedding algorithm to encode the token vectors to meaningful real-valued vectors.Finally,this paper constructs a self-attention mechanism-based neural network architecture to automatically learn semantic features from the vectorized token sequences for model building.Experimental results show that the proposed self-attention-based neural network architecture is effective in semantic feature generation and improves the performance of the traditional neural network architecture in defect identification scenar-ios.In addition,the comparison results show that the model using path-based semantic vector representation outperforms the traditional abstract syntax tree-based semantic vector representation approach.3.One major problem of cross-project defect identification lies in the differ-ences in feature distribution between source and target projects.To solve the critical problem,this paper proposes a feature-based transfer learn-ing method that includes three phases:1)proposes a two-stage transfer learning framework based on feature ranking and matching;2)leverages the path-based semantic feature representations for cross-project defect identification;3)proposes a transferable mixed features-based method by concatenating path-based hand-engineered features and semantic features.The proposed method is able to alleviate the data discrepancy problem be-tween source and target projects,which migrates the domain knowledge of source project to target projects by feature transformation to improve the performance of cross-project defect identification.Extensive experiments on the open-source project datasets show that the proposed three frame-works perform superiorly in cross-project defect identification tasks,and all improve the performance of cross-project defect identification com-pared to the relevant baseline methods.The above proposed methods have been evaluated on real-world open-source projects through extensive experiments for performance analysis,and the experimental results verify the effectiveness of the proposed methods.Thus,we recommend the above methods for automated software defect identification.
Keywords/Search Tags:software defect identification, machine learning, static program analysis, path analysis, source code representation
PDF Full Text Request
Related items