Font Size: a A A

The Study Of Generalized Semi-supervised Learng Based Software Quality Estimation

Posted on:2011-11-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:P HuangFull Text:PDF
GTID:1118360305456798Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Software quality prediction is a method of controlling and predicting the software quality in the early pharse of software development. The main objective of the software quality prediction is to estimate the potential defect in the software utilizing the machine learning or statistical analysis methods. Correctly predicting the fault-prone software modules in the process of software development and testing can help the software designers allocate resources, reduce cost and shorten development cycle. Therefore, effective software quality prediction system has important value in terms of improving product quality and reputation of the organizations.Traditional software quality estimation systems are mostly based on supervised learning, which requests the label of every training instance. Reliable labels of the software modules come from thorough testing and accurate location of the defect. Hence, obtaining these labels are costly and time-consuming, and their credibility may also be weakened by many practical issues which limit the wide spread of the software quality model. To conquer these drawbaks, this paper endeavors to study some novel semi-supervised learning methods, which use both labeled and unlabeled data for training, and discuss how to construct effective software quality model using relatively fewer labels. The methods invesitgated in this paper include multi-instance learning, structured kernel methods and active learning. The studies on the semi-supervied learning and software quality estimation are few, and, as far as our knowledge, other research on the three learning methods in software quality estimation is not reported in the literature yet.Firstly, before the introduction of new semi-supervised learning methods for software quality estimation, this paper gives a survey on the traditional software quality prediction systems. Quality prediciotn systems can generally be divided into four main components, i.e. dataset construction, model training, model evaluation and algorithms comparison. The detailes of the four components and related work are addressed. In the dataset construction, the system conducts the preprocession on the raw data, and then split the whole dataset into training and testing subsets. System chooses a concrete algorithm to train a predciton models on the traing set and evaluate them on the testing set. Then the software models can be compared using some criterion. This survey reviews and covers the eseence of software quality, software metrics, various statistical and learning algorithms, and evaluation parameters or strategies.Secondly, the first semi-supervised learning schema for software quality estimation, i.e. multi-instance learning, is introduced to the software quality estimation domain. Multi-instance learning (MIL) views a bag of instances as the elemental learning object, and the instances in each bag shares one label. Hence MIL can use many software modules and relatively fewer labels for the training. The author introduces the basic notion and related research of MIL, and both theoretically and experimentally compares it with two supervised learing methods-SL-B and SL-I. Especially, the thereotical analysis of the bag misclassification rate for SL-I is given and its multi-variate normal approximation is also discussed in detail. The experiment on the industrial datasets indicated that MIL had better prediction accuracy than SL-B, and needed much less labels than SL-I.Thirdly, the software modules are regarded as more complicated structured data and hence corresponding structured kernel methods (SKM) are introduced for modeling and classification. The author introduced the structured kernels, the theory of support vector machine and practical implemtnations. Then, the knowledge representation and training procedure concerning class hierarchy are addressed. An original layered kernel, which is especially suitable for object-oriented software, is explicated and then compared with the supervised learning using sevearl major kernels in the experiments on both artifical and real-life datasets. The results showed a layered kernel proposed by the author outperformed others in terms of predictive accuracy. The comparison of MIL and SKM shows that MIL can be more widely applied to structured software models, but its prediction accuracy decreases as the software models tend to be more complex. In contrast, SKM especially the one using layered kernel shows better performance in learing more complex structured data, and thereby it is more powerful in estimating the quality of object-oriented software.Then, the active learing is studied. Different from traditional supervised lening which learns all the samples at one time, active learing constructs the models by active selection, label query and incremental learning. The main merit of active leaning is its capability of building accurate model using a few of labels quried actively. The theory, related study and two key issues of active learing are addressed. On datasets from NASA and telecommunication industry, two pool-based active learning algorithms and one latest stream-based were investigated. Experimets indicated active learning can employ as few as 10% of all samples to build effective software quality models, and achieve good prediction performance. Hence, active learning based software quality estimation has strong potential for guiding agile software testing. Moreover, active learning is compared with both MIL and SKM.Lastly, the author concludes the paper and proposes the outlook for future work. It's worth noting that the datasets in the experiments of this paper all came from some key practical software projects and some datasets as well as software quality models are based on the practical application in the Lucent Technology in Optical Networks. Hence, the novel software quality models and methods studied in this dissertation were of practical value as well as their theoretical creativity.
Keywords/Search Tags:software quality prediction, multi-instance learning, structured kernel methods, support vector machine, active learning
PDF Full Text Request
Related items