Font Size: a A A

Research On The Improvement Of Software Vulnerability Prediction Based On Dimensionality Reduction Techniques

Posted on:2020-03-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:PATRICK KWAKU KUDJOFull Text:PDF
GTID:1368330623961211Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Software vulnerability constitutes a major and increasing threat to our healthcare,energy,defense,financial,and other critical infrastructure systems.There is evidence that the system downtime caused by software vulnerability would increase dramatically.Hence,novel and vital information on the potential risk of vulnerabilities is essential to security experts.Additionally,billions of dollars are lost every year to the successful exploitation of vulnerabilities.Given that vulnerabilities primarily cause such attacks,it is important to detect and resolve them.One of the early detection approaches is to develop application patches and hardening software systems after vulnerabilities have been discovered.Similarly,building predictive classification models to determine whether a software component is vulnerable or neutral is essential to researchers and practitioners in the software engineering domain.As one of the classical problems in vulnerability analysis,the severity prediction of vulnerability is an important activity that has received a great deal of attention from researchers and practitioners.Most prior work relies on historical vulnerability data and the common vulnerability scoring system(CVSS)for assessing the impact and severity levels of vulnerabilities.Furthermore,machine learning techniques such as random forest(RF),k-nearest neighbor(KNN),and decision tree(DT)have been applied to predict software vulnerabilities.However,one major challenge in vulnerability prediction is the vagueness,sparse,and complex semantic content that results in a high-dimensional feature set.That is,there are several irrelevant and redundant features that impact predictive and classification performance,particularly when the process involves N-gram analysis.Strictly speaking,this problem falls under the umbrella of “dimensionality reduction.” Thus,this dissertation seeks to tackle the dimensionality problem in software vulnerability prediction and classification.Particularly,the thesis will add to the respective scientific knowledge base in this domain by conducting a theoretical and empirical investigation using the Bellwether analysis.The study also investigates the impact of different feature selection techniques,namely term frequency-inverse gravity moment(TF-IGM),normalized difference measure,and firefly algorithm-based feature selection on vulnerability prediction and classification models.The main contributions of this dissertation are as follows:(1)This study presents an approach to predict the severity levels of software vulnerability using Bellwether analysis(i.e.,exemplary data).In this approach,we developed a novel Bellwether algorithm to search and identify an exemplary subset of data(referred to as the Bellwether)to be considered as the training set to yield improved prediction accuracy against the benchmark techniques and within-project prediction cases.The experimental result shows that the Bellwether approach achieves F-measure(i.e.,the harmonic mean of precision and recall),ranging from 14.3%-97.8%,which is an improvement over the benchmark techniques.Besides the severity prediction of vulnerabilities,the Bellwether analysis is applied to software vulnerability prediction to predict the vulnerable modules in software systems.More importantly,we combine N-gram analysis and the Bellwether analysis to predict vulnerable software modules.The proposed approach was validated using ten Java android applications extracted from the F-Droid repository.The result indicates that the Bellwether method improves prediction performance with F-measure values ranging from 18.5%-94.3%.In summary,we recommend the Bellwether method for vulnerability prediction models built on high dimensional datasets that include irrelevant and redundant features.(2)This study proposes a term-weighting metric termed term frequency-inverse gravity moment(TF-IGM)for software vulnerability classification.The metric leverages class labels for term-weighting,which is similar to feature selection.Thus,this study demonstrates that the TFIGM model can be incorporated into our classification scheme to measure the class distinguishing power of a term in the corpus so that terms with stronger class distinguishing power are assigned greater weights.Also,we broaden the set of parameters in our previous study by conducting an empirical study to validate TF-IGM within the context of vulnerability severity classification.Specifically,we extensively compare TF-IGM,with information gain(IG)feature selection using five machine learning algorithms on ten vulnerable software products containing a total number of 27248 vulnerabilities.The experimental result shows that the TF-IGM model is a promising metric for vulnerability classification compared to the classical term-weighting metric.Furthermore,the finding shows that feature selection improves software vulnerability classification.(3)In addition to the techniques mentioned above,this study theoretically and empirically investigates the impact of different feature selection techniques,namely normalized difference measure,and firefly algorithm-based feature selection on the performance of vulnerability classification.These metrics are used for excluding a large number of irrelevant and less important features,where a small number of features are reserved to improve classification accuracy.Again,this study addresses the high dimensional-feature set problem using the normalized difference measure.The primary objective is to eliminate irrelevant features that have no significant effect on the text mining process.On average,the models trained with the introduced feature learning algorithms achieved improved prediction accuracy than the benchmark feature selection methods.(4)Finally,this study investigates whether optimizing the hyperparameter of machine learning algorithms can improve vulnerability prediction accuracy.To do this,we conducted an empirical study using eight prediction models with different parameter settings on twelve opensource applications.Surprisingly,our models yielded improved classification accuracy in all cases of the applications studied.In summary,this thesis provides major contributions to the theoretical background of software vulnerability analysis and introduces four key techniques to improve the performance of vulnerability prediction and classification models.
Keywords/Search Tags:Software Vulnerability, Bellwether Analysis, N-gram Analysis, Firefly Algorithm, Normalized Difference Measure
PDF Full Text Request
Related items