Font Size: a A A

Classification Model In Software Engineering Based On Mainstream Static Analysis Reports

Posted on:2021-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhaoFull Text:PDF
GTID:2428330626461133Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
In the field of software engineering,source code analysis is an essential part of soft-ware before it goes online.Static analysis(SA)tools examine code for flaws without executing the code,and produce warnings("alerts")about possible flaws.Today,main-stream static analysis tools,such as Coverity,Kolowork and CppTest,have performed well in source code analysis.However,the number of non-code errors in the reports gen-erated by these tools is at least 3 times more than the true errors in the code,which makes the code review work become huge and complicated.With the rapid development of arti-ficial intelligence,existing classification models have become increasingly mature.This article tries to apply artificial intelligence to the field of software engineering.Therefore,it is proposed to use artificial intelligence algorithms to predict the results of static anal-ysis tools,which will greatly reduce the difficulty of manual classification by separating true code errors from non-code errors in the report.This article attempts to apply artificial intelligence algorithms to source code anal-ysis.First of all,for different data features in analysis reports generated by static analy-sis tools,we use corresponding feature engineering methods to extract more information.We use natural language processing methods like TF-IDF(Term Frequency-Inverse Docu-ment Frequency)algorithm and LSI(Latent Semantic Index)for natural language features.About ordered factors,we perform digital transformation.For category features,we per-form one-hot encoding and scientifically reduce the dimension to ensure the integrity of the information carried by the data.Next,due to the training data requirements of artifi-cial intelligence algorithms,and considering the huge workload of manual labeling,this paper proposes to use semi-supervised learning algorithms to label training data.This can make sure that the classification accuracy of the final model is as good as possible without losing the validity of the data.Finally,this paper uses LightGBM(Light Gradient Boosting Machine)to get the final classification model.The software project used in this article contains 106372 C++ files,including 65363 code errors.We use Coverity,KlocWork,CppTest and codesonar to analyze the source code and generate reports.Experiments show that the classification model based on weak supervision proposed in this paper has achieved good results on all four report sets,and has realized the application of artificial intelligence in the field of software engineering.
Keywords/Search Tags:Source code analysis, Static scanning tools, Semi-supervised algorithm, LightGBM
PDF Full Text Request
Related items