| In recent years,software supply chain security incidents have been frequent and become one of the most important factors affecting software security.The software supply chain is divided into three segments:development,delivery and use.In the development segment,hackers can contaminate all compiled and released software by attacking the compiler,which is cheaper and has a wider impact than attacks in other segments.Developers can compile source code into different executables(binaries)during software development using different configurations of compilers with different families,versions,and optimization levels.Reverse identification of compiler configuration information from binary files is important for tasks such as compiler attack identification and binary code similarity analysis.Existing studies focus on coarse-grained identification of compiler families,versions and four optimization levels(O0,Ol,O2 and O3)(mixing O2 and O3 into a new optimization level OH,followed by classification of O0,O1 and OH).In this paper,we takes executable files as the research object and focuses on exploring compiler feature extraction methods based on binary files,in order to better identify the compiler family,version,and four optimization levels(O2 and O3 separated).The main research content of this article is as follows:(1)Existing studies mainly download C source code from open source projects as datasets,which have small sample sizes.In order to expand the dataset,features are always extracted based on function granularity,which causes the problem of unbalanced dataset.Besides,the C source code from open source projects is not compiler specific,so most functions compiled with different compilers have the same binary code,which affects the feature extraction work afterwards,thus affecting the compiler’s recognition results.To address the above issues,this paper generates 10,000 C source codes using CSmith and compiles them into corresponding executables dataset(12 types)using four optimization levels of three compilers:gcc8.1.0,gcc9.2.0 and clang10.0.0.Afterwards,compiler features are extracted based on the generated executables,which is classified by machine learning to identify the compiler families,versions and optimization levels.(2)Existing research is mainly based on deep learning for compiler recognition,which only represents simple features of binary files,making the compiler’s features unable to be fully reflected,and the recognition accuracy for compiler optimization level is not high.In response to this issue,this article proposes a compiler feature extraction and recognition method based on disassembly.Firstly,disassemble the executable file into a disassembly file,and then extract statistical features(frequency of commonly used registers and operating codes)and associated features from the disassembly file;Then,use the Chi-squared test Feature selection method to filter the two extracted features,select the feature set with the top 40%of the chi square score,and fuse the filtered features as the recognition basis of the compiler;Finally,support vector machine,LightGBM,XGB oost and Random forest are used to classify the reduced single feature and fused feature respectively.The experimental results indicate that the combination of LightGBM and fused features has the best experimental effect.Its recognition accuracy for the family,version,and two optimization levels(O0 and O2)of the compiler can reach 99.9%.For the four optimization levels of the compiler,the recognition accuracy gcc(gcc8.1.0 and gcc9.2.0 are mixed according to the four optimization levels)is 98.4%,and the clang is 94.0%.(3)Binary code sometimes is shelled,which is generally difficult to disassemble,so the compiler feature extraction and identification method based on disassembly proposed in this paper has some limitations.Based on this,this article proposes a compiler feature extraction and recognition method based on binary bytecode multi feature fusion.Firstly,extract GLCM features,LBP features,byte histogram features,and byte entropy histogram features from the executable file;Then,the PCA Feature selection method is used to reduce the dimensions of the four extracted features,select the feature sets whose total contribution rate is more than 90%,and fuse the reduced feature sets;Finally,support vector machine,LightGBM,XGBoost and Random forest are used to classify the reduced single feature and fused feature.The experimental results show that the combination of LightGBM and fused features has the best experimental effect,with recognition accuracy of over 99%for the compiler family,version,and two optimization levels(O0 and O2).For the four optimization levels(O0,O1,O2,and O3),the recognition accuracy gcc is 97.1%,and the clang is 89.1%. |