| Binary code similarity detection(BCSD)is of significant importance in preventing vulnerability propagation and maintaining software supply chain security,as it enables the detection of potential vulnerabilities in non-open source software such as commercial software.Code embedding technology uses neural network models to convert the code representation of a binary function into a numerical vector,which has gained advantages in applications such as vulnerability search.Binary functions generated by the same source code under different compilation configurations have the same semantics,but their control flow graphs(CFGs)are quite different,resulting in their assembly instruction sequences not being similar,and the code paths in CFGs being difficult to match.As a result,code embedding methods based on the assembly instruction sequence,CFG,or paths typically yield mediocre results in the BCSD task.To reduce the interference of compilation configuration on similarity detection,this dissertation first represents functions as a set of basic block trees(BBTs),where a BBT is composed of a basic block and its directly connected child nodes.Then,a code embedding model is constructed to transform the BBT-based code representations into vectors,which are applied to the search for vulnerabilities.The main research work and contributions of this dissertation are as follows:(1)Binary code embedding method based on basic block treesWe found that despite significant structural differences between the two CFGs,there are more matched BBTs and fewer matched code paths between them.Therefore,we choose BBT as the local feature of CFG and propose a binary code embedding method based on BBTs.To design the BBT-based code representation,we decompose the CFG into a set of BBTs by level-order traversal.At the same time,each BBT is merged into a sequence of assembly instructions using pre-order traversal.To construct the corresponding code embedding model BBTree,we first use the Word2vec model and the LSTM network to convert the assembly instruction sequence of BBT into a vector.Then,all BBT vectors are fed into the Bi-GRU network,and its output is generated as a function vector by maximum pooling.In the binary classification experiment of BCSD,the area under the curve(AUC)of BBTree is 0.942,which is significantly higher than the 0.822 of Gemini and the 0.897 of SAFE.(2)Search application based on binary code similarity detectionThe goal of binary code similarity search is to find the target function T that matches the query function Q in a codebase.This requires computing the similarity between multiple pairs of functions and striving for the highest possible similarity between Q and T.To this end,we propose a binary code similarity search framework based on BBTree.The framework first uses BBTree to convert functions into vectors and calculates the cosine similarity between the vector of Q and the vector of each function in the codebase.The top-k functions with the highest similarity scores are selected as the candidate function set.Then,calculate the Jaccard similarity between the call instruction sequence of Q and the call instruction sequence of each candidate function to further locate T.Experimental results show that the average recall rate of the framework in vulnerability search is 87.9%,significantly better than SAFE’s 59.1%and Asm2Vec’s 78.3%. |