| With the continuous expansion of software scale,the number of malicious software is also constantly increasing,bringing huge challenges to the industry and academia.Binary code analysis is an efficient and safe analysis method.it does not need source code to study the structure and characteristics of the program.At present,experts in this field have proposed a series of schemes around binary code similarity detection(BCSD),including many algorithm models based on graph neural networks.On the other hand,we also noticed that contrastive learning has excellent ability in feature extraction from datasets,allowing the model to learn key features of binary functions.However,these traditional schemes mainly have the following problems: insufficient Semantic information extraction of Control flow graph,only feature extraction of Control flow graph,and insufficient data enhancement methods in comparative learning.Based on the above issues,this article conducts the following research:1)We propose a binary function representation learning model based on BERT.Through the BERT language model,the extracted basic blocks of the control flow graph can be pre trained to obtain a vector representation of the basic blocks.At the same time,during the training process,three tasks were set up to enable the model to learn the vector representation of the basic blocks.Then,the structure information of the control flow diagram and the abstract syntax tree was extracted through a graph neural network,and the graph vector representation was obtained.Finally,the information was fused together to form the final functional vector representation.2)A binary function similarity detection algorithm based on contrast learning is proposed.We apply comparative learning to the binary function similarity detection task.Compared with the previous task,we designed a data enhancement method for binary functions,that is,using the feature that the source code generates similar binary codes at different optimization levels to generate positive samples.This method can effectively prevent the previous data enhancement methods from generating wrong samples,and also reserve the key basic blocks in the control flow diagram.And we have proved that this method can improve the detection effect through comparative experiments.3)A binary function similarity detection system based on deep learning is designed.Through the above two algorithms,we designed two functions of malicious function detection and clone function detection for the system.For the malicious function detection function,we built the known malicious binary function into a function library,and analyzed the similarity between the new function input model and the existing malicious function to determine its security? Clone function detection allows us to detect whether two binary functions are similar.At the same time,the system also includes user management,function management,model management and other functions required in production scenarios,and designs a series of test functions to improve the robustness and reliability of the system.In general,this thesis studies the binary function representation learning algorithm based on BERT model and comparative learning.And the effectiveness of the algorithm in application scenarios is proved by sufficient experiments.At the same time,this paper also designs and implements a binary code similarity detection system integrating the above algorithms,and verifies the effectiveness and efficiency of the system through a large number of test cases. |