Font Size: a A A

Research On Key Technologies Of Binary Code Similarity Detection Based On Neural Network

Posted on:2022-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:L FangFull Text:PDF
GTID:2518306521457544Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The abundant open source code and third-party components on the Internet can help software developers to complete development tasks efficiently,and increase their productivity and creativity.For this reason,third-party code is widely used in software engineering.In many cases,even if the source code of the software can not be obtained or the copyright statement is missing,we still want to acquire the reuse of third-party code in the software,to realize some important applications,such as intellectual property protection and vulnerable code tracing.Binary code similarity detection is to accomplish this work.Through the research on existing works,we found that,neural network-based binary code similarity detection has become a hot research topic,because it breaks through the performance bottleneck encountered by traditional methods in large-scale detection,but there are still many problems exist.We studied the key technologies of cross-architecture binary code similarity detection,and proposed a neural network-based binary code similarity detection method.Firstly,we construct a two-level cross-architecture intermediate representation for binary codes of different architectures.Secondly,we use a sentence embedding model to learn the basic blocklevel intermediate representation,and use the semantic embedding vector of the basic block as its similarity features.Then,we use a graph embedding model to learn the function-level intermediate representation,and use the embedding vector of the entire function as its similarity features.Finally,we use the cosine distance between both function embedding vectors to judge the similarity between both functions,and provide reference for the similarity between codes by considering their function call graphs.The main researches and innovations of this paper are as follows:1.In view of the existing researches on code similarity,the latest results of binary code similarity detection have not been covered,nor have they been able to highlight the development direction of this field.We proposed a method to classify binary code similarity detection technologies based on the information concerned,and specifically studied the binary code similarity detection technology based on neural network.2.In response to the specificity of different instruction architectures,we studied the characteristics of the two cross-architecture intermediate representations,VEX and CFG,and proposed a method for constructing a two-level intermediate representation of cross-architecture binary code.3.In response to the extraction process of code similarity features easily bringing with human bias,we studied the program language representation learning by using natural language processing representation learning model,and proposed a program basic block embedding technology based on sentence embedding.Result shows that compared with a code similarity detection system which uses manually defined basic block features,our technology can improve the accuracy of the system by at least 3.2%.4.In response to existing graph neural network underutilizing the information in function control flow graph,we studied the concept of message passing network,and proposed a function CFG embedding method based on graph neural network.Result shows that the accuracy of the improved detection system is increased by at least 1.6%.
Keywords/Search Tags:Binary Code, Similarity Detection, Intermediate Representation, Sentence Embedding, Graph Embedding
PDF Full Text Request
Related items