Font Size: a A A

Binary Code Similarity Detection Based On Deep Learning

Posted on:2022-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:X X GengFull Text:PDF
GTID:2518306614960119Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Binary code similarity detection(BCSD)is widely used in the security field,such as malicious code detection,vulnerability search and code plagiarism detection.The first prerequisite for implementing BCSD is to identify the function.The front n instructions or the end n instructions of a function are generally different from the instructions in the middle of the function.Therefore,most function identification methods are based on this principle.However,when the function does not have a standard beginning or end,this type of method cannot be identified.In order to make up for the shortcomings of the above identification scheme,a method based on return instruction identification is proposed,which uses a Two-layer Bidirectional Long Short-term Memory Network(TBLSTM)to improve the accuracy of return instruction recognition.The evaluation is performed on 4680real-world binaries.The result shows that the proposed TBLSTM achieves accuracy of 99.69% which is higher than that of other classifiers in the evaluation,including the state-of-art tool IDA Pro.The most widely used method for BCSD at this stage is graph matching algorithm.But its time complexity is extremely high and time-consuming.Also it has the poor adaptability to cross-architecture and cross-version.In this paper,a BCSD model based on graph embedding and CNN is proposed.Structure2 Vec is used to generate graph embeddings of control flow graphs,while CNN is introduced to process the sequential structure information between the basic blocks of control flow graphs,so as to better clarify the sequential relationships between the blocks.Finally,these two parts of features are fused to form the final embedding of a function.In the test set constructed by OpenSSL,the AUC/F1-score of this method reach 98.33%/94.56% and 97.79%/93.97% for big and small graphs,respectively.The experimental results show that the proposed approach improves the efficiency of similarity detection and is well adapted to cross-architecture and cross-version similarity detection.
Keywords/Search Tags:binary code similarity detection, function identification, return instruction, two-layer bidirectional long short-term memory network, graph embedding
PDF Full Text Request
Related items