Font Size: a A A

Research On Recognizing Functions In Binary Code Of ARM Platform Based On Machine Learning

Posted on:2021-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiFull Text:PDF
GTID:2428330605972968Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Binary function recognition is the basis for many binary code detection and analysis tasks,such as binary code reuse,control flow diagram generation,and performing semantic analysis.That is a fundamental problem of reverse engineering.The difficulty with binary code analysis is that most software releases do not come with compilation and debugging informat ion,so there is usually no function information in the binary.This paper focuses on the function recognition of binary code of ARM platform and proposes two new automatic function recognition algorithms.By analyzing the current function recognition software and methods,it is found that most of them can only analyze x86 simplified binary files,but not ARM binary files.Or because the diversity of the function start instruction is poor applicability;Return instructions are also used in literature to identi fy functions,because a function usually has a return instruction.Ret is a common return instruction under x86.Unfortunately,due to the differences in ARM instructions,there is no similar instruction in ARM.In terms of data collection,multiple open source software is cross-compiled,the machine code corresponding to assembly instructions is obtained by disassembly of binary code,which is input into machine learning model and neural network respectively for pre-processing and final analysis to obtain the classification of binary code,that is,whether it is the entry point of a function.The method in this paper automatically learns the key features of the recognition function,starts from the initial instruction of identifying the disassembled binary code,takes the 32 bytes around a byte in the binary code as the feature of the byte,USES the XGBoost integrated learning method and the Text-CNN network based on Doc2 Vec to build the classification model respectively,and then trains the model.Experiments on a number of popular open source software show that the text-CNN model based on Doc2 Vec has a good effect,with the recognition accuracy and recall rate above 90%,which is of practical significance for software reverse engineering and software safety analysis.
Keywords/Search Tags:binary code analysis, function recognition, machine learning, reverse engineering, neural network
PDF Full Text Request
Related items