| Software vulnerability is the main cause of various network security events,and it has received continuous and extensive attention from security research institutions,academic groups and enterprises.In the process of software development,developers often reuse the released code modules to implement the specific program functions,which can improve the development efficiency.However,the code reuse method can also spread the security vulnerabilities of the reused code to a large number of relevant programs,bringing potential security threats.Compared with open-sourced programs,binary programs without source code lack rich code semantic information such as data structure and data type.Besides,due to different code optimization settings,there are large differences among binary programs compiled from the same source code,making binary code reuse vulnerability detection method facing more challenges.This thesis aims to detect code reuse vulnerabilities based on similarity comparison technology,combining machine learning,static and dynamic analysis method.The research is conducted from three aspects,including binary function level similarity comparison method based on semantic learning model,binary component level similarity comparison method based on data flow analysis and vulnerability-oriented directed binary fuzzing method.The main contributions are summarized as follows:1.This thesis proposes a binary similarity comparison method based on semantic learning model to solve the problems of current function level binary vulnerability detection researches,including incomplete representation of structural features,coarse granularity of semantic information and insufficient semantic representation.The features of binary program are extracted from instruction,basic block and function granularity.Natural language processing technology and graph neural network model are used to represent the code context semantics and structural semantic features.Then the semantic learning model is trained to learn the binary code similarity,which can be used to detect code reuse vulnerabilities at the function level.The implemented prototype system performs better than the representative tool Gemini in the verification set.In the detection of real firmware vulnerabilities,the accuracy of top-5,top-10 and top-50 is 113.3%,60.0%and 32.7%higher than Gemini.2.Aiming at the problem that current component level binary program similarity comparison method has low accuracy when comparing codes with large structural differences,a binary component level similarity comparison method based on data flow analysis is researched.According to the different comparison granularity,the comparison process is divided into top-down and bottom-up stages based on the defined anchor function and function call relationship.In the top-down stage,the candidate functions that may have corresponding relationship are selected based on the anchor function and data flow analysis method.In the bottom-up stage,the semantic learning model is applied to extract and represent the semantic features of the function and determine the corresponding relationship of candidate functions.This method can not only be applied to component level vulnerability detection,but also assist patch comparison and malicious code detection in software supply chain security analysis.Experiments show that the average recall and accuracy of the method proposed are 69.8%and 60.5%higher than BinDiff,which performs best among the representative tools.3.Considering the relatively high false positive rate of static vulnerability detection methods,this thesis researches the binary fuzzing method to verify the static vulnerability detection results.Meanwhile,the research optimizes directed greybox fuzzing technology,solving the problems such as less consideration of the unequal role of the code when veering execution towards target code area.The semantic learning model is applied to automatically locate the target code area.Considering the inequality of code snippets in reaching the given target,the directed fuzzer assigns diferent weights to basic blocks and takes the weights as feedback to generate test cases to reach the target code.The proposed directed fuzzing method in this thesis has better scalability and is more general,not limited to detecting specifc types of vulnerabilities.The experimental results show that compared with relevant fuzzing tools,this method can not only trigger more bugs on LAVA-M dataset,but also be effectively applied to vulnerability reproduction and exception discovery. |