Font Size: a A A

Research On Key Technologies Of Software Security Analysis Oriented Binary Code Analysis

Posted on:2016-03-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:J QiuFull Text:PDF
GTID:1108330503469631Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Binary code analysis is a program analysis technique for binary code. It is critical for the situations where source code is unavailable, such as malware detection and analysis.Binary code analysis is nearly the only one approach for malware analysis since its source code is not released by the authors. In the security audit and plagiarism detection for commercial software, only binary code could be analyzed because the source code of the software is unavailable. Binary code analysis can be employed to protect existing software and reduce vulnerabilities. It can also be used to thwart cracking and pirating software, protecting the intellectual property. Nowadays, both in supercomputers and in smart phones and embedded devices, the vast majority of software is released in the format of binary code. Therefore, studying binary code has scientific theory significance and practical values for improving the security of software.Due to the huge differences between the source code and binary code, analyzing binary code is much harder than analyzing source code. Obfuscation and compiler optimization also increase the difficulty of analyzing binary code. Additionally, malware authors often employ various anti-analysis techniques, such as self-checksumming-based anti-tampering and timing-based anti-monitoring, to hinder analysis. Researchers have to neutralize these anti-analysis techniques in order to analyze the malware. This further increases the difficulty of binary code analysis, challenging the key technologies including code disassembly, functions identification and library functions identification. This thesis focuses on identifying anti-analysis, code disassembly, functions identification, and library functions identification for in-depth studies.The current researches on identifying anti-analysis tend to study them in isolation;thereby they are lack of universality. To solve this problem, conceptual similarities between different kinds of anti-analysis techniques are analyzed, and an information-flowbased framework that encompasses a wide variety of anti-analysis techniques is proposed. Existing approaches for neutralizing self-checksumming-based anti-tampering require hardware assistances and cannot deal with self-modifying code. To solve this problem, a new pure software approach that based on dynamic information flow is proposed. First, backward taint propagation is used to identify memory locations that are either executed or which are used to compute the values of locations that are executed,followed by a forward taint propagation that identifies checksum computations on the code. For timming-based anti-monitoring, this approach also works. Common instructions and system calls for fetching time values are regarded as taint sources, and then taint analysis is used to identify the validation procedure. The proposed approach can successfully identify self-checksumming and timming-based anti-analysis techniques in proposals from the research literature, and can provide insights into the underlying structure of these techniques and thereby help devise techniques for neutralizing them.Current code disassembly approaches that combine static and dynamic analysis have low code coverage. A new approach for code disassembly that using multiple paths exploration is studied. In x86, data and code can be mixed in executable sections. Static disassembly cannot distinguish the data and code in the executable sections. Precision of a dynamic disassembly approach is high, but its code coverage is low and only executed code is disassembled. Dynamic analysis technology based on binary instrumentation is used to record the instruction trace of a program. Then executed conditional branches are reversed to implement the multiple paths exploration, discovering more paths in a program for one input. It improves the code coverage of the dynamic analysis. All the traces are simplified and combined. Finally, static disassembly approach is used to identify the reset code. Experimental results show that the proposed approach can disassemble binary code in high precise and high recall.The current function identification approaches are challenged by functions without explicit call sites and handcrafted assembly without standard prologues/epilogues. A new function representation called a reverse extended control flow graph(RECFG) and a RECFG-based method for identifying functions in binary code are proposed. A function has at least one return instruction(an instruction that makes the control flow leave a function). Therefore, return instructions are more reliable than the function prologues and epilogues used by traditional methods. First, RECFGs that from any values that can be interpreted as return instructions in a code range are built. Four filters are developed to remove nodes and paths in graphs that are not generated by a compiler. Then, for each independent RECFG, the multiple-decision method chooses a subgraph as the control flow graph of a function. Experimental results show that the proposed approach can effectively identify all possible functions in a code range.Traditional approaches for identifying library functions cannot identify inline library functions. A new library function identification approach is studied. Inlined and optimized library functions can be discontinuous and polymorphic. Traditional approaches that using the first n bytes of a function as a recognition feature cannot identify inline functions. First, a new representation named as execution flow graph(EFG) is introduced to describe the behavior characteristic of binary code. Then, identify both full and inline library function by finding similar EFG subgraphs in target functions. This method identifies inline library functions with discontinues bytes in target functions by analyzing execution dependence within instructions. On the other hand, it identifies variants of a library function by instruction numbering. Subgraph isomorphism detection is timeconsuming. Thus, five lossless filters are developed to filter out the impossible matching of subgraphs, and a new representation called Reduced Execution Flow Graph(REFG)that based on EFG is introduced to speed up the isomorphism testing. We have proved that EFGs are subgraph isomorphic as long as their corresponding REFGs are subgraph isomorphic. High efficiency of the REFG approach in subgraph isomorphism detection comes from fewer nodes and edges in REFGs and new lossless filters for excluding the unmatched subgraphs before detection. Experimental results show that precisions of both the EFG and REFG approaches are higher than the state-of-the-art tool, and can identify inline functions that cannot be identified by traditional approaches, and the REFG approach sharply decreases the processing time of the EFG approach with consistent precision and recall.In conclusion, the above approaches provide new ideas and methods for improving code coverage of dynamic disassembly, identifying functions without cross-reference and prologues/epilogues, fast identifying library functions, and identifying anti-analysis including self-checksumming-based anti-tampering.
Keywords/Search Tags:Binary code analysis, disassembly, function identification, library function identification, anti-analysis identification
PDF Full Text Request
Related items