Font Size: a A A

Research On The Technology Of Similarity Comparison Between Source Code And Binary Code Based On Intermediate Representation

Posted on:2022-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z T ZhangFull Text:PDF
GTID:2518306521957429Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Open source is everywhere,from the underlying chips,drivers,firmware,to the operating system,browser,application software,there are applications of open source software.Component-based development and code reuse greatly improve the efficiency of software development.However,the maintainers of open source projects pay insufficient attention to the safety and quality of code and lack of technical ability.The dependence and reference relationship of open source code are complex,and its security is often lack of review and management.Therefore,open source software also increases the complexity and security risks of software supply chain,and many open source vulnerabilities are also introduced into closed source binary files.Therefore,it is of practical significance to detect open source code in closed source binary code and study the similarity comparison technology between source code and binary code.Existing research focuses on similarity comparison between source code and source code,or between binary codes,and the research on similarity comparison methods for such problems is relatively mature.Similarity comparison between source codes is often used to solve problems such as code clone detection and code search,while similarity comparison between binary codes is often used to solve problems such as vulnerability search,patch analysis and malware detection.Generally,the granularity of code comparison is basic block,function,or whole program to determine their similarities and differences.However,due to the huge differences between source code and binary code,there are few studies on binary source code matching.In the process of compiling source code into binary code,compilation options with random or fixed configuration may be selected,such as different compiler versions,compilation optimization level and target architecture,so as to generate completely different binary codes,which greatly increases the difficulty of similarity comparison between source code and binary code.At the same time,because the source code and binary code have completely different syntax forms,it is impossible to directly extract enough features from the source code and binary code for similarity comparison.In view of these challenges,the main research contents of this paper are as follows:1.A conversion method between original code and binary code based on intermediate representation is proposed.Because the syntax forms of source code and binary code are completely different,the features that can be directly extracted from source code and binary code itself and used for comparison are insufficient,and they are not representative for the semantics of code.In order to solve the problem of direct comparison between source code and binary code,this paper proposes to transform them into the same form of intermediate representation,while preserving the semantic information of source code and binary code as much as possible,thus having cross-language and cross-platform characteristics.2.A learning method of code semantic representation based on intermediate representation is proposed.In order to obtain semantic features from the intermediate representation statements of code,the program language is compared with natural language.According to the idea that statements appearing in the same context have similar semantics,the semantic representation of code is obtained by extracting data flow and control flow and other features as the semantic relationship of the context in the process of code running,and using natural language processing model Word2 vec to carry out word embedding training on the intermediate representation of code.3.Similarity comparison tool between original code and binary code based on intermediate representation.A tool for comparing the similarity between source code and binary code based on intermediate representation is designed and implemented.With the source code and binary code to be analyzed as input and the code similarity evaluation as output,a data set construction and preprocessing module,a data flow graph and control flow graph construction module based on LLVM IR,a word embedding training model based on LLVM IR and a code similarity comparison module are designed and implemented in turn.
Keywords/Search Tags:Source Code, Binary Code, Intermediate Representation, Natural Language Processing, Similarity Comparison
PDF Full Text Request
Related items