Font Size: a A A

Research On Software Supply Chain Analysis Technology From The Perspective Of Gene

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:W J SunFull Text:PDF
GTID:2428330623982216Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development and growth of the open source community,more and more open source code and third-party components have appeared on the Internet.These open source code and thirdparty components can provide basic support for application development.Developers often search the existing components in the open source code repository,and import existing functional codes into their own projects to improve development efficiency.However,developers usually do not know the list of components used in their projects,so they are not aware of the security risks in the components.In recent years,there have been more and more reports on attacks on the software supply chain.Widely affected vulnerabilities such as Heartbleed and Ghost are caused by unrepaired known vulnerabilities in the software supply chain.Therefore,in-depth analysis of the software supply chain is critical to the security performance of the software.In view of the risks in the software supply chain,in order to solve the problem of the supply chain security of binary files,this thesis conducted a software supply chain analysis technology research from a genetic perspective.In this thesis,we propose two methods for semantic embedding of software genes and moving distance of gene graphs,to compare the similarity of binary from the basic block and the function level.On this basis,we analyze the multi-level dependencies of binary files to obtain a list of components in its software supply chain,and then detect possible known vulnerabilities in the supply chain.The main research results of this thesis are as follows:tk1.We propose a semantic embedding method for software genes to generate semantic coding of software genes,and solve the problem of comparison of binary semantic similarity across instruction sets at basic block granularity.We segmented the control flow relationship of the assembly instructions to obtain the smallest functional unit and abstracted the software gene.Inspired by the machine translation model,we designed a gene semantic embedding model,trained the encoder to extract the semantics of software genes across instruction sets,and encoded genes from different instruction sets into semantic vectors in the same vector space.Experiments show that the semantic vectors obtained by this method retain as much as possible the semantic information of the assembly sequence.In the task of semantic similarity matching of the assembly sequence,our method has a higher accuracy than the current mainstream method,and the p@10 index reaches 94.9%.2.We propose a graph moving distance algorithm,which is applied to the comparison of function gene graphs,to achieve similarity matching of binary functions,and to solve the similarity comparison problem of binary files at function granularity.We adopt graph attention neural network to learn the spatial structure of the nodes in the graph,and encode the nodes to embed nodes that contain the spatial structure and neighbor node information.On the basis of node embedding,we extended the "Earth Mover's Distance"(EMD)to the graph matching problem,and proposed the graph moving distance as an indicator of graph similarity to solve the matching problem between two graphs.This method makes the moving distance between similar graphs smaller and the distance between dissimilar graphs larger.Experiments show that this method can effectively evaluate the similarity between two function gene graphs.In the task of matching the similarity of functions with two optimization options of O2 and O3,the p@10 index reaches 87.8%.3.Propose a method for detecting known vulnerabilities in the supply chain,realize the supply chain analysis of binary files,and solve the problem of detecting known vulnerabilities in the software supply chain.According to the similarity of the function gene graph,we scan the thirdparty components contained in the binary file and analyze their multi-level supply chain relationships;collect the vulnerability information in the third-party components,compare the genetic differences of the vulnerability function before and after the vulnerability is repaired,to determine whether the function in the third-party component is a vulnerable version.Finally,we conducted a case analysis on the open source project Safe Board Messenger,which proved that the method can effectively analyze the complete supply chain relationship of the software and detect the known vulnerabilities in the software.
Keywords/Search Tags:Software supply chain, Software similarity, Machine translation model, Graph attention neural network, Vulnerability detection
PDF Full Text Request
Related items