| Traditional software vulnerability mining techniques include fuzzy testing,symbolic execution,model detection,and stain analysis.In recent years,the rise of big data and machine learning has led to a new round of software vulnerability mining technology research.At present,the research of software vulnerability mining technology using machine learning or data mining is divided into three categories: vulnerability mining technology based on software metrics,vulnerability prediction technology based on anomaly detection,and vulnerability prediction technology based on vulnerability code pattern recognition.However,most of the previous research was on source code,and only a small percentage was for binary software.Moreover,most of the research is based on coarse-grained.The vulnerability of software component-based vulnerability mining is not conducive to pinpointing the specific location of the vulnerability.Function-level vulnerability mining limits the vulnerability mode to the function.This article mainly studies the following:(1)Analyze the causes of current mainstream software vulnerabilities,and summarize the characteristics of buffer overflow,integer overflow,and reuse vulnerability after release at the assembly level.Then introduce the common methods of static program analysis and dynamic program analysis,signals received during a crash and some software protection mechanisms,as well as how to track the program and analysis the program automatically.(2)In terms of sample collection,this paper considers that the function call sequence is not enough to describe the vulnerability,as a result,proposes a sample collection scheme based on code block granularity.For the vulnerability program,we extract code snippets from data entry points to program crashes.For normal programs,we extract code snippets from data entry points to program exits.In the process of extracting,we solve the problem that all security protections are on,which makes it impossible to extract the PLT table.We also propose an algorithm to reduce the number of loops of the same code segment.For the problem of sample collection imbalance,this paper proposes an oversampling technique for assembly code blocks.(3)For the assembly code block sequence,we designed a machine learning model based on Block2 Vec.Previous work was used as input to the Doc2 Vec model in units of assembly instructions,and this article was used as input to the Doc2 Vec model in basic blocks,hence it's called Block2 Vec.In the training of the Doc2 Vec model,we collect the deduplication set of all assembly code blocks of all programs collected as training data.After the training,each code block in the sample is represented by a vector.The code blocks of the same pattern are similar after being represented by the Doc2 Vec model.Finally,the vector representation of the sample is obtained by cascading and unified dimension processing.In order to test the effect of the assembly code block sequence sample,this paper constructs the LSTM network and the Text-CNN network,and the assembly code block sequence sample and the function call sequence sample are subjected to the same processing as above to train the LSTM classification model and the Text-CNN classification model respectively.The results show that the Text-CNN model using the assembly code block sequence has a good effect and the accuracy rate is 96.3%. |