| The explosive growth of Android malware has seriously threatened the privacy data and property security of users.Therefore,it is of great significance to detect Android malware through code audit.At the same time,the detected malware can be classified into different families,which is convenient to study the version iteration and variation rules of malware families and is useful to formulate corresponding defensive protection measures.Since Android malware often requests specific APIs to access the system when performing dangerous operations,and the API call information can be obtained through static or dynamic analysis.Therefore,many methods extract the API call information to detect Android malware.API features mainly include API function call graphs or API call sequence-based features.However,the function call graph belongs to non-Euclidean structured data and cannot be processed by traditional machine learning algorithms.While the spectral graph convolutional network GCN can only process undirected function call graphs,which will lead to the loss of the directed call relationship.For API call sequence features,existing methods mostly consider the one-way semantic correlation of API sequences,not fully mine the two-way semantic information of APIs.Besides,malicious API sequence segments only account for a small part of the overall sequence,and many redundant APIs in the sequence,which brings difficulties to feature extraction.In addition,for the classification of malware families,the existing methods mostly extract the common behavioral features of all samples in the family.But with the continuous development of malicious families,the same family applications may complete the same malicious activities in different ways.Existing deep learning-based classification models mostly rely on a large number of labeled samples for training,when family samples are scarce,the model is prone to overfitting.In response to the above problems,a behavioral analysis model is proposed.The model combines the static detection method of the function call graph with the dynamic detection method based on the API call sequence.The main work and innovations of this thesis are as follows:(1)To solve the problem that the spectral graph convolutional network cannot handle the directed function call graph in the static analysis,a directed graph convolutional network is proposed to extract the API node aggregation features.Aiming at the large scale of the function call graph and more redundant information,the directed function call graph is reconstructed by extracting the API sets related to dangerous permissions.To effectively process the directed function call graph,the traditional spectral graph convolution network is improved,the asymmetric directed graph adjacency matrix is converted into three different levels of symmetric graph matrices.While the API node aggregate features are effectively extracted,it can also expand the receptive field of the convolutional network and avoid too many training parameters.Results show that the method has perfect classification capability,its accuracy and F1-Score have reached 0.9504 and 0.9499.(2)To solve the problem that too many redundant APIs bring difficulty to extract malicious sequence segments,and the semantic information of the API call sequence context is not fully mined,a dynamic analysis combining Bi-LSTM and the self-attention mechanism is proposed.Aiming at the problem of many redundant API nodes,N-gram is used to divide the API time series,and the TF-IDF algorithm is used to extract the set of sensitive API short sequences to reduce redundancy.At the same time,Word2 vec is used to obtain the dense vector of each short sequence.In the feature extraction stage,the API sequence vector of the application is input into Bi-LSTM to learn bidirectional semantic information,and the output hidden state is assigned different weights through the self-attention mechanism to effectively extract the API sequence features of the application and realize malware detection.Test show that its accuracy and F1-Score on dataset reach 0.9293 and 0.9288,if the static analysis is combined with dynamic analysis,they can be increased to 0.9711 and 0.9698.(3)To solve the problem that the existing family classification method only extracts a single behavioral prototype of the family and cannot effectively deal with small sample malicious families or unknown new families,a small sample family classification method based on metalearning is proposed.To build multiple prototypes of families to improve the classification accuracy,based on the sequence features extracted by the above Bi-LSTM model,multiple prototype features are created for malicious families through the prototype network.In the malware family classification period,based on the N-way K-shot task,meta-learning is used to continuously train the classifier to obtain prior knowledge to overcome the overfitting of small samples.By measuring the distance between malicious applications and each family prototype to match the malware family.If not successfully matched,the applications are identified as unknown new families,and their family prototype features are extracted and added to the model.Tests show that its accuracy rate and macro F1-Score are 0.9756 and 0.9731,which are greatly improved compared with other classifiers.At the same time,in the fewer sample family classification,the average accuracy rate can reach 0.9311. |