Font Size: a A A

Research On Android Malware Classification Method Based On Traffic Fingerprint

Posted on:2022-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:J DengFull Text:PDF
GTID:2518306524489554Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of mobile terminals,smart phones have attracted a very large number of users because of their powerful functions.The Android system is welcomed by developers because of its open source and free features,and it occupies a large mar-ket share.But this also opens the door to Android malware,so it is necessary to research an effective malware detection method.There are not many studies on traffic analysis to detect malicious code.More common Android malware identification and classification methods are mostly based on static program analysis,which identifies and categorizes by analyzing the features such as API calls and permissions of Android software.This type of analysis method requires operations such as reverse engineering and decompilation of the software and is easily bypassed by codebased malware.This paper uses a dynamic anal-ysis method of traffic analysis to obtain the traffic generated during the running of the Android software,and uses machine learning and deep learning to identify and classify Android malware.This method has the advantages of high recognition and classification accuracy,flexibility and applicability,and resistance to static obfuscation based on the code level.The main work of this paper includes the following points:1.Choose a machine learning algorithm to build an effective traffic fingerprint detec-tion model,and the model is also suitable for encrypted traffic.We simulated two scenarios to distinguish benign traffic from malicious traffic and distinguished the types of malicious traffic.In order to better distinguish the two types of obfuscated traffic with high levels of confusion,Scareware and Adware,we have added an additional layer of obfuscation clas-sifier to help further classify malicious software.The framework mainly includes the ac-quisition of application communication traffic,the segmentation of traffic files in units of sessions or streams,preprocessing,feature engineering,and classification processes based on machine learning algorithms.In dealing with the problem of confusion classification,a confusion classifier is designed to form a multilevel classifier to improve the accuracy of classification.2.When constructing a deep learning detection framework,a method of removing thirdparty traffic is designed and introduced to improve the operating efficiency and de-tection accuracy of the model,and then segment the original traffic data by session,and convert it into a gray scale image that can represent the characteristics of the original data of the flow,and we use a two-dimensional matrix as the data structure of the gray scale image.In the field of classification,the CNN network can better learn the spatial structure information in the two-dimensional matrix,so CNN is used as the neural network model to autonomously acquire spatial features in traffic gray scale images.In addition,since the content of the session is essentially composed of a linearly arranged data packet sequence generated according to the time of the traffic interaction,the two-layer LSTM in the RNN is used to autonomously obtain the timing characteristics in the traffic,and finally the two trained deep learning models are used to identify and classify the malware to be detected.3.The experimental data used in this paper is the CICAndMal2017 data set collected in a real environment.The model is evaluated in two scenarios under the machine learning model.The experimental results show that the accuracy of the two classifications of malicious traffic and benign traffic reached 98.8%.There is also a 95.2%accuracy rate in the specific malicious traffic multi-classification;under the deep learning model of CNN and two-layer LSTM,after the third-party traffic is removed,the effectiveness of malware classification and identification on the samples in the test set is greatly improved.The accuracy rate of CNN has increased from 88.2%to 96.8%,while the increase effect of LSTM is more obvious because the accuracy rate of it has quickly increased from 89.2%to 98.3%.However,deep learning models usually require a lot of training data to get better results.In real life,many small sample problems will be encountered.At this time,machine learning is more suitable for such situations than deep learning.
Keywords/Search Tags:Traffic analysis, Android malware, Machine learning, Deep learning
PDF Full Text Request
Related items