| Since the twenty-first century,science and technology have developed rapidly,among which Internet technology,as a great invention in the field of science and technology,has been integrated into every aspect of people’s lives.With the increasing popularity of smartphones,while people enjoy the convenience provided by mobile Internet,there are also some unscrupulous elements who make personal profits for themselves by releasing malware and spreading malicious behaviors.The Android system,with its convenient operation and highly free secondary development,has quickly taken over the majority of the smartphone market.At the same time,Android malware is also emerging,which makes people pay more and more attention to the identification and classification of Android malware.In recent years,the use of network traffic to identify and classify Android malware has gradually developed into a mainstream approach due to its fast and accurate advantages.However,previous studies usually used detection schemes based on ports,deep parsing of packets,etc.Today,with the frequent use of obfuscation and encryption techniques in modern software,many research methods tend to fail,and the problem of how to identify and classify Android malware using encrypted traffic has gradually become a difficult and important point of research.Recently,researchers have mostly used machine learning detection methods,which can make full use of encrypted traffic features but are also susceptible to problems such as insufficient preprocessing of traffic features,improper feature engineering,incomplete adaptation of classifier scenarios,and weak model generalization capabilities.In order to solve the above problems,thesis proposes a pure feature generation method based on host-level traffic.The method uses multiple data cleaning algorithms to purify the traffic,and uses multidimensional feature mining methods to extract effective features,combines random forest and principal component analysis algorithms for feature screening and dimensionality reduction,and finally uses feature aggregation algorithms to obtain pure host-level traffic features.Thesis then proposes a Stacking-Tr Ada Boost-based twolayer Android malware classification and detection model.The model uses the Stacking idea to build integrated classifiers so as to fully utilize the advantages of each base classifier in different classification scenarios,and constructs a multi-source weighted Tr Ada Boost migration learning algorithm to learn the knowledge of recently emerged novel malware in the target dataset while retaining a large amount of a priori knowledge in the source dataset,so as to improve the generalization ability of the model.Thesis achieves 98.9% Android malware classification accuracy on the dataset CICInves And Mal2019,which is about 14.5% improvement over the pure feature generation method based on host-level traffic without using thesis.Compared to some traditional machine learning and integrated learning algorithms,the classification accuracy is improved by about 7.4%.Subsequent experiments on generalization capability using the dataset constructed by CICAnd Mal2020 and malware-traffic-analysis.net improve the model accuracy by about 10.9% compared to the model without migration learning,which significantly improves the model robustness.The experiments show that the classification and detection model proposed in thesis outperforms other research solutions and has good generalization capability,making it an excellent method for Android malware identification and classification. |