| In recent years,the cyber-attack has progressively varied in a more professional,organizational,and precise manner.Further,covert data-theft(CDT)has spread to the key domains,such as communication,aviation,finance,and energy,leading to remarkable and extensive economic damages to citizens,companies,and even the whole country.In addition,the leakage of sensitive information concerning the state secrets,military intelligence,and scientific and technological equipment,would directly threaten the development of national economy as well as the stability of homeland security.The CDT is termed as utilizing a series of covert communication technologies(CCT)to bypass the restriction of firewalls or escape detection systems for stealing users’data.By utilizing TLS(Transport Layer Security)encrypted convert communication based on network tunneling technology,malware can fulfill the CDT,which possesses two traits,namely protocol camouflage and encrypted payload.Meanwhile,the malware applying the TLS protocol could submerge its communication payload in the massive amount of benign network traffic,which imposes a great challenge for identifying covert communication behaviors.In this thesis,we focused on the scenario of identifying encrypted and covert communication behaviors(ECCB)based on the TLS protocol,and carried out the related research on the key issues about detecting ECCB.Summarily,the main works and contributions are demonstrated as follows.1.With respect to rapid preprocessing of massive encrypted network traffic,to alleviate the problem of extreme imbalance between positive and negative samples in TLS communication,a novel batch preprocessing approach for encrypted network communication flows based on semi-supervised model is proposed in this thesis.On one hand,since there is few methods which can extract signature from encrypted network traffic,traditional detection systems based on DPI(Deep packet inspection)technology fails to handle encrypted communication flows.On the other hand,the machine-learning-based approaches usually deem single flow as a processing unite,thus leading to inability of batch processing.Moreover,due to various kinds of network traffic continually surging as well as unknow ones,it is difficult for a supervised model to identify untrained samples while the recognition accuracy would be driven in a stagnation fashion for an unsupervised one.To remedy these issues,this thesis gathers the conversation flows belonging to the same server application,extracts the features of multiple flows,and completes unified calculation and identification,thus fulfilling the batch processing of conversation flows.Meanwhile,a semi-supervised clustering method is presented for mining benign TLS flows including untrained samples in batches with clustering algorithm being reverse nearest neighbor DBSCAN(RNN-DBSCAN).Regarding the high time complexity of the algorithm,an improved scheme based on the k nearest neighbor algorithm for approximate solution is proposed,which forms the Modified RNN-DBSCAN(MRD)algorithm,whose time complexity is decreased from O(k×n2)to about O(n1.5).In the feature selection stage,the spectral density algorithm is used to perform rough feature selection on the unlabeled feature set,then the proposed Semi-supervised Feature Selection(SMFS)algorithm is iteratively utilized to select the better feature subset from the rest features.The experimental results show that by using the SMFS algorithm,23 effective features are selected from 500 original features.When testing in a real network environment,the related results show that the proposed method can reliably mine the TLS flows generated by benign applications whose network flows account for 67.36%of the entire network traffic.In addition,compared with other traditional methods,the proposed method is 87 times faster than the single-flow-based supervised model,and is at least 3 times faster than the single-flow-based unsupervised model,and is nearly 61 times faster than single-flow-based semi-supervised model.Meanwhile,the purity of benign flows mined by proposed method is higher than other existing pre-processing ones,which could verify the reliability and efficacy of the proposed method.2.Regarding to the efficient and accurate detection concerning ECCB,to optimize the multi-layer architecture of identifying malware family of TLS flows,a detection method consisting of two supervised models based on a two-layer detection framework is proposed in this thesis to optimize the detection process of encrypted flows.A single detection architecture is difficult to deal with complex problems,and the use of a multi-layer detection architecture can easily introduce other new problems such as decreased detection efficiency and increased false negative rate.This solution optimizes the feature set and uses different features for different classification tasks,which can greatly improve the detection efficiency while ensuring the detection accuracy.In the prior model of first layer,a binary classifier is formulated for identifying benign/malicious flows on the basis of TLS handshake features.In the posterior model of second layer,combining some effective statistical features with the TLS features presented in this thesis,a malware family classification model can be trained.Both the prior and posterior models use the random forest algorithm based on Bagging ensemble learning method.To compare the pros and cons of different feature subsets more efficiently,a modified Wrapper method is designed,which can automatically select the local optimal feature subset.Via the experiments,firstly,the prior model achieves 99.78%average accuracy and 0.09%false positive rate,which outperforms the competitions.Simultaneously,the feasibility concerning excluding major benign flows without increasing the false negative rate by setting a reasonable threshold is also demonstrated.Secondly,to better solve the multi-classification problem of identifying different malware families of TLS flow,the experiments about performance of the“Multi-class classifier”scheme and the“One vs All”scheme are conducted,which can verify that a remarkably superior efficiency is achieved for“Multi-class classifier”scheme with little detection performance loss.Thirdly,comparing the two-layer detection framework we proposed with the single-layer detection framework,it can be found that the overall efficiency of the two-layer detection framework is more superior than that of the single-layer one as long as two prerequisites have been satisfied,namely,the detection rate of the prior model being higher than the posterior model and a mathematical inequality in this thesis.The detection rate of two-layer detection framework is 99.45%with the 188%increased detection efficiency,which manifests the advantages of the proposed framework.Finally,the superiority of the proposed framework is further substantiated via comparison experiments to other three types of frameworks.3.In terms of the identification of deeply disguised ECCB,to combat the malicious and deeply disguised TLS samples,a novel approach based on the service channel features and deep learning model is proposed.To observe the sensitive behavior of TLS conversation flows more effectively,this proposed approach expands the expression level of the training samples from the time and the space dimensions.That is,the time window for observing TLS conversation flow is lengthened in the time dimension;the TLS flows is accumulated to jointly evaluate the maliciousness of them.To the best of our knowledge,it is the first time to discover the coherent difference between multiple TLS flows on the same handshake fields contained in the TLS service channel,and based on these differences,a new TLS feature set is designed.In terms of model improvement,Wide&Deep deep learning framework is modified by adding a clustering layer and the MWD(Modified Wide&Deep)deep learning framework is formed.Moreover,in order to verify the superiority of the MWD model,this thesis also uses a genetic algorithm-based random forest detection model,and selects 304 valid features from 564 candidate features.In the experiments,compared with random forest detection model,it is found that the MWD deep learning framework can improve the performance of the detection model,and the comprehensive performance index F1-score has increased by 1.95%.Next,to gain a more effective and stable classifier,the experiment concerning the least number of TLS flows that need to be accumulated in a service channel is implemented.On the basis of the collected sample set,we find that when the number of flows is more than9,the detection performance can approach to stability.Finally,the comparison experiments are conducted and the related results display that our method possesses slightly worse performance than other methods facing ordinary ECCB,but it is more effective encountering deeply disguised ECCB,with the detection rate being 93.48%and false positive rate being 2.63%.Besides,the detection rates of the other three methods are 84.05%,71.23%and 85.04%,respectively,and the false positive rate compared with 2.63%achieved by our method is an order of magnitude higher,which can prove the unique advantage of our method to handle the deeply disguised ECCB.4.To verify the generalization of the proposed method based on the service channel features,this proposed method is applied to explore the possible existence of ECCB in other types of protocols.Theoretically,as long as two prerequisites are meet,this proposed method could be leveraged to model malicious evaluation in any kind of service channels,e.g.HTTP,FTP,and other private protocol-based service channels.One is that distinctions being existent in the main transmission direction of the network traffic between benign applications and malware.Another is that the consistency difference being existent in the content of the plaintext fields of aggregated flows in the service channel.To verify the generalization of this method,this thesis takes HTTP protocol as an example,the HTTP service channel features are extracted by imitating the identification method for the malicious TLS service channels and the classification model of HTTP samples is trained.The related experiments concerning identifying the malicious HTTP channels are conducted with an average detection rate of 96.74%and an average false positive rate of 2.28%,which could substantiate the superiorities and generalization of the proposed method compared with other two approaches on the same data set.At last,by applying the key technology proposed in this thesis,a solution for detecting ECCB is designed.To solve the problem of multi-model collaborative judgment,a comprehensive judgment model(CJM)for malicious TLS flows based on logistic regression algorithm is proposed,and the effectiveness of the CJM is verified by experiments.Meanwhile,through experimental calculations,the feasibility of applying this system to the real network environment is demonstrated to some extent. |