Font Size: a A A

Research On Key Identification Method Of P2P Traffic

Posted on:2012-07-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:J F PengFull Text:PDF
GTID:1488303356972219Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The internet traffic identification is one of the crucial tasks for the large network management and the major component of the lawful interception. With the rapid development and wide application of the network technology, more and more applications based on the peer-to-peer (P2P) protocols appear. The characteristics of the P2P techniques, including the high utilization of resources and the non-centralized storage requirement, which accelerate the application of itself in file-sharing, distributed computation, collaborative systems and e-commerce. Since more and more network bandwidth is occupied by the large-scale P2P applications, more than 70% of the whole traffic in China, it is emergent to identify the P2P traffic for the QoS guarantee in the plan and design of network. Meanwhile, the existing vulnerabilities of the P2P applications cause them be easily attacked by the denial of service attacks and intensify the collapse of the Internet. Actually, it is the inherent characteristics that facilitate the spread of the Trojans, viruses and other destructive programs, for instance, the decentralized network storage structure, the principle for convenient file-sharing and the fast routing mechanism. Therefore, to ensure the normal operations of the network, it is urging to identify the P2P traffic quickly and accurately.However, the popular P2P techniques prefer to employ the technologies of dynamic port and encrypted payload to evade either the port-based or the signature-based P2P traffic identification. Currently, the state-of-the-art traffic identification techniques are based on either the network behavior or the machine learning. In this paper, the early and fast P2P traffic identification method and the improved fast identification method of P2P traffic based on heuristics are respectively belonging to the traffic identification technologies based on the machine learning and the behavior. The early traffic identification algorithm uses the size of the first three packets and the server port number extracted from the TCP flows as the features and conducts the supervised learning for classifying the traffic, it can achieve the high accuracy, thus it is suitable for early P2P traffic identification. Improved fast identification method of P2P traffic based on heuristics uses the differentiation between P2P flow and non-P2P flow at the transport layer, which can quickly identify P2P traffic and the specific application of the popular P2P applications. Finally, TCP traffic of P2P application host on the responds success rate and self-similarity are analyzed.The main contributions of this paper are concluded as follows:1. In order to identify P2P traffic quickly and accurately as early as possible, early TCP traffic identification method based on support vector machines(SVM) is proposed for early warning and control of P2P traffic. The method uses the size of early three packets payload and server port number obtained from the TCP flow as flow features and conducts SVM using one against all classification strategy for classifying the traffic. Both theoretical analysis and experimental results show that the method meets the following conditions:extracted features used, training samples selected under the unbiased conditions, it can identify the Internet traffic into application among WEB, MAIL, BitTorrent and eMule categories efficiently. The extracted features are not related to packet payload, so the method is suitable for early identification of encrypted traffic.2. In order to reduce modeling time and improve classification accuracy, early and fast P2P traffic identification method based on C4.5 decision tree. Both theoretical analysis and experimental results show that the C4.5 decision tree has the following superiority compared to two other supervised machine learning algorithms in traffic identification: higher accuracy, computational time saved in traffic identification. Therefore, the method using the size of early three packets payload and server port number obtained from the TCP flow as flow features can quickly and effectively identify internet traffic related to WEB, MAIL, BitTorrent and eMule.3. In order to improve the accuracy and efficiency of transport layer P2P traffic identification method proposed by Karagiannis et al, the port 4662, effective counting mechanisms, the fixed size of BitTorrent peer protocol handshake message packet payload and the payload characteristics of Skype are used to improve the method, the improved fast identification method of P2P traffic based on heuristics is proposed. Both theoretical analysis and experimental results show that the accuracy and efficiency of improved identification method have improved. It can identify the P2P traffic and specific applications of the P2P traffic, such as BitTorrent, eDonkey, Skype.4. In order to identify P2P host, we study connection characteristics and self-similarity of host TCP traffic. P2P host acts as server and client. Non-P2P system connects using the traditional client/server model and achieves a high success rate, as opposed to that, P2P host constantly initiate connections to other online host to guarantee a stable download speed because of dynamic nature of P2P systems. Parameters associated with the dynamic of system and connection success rate include:number of transmitted SYN packets, number of transmitted SYN/ACK packets, number of different destination IPs of transmitted SYN packets, number of different source IPs of received SYN/ACK packets, number of different destination port of transmitted SYN packet, number of different source port of received SYN/ACK packets. Both theoretical analysis and experimental results show that the feature combination of the last four parameters outperforms the other combinations of features while being employed in the identification of P2P host TCP flows. The self-similarity of host TCP flow is analyzed under behavior scale and under time scale. We conclude the received payload of packets of host TCP only have little change after host receives a certain number of packets.
Keywords/Search Tags:P2P, traffic identification, supervised machine learning, support vector machine, decision tree
PDF Full Text Request
Related items