Font Size: a A A

Research On Server Role Based P2P Nodes Identification Methods

Posted on:2011-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:1118360305992007Subject:Information security
Abstract/Summary:PDF Full Text Request
Peer-to-Peer (P2P) networks boom on the Internet. They are user-friendly, but take up a lot of network bandwidth. To ensure the fairness of resource utilization, P2P traffic must be under control. P2P identification, as a prerequisite to P2P traffic control, has become an important and open issue. Currently, port-based detection methods can not identify P2P applications which employ dynamic ports mechanism. Although deep packet inspection methods are well developed, they can not identify the content encrypted P2P applications, and can not work in heavy traffic environments. Behavior based detection methods, which do not use port numbers or packet content but the transport layer statistical characteristics to identify P2P nodes, are current research focus. But the previous hehavor based P2P detection methods also have some drawbacks. First, they mainly base on hosts'clients role, and can not identify P2P nodes uploading only. Second, they can only discriminate between P2P and non-P2P, but can not classifiy traffic in applications level. Third, they can not meet the real-time identification requirements in heavy traffic enviroments. At last, the domestic P2P applications and network environments are not sufficiently considered.We summarized six types of behavior characteristics of P2P servers'role:(1) P2P nodes have listening ports and response connections, and most of the listening ports are large number ports; (2) P2P nodes have a lot of heavy payload and long duration response connections; (3) Online P2P nodes have high service rates; (4) The number of distinct IP (Internet Protocol) addresses connected to a P2P node is roughly equal to the number of distinct ports used to connect to the P2P node; (5) The P2P nodes have both request connections and response connections; (6) There is a large number of response connections which receive and send data simultaneously. We quantify the characteristics and employ statistical methods to predict the theory probability distributions. We employ the maximum likelihood parameter estimation to determine the shape and location parameters, and choose the best fitting analytical distribution by using the Komogorev-Smirnov (K-S) goodness-of-fit test.We designed a connection parallel processing method, employed a connections construction and update algorithm, and used a P2P traffic sampling method to achieve real-time data processing in heavy traffic enviroments. The connections construction and update algorithm balances the new arrival connections and the overtime connections, so that the number of connections stored in the memory is stable. The operation on the connections is divided into two stages, connections construction stage and connections analysis stage, and the parallel process enhanced the computational efficiency. Using the distribution of P2P nodes online time, the traffic sample method collects network traffic at time intervals and filters out the traffic of known applications to reduce the computation. Experimental results show that employing above methods in 30 seconds (rate of 1Gbit/s) sampling intervals, we can identify 92% of P2P nodes. But with the increase of the sampling interval, P2P identification accuracy decreases, which is determined by the heavy tail characteristic of online time of P2P nodes.A server role based P2P nodes identification method, namely PN-Detecor, is proposed. First, the host taking the server role has a large number of client nodes and reponse connections, and its connection mode is quite different with the nodes taking client role. The hosts taking server role are chosen as candidates for further analysis. Second, we respectively employ such behavior features as connection duration, connection payload, request versus response connections, data uploading versus data downloading in response connections and service rate to identify P2P nodes based on the sequential hypothesis testing (SHT). The feature with higher accuracy in P2P identification will be assigned a larger weight value. Then an improved SHT algorithm is proposed, which utilite all the features assigned with weight values. At last, the listening ports of P2P nodes are identified based on the number of request connections and response connections. The results show that the server role based method can real-timely and accurately identify the P2P nodes even if they are uploading-only nodes.A Multiple Support Vector Machine based P2P connections identification algorithm, namely Multi-SVM, is proposed. P2P applications usually divide files into blocks or pieces that can be downloaded by peers from different sources. The file-deviding mechanism needs a lot of short packets to exchange location and control messages between peers, and the flow of long packets is separated by short packets and during transmission the pieces have long time intervals. Multi-SVM algorithm employs different vectors to describe data packet length, the number of consecutive long packets, time intervals between long packets, and then builds a multiple support vector machines to identify P2P connections. The multiple support vector machines is trained offline and online together to be adapted to different network environments. Further more, the connections of different P2P applications have different characteristics, which are called profiles. Then a Profile-based P2P connections classification algorithm, namely FCP, is proposed, which employs a standardized threshold calculation method rather than sets threshold value by users'experiences. The results show that the connections can be divided into two categories:P2P connections and non-P2P connections, and the specific application types of P2P connections can be identified.A P2P applications sliding window signature automatic extraction algorithm, namely SWE, is proposed. Each information packet of P2P applications is regarded as a binary sequence. Frist, SWE divides every packet into sub-sequences by the window which slide on each of them with a fixed width and single-byte step. Second, SWE calculates the frequency of each sub-sequence packets appearance in the same offset position of every packet. Change the window width and repeat the process, then the sub-sequence, which frequence and length are both over the thresholds, is chosen as P2P application signature. The results show that the proposed algorithms can extract signatures of P2P applications accurately and effectively.Based on the mentioned methods, a prototype system of P2P identification and siganature auto-extraction is designed, which includes data collection and preprocessing module, P2P nodes identification module, signature auto-extraction module and feedback module. Prototype system has been put into operation, and its identification accuracy is over 90%.
Keywords/Search Tags:server role, behavior characteristics, traffic sample, sequential hypothesis testing, multiple support vector machines, signature automatic extraction, peer-to-peer nodes identification
PDF Full Text Request
Related items