Font Size: a A A

Identification And Analysis Of Massive Web Traffic Based On Behavior Characteristics

Posted on:2017-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L GuiFull Text:PDF
GTID:1318330518995986Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the widely use of WeChat and other social applications as communication tools, telecom operators are pushed into the situation of over-the-top (OTT) furtherly. Also, telecom operators' continuing investment in the network infrastructure does not bring the expected benefits to them. Telecom operators are caught in a dilemma with increase of investment but without increase of income. Therefore, telecom operators have to find a new way to increase income under the cross-border competition environment. The advent of big data era provides a consensus to make full use of the advantages of huge network traffic to increase their income.Telecom operators are enterprises providing telecommunications service. The implementation of big data strategy should be reflected in providing better telecommunication services for the public, such as on-demand services, precision marketing, etc. Fine-grained network traffic identification is a prerequisite for achieving these goals. The proportion of Web traffic is the highest under current network. Also, Web traffic collected on the network side contains more useful information, which can help operators to understand users and network status, than other kind of traffic. Therefore, this thesis foucus on fixed network and mobile Internet network, centered on fine-grained traffic identification, based on the massive Web traffic, make use of the behavioral characteristics of Web traffic to identify and analyze Web traffic in fine-grained from the dimensions of websites, applications, operating systems, smartphones, and network security. The main contents and innovations of this thesis are summarized as follows.(1) Fine-grained Web traffic identification and analysis in the fixed networkTop-k ranking of websites according to traffic volume is important for Internet Service Providers (ISPs) to understand network status and optimize network resources. However, the ranking result always has a big deviation with actual rank for the existence of unknown web traffic. We should associate unknown web traffic with websites in order to narrow the deviation and reflect the real situation of network. In this paper, we describe the relationship between client and server by use of temporal bipartite graph (TBG). Then, we construct a probabilistic model to identify unknown Web traffic based on TBG and the referer characteristic. Finally,we propose three methods to approximate the actual rank of websites based on this probabilistic model. Experimental results show that the proposed techniques can reduce the deviation existing between the ground truth and the ranking results vastly. In addition, we find that the websites providing video service have higher ratio of unknown IP as well as higher ratio of unknown traffic than the websites providing text web page service.Specifically, we find that the top-3 video websites have more than 90% of unknown web traffic. Besides, we firstly apply the theory of probabilistic top-k query to the field of network traffic analysis.(2) Fine-grained Web traffic identification and analysis in the mobile InternetDifferent from the fixed network, app is the main way to access Internet in the mobile network. Therefore, the goal of fine-grained Web traffic identification in the mobile Internet is to associate Web traffic with apps. At present, deep packet inspection (DPI) is the main method for apps identification. However, we can only obtain some characteristic-strings of apps instead of apps' name by use of DPI method because app-idenfiers are set by app-developers randomly without uniform rules of coding. As a result, the previous works were limited to extract characteristic-strings from HTTP header. While we proposed a new method with three steps to identify app-names according to incomplete and ambiguous user-agents.Firstly, we extract user-agents from HTTP headers; then, the related text feature is obtained from Internet by use of the corresponding user-agent;finally, we identify apps by analyzing the obtained text features.(3) Analyzing the behavior of smartphones in the mobile InternetOperating systems and device models are two important factors when people consider to buy a smartphone. Different types of mobile devices are designed to attract different people, and different kinds of people use smartphones in a different way. Therefore, we can reveal the flow characteristics of different types of smartphones and the behavioral characteristics of different kinds of people use smartphones from these two dimensions. Based on a dataset across 31 days (a billing cycle), we study smartphones from these two dimensions and we get some meaningful conclusions. For example, Android OS eat more than 1.5 times traffic volume than the devices with iOS on average; the deviation of traffic volume between different models in iPhone is within 5%, while the deviation of traffic volume between different models in Android is more than 200% (e.g. Xiaomi 3 and Xiaomi Note); people spend more time on smartphones on weekends than weekdays; more than 70% of people use less than 10 apps in a week; 10% apps consumes 98% total traffic volume,while the other 90% apps generates 2% total traffic volume. All the measurement conclusions provide insights for network operators to strategize pricing and resource allocation for their cellular data networks.(4) Identifying and analyzing malware traffic in the mobile InternetWith the mobile Internet penetrated in the areas of social, economic,cultural, and others, network security issues are closer to user than ever before. With the development of technology, the new malwares are emerging endlessly in the network. If we can identify the network traffic generated by malware and then restrict these traffic or provide early warning to the user, the loss of user will be reduced, and the better network environment will be achieved. While XcodeGhost, a malware of iOS emerging in late 2015, leads to the privacy-leakage of a large number of users, only a few studies have examined XcodeGhost based on its source code. In this paper we describe observations by monitoring the network activities for more than 2.59 million iPhone users in a provincial area across 232 days. Our analysis reveals a number of interesting points. For example, we find that the ratio of the infected devices is more than 60%;that a lot of popular applications, such as Wechat, railway 12306, didi taxi,Youku video are also infected. Besides,we propose a decay model for the prevalence rate of XcodeGhost.
Keywords/Search Tags:network traffic measurement, fine-grained network traffic identification, web traffic, apps identification, XcodeGhost
PDF Full Text Request
Related items