Font Size: a A A

Research On Real-Time Classification Of URLs In Large Scale Network Traffic

Posted on:2016-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Z ShaFull Text:PDF
GTID:1108330482960404Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the continually expansion of network services, the scale of URLs has experienced a fast growth in recent years. China, for instance, has experienced a rapidly growth in its Internet industry. As reported by "The Development situation and security report of Chinese Websites", at the end of December 2014, the number of Chinese websites has reached about 3,647,000 with an annual increase of about 141,000. The number of Internet Service Providers (ISP) has reached 1,068, with an increase of 86. With the increasing number of websites and web pages, Internet services have gradually penetrated into people’s daily lives. The abundant Internet services, on the one hand enrich people’s daily lives, on the other hand provide a broad development platform for various cyber-attacks (including phishing, scams, Trojan, etc.). These network attacks are often launched around the web, either by designing traps or digging vulnerability. By utilizing various techniques, these network attacks finally constitute a serious threat to the security of the whole Internet. Take URL for instance, as shown by the Kaspersky statistics, malicious URLs appeared over 139 million times, and play an important role in 87.39% of the year 2012. At present, with the rapid development of Internet, the number of network attacks keeps a continually growth. And the fact that the scale of malicious URLs keeps expanding leads the situation of network security to be increasingly grim.As one of the key security techniques against cyber-attacks, the real-time classification techniques of URLs can avoid the security threats caused by malicious web pages, provide effective protection of personal privacy and online transaction, and thus enhance network security. To this end, many schemes and techniques have been proposed. However, with the continually growth of the scale of web pages, to fight against network attack faces numerous challenges including plenty of network resources, the collection of web data is uneven, and the upgrades of escaping technologies of malicious URLs. In face of these challenges, traditional schemes exposed many defects such as low accuracy and high memory usage. To effectively deal with these new challenges, researchers should study real-time URL classification technology. Such techniques should be able to prevent security threats aroused by malicious URLs fundamentally by applying efficient, reliable, and accurately defensive model.In this paper, we took the real-time classification of web pages as the main line, studied the latest research results for real-time classification techniques of URLs, and further expanded their research based on the actual needs. More specifically, we proposed non-human access filtering technology, malicious web page recognition technology, potentially malicious web page discovery technology and many other technologies. Based on these technical theories, this paper formulates a large-scale real-time URL classification framework. Through detailed and specific experiments and rich sources of open data analysis, this article conducts experimental verifications for these solutions and achieves good practical results.The main contributions are summarized as follows:(1)It proposed a large-scale framework for real-time URL classification. By integrating various techniques, it aims to classify and analyze the network traffic on the gateway. In this paper, it proposed an asynchronous collaborative architecture which combines both offline analysis and online classification. In this way, it can effectively improve the efficiency of web traffic classification, and support the practical needs for multi-level fine classification. Besides, by considering the characteristics of gateway traffic, it focused on the potential problems which may exist in the current classification frameworks. As shown in experimental evaluations, such framework has a much higher operating efficiency and a stable classification performance. Besides, it also indicates that such framework not only contributes to the theoiy, but also reflects a certain degree of practical value.(2) It researches on an effective filtering technique named EPLogCleaner which identifies and filters non-human clicks with high frequency. EPLogCleaner focused on the filtering of high-frequency artificial web page views in the gateway traffic. It mainly makes use of the periodicity of high-frequency artificial web page views. By combining with traditional similarity measures for analysis, it is able to summarize filtering rules which is used to exclude high-frequency artificial web page views. Experimental results demonstrate that, compared with traditional methods on data cleaning, EPLogCleaner can filter 30% more URLs and ensure the accuracy to be higher than 90%.(3) It researches on the light-weight suspicious URL recognition technology. In this paper, a simple and effective feature selection method is proposed to limit the scale of feature set. For each feature, it first gives an evaluation method of O (1) complexity to measure its predictive ability. Then, it selects effective features based on the measured values with linear complexity. Experimental results indicate that our approach can achieve almost the same predictive ability by using only 8.3% features for malicious URLs detection, comparing with prior approaches. Moreover, our approach works efficiently in the big data era, since it can averagely handle 20 thousand URLs per second in our evaluations.(4) It researches on the detection technology of cloaked phishing web pages. In this paper, a lightweight phishing detection approach namely CPRM (Cloaked Phishing Recognition Model) is proposed. With the observation of the phishing URLs’ cloaking process, it introduces a few new lightweight features in the detection system. Experimental results demonstrate that our approach can detect phishing pages using cloaking techniques accurately. Comparing with prior systems, our approach improves the precision rate by 2.74%, recall rate by 1.25% with processing almost the same number of URLs per second.(5) It researches on the inferring techniques for malicious URLs. It is the first to introduce access relationship to identify new malicious URLs. In this way, it solved the problem of low concentration of malicious web pages. Experimental results show that, compared to traditional schemes, GuidedTracker can effectively enhance the concentration of malicious URLs (increased from 1.06% to 1.94%), and shorten the duration of the detection process by 33.89%.
Keywords/Search Tags:Real time classification, malicious URL, detection, non-human access filtering techniques, URL features, inferring techniques for malicious web pages
PDF Full Text Request
Related items