Font Size: a A A

Illegal Site Identification Using Template Detection

Posted on:2016-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:H L ZhangFull Text:PDF
GTID:2308330476453507Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, with new technologies booming on the Internet, the real-life criminal activities are rapidly spreading to the Internet. Illegal websites have posed threat to the economic growth as well as the social stability. The Ministry of Public Security put a lot of manpower to prevent the rampant phenomenon of illegal activities on the Internet. Identifying illegal websites is mainly relied on public reporting and manual screening, which is time-consuming and labor-intensive. The mainstream technology of automatic illegal website identification is blacklist, static analysis and dynamic detection. Blacklist provides a quick and convenient service for checking suspicious URL which is expensive in practice. The advantage of static analysis is complete theory and mature technology, but the data source is limited to pages and real-time performance is poor. Dynamic detection focused on hanging horse websites, not valid for the phishing, gambling sites. So it is urgently needed a practical automatic illegal websites identification technology that can quickly and accurately identify the illegal websites from the mass websites, so as to achieve the purpose of combatting cybercrimes.In view of the above phenomenon, this paper proposes and implements an efficient template detection approach for illegal websites identification. From the perspective of templates detection, combining HTTP POST analysis, website similarity model, clustering, template detection, big data processing, visualization and other technologies, our approach can extract illegal website template from the massive site, and accurately identify illegal sites at high speed. The system is designed to meet performance requirement and scalability.First,this paper analyzes the characteristics of common illegal sites and makes three illegal website identification proposals. After considering the research objectives and evaluation measure, the paper determines to identify illegal sites using template detection. The approach consists of four key technologies: 1) The feature for template detection. The HTTP behavior is considered as an entry point to identify websites templates. After analyzing the multiple behavior of HTTP, POST is determined as the feature for template detection. We analysis the information in POST and propose for-mula for calculating eigenvalues. 2) Website similarity model. In order to extract illegal website templates and identify the illegal sites, this paper proposed a website similarity model. The model build feature set for each site and calculate the similarity between two sites using Jaccard. 3) Template extraction. In order to obtain a template from similar illegal sites, websites are clustered based on the similarity between the sites and the critical POSTs are extracted as template by the TF-IDF. 4) Illegal websites automatic identification. Using similarity model and the illegal templates, illegal websites can be identified from unknown websites.Then, this paper designs the prototype of illegal sites identification using template detection on the basis of Hadoop, Map-Reduce, Hive and multi-threading technologies. Three experiments are conducted on the gamble sites. The results show that the proposed method can accurately identify the illegal sites. By constantly adjusting the threshold, the precision can reach 100%. Comparing URL, HTML and semantic features, POST has higher accuracy. The results also validate that the technical framework meets the performance measures proposed. The framework is scalable while running time and recall can be improved.Finally, to optimize the efficiency and recall, this paper leverages graph analysis to discover the laws or anomaly from the clusters of sites. According to characteristics of website and graph, features are selected which lead to patterns/laws. After that, features are grouped into chosen pairs which contain patterns of normal behavior and display points that significantly deviate from the discovered patterns as anomalous. These findings are applied to optimize similarity model. The experimental results prove that the improved similarity model can achieve better recall and efficiency.
Keywords/Search Tags:illegal websites identification, template detection, HTTP POST, graph analysis
PDF Full Text Request
Related items