Font Size: a A A

Research And Implementation Of Sensitive Information Detection System Based On Incremental Search

Posted on:2016-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y YeFull Text:PDF
GTID:2308330479984747Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In present, Internet has been an essential part in people’s lives. By means of the broad effects of Internet, some criminals spread rumours, pornography, terrorism, reactionary and other sensitive information, which may violate the personal legitimate rights, so far as to the national security. During the Internet age, it becomes an important research topic in the field of information safety as quickly as possible to find sensitive information within huge amounts of data in the Internet. The detailed researches are described as follows:Firstly, an algorithm identifying post links and their modes is proposed. By analyzing the characteristic of post links, statistical law on their lengths and the structures of post links text descriptions is found to have a follow some certain patterns. The post links parameters can be divided into explicit mode and implicit mode. Explicit post links parameters contain some information, such as name of request page, post id, post page id and so on. All parameters information of post links is contained in the request page name in implicit mode. Based on the statistic of the lengths of link text descriptions and the statistical analysis on the length of links’ text descriptions, a algorithm is proposed to identify the structures pattern of post links and to extract the name of request page, post id and post page id. According to experimental results, this algorithm can adaptively identify post links and navigation links. It provides the basis of quickly extracting post content.Secondly, a scanning strategy of detecting sensitive information of forums based on auto-increment post id is proposed. Breadth First Algorithm is used in this thesis when structure of post link has not been found.The Breadth First Algorithm is also used to construct the queue to crawl web. In this strategy, only the post page to identify sensitive information is scanned when the structure mode of post link has been found, while the post id cannot increase automatically. In this strategy, links are constructed by increasing post link id by degrees to crawl posts when the structure mode can increase automatically. Experiments indicate that this scanning strategy is faster than Breadth First Algorithm.Thirdly, a schedule strategy of sensitive information based on incremental searching is proposed. Following finding the changed pages by comparing on the new and old md5 values of web pages, these pages are incrementally scanned. According to the scanning results, which shows whether the web pages contain sensitive information, the next scan time values of these pages are predicted. By dynamically adjusting the scanning frequency, the schedule strategy is optimized. In this strategy, a web page queue is constructed based on their scanning urgent degrees, which are computed based on sensitive degrees, change frequency and depth of web pages, and then theses important pages are scanned on high frequency. This queue is used to monitor those pages which are needed to be focused on. In the strategy, another websites queue with low scanned frequency is constructed to make sure that the whole changed pages can be found and network costs and computer resource consumption can be reduced as much as possible. During constructing website queue, the web pages containing changed sensitive information are found, and their links are added to the queue. Experimental results indicate that the web pages containing newly changed sensitive information are more quickly detected in this strategy. The costs of network traffic and computer resource are greatly reduced. By reducing the extra time, those web pages, which contain no sensitive information, or whose sensitive information has not been changed, are not scanned.Finally, a sensitive information detection system is designed and implemented. The system is divided into three layers: presentation layer, business logic layer and data access layer. The experimental results of scanning and monitoring 41 websites and 4 forums indicate that the system can run steadily and detect the sensitive information from websites and forums quickly.
Keywords/Search Tags:sensitive information detection, post id identifying, scanning urgent degrees, analysis of characteristics, incremental search
PDF Full Text Request
Related items