Font Size: a A A

Study On Scheduling Detection And Eliminating Duplication Of Text-based Sensitive Information

Posted on:2017-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:D X ZhangFull Text:PDF
GTID:2348330503965574Subject:Engineering
Abstract/Summary:PDF Full Text Request
The development of the Internet has brought people great convenience of living, the society great progress. At the same time, some criminals use Internet to disseminate the sensitive information which contains undesirable contents, such as pornographic, vi olent terrorist information, reactionary information quickly and conveniently. These cri minals have brought tremendous negative impact to national security, social developme nt and people's living. It has become hot topic in the field of network security to access and monitor sensitive information in time from the huge Internet. In this thesis, the sche duling strategy of detecting sensitive information and the method of removing duplicate sensitive pages are researched. The detailed contents are described as follows:1 A scheduling strategy based on sensitive pages classified by the sensitivity of web pages is proposed. In this thesis, the sensitive keywords and its positions in a web page are obtained by searching them from the web page, and then the sensitivity of the keywords and the impact factors that come from their positions in the page are combined by an algorithm to calculate the sensitive correlation value of a web page. The sensitive pages are classified by the sensitive correlation value calculated above. Different scanning frequencies are given to the pages according to the classification to achieve intensive monitoring and timely detection of highly sensitive pages. Experiments indicate that the strategy can effectively improve the timeliness finding sensitive information, and the proportion of highly sensitive information.2 A supplementary strategy discovering sensitive pages based on predicting the changing time of insensitive web pages is proposed. In this thesis, the next changing time of the page is predicted by the number of changed times and the time interval in recent web crawling. Only the web pages meeting the time condition are grabbed to increase the crawling frequency of constantly changing pages and reduce the crawling frequency of pages with less changing frequency. In the strategy, the timeliness finding the change pages is used to improve the efficiency and the timeliness to discover the new sensitive pages. The experimental results indicate that the strategy can better increase the proportion of sensitive pages.3 A strategy eliminating the duplicate web pages based on abstracted sensitive information is proposed. In this thesis, the sensitive keywords and their position in the page are attained by matching the sensitive keywords and the page. By merging the all context of the sensitive keywords, the context of the sensitive keyword and the sensitive abstract of the page are produced. In the strategy, the sensitive keywords contained by the page are used to reduce the scope of the pages needed to compare, the similarity of the sensitive abstraction information from different pages by calculating editing-distance is used to remove duplicate pages. The experiments indicate that the strategy can improve the effectiveness of the system to remove duplicate pages.4 A system detecting sensitive information, based on the strategies presented above, is designed and implemented. By scanning and monitoring on the university's websites, the validity and stability of the system are confirmed. The results of the system indicate that the strategies in the thesis can discover sensitive information and its changes more timely.
Keywords/Search Tags:sensitive information, classified monitoring, supplementary discovery, remove duplicate pages, sensitive abstraction
PDF Full Text Request
Related items