Font Size: a A A

Design And Implementation Of Sensitive Information Monitoring System For Website

Posted on:2009-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2178330338485517Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, network has become the main platform for academic exchanges, information sharing. However, due to the size of the network and expanding distribution of geographic features, making for the national network of information management has becoming more and more difficult. Therefore, how to discover the sites which contains the dangerous information will be important to the social stability.This paper use the Web site sensitive surveillance information system which combine the information retrieval technology and Web information extraction technology ,on the basis of Web data mining of the basic ideas, realizing the sensitive intelligence information quickly Found ,which its high flexibility and a higher rate of recall and precision.This paper carried out as follows:1. Designed and implemented the Web site sensitive information monitoring system based on three-tier structure , including the module of collection Web page information , extraction of Web page information and the alarm module when the information identified ,which realize monitor and alarm the sensitive information on the site page , detect and track these sources of information.2. In the process of collecting the web site and its links to the relevant pages, using the acquisition strategy which makes the link information on the authority pages and the sensitive pages as key acquisition targets. This strategy adopts PageRank algorithm to access every important page assessment, based on this assessed value to determine the authority pages ,from this authority pages to start new search, repeating this process until satisfy the conditions to stop, which ensured the page collection of recall at a certain extent.3. Based on the division of Word in the pages, this paper use the optimizatied K-clustering algorithm to improve the accuracy about the division of similar page, and through improving the structure of tab page tree, combined with the optimizatied Smith-Waterman algorithms, that realize the accurate division of page datas, identifying the main pages of the text block. The tests verify the effectiveness of these algorithms.4. Useing the Wu-Manber algorithm to extract the text information for multi-pattern matching keywords, for the realization of accurate and rapid identification to alarm about the sensitive intelligence information.Finally, this paper summarizes the works, put forward the realization of the system.
Keywords/Search Tags:information collection, information extraction, tag tree, comparison of dissimilarity degree, Clustering Algorithm
PDF Full Text Request
Related items