Font Size: a A A

Real-time Crawler Detection And Interception For Information Disclosure Website

Posted on:2017-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:N P LuFull Text:PDF
GTID:2348330533968919Subject:Engineering
Abstract/Summary:PDF Full Text Request
As a main way of stock exchange information disclosure,the information announcement websites take charges of disclosing information related to Initial Public Offerings Corporations,the funds,bonds,stock issuance and transactions.Such kind of website needs to be real,comprehensive and timely.Since all these disclosed information are widely read by individual investors,securities institutions and merchants who are keen on information.However,some securities information businessmen and a lot of search Engines that can gather information through website spiders.Then it surely continually consumes the limited recourse of a website.It has a great influence to other users of the information announcement websites.More-worse,sometimes,it even makes the website to be out of control that information cannot be daily disclosed and even suspension to stock transactions.Therefore,in order to ensure the safety,stability and reliability of the site,to ensure the normal operation of the information disclosure,information disclosure site of the effective management of the site is particularly important for the management and control.Having studied a lot of different but most updated and advanced technologies for synchronization date processing and inspection of website domestically and overseas,analyzed the characteristic for website access amounts and concurrent access and so on.Then having researched the feasibility of real-time acquisition and processing of the data processing,the incremental log real perception,real-time data acquisition,data transmission method.This thesis combines the behavior characteristic of the general web crawler,analyzes the information disclosure website crawler characteristics,and with access to information disclosure site differences in behavior,from the website of the log extraction crawler behavior,to classify the behavior of website visitors to detect the crawler.This thesis had used HTTP,NETTY comparison of synchronous and asynchronous log transmission effect of concurrent transmission to analyze.The final selection is to use NETTY asynchronous transmission as the log transfer tool,and to use JStorm as framework of real-time data processing,and the extraction of information disclosure website crawler features and the feature data processing.Having studied the use of Decision Trees,Random Forests,Support Vector Machine etc.varioussupervised classification algorithms to classify the crawler.And the classification method respectively in real-time,accuracy,recall,the F1 measure of the performance.The final selection of website spider detection classification is to use Support Vector Machine(SVM).The crawler detection error tolerant method had detected the for crawler access interception to validate the code page,the misjudgment of access for crawler can influence normal visitors through verification of the verification code so as to effectively reduce the error crawler.For this thesis,having studied the implementation for the information announcement websites which based on website access log on real-time SVM crawler detection methods,the crawler detection algorithm and intercept plus the real-time detection and experiments,the according to the systemic experiment and testing analysis,all were conducted to evaluate,and implement the fault tolerance of crawler of error detection.The results show that the system had given more effective classification algorithm for real-time detections of a crawler,and more effectively intercept & control crawler access.
Keywords/Search Tags:crawler spider, real-time crawler detection, information disclosure, real-time, Intercept
PDF Full Text Request
Related items