Real-time Crawler Detection And Interception For Information Disclosure Website

Posted on:2017-03-30

Degree:Master

Type:Thesis

Country:China

Candidate:N P Lu

Full Text:PDF

GTID:2348330533968919

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

As a main way of stock exchange information disclosure,the information announcement websites take charges of disclosing information related to Initial Public Offerings Corporations,the funds,bonds,stock issuance and transactions.Such kind of website needs to be real,comprehensive and timely.Since all these disclosed information are widely read by individual investors,securities institutions and merchants who are keen on information.However,some securities information businessmen and a lot of search Engines that can gather information through website spiders.Then it surely continually consumes the limited recourse of a website.It has a great influence to other users of the information announcement websites.More-worse,sometimes,it even makes the website to be out of control that information cannot be daily disclosed and even suspension to stock transactions.Therefore,in order to ensure the safety,stability and reliability of the site,to ensure the normal operation of the information disclosure,information disclosure site of the effective management of the site is particularly important for the management and control.Having studied a lot of different but most updated and advanced technologies for synchronization date processing and inspection of website domestically and overseas,analyzed the characteristic for website access amounts and concurrent access and so on.Then having researched the feasibility of real-time acquisition and processing of the data processing,the incremental log real perception,real-time data acquisition,data transmission method.This thesis combines the behavior characteristic of the general web crawler,analyzes the information disclosure website crawler characteristics,and with access to information disclosure site differences in behavior,from the website of the log extraction crawler behavior,to classify the behavior of website visitors to detect the crawler.This thesis had used HTTP,NETTY comparison of synchronous and asynchronous log transmission effect of concurrent transmission to analyze.The final selection is to use NETTY asynchronous transmission as the log transfer tool,and to use JStorm as framework of real-time data processing,and the extraction of information disclosure website crawler features and the feature data processing.Having studied the use of Decision Trees,Random Forests,Support Vector Machine etc.varioussupervised classification algorithms to classify the crawler.And the classification method respectively in real-time,accuracy,recall,the F1 measure of the performance.The final selection of website spider detection classification is to use Support Vector Machine(SVM).The crawler detection error tolerant method had detected the for crawler access interception to validate the code page,the misjudgment of access for crawler can influence normal visitors through verification of the verification code so as to effectively reduce the error crawler.For this thesis,having studied the implementation for the information announcement websites which based on website access log on real-time SVM crawler detection methods,the crawler detection algorithm and intercept plus the real-time detection and experiments,the according to the systemic experiment and testing analysis,all were conducted to evaluate,and implement the fault tolerance of crawler of error detection.The results show that the system had given more effective classification algorithm for real-time detections of a crawler,and more effectively intercept & control crawler access.

Keywords/Search Tags:

crawler spider, real-time crawler detection, information disclosure, real-time, Intercept

PDF Full Text Request

Related items

1	Design And Implementation Of Large-scale Internet Information Real-time Extraction System
2	Research And Implement Of Improved Real-time Crawler Modeling
3	Research On Topic Focused Web Crawler And Related Technologies
4	Detection And Simple Use Of Time Information In Real-time Search Engine
5	Research And Implementation Of A Public Forum Information In Real-time Retrieval
6	Design And Implementation Of Crawler Based On Real-time Distributed Network
7	The Research Of Real-time Search And Semantic Understanding Of Dynamic Traffic Information Internet Web Page Contains
8	Research On Technologies Of Real-Time And Wideband Field Network
9	Research On Real-time Information Theory And Real-time Information Acquisition In Internet Of Things
10	The Embedded Real-time Process Management And Its Support For Real-time Databases