Font Size: a A A

Research And Implementation Of A Combined Focused Crawler Based On Protocol-Driven And Event-Driven Crawling Techniques

Posted on:2010-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:X J YuanFull Text:PDF
GTID:2178360278456748Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
AJAX technology is very popular in Web2.0 applications because it can change the page content dynamically. The delay in loading pages improves the user interface in the degree of interaction, but also increases the difficulty significantly in the network page crawling process. Therefore, the analysis of JavaScript code and crawling page content transfered asynchronously become a research topic.Focused crawler selects URL from pages to visit based on focusing description. When data model wanted by user includes pages, focused crawler ought to gain these pages and construct data model fleetly and accurately.We adopt the combined focused crawling algorithm based on the combination of protocol-driven and event-driven and easy extensional object description to realize the multi-page association focused crawling. The main contributions of this thesis can be summarized as follows.1. In this paper, we suggest a model of multi-page association focused crawling. Based on the page layer and optimal crawling path in user profile, we add associated semantic in the address model to bring about the result of gaining data model fleetly and accurately and realization of multi-page association focused crawling.2. We propose a vector model easy to extend for the goal of focusing description, so that we can add and delet the target site easily, and match multi-page association focused crawling algorithm.3. In this paper, we suggest the framework of combined focused crawler based on protocol-driven and event-driven. The basic functions of protocol-driven module, event-driven module, coroutine module and common module are designed in detail. We mainly study the relevant model and definition involved in protocol-driven module crawling asynchronous transmittal pages.4. We design and implement a prototype system, which is based on the combination of protocol-driven and event-driven. For sina news and comment data, we use the vector model easy to extend to implement two layer pages associated focused crawling in the framework of combined focused crawler.
Keywords/Search Tags:focused crawler, AJAX, protocol-driven, event-driven, JavaScript, page layer, associated crawling
PDF Full Text Request
Related items