Font Size: a A A

Design And Implementation Of A Web Crawler Based On Deep Web Deep Data Acquisition

Posted on:2014-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:W ChenFull Text:PDF
GTID:2268330398487912Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In this information age,the information on the Internet is growing fast,it is easy to store large amounts of data,but more and more difficult to find useful information. Now,the emergence of general search engine provides a solution to the problem.The network has a portion of the data information is stored in the Web site in the database,the data can not be directly obtained through the link in the Webpage,but require the user to manually fill in the Website form,submit the query command to access,these data are called Deep Web data. Compared with the Deep Web data and other information in a static page Web site,more professional,more data,and more valuable for users. The general search engine in information crawling,can not crawl to the Deep Web data source,Which makes search engine users to be able to get valuable information are limited.The1912Revolution in the era of the E search engine is a study of the1912Revolution historical events to provide vertical search engines for researchers, The development of Web crawler subsystem is one of the key system in the engine must be settled. Based on the general search engine,through the analysis of the characteristics of Deep Web data structure,this article provides a set of scheme for the detection of Deep Web data source and access,Solved the two major problems,namely:1. Analysis for the node characteristics of query interface to Deep Web data,establishing the node feature library. When access to the new page,Web crawler using the node feature library matching the node features of current Webpage,and Looking for possibilities include the Deep Web data source in the Webpage,in order to realize that when the crawler to crawl data,automatically find Deep Web data and record the related information to the file.2.The crawler can read Deep Web file,assembly Deep Web data source query request,have access to the site to return information;Through the calculation of page similarity,for the query results page to look for a similar Web pages; Through the query results page and the "similar pages" structure analysis,extract query results from the query results page links and paging links,abandon the navigation links and advertising links,etc.,the right of access to effective information query. Research and experiments show that,the Deep Web data source detection and acquisition model can find a query interface of Web page more easily,and extract Deep Web query results more accurately.
Keywords/Search Tags:Web crawler, Deep Web, Vertical search engine, Theme correlationdegree, The query interface of Deep Web
PDF Full Text Request
Related items