Font Size: a A A

The Design And Implementation Of Deep Web Crawler System Based On Template Configuration

Posted on:2022-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:D J KongFull Text:PDF
GTID:2518306725976959Subject:software engineering direction
Abstract/Summary:PDF Full Text Request
Recently,Web Crawler has been widely used in web service,such as search engine,personal recommendation,and so on,since these services need to be supported by data extraction and parsing.Hidden database refers to the data set that organizations access on the network by allowing users to query through the search interface.In other words,getting data from such a source is not through static hyperlinks.On the contrary,the data is obtained through the query interface and reading the dynamically generated result page.This,together with other obstacles(for example,the interface may only partially answer queries),prevents hidden databases from being effectively crawled by existing search engines.With the emergence of dynamic web page technology,traditional information extraction methods based on static pages cannot meet the business requirements any longer.On the one hand,dynamic web pages convey much more bigger database generated data than traditional static pages loaded.Meanwhile,these contents in dynamic web pages usually contain certain topics and valuable information.On the other hand,traditional crawler methods,such as seed queue-based,depth-first traversal and breadth-first traversal,cannot effectively obtain dynamic pages' information(also called Deep web).Building crawler for deep web is worthy for not only business but also research.In this thesis,a deep web Crawler system based on template configuration is proposed to address the problems mentioned above.The Crawler system extracts the hidden data via sending keywords to target database in web forum form.The system flow includes five steps: first,locating the entrance of deep web database;then,interacting with the search interface automatically;third,evaluating deep web database's attribution;next,selecting keywords;finally,obtaining the crawling results.To implement the above steps,this thesis do research on the design and implementation of deep web crawler system.The system mainly includes five modules: parameter configuration,data crawler,data retrieval,data storage and data analysis.The system has been successfully on-line and running stably.It can effectively crawling most of information from databases.The design and implementation of this system also provides the design ideas and implementation guidance for other research and business.
Keywords/Search Tags:Template Configuration, Deep Web, Crawler
PDF Full Text Request
Related items