Font Size: a A A

Research And Practice On Key Techniques Of Deep Web Crawl

Posted on:2011-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:M Y FengFull Text:PDF
GTID:2178360302974615Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the Internet continues to develop, the amount of information in the network grew rapidly. Based on the difference of accessing information, internet can be divided into shallow network and the deep Web. We crawled shallow network by means of hyperlink which can be carried out on a common search engine. While deep web information hidden behind the web search box, the user must submit queries in the web form to obtain the information.As the method of crawling in deep web clearly distinguished from shallow network, which means traditional hyperlink-based web crawler can not crawl and index deep Web information. With the growth of useful information in the deep web, accessing into the deep web is of significance mean for the search engines.We designed a deep web crawler base on the most efficient queries. Our method solved the problems of low level automatic and domain constrain in the deep web crawling. Our deep web crawler contained three core algorithms. Reorganizations of entry of deep web, training algorithms through a large number of HTML page's form control, text context and depth in the site. Calculation of the most efficient initial queries, dividing the form page into two spaces which are form and text context space, then performed K-Means algorithm up crawled pages to get the candidate queries. Submitting the most efficient queries to site, analyzed the result pages to update candidate queries iteratively.Finally, we designed system and tested all the algorithms. We completed the system coding work based on the theoretical analyzing. Experiment results verified the effectiveness of the algorithm.
Keywords/Search Tags:Deep Web, Deep Web Crawler, Page Cluster, Most Efficient Queries
PDF Full Text Request
Related items