Research On Key Technologies Of A High-performance Web Crawler System

Posted on:2018-05-28

Degree:Master

Type:Thesis

Country:China

Candidate:C Yu

Full Text:PDF

GTID:2348330542472250

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The rapid development of Internet technology has brought rapid growth of network information.At the same time,lots of academic fields which based on the support of web crawler have appeared such as data mining and search engines.One thing that makes great influence on the relevant researches is the efficiency of the web crawler system.How to accurately and quickly crawl the information in the network will be the key factors to measure a web crawler system performance.This paper improves the target link extraction accuracy and the efficiency of the business data cache,thus achieve the goal of ascension crawler system performance.Aiming at the problem of relation link extraction,this paper improves the accuracy rate by two aspects: not repeat link extraction and relation link extraction.Firstly,the Bloom filter is improved,and a multi-level dynamic Bloom filter based on link feature is proposed.This new Bloom filter improves the accuracy of non-repetitive links extraction by segmenting and combining URLs to perform multiple matching,reducing the false positive rate of URL de-duplication.Secondly,this paper proposes a relation link extraction algorithm based on link attributes to improve the accuracy of link extraction.The algorithm sets rules for different link attributes such as page structure,semantics,themes,and link text ratios.By comparing the rules,out of compliance with the rules of the noise link to improve the accuracy of link extraction,avoiding the crawler to crawl some noise links.At last,by experimental verification,the two methods can improve the target link extraction accuracy.Aiming at the problem of business data cache efficiency,this paper improves business data cache in two ways: web data cache and DNS cache.By designing a web page data cache management model to reduce the time consumption when the crawler system do some memory application and release in the Web page data cache,thereby raising the web data cache efficiency of a crawler.The data cache management model add a global management thread based on thread private memory pool to balance individual threads available memory node,thereby reducing the times requesting memory from operating system,improve crawling efficiency.Design a DNS-based pre-parsing caching algorithm based on three-level hash to reduce the time consumed by crawlers in domain name resolution.The algorithm do DNS speculative parsing,and save the IP to the three layers of hash cache structure,so thatthe crawler system don't have to request the IP information from public domain name server every time.By caching DNS information,reducing the time consumption of DNS resolution,this algorithm improves the crawler performance.Finally,through experiment,the two methods can improve the efficiency of the business data cache.

Keywords/Search Tags:

Web Crawler, Bloom Filter, Link Extraction, Data Caching

PDF Full Text Request

Related items

1	Research On Technologies Of Distributed Link Extraction And DNS Cache
2	Investigation On Web Crawler Technology Based On Hadoop Platform
3	Research And Application Of Data Deduplication Technology Based On Bloom Filter
4	Researches And Applications On Efficient Bloom Filter For Big Data
5	Research On Information Storage Optimization In Named Data Networking
6	Research And Application Of Efficient Data Acquisition Methods For Domain Data
7	Privacy Preserved Bloom Filter And Key-value Based Bloom Filter
8	Research On Neighbor Object Availability-based Cooperative Caching Policy For ICN
9	Multi-Bloom-Filter Query Algorithms And Their Applications
10	The Implementation And Application Of Removing Duplicated Web Pages Based On Bloom Filter