Font Size: a A A

Research On Key Technologies Of A High-performance Web Crawler System

Posted on:2018-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:C YuFull Text:PDF
GTID:2348330542472250Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology has brought rapid growth of network information.At the same time,lots of academic fields which based on the support of web crawler have appeared such as data mining and search engines.One thing that makes great influence on the relevant researches is the efficiency of the web crawler system.How to accurately and quickly crawl the information in the network will be the key factors to measure a web crawler system performance.This paper improves the target link extraction accuracy and the efficiency of the business data cache,thus achieve the goal of ascension crawler system performance.Aiming at the problem of relation link extraction,this paper improves the accuracy rate by two aspects: not repeat link extraction and relation link extraction.Firstly,the Bloom filter is improved,and a multi-level dynamic Bloom filter based on link feature is proposed.This new Bloom filter improves the accuracy of non-repetitive links extraction by segmenting and combining URLs to perform multiple matching,reducing the false positive rate of URL de-duplication.Secondly,this paper proposes a relation link extraction algorithm based on link attributes to improve the accuracy of link extraction.The algorithm sets rules for different link attributes such as page structure,semantics,themes,and link text ratios.By comparing the rules,out of compliance with the rules of the noise link to improve the accuracy of link extraction,avoiding the crawler to crawl some noise links.At last,by experimental verification,the two methods can improve the target link extraction accuracy.Aiming at the problem of business data cache efficiency,this paper improves business data cache in two ways: web data cache and DNS cache.By designing a web page data cache management model to reduce the time consumption when the crawler system do some memory application and release in the Web page data cache,thereby raising the web data cache efficiency of a crawler.The data cache management model add a global management thread based on thread private memory pool to balance individual threads available memory node,thereby reducing the times requesting memory from operating system,improve crawling efficiency.Design a DNS-based pre-parsing caching algorithm based on three-level hash to reduce the time consumed by crawlers in domain name resolution.The algorithm do DNS speculative parsing,and save the IP to the three layers of hash cache structure,so thatthe crawler system don't have to request the IP information from public domain name server every time.By caching DNS information,reducing the time consumption of DNS resolution,this algorithm improves the crawler performance.Finally,through experiment,the two methods can improve the efficiency of the business data cache.
Keywords/Search Tags:Web Crawler, Bloom Filter, Link Extraction, Data Caching
PDF Full Text Request
Related items