Font Size: a A A

Design And Implementation Of Configurable Distributed WEB Information Crawling System

Posted on:2017-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:A XiaoFull Text:PDF
GTID:2348330518496572Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,Web information grows exponentially.Data showing a large amount of data,data types and more practical,large-value and other characteristics.At the same time people's needs for quick and easy access to information is also growing strongly.These requirements promote the rapid development of cloud computing.In this background,Google,IBM,Apache and Amazon and other major companies competing to develop cloud computing.Which led by the Apache Hadoop platform developed is a very good open source cloud computing framework.This paper studies the self-configuring distributed web information crawling system,which is designed and implemented on the basis of this frameworkThis paper treats the e-commerce website as the research object.The purpose is to grab the numerous and diverse information on commodity electricity supplier website.In today's era of rapid development of e-commerce and Internet,massive commodity information contains huge commercial value and practical value.It is of great significance to obtain product information quickly and accurately.Therefore,the current mainstream e-commerce sites have done a lot of research.The main analysis of the site structure,page layout,and presentation of product information,etc.It summed up the process of various technical problems on electricity supplier website to crawl product information in the face.Including dynamic page analysis,product information localization,dynamic access to product information,data completeness and URL to weight and other key issues.For this series of questions,this paper designed and proposed the corresponding solution.For three page structure to develop the site of the corresponding commodity information fetching strategy.The way of Self-configuring ensures system's flexibility and scalability to a certain extent.Loading a web page with a browser achieves dynamic web crawling.The establishment of commodity information feature library and product information extraction rules model achieves precise positioning and crawling on the product information page.This paper proposed price range partitioning algorithm based on adaptive step to solve data completeness issues.This paper designed deduplication strategy based on bloom filter to tackle the problems of data reproducibility.These programs ultimately be realized,and have achieved good experimental results.On the other hand,this paper analyzes the development status of distributed crawlers and research on the distributed file system HDFS and Hadoop platform for distributed programming frameworks Map/Reduce theory and technology.On this basis,we design and implement self-configuration information distributed web crawling system to solve the low efficiency,poor scalability of single reptile.Self-configuration function ensures that the system can be applied to different sites to crawl task.Distributed features improve the speed of the web crawler to crawl and expand the size of the data.
Keywords/Search Tags:self-configuration, distributed, web information crawling, Map/Reduce
PDF Full Text Request
Related items