Font Size: a A A

Design And Implementation Of Large-scale Internet Information Real-time Extraction System

Posted on:2017-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:H PanFull Text:PDF
GTID:2348330518493521Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the number of information in Internet increases rapidly.The data in Internet presents the characteristics of large scale,multi type,high value,strong real-time performance and so on.Meanwhile,it is more and more intensely needed by people to get information conveniently,effectively and fast.Besides,the resource in websites is the main part of the information in Internet.Therefore,it is vital to research how to get latest and high-quality information from large-scale resource in websites conveniently and effectively.Web crawler is the main tool to obtain information.But stand-alone running web crawler is too slow.So in order to get large-scale data,Web crawler need to be built based on the distributed platform.However,the existing distributed web crawler has the shortcomings of low real-time performance,which is unable to get the newest data.Besides,the duplicated URL removal algorithm of web crawler has problem in either wasting memory or working slowly.Facing to these problems,this paper proposes a large-scale Internet information real-time extraction system based on Storm.Meanwhile,this paper improves traditional web crawler's real-time performance and duplicated URL removal algorithm.The main research of this paper includes:(1)Facing to the performance problem of distributed crawler system,a web crawler system based on Storm distributed real-time computing stream processing system is proposed.Based on the analysis of the principle of Storm and web crawler,a distributed web crawler system based on Storm is designed and implemented,which makes distributed computing model more in line with the working process of web crawler and improve the performance of distributed crawler system.(2)Facing to the duplicated URL removal problem,a multidimensional Bloom Filter is proposed.By mapping the URL from the multi group Hash function to the multi-dimensional bit vector,the algorithm can remove duplicated URL fast and efficiently,which improves both the accuracy of duplicated URL removal and the performance of distributed crawler system.(3)Facing to the real time problem in distributed crawler system,a time-based real time crawling strategy is proposed.The strategy predicts the next time the page to visit through the history of web page update frequency,which achieves the goal of crawling the latest pages in Internet accurately and improves the real-time performance of the distributed crawler system.This paper researches how to get information from lager-scale website resource aiming at current people's need of getting Internet data.This paper proposes a time-based real time crawling strategy to improve web crawler's real-time performance and a multidimensional Bloom Filter to decrease false recognition rate.According to these,the large-scale Internet information real-time extraction system is built based on open-source distributed real-time computing system.The system achieves the goal of get the latest required data from Internet effectively and reliably.And the feasibility of the system design scheme is tested and verified via experiments.
Keywords/Search Tags:distributed web crawler, storm, duplicated url removal, real time crawler
PDF Full Text Request
Related items