Font Size: a A A

Research And Optimization Of Dynamic Web Crawler Based On Webmagic

Posted on:2017-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:J F CaiFull Text:PDF
GTID:2348330518493383Subject:Cryptography
Abstract/Summary:PDF Full Text Request
With the explosive growth of web page quantity,the traditional¢ralized crawler can't meet the practical application.In addition,the wide popularization of Ajax technology in network application brings one fully new reform for the traditional Web development.By the partial refresh function,the users' experience degree is improved and the users can carry out interaction with the remote server.The typical applications include BBS of campus,blog site and so on.The appearance of so many dynamic pages bring huge obstacle for the web crawler.Both the efficiency of crawler and obtaining of page contents are influenced.Aiming at the above mentioned two problems,based on the WebMagic crawler frame,the text puts forward one kind of distributed dynamic page crawler system Dis-Dyn Crawler.The system adopts SOA structure thinking,separating the time-consuming operation of dynamic web page analyzing tool—HtmlUnit during dynamic page process,as the independent service.For improving the analyzing efficiency,we will cache the JavaScript document and so on needed by HtmlUnit into Redis database.When carrying out page rendering,you don' t need download from the internet each time,decreasing the network requirement and improving the analyzing efficiency.Asynchronous page downloader makes the overall efficiency of the system to improve.At last,in the built system test environment,design the system test plan from function and performance in detail and carry out test for Dis-Dyn Crawler system.By contrast with the grasping ability ofexisting distributed web crawler tool,verify the high efficiency of dynamic web crawler based on Webmagic in this text.By contrast with the existing dynamic web page analyzing tool,verify the high efficiency and feasibility of Dis-Dyn Crawler system put forward in the text.
Keywords/Search Tags:Distributed Crawler, Dynamic Web Page, HtmlUnit, ZeroMQ, Redis
PDF Full Text Request
Related items