Font Size: a A A

Design And Implement Of Distributed Commodity Information Web Crawler System

Posted on:2015-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:F G YaoFull Text:PDF
GTID:2308330452957221Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Since World Wide Web born in1989, e-commerce access to high-speed developmentalong with the popularity of Internet, it not only effectively promotes economic globalization,but also greatly changes people’s life-style. More and more goods appear online and theforms of online shopping are becoming diversified in the process of e-commerce websitedevelopment. How to help and guide users when they have many choices online shopping isa topic to study, especially in the background of the mobile terminal diversified. Design andimplement of a distributed commodity information crawler system to download shoppingsites’ commodity information, then provides data for figure search site, so as to achieve thepurpose of helping users shopping.First, the web crawler related technologies are introduced, then focus on crawlingstrategy, page parsing and stability of crawler system. On the crawling strategy, animprovement for the breadth-first crawling strategy is proposed after comparative analysis ofthe merit and demerit of existing crawling strategy. Change the original URL queue topriority queue by empowering URL weights so that making more purposeful crawling. Forpage parsing, the use of JavaScript in web page lead to some commodity information can notbe extracted. Two solutions are proposed, one is that crawling mobile web page to bypassJavaScript, the other is that access web page by simulated browser to get all data, then dopage parsing. In order to deal with complex network conditions, the crawler system improvethe stability and disaster tolerance by checking and restarting threads, backing up the centralnode and crawling by increment.Through testing the system, the test results show that target commodity is correctlycrawling above99%and the efficiency of the distributed architecture is confirmed and thecrawling efficiency improvement is demonstrated.
Keywords/Search Tags:commodity information crawler, distributed, crawling strategy, page parsing
PDF Full Text Request
Related items