Font Size: a A A

Design And Implementation Of Building Materials Information Oriented Web Crawler System

Posted on:2016-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:H B YuFull Text:PDF
GTID:2308330467496849Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of network,e-commerce grows swiftly and violently,but e-commerce of building materials develops slowly,and it is Blue Ocean of e-commerce.Many companies have found this opportunity and divide up market share by all kinds of e-commerce websites.However,these e-commerce websites usually cover limited product category and district,which has little influence on e-commerce of building materials.The market need a website which can cover all kinds of products in national regions urgently.However,it is hard for those companies to agree to share their resource. Based on the above background,this thesis designs and implements a web crawler system to download the key information from the Internet and publish the information on our own e-commerce website.It helps the construction enterprises and building materials suppliers to get win-win by the information services the website provided.This thesis introduces the basic working principle of web crawler and the related theory knowledge and make the system functional requirements analysis and system non-functional requirements analysis and technical feasibility analysis based on the requirement analysis.This thesis puts forward the overall design of the web crawler system and designs several submodules.The web crawler system can not only fetch information from static pages but also from JS dynamic pages with the help of parsing engine Rhino. With the help of image parsing engine Tesseract,the web crawler can fetch the key information hidden in the picture.In the process of fetching web pages,in order to improve the speed,this thesis uses DNS cache.In order to avoid fetching repeated information,this thesis uses Bloom Filter to remove duplicated URLs.This thesis also implements a management system which can monitor and manage the web crawler’s work.This thesis designs and implements a web crawler system and it can satisfy the basic needs of users.The web crawler system has been running in GuangLianDa software company and can fetch building materials information successfully.The obtained data is stored in the Mongo database and by now it has fetched about200million records.
Keywords/Search Tags:web crawler, management system, regular expression parse template, duplicated_URL removal
PDF Full Text Request
Related items