Design And Implementation Of Building Materials Information Oriented Web Crawler System

Posted on:2016-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:H B Yu

Full Text:PDF

GTID:2308330467496849

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of network,e-commerce grows swiftly and violently,but e-commerce of building materials develops slowly,and it is Blue Ocean of e-commerce.Many companies have found this opportunity and divide up market share by all kinds of e-commerce websites.However,these e-commerce websites usually cover limited product category and district,which has little influence on e-commerce of building materials.The market need a website which can cover all kinds of products in national regions urgently.However,it is hard for those companies to agree to share their resource. Based on the above background,this thesis designs and implements a web crawler system to download the key information from the Internet and publish the information on our own e-commerce website.It helps the construction enterprises and building materials suppliers to get win-win by the information services the website provided.This thesis introduces the basic working principle of web crawler and the related theory knowledge and make the system functional requirements analysis and system non-functional requirements analysis and technical feasibility analysis based on the requirement analysis.This thesis puts forward the overall design of the web crawler system and designs several submodules.The web crawler system can not only fetch information from static pages but also from JS dynamic pages with the help of parsing engine Rhino. With the help of image parsing engine Tesseract,the web crawler can fetch the key information hidden in the picture.In the process of fetching web pages,in order to improve the speed,this thesis uses DNS cache.In order to avoid fetching repeated information,this thesis uses Bloom Filter to remove duplicated URLs.This thesis also implements a management system which can monitor and manage the web crawler’s work.This thesis designs and implements a web crawler system and it can satisfy the basic needs of users.The web crawler system has been running in GuangLianDa software company and can fetch building materials information successfully.The obtained data is stored in the Mongo database and by now it has fetched about200million records.

Keywords/Search Tags:

web crawler, management system, regular expression parse template, duplicated_URL removal

PDF Full Text Request

Related items

1	Research On Multi-dimensional Regular Expression Matching Algorithm For Network Security
2	The Design And Implementation Of Regular Expression Engines Based On Deterministic Finite Automata
3	The Research And Implementation For The Matching System Of Regular Expression In Bgp Protocol
4	Design And Implementation Of The CUDA-based Regular Expression Matching System
5	Study And Realization Of Template-based Web Crawler And Editing System
6	Gpu Based High Speed Regular Expression Matching Engine
7	GPU Based High Speed Regular Expression Matching Engine
8	The Design Of Regular Expression Matching Engine Based On FPGA
9	Based On Ontology Stock Information Extraction System
10	Designs Of A Compiler For Hardware Circuits Of Regular Expression Matching Engines Based On NFA