Design And Implementation Of A Kind Of Distributed Focused Crawler System

Posted on:2017-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:L B Huang

Full Text:PDF

GTID:2348330569485045

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the era of rapid development of Internet technology,A large amount of information is created and produced on the Internet.The demand for key information retrieval is higher and higher.Whether the key information can be retrieved quickly from the Internet,Determines whether an Internet company can build a stable foundation in this wave of the Internet.This paper is based on the Internet search demand,Combined with system stability and high yield.The implementation scheme of a distributed focused crawler is proposed.The program looks at the company's collection of specific information on the Internet.A highly efficient and feasible crawler system is established.With limited computer resources to achieve a large number of information crawling work.The distributed focused crawler system starts from the demand of the enterprise.Through the demand analysis and detailed performance analysis,With the existing technology,the modules of the system were discussed,and put forward the innovative design for the specific module.Using Python as the main development language,realize the distributed technology of Master-Slave architecture.Because this paper's main user group are travel service company.The crawler is focused on crawling official website's data such as Hotel,air tickets,train tickets,bus tickets.The capture website data is resolved by XPATH and regular expression method.In the parsing process,The using of logic code and configuration files combination,In a creative way improved system's high cohesion and low coupling.Through the statistic of a period time's crawler return status code,making the crawling state diagram of the crawler system,thereby positioning the specific problem of the crawler system.Then solve these problems and effectively improve the stability of the crawler system.Realize the crawler's integrated design of reptiles and optimization.In the implementation of Master-Slave distributed technology,Using the thread pool technology to control the number of each Slave machine.Greatly improve the efficiency of crawling out of the system.For the Slave server design,We use the automatic reset function to avoid memory leaks.The stability of the system is improved from the system level.The distributed crawler system is designed in this paper to obtain the concrete application in the enterprise.Once the stability can be met,We climb the data of travel websites with largest output ratio.Through the use of structured data analysis of the final analysis,to achieve a larger corporate profits.

Keywords/Search Tags:

Focused crawler, Master-Slave system, Distributed system, System stability

PDF Full Text Request

Related items

1	Research On Topic Focused Web Crawler And Related Technologies
2	Resolvent For Parking Problem In Residential Area Based On Solid Parkinge Quipment And Master-Slave Distributed Control System
3	Development And Maneuverability Of Modular Master-slave Robot Teleoperation System
4	Research On Master-slave Tele-robotics System Based On Virtual Reality
5	Web Crawler System Based On Chrome Extension
6	Research On Distributed And Focused Web Crawler Technology And Algorithms
7	Research And Implement Of Distributed Focused Crawler
8	Research On A Master-slave Interactive Control System Of A Hydraulic Manipulator
9	Research And Implementation Of Focused Crawler Based On Distributed Strategy
10	Design And Realization Of A System For Gathering Web Ontologies Based On Focused Crawler Technique