Font Size: a A A

Design And Implementation Of A Kind Of Distributed Focused Crawler System

Posted on:2017-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:L B HuangFull Text:PDF
GTID:2348330569485045Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of rapid development of Internet technology,A large amount of information is created and produced on the Internet.The demand for key information retrieval is higher and higher.Whether the key information can be retrieved quickly from the Internet,Determines whether an Internet company can build a stable foundation in this wave of the Internet.This paper is based on the Internet search demand,Combined with system stability and high yield.The implementation scheme of a distributed focused crawler is proposed.The program looks at the company's collection of specific information on the Internet.A highly efficient and feasible crawler system is established.With limited computer resources to achieve a large number of information crawling work.The distributed focused crawler system starts from the demand of the enterprise.Through the demand analysis and detailed performance analysis,With the existing technology,the modules of the system were discussed,and put forward the innovative design for the specific module.Using Python as the main development language,realize the distributed technology of Master-Slave architecture.Because this paper's main user group are travel service company.The crawler is focused on crawling official website's data such as Hotel,air tickets,train tickets,bus tickets.The capture website data is resolved by XPATH and regular expression method.In the parsing process,The using of logic code and configuration files combination,In a creative way improved system's high cohesion and low coupling.Through the statistic of a period time's crawler return status code,making the crawling state diagram of the crawler system,thereby positioning the specific problem of the crawler system.Then solve these problems and effectively improve the stability of the crawler system.Realize the crawler's integrated design of reptiles and optimization.In the implementation of Master-Slave distributed technology,Using the thread pool technology to control the number of each Slave machine.Greatly improve the efficiency of crawling out of the system.For the Slave server design,We use the automatic reset function to avoid memory leaks.The stability of the system is improved from the system level.The distributed crawler system is designed in this paper to obtain the concrete application in the enterprise.Once the stability can be met,We climb the data of travel websites with largest output ratio.Through the use of structured data analysis of the final analysis,to achieve a larger corporate profits.
Keywords/Search Tags:Focused crawler, Master-Slave system, Distributed system, System stability
PDF Full Text Request
Related items