| With the implementation of reforming and opening-up policy, the specialized industrial townships of Guangdong province develop rapidly. The people’s government of Guangdong province put forward the construction of information network service platform for micro-small and medium enterprises in Guangdong province, which can provide the information of innovative resource, market information and the demands of enterprise technology, and guide the micro-small and medium enterprises to become informatization and internationalization. The information service platform needs to collect data information, and in today’s information era, the network has become an important channel for people to obtain information, so we can collect data from the Internet using the web crawler technology.The research projects working of this thesis primarily from a project named “towns of profession in Guangdong province information service platformâ€. This information service platform mainly consists of three parts: the front-end system, the background system and the web crawler system. The background system is mainly used for management. It implements the function of content management, user management, authentication management and process management. The front-end system is mainly used for show. It get the data from the background system and display them in the corresponding page according to the requirements of users. The web crawler system is mainly used to collect the data in Internet for the information service platform. After processed, the data will be imported into the background system.The work of this paper is aimed at the web crawler system, including the following work: firstly, this paper introduce the Heritrix framework the system used and related technology. Then, this paper describes the overall architecture of the platform system, including needs analysis, network structure, software architecture and processing flow. Then this paper makes a detailed description to the design of web crawler system, including the system function, software structure, workflow and the database design and so on. This paper also makes a detailed description to the implementation of web crawler system, including the realization of web crawling module, content analysis module, data filtering module and data import module, and the crawling process of some specific data types.This paper does a deep research on the Heritrix 3.1.1 source code and customizes the Heritrix. We add the link text to the Heritrix and do some optimization, including the parameter optimization, the cancelling of robots.txt examination and the multi-threaded optimization in same domain name. According to the data types in the innovative resource database, market information database and the demands of enterprise technology database, we extends the Heritrix processor chains and add our analytic method and filtering rules. We use some heuristic rules to crawl some data types, and we also use some technologies such as the web page text extraction, the IK analyzer, the login simulation and the remote image storage using the SMB protocol. Specific to the lighting industry, the Professional-town web crawler system has crawled more than 350 thousands records and import the records into the background system successfully, so we have completed the task of the project. |