| Along with the extensive application of the Internet, network information increases exponentially, at the same time,the demand for accessing to information is also increasing. How to use existing network and resources to provide users with effective information is becoming an urgent problem.The focus crawler is one of effective means to solve those problems and cloud computing develops make it possible to increase crawler system efficiency.The Hadoop platform developed by Apache is a user-friendly open cloud architectures, the main objective of this research is to design and implement a theme crawler system in this framework. The main work is follows:(1)Discussing the Hadoop related knowledge such as the calculation model of MapReduce and the HDFS distributed file system, and then discussing the framework, work flow and characteristics of focus crawler further. After that, in order to obtain a more professional, accurate theme information, this paper studies the key technologies such as correlation discriminant, page text extraction, hyperlink extraction etc. topic crawler; and based on the research work, using the existing academic achievements, on the theme of relativity judging technology made some improvements, making the system positioning and searching for information more precisely, the extracted data is more in line with the actual needs.(2) Design a focus crawler system based on Hadoop under the circumstance, describe in detail the workflow and general frame of system. In order to make the system helpful for the information processing and indexing, we design the content extract module to batch filtration to grab the page, and the requirements of the page text content is extracted to make the information structural(3) Research expound the overall architecture of system and the realization of all modules, including implementation of data storage structure, system function module division and the module’s Map/Reduce implementation and so on.(4) After analyzing the result of the experiment:we conclude, all modules of the theme crawler are running very well, and this system can achieve high accuracy in collecting themed information, at the same time, comparing with stand-alone system, its efficiency is higher in collecting data, the flexibility and extensibility are greatly improved. |