As the Internet network is growing popularity in our country, for those young collegestudents who are willing to accept new things, the Internet has almost became the basicnecessities of life, as well as food and clothing. At the same time, the campus forumsbecome the main platform where students express and exchange their views. In order tounderstand the hot topics in campus, it is very meaningful to build a campus networkinformation management system. The forum spider designed and implemented by thisthesis is a sub-system of the information management system, it’s main task is collectingforum data which is prepared for future analyzing.While crawling forums, traditional general crawlers would encounter a large numberof duplicate links. This would be a waste of resources and inefficient. On the other hand,most existing forum crawlers are tailored for specific users, therefore they only act on asingle forum. This thesis has analyzed the differences in the structure of many forums, andstudied the features and system architectures of several mainstream crawlers, and finallyproposed an implementation of incremental web crawler system which could be applied tomany forums. After analyzing system requirements, this thesis designed each sub-module,and then elaborated details of the implementation of each module. The main work of thisthesis includes the following aspects. First, analyzed the features of many campus forums,extracted their commonalities and differences, determined crawling mode for each style offorums. Then, according to the heat and features of forum sections, determinedincremental crawling strategy based on the weight of forum sections. At last, on thepurpose of improving the versatility and flexibility of the crawler, this thesis used XQuerytemplates to parse the web pages and extract the content.After deploying and running the crawler, this thesis analyzed the test results, it showsthat the spider system was running stably, so the system has met the needs of design, andis useful. |