| As the continuous development of Web2.0technology and the rapid popularization of web terminal, more and more people begin to participate in the online exchange activities, and the Internet has become the place where public opinion is produced and spreads. An Internet public opinion monitoring system will not only reduce the investment in human and material, but also improve the monitoring efficiency, comparing with traditional monitoring methods.For the Internet is open and virtual, some of the characters of the Internet public opinion such as free, spreading rapidly and hidden, make it difficult to apply the technology used in information; retrieval to Internet public opinion monitoring direcrly. After an in-depth study of the relative technologies of Internet public opinion monitoring, a monitoring frame for Internet public opinion is proposed. Moreover, around forum crawler the key-technology of Internet public opinion monitoring, based on a great deal of analysis of the structures of the forums, a level model for the forums was created, and then a forum crawler was implemented to fetch different forum sites. The main results can be summarized in the following areas:To construct the framework for Internet public opinion monitoring, an in-depth of research was done on the relative technologies about it, such as IR, NLP and text mining. And based on this framework, an Internet public opinion monitoring prototype system is established on the distributed programming architecture Hadoop. In order to solve the case that the public opinion will distribute in many sites of different types, and take the breadth and the relevance of public opinion collection both into account, several kinds of ways were used to fetch pages. Finally, the framework uses an ontology-based text mining method to find hot-point.Based on the analysis of the structures of many forums, this paper describes a level model based forum crawler. First, a forum site will be mapped to a tree, then the crawler crawl each level of the tree with a combination strategy of depth-first and breadth-first, and extract text information based on templates, properties such as "click number" and "last reply time" are then use to determine whether a thread is updated, the experimental results show that this will greatly increase update speed of forum crawler. |