Font Size: a A A

Key Technology Research On Web Forums Crawling And Hot Topic Detection

Posted on:2012-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:H X LiFull Text:PDF
GTID:2178330332992853Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet is boomed in recent years, and it has become an indispensable part of people's lives. Because of some features, such as rich interactive, instant and open, forum gradually attracted a large number of users, which has become an important part of the Internet. Forum is a necessary approach and important method for people to publish and acquire information in our daily life, work, entertainment and other aspects, which plays an indispensable role. Internet users can communicate through the forums by post a topic to explore all together. You can ask a question, whoever knows will work together to solve the question. So, it is a platform for people to share language and culture, which contains a wealth of information. So forum is a huge knowledge base, it is also an important data source of search engines. In addition, comments of active Internet users in China reached unprecedented levels, which continued to form the network hot topics, and some even form a focus of social events to show their power cannot be ignored, which often lead to a major crisis in public opinion. Therefore, the forum is an important basis for information retrieval, data mining and monitoring public opinion. However, because of the unique structure of the forum, it is hard to obtain the forum data, and most search engines have avoided crawling from the forum.We studied the key technologies on the forum crawling in this paper, besides the complex structures, deep link-level, the link flipping, easy to fall into collection traps and other problems. We proposed a universal forum crawling method.First, we use depth first and breadth-first combining algorithm to randomly sampling from the forum of a certain number of pages, through the web structure identify, web page clustering, dynamic web links clustering and some other methods, we obtain the logical structure of the forum. Then, we design and implement a rapid and efficient distributed forum crawling framework for large-scale crawling, in which the performance problems are analyzed and discussed. Compared with traditional crawling methods, our method greatly increased the efficient and coverage of the forum crawling.Based on the crawling of the forum, we applied it to a hot topic detection prototype system. The system can detect forum hot topic effectively for some time period, and find the posts each topic contains. Finally, we successfully applied it to a public opinion monitoring system in ICT, CAS, which achieved good practical results.
Keywords/Search Tags:Web Forums Crawling, Automatic Structure Analysis, Page-flipping Detection, Web Page Clustering, Traversal Strategy, The Framework Design of Crawling, Hot Topic detection
PDF Full Text Request
Related items