Font Size: a A A

The Study On Incremental Crawling Of Web Fourms

Posted on:2011-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q DuFull Text:PDF
GTID:2178360305951061Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Web forum is a Web application for holding discussions and posting user-created content, millions of Internet users discuss various topics every day. Forum data usually has plenty of highly valuable knowledge and information, and becomes an important resource on the Web. Some commercial search engines have begun to leverage information from forums to improve the quality of search results, and it is also noticed that some recent research efforts have tried to mine forum data to find out useful information. However, whatever the application is, the fundamental step is to fetch data pages from various forum sites distributed on the whole Internet and maintain a local database of web pages.The main theory of incremental crawling techniques is the evolution of web pages and the optimal scheduling strategy based on page evolution. Forum site has some different characteristics from other general websites:it has complex structure and many duplicate links; long discussion thread is usually divided and shown in multiple pages; Content of web forums usually changes more frequently and it usually changes incrementally. The revisiting strategy of traditional incremental technologies is usually based on individual page, so these technologies are not suitable for crawling forum sites incrementally.This thesis does a study of incremental crawling for web forums, and the main contributions of this thesis are as follows:1. In the forum a topic is usually distributed in more than one page, This thesis abandon the traditional incremental technology which use a single page as the basic unit of incremental crawling, and define the pages of the same information as the basic unit for incremental crawling, page set has two main types:page set which belongs to the same board and page site belonging to the same thread.2. Based on the statistical analysis on the evolution of board in many Web forums, a novel and board-based incremental crawling strategy is proposed. This strategy mainly consists of two kinds of algorithms: a) A board-based incremental crawling algorithm for forums:In most web forums, in the thread list page, the thread list is sorted by the last reply of thread. It means that the new release thread or new replied thread will be arranged in the front of the thread list. The incremental crawling algorithm uses the MDR automatic extraction algorithm to extract the thread link and the last response time, and then determine whether a thread in new released or new replied.b) A scheduling algorithm for incremental crawling of forum:through the statistical analysis on the evolution of board in many Web forums, we found that the change frequency of different board varied widely and the change frequency has some works with the local time of a day. This approach leverages the board weights and local time discipline to allocate crawl resources and determine the crawl time.Experimental results show that the bandwidth utilization of the system is 1 and the coverage for the newly published and updated discussion threads is close to 100 percent based on this strategy, and the overall system delay is maximally decreased by 42% as compared with even scheduling method.
Keywords/Search Tags:incremental crawling, forum crawler, delay, board evolution
PDF Full Text Request
Related items