Font Size: a A A

Research On Large-scale Crawling On Web Forums

Posted on:2007-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:K LiFull Text:PDF
GTID:2178360185454111Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web Forums have been one of dominating ways for information release and exchange inInternet .Crawling is the groundwork of searching and mining information from Web Forums.However, traditional crawling component usually using Broad-First strategy can not fetchinformation from Web Forums effectively and precisely. By exploring inner structure-featuresof forums, this paper presents a crawling strategy, which is based on "Topics IndexPage--Topic Correlation Judgments" algorithm. Compared with Breadth-First Strategy, oursolution performs remarkably better both in precisions and recall.Moreover, in order to meet the need of web forums crawling as quickly as possible,designing a large-scale crawler that has capability to collect hundreds of millions web-pages isour choice. By discussing the important performance and reliability considerations for alarge-scale crawler, we propose a site based parallel architecture using non-blocking socket,which solves several key issues on large-scale crawling.Finally, combined with the two components, a large-scale web forums crawler isimplemented effectively. In practice, the system is running over 12000 different Web Forumsand has achieved a good result. In summary, our work mainly includes:1) Study on principles of dynamic URL, which is the basis of web forums.2) A new approach to cluster dynamic URLs.3) Investigation on inner-logic architecture of web forums.4) To find a new strategy aimed at web forums crawling.5) Design a site based parallel crawling architecture using non-blocking socket.
Keywords/Search Tags:Web forums crawling, Large-scale crawling, dynamic web pages, non-blocking socket
PDF Full Text Request
Related items