Research On Large-scale Crawling On Web Forums

Posted on:2007-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:K Li

Full Text:PDF

GTID:2178360185454111

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Web Forums have been one of dominating ways for information release and exchange inInternet .Crawling is the groundwork of searching and mining information from Web Forums.However, traditional crawling component usually using Broad-First strategy can not fetchinformation from Web Forums effectively and precisely. By exploring inner structure-featuresof forums, this paper presents a crawling strategy, which is based on "Topics IndexPage--Topic Correlation Judgments" algorithm. Compared with Breadth-First Strategy, oursolution performs remarkably better both in precisions and recall.Moreover, in order to meet the need of web forums crawling as quickly as possible,designing a large-scale crawler that has capability to collect hundreds of millions web-pages isour choice. By discussing the important performance and reliability considerations for alarge-scale crawler, we propose a site based parallel architecture using non-blocking socket,which solves several key issues on large-scale crawling.Finally, combined with the two components, a large-scale web forums crawler isimplemented effectively. In practice, the system is running over 12000 different Web Forumsand has achieved a good result. In summary, our work mainly includes:1) Study on principles of dynamic URL, which is the basis of web forums.2) A new approach to cluster dynamic URLs.3) Investigation on inner-logic architecture of web forums.4) To find a new strategy aimed at web forums crawling.5) Design a site based parallel crawling architecture using non-blocking socket.

Keywords/Search Tags:

Web forums crawling, Large-scale crawling, dynamic web pages, non-blocking socket

PDF Full Text Request

Related items

1	Key Technology Research On Web Forums Crawling And Hot Topic Detection
2	Crawling the Web: Discovery and maintenance of large-scale Web data
3	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
4	Research On Efficient Web Information Crawling Strategy
5	The Research Of Multi-Strategies Methods In The Information Crawling
6	Classification System Based On The Theme Of Information Acquisition In The Pages
7	Research And Implementation Of Web Information Automatically Crawling In Vertical Search
8	The Study And Implementation Of Efficient And Stable Methods For Data Crawling In Vertical Search Engines
9	Research And Implementation Of Information Crawling System Based On Nutch
10	Research On Customized Web Information Crawling And Pushing Techniques