Font Size: a A A

Research On And Design Of A Focused Crawler Algorithm Based On Web Community Identification

Posted on:2009-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:J M LiFull Text:PDF
GTID:2178360242983042Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In this paper, we propose a new focused crawling algorithm named Adaptive IHEIM by combining studies in Web community, text classification and focused crawling.We propose an Improved-HITS-Expansion-Iteration Model, which is formed from expansion of the iteration algorithm based on improved HITS algorithm. IHEIM prototype algorithm is based on this model. In consistence of the online crawling feature of focused crawler, Adaptive IHEIM is proposed based on IHEIM prototype algorithm. The concept of focusing index is suggested in the algorithms.Applying Adaptive IHEIM algorithm, this paper describes a focused crawler system, which includes: topic generation module, base set generation module, classifier module, web graph computation module and fetching-parsing module.The experiments are conducted on four focused crawler algorithms including Breadth First strategy algorithm, Link Context Prediction algorithm, OPIC algorithm and Adaptive IHEIM algorithm. Comparing four algorithms' results of average harvest rate and average target recall, the conclusion is that Adaptive IHEIM outperforms all other algorithms. Comparing average harvest rate and average target recall under different values of focusing indices, the conclusion is that after every round of iteration, the focus crawler's performance increases and gradually the performance decreases. When all other parameters are the same, the smaller the focusing index is, the better the performance is. The third comparison of all fetched page numbers under different values of focusing indices shows that the all fetched page number grows exponentially to focusing index.
Keywords/Search Tags:Web Community, Focused Crawler, HITS algorithm
PDF Full Text Request
Related items