Font Size: a A A

Research On Search Strategy And Key Techniques Of Focused Crawler

Posted on:2016-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:N XuFull Text:PDF
GTID:2308330479984858Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Focused crawling is a key technique of topical search engine. Focused crawler can acquire topical page and avoid visiting irrelevant page intelligently. The search engine that based on general web crawler has the problem that low coverage, low search accuracy and web page is not updated timely, while the application of focused crawler can greatly alleviate this problem. The key problem of focused crawler is to predict the relevance of unvisited links. The standard vector space model used by most classic focused crawler did not consider the term semantic information in language; analyzing the anchor surrounding text would induce many noisy pages; the strategy of combining link analysis and content similarity was just add together linearly; most focused crawler did not consider the tunneling problem or used a poor method that download a lot of irrelevant web pages.To address the above problem, the main research works of this paper are as follows:(1)The paper proposed a novel term semantic similarity vector space model(TSSVSM), this model was based on traditional vector space model and term semantic similarity. TSSVSM method was used to calculate the similarity between page content and the topic in this paper.(2)Based on the analysis of characteristics of tunneling, the paper proposed an adaptive tunneling method, this method can cross topical page dynamically based on the correlation of page content and the tunneling path, thus to reduce access irrelevant web pages under the premise of obtaining more topical pages.(3)The paper analyzed the limitations of link context, so the paper used the title factors to substitute link context factor for content similarity, so web page content similarity was determined comprehensively by the page title, text content and anchor text, and then used this value adapt to OPIC algorithm to bias the cash distribution so as to favor on-topic pages and to suppress off-topic pages, this method was called as NOS algorithm.(4)The paper selected topics and seed pages from the Open Directory Project(ODP) and conducted the comparative experiments with five crawling algorithms based on Nutch: Best-First algorithm, Shark-Search algorithm, OTIE algorithm, NOS algorithm and NOS-TSSVSM algorithm. The results of experiment indicate that the proposed method improves the performance of focused crawler that significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.
Keywords/Search Tags:focused crawler, semantic similarity, vector space model, Shark-Search algorithm, tunneling
PDF Full Text Request
Related items