Font Size: a A A

URL Rule Based Focused Crawl And Its Application

Posted on:2008-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y YeFull Text:PDF
GTID:2178360212484962Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the ever-expanding information, people become increasingly dependent on search engines. The general search engines, like Baidu and Google, have provided people with a lot of facilities, and become very popular. However, as people want to search information in more specialized fields and want that the results returned by the search engine be more quality, general search engines can not meet the people's requirements in some specialized fields. So there come vertical search engines. Although there are lots of similarities between vertical search engines and general search engines, vertical search engines have many its own specific characters and new issues. Focused crawl is one of the key issues that need to be addressed.In this paper, we first propose a URL Rule Based Focused Crawl (UBFC) based on the law that the pages which generated by the same template often belong to the same topic and their URL are very similar. Then we implement UBFC based on open source project—Nutch and also design and implement URL Regular expressions learning algorithm which supports UBFC. Finally, we introduce the application of UBFC, and have done a lot of test and analysis, particularly comparing UFBC with both the Bread First Search Crawl (BFSC) and Baseline Focused Crawl (BLFC). The test shows that UBFC did a remarkable improvement comparing with BFSC and BLFC in harvest, and its recall rate is far bigger than BLFC.
Keywords/Search Tags:vertical search engine, focused crawl, URL regular expression learning, Nutch
PDF Full Text Request
Related items