Font Size: a A A

Research On Focused Crawling Technique For Vertical Search Engine

Posted on:2009-04-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M ChenFull Text:PDF
GTID:1118360272471770Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Focused Crawling(Topical Crawling) is a key technique to collect Web pages in a specific domain(topic) from World Wide Web.Due to the limited bandwidth, storage,computational resources and rapid growth of the Web,searching the Web for all,accurate and high quality information has become increasingly difficult. General-purpose search engines have presented some serious limitations.(1) They may retum hundreds or more links to a user's query,however the pages pointed to by these links may not closely relevant to the user's query.(2) They cannot satisfy the query requests of different background,purpose and period.(3) It is impossible for them to index and analyze all pages and maintain comprehensive,up-to-date search indexes.Vertical search engine regarded as a potential solution to overcome these limitations has become an active research area of academic and industrial circles. There exist some new characteristics for a vertical search engine such as intelligence, personalization and domain-specification.A vertical search engine collects domain-specific Web pages by focused crawling technique Obviously,focused crawling is the foundation and core of a vertical search engine.The performance of a vertical search engine heavily depends on that of a focused crawler.Focused crawling also has been applied for other fields such as dynamic Web retrieval,personalized Web retrieval and digital library,etc.Accordingly,the research on focused crawling will be an academic signification and a broad application perspective.An important assumption implicit in focused crawling is that the pages with respect to related topics tend to be neighbors of each other,i.e.topic locality on the Web.A focused crawler identifies the most promising URL from the crawling frontier to visit at every step.Thus,the objective of the crawlers is to stay focused,that is, remaining within the neighborhood in which topic-specific pages have been identified. This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.However,because Web is a highly open,heterogeneous, distributed information space,pages refreshed more often are randomly distributed in various sites around the world.Compared with the huge and chaotic Web information space,the number of pages in a specific topic is limited.In addition,an ideal focused crawler retrieves the maximal set of relevant pages while simultaneously traversing the minimal number of irrelevant pages on the Web.Apparently,focused crawling is a challenging research topic.The key techniques of focused crawling include topic description,Web pages segmentation,prioritizing URLs to be visited and focused crawling algorithm.The goal of focused crawling research is to produce the general techniques and algorithms for vertical search engines and to lower the cost of constructing a vertical search engine.This study mainly supported by the Key Science-Technology Project of Shandong Province focuses on the key techniques of focused crawling mentioned above.The main contributions of this thesis are described as follows:(1) A Contextual Topic Description method based on Taxonomy(CTD-T) is proposed.CTD-T describes topics based on ODP(Open Directory Project).The Contextual Topic Key Words(CTKW) and Topic Description(TD) are first defined for any given topic node in ODP.However,a lot of noise nodes will be introduced in CTKW if it is directly extracted from ODP.Accordingly,a method based on Inverse Path Frequency is given to solve this problem.In addition,topic and its contextual topics in CTKW are effectively weighted in terms of their relative hierarchies in ODP.The relevance between a page(an anchor text) and CTKW is utilized to guide the online focused crawling.On the other hand,the relevance between a page and TD is used to evaluate offline the performance of different focused crawling algorithms.(2) An Online Page Segmentation method for Focused Crawling(OPS4FC) is presented.It is pointed out that there are two kinds of blocks,i.e.text block and link block, in a page which significantly affects the performance of a focused crawler. Furthermore,link blocks are classified into three groups:relevant link block, navigation link block and noisy-advertising link block.The goal of OPS4FC is to recognize and extract the topical text and relevant link blocks of a page.First,a given page is parsed into a DOM(Document Object Model) tree.Then,the topical text of the page is concatenated by computing the semantic relevance among different text blocks.Finally,all relevant link blocks of the page are recognized by computing the semantic relevance between the text of every link block and the topical text.It is effective to filter out noise text and links of a page in OPS4FC.Accordingly,OPS4FC can significantly improve the performance of a focused crawler.(3) An approach Prioritizing URLs in Multi-Granularities(PUMG) is introduced.The basic idea of a focused crawler is to compute visit priorities of candidate URLs in a crawling frontier and identify the most promising URL to crawl at each step.Therefore,how to prioritize every candidate URL is the key of focused crawling. PUMG utilizes six features including site,page content,relevant link block,anchor text,URL address and link type,respectively,to prioritize a URL in four different granularities,i.e.site,page,block and link,based on CTD-T and OPS4FC.Several sub-contributions in PUMG are as follows:a) An approach to prioritize candidate URLs in site granularity is presented.If the number(relevant degree) of relevant pages on one Web site S1 is larger(higher) than that of another site S2,the number(sum of relevance) of relevant pages crawled of S1 may increase more rapidly than that of S2 during the dynamic crawling process. Therefore,the increase(or time derivative) of the number(sum of relevance) of relevant pages crawled can be used to prioritize these candidate URLs well.b) The topical text and texts of all relevant link blocks instead of complete content of a page are used to prioritize this page's candidate URLs.It is obvious that the priorities of these URLs are more accurate because most noisy information in the page has been filtered out after page segmentation.c) Prioritizing URLs in block granularity is based on the assumption that links in a page tend to be clustered in blocks and pages pointed to by links in the same link block usually share the same topic.This method can find these URLs in a relevant link block whose anchor texts are not relevant to the topic explicitly and filter out a large number of noisy links effectively.d) URLs of most pages are associate semantic meanings with the pages content. The tokens in a known URL may be used to prioritize the URL Fist,the most tokens of Chinese URLs are classified into four groups,i.e.full English word,abbreviation of English word,full pinyin and the first letters of pinyin.Then,a Topic-Token Mapping Table(TTMP) is derived by statistically analyzing these tokens in 3570000 URLs and manually associating them with corresponding topics.For a given topic, the corresponding tokens are got from TTMT.For a given URL,it is parsed into tokens in terms of the separators of "/" and ".".Then,the priority of the URL is computed by the matching strength between the topic's tokens and the URL's tokens.e) A method based on a URL's link type is discussed to prioritize the URL.A child page inherits the relevance of its parent page to a topic.First,links are classified into five types in terms of the relative locations of them and their parents in the Web graph.Then,five heuristic rules are presented to infer the topical relation of a page pointed to by a URL to its parent page and produce an inheriting factor in terms of link types.Finally,the URL's priority is determined by the product of its parent page's relevance and the inheriting factor.(4) An Adaptive Focused Crawling algorithm based on PUMG(AFC-PUMG) is proposed.Starting from some predefined seed URLs,AFC-PUMG segments a visited page in OPS4FC.Then,candidate URLs extracted from this page are prioritized in PUMG and added to the crawling frontier in the order of their priority scores.The URL with the highest priority score in the frontier is selected to visit at each step.A Path Exploring Depth(PED) function is introduced to compute each candidate URL's PED, depending on the relevance of the URL's parent page to the given topic.Thus, AFC-PUMG can control the exploring range and depth more flexible and collect more relevant pages.Additionally,the order of these prioritizing URLs strategies in different granularities is defined and utilized to improve efficiency of AFC-PUMG(5) A focused crawling prototype is developed,and many experiments were carried out from various aspects to validate these methods proposed in this thesis.System architecture and a specific design scheme of the prototype are given. Four focused crawling algorithms,i.e.Breadth-First,Best-First,Shark-Search and AFC-PUMG,were implemented in the prototype.Experiments were conducted on the Web for different English and Chinese topics and Web sites and demonstrate the effectiveness of CTD-T,OPS4FC,PUMG and AFC-PUMG In general performance, AFC- PUMG algorithm outperforms the other three algorithms significantly in terms of precision rate and sum of information without increasing time complexity.
Keywords/Search Tags:Vertical Search, Focused Crawling, Topic Description, Page Segmentation, Relevance Computation, Relevance Prediction, Prioritizing URLs
PDF Full Text Request
Related items