Font Size: a A A

Focused Web Crawling Strategy Based On Formal Concept Analysis

Posted on:2008-09-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z B DongFull Text:PDF
GTID:2178360212995654Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With internet growing exponentially, search engine is encountering some unprecedented challenges. How to respond to this drastically expanded size becomes a noticeable problem. Search engine originated from finding useful in immense web resources, web spider, as a important part of search engine, mainly answers for collecting web resources to store. Scalable web crawling always plays a important role in Web information retrieve, so it is extensively adopted by some large meshwork stations, but with the rapid growth of World-Wide Web, scalable web crawling has to retrieve large numbers of pages, this will exhaust current system resources and network resources, but not all these data is fully utilized, much of these is wasted. In fact, user only concerned with the useful part of pages retrieved, and the helpful pages usually relate to only several topics, a majority of the pages retrieved are not valuable to user. In addition, refreshing such vast pages needs to spend a long time period, this will produce a lot of outdated information, because Web resources are continuously updated with new content. Besides because of a large retrieving scale, traditional crawling technology returns many littery pages, it can not focus on some interesting topics, this will result in missing some relevant information and return some extensive topic pages, the information about specific topic is little.The focused web crawling is to selectively seek out pages that are relevant to a pre-defined set of topics, these feature topics are defined not only by keywords but also by example pages. Focused crawling reduces the number of retrieving pages, simultaneously regulating the pages visited and deep analyzing the interesting topics, so it obtains large number of high quality pages. This method not only reduces requirement to hardware and network resource but also improves the precision and refreshing speed. So in order to obtain a high crawling efficiency, what strategy is utilized to visit Web is crucial to focused web crawling.By researching the characteristics of current focused crawling strategies, this paper brings forward applying formal concept analysis to focused crawling, and enhancing match technology from the level of mechanical and external keywords to the level of concept. We predict the relevance in terms of semantic relation between concepts, through computing concept similarity we can pick out relevant hyperlink from a page, so this paper puts forward a focused web crawling strategy based on formal concept analysis(FCA). Skimming over various crawling strategy, applying FCA to focused web crawling is a novel method, the following is the main research content:1. As the core of FCA, concept lattice is a powerful data analysis tool. Through researching the relation between concepts and the characteristic of concept lattice, we present the user search topics by concept lattice structure, and establish user topical modeling as a basic lattice to match.2. In this paper we mainly study the successive relation, and define core concept and non-core concept, and present the computation of concept distance, introduce three methods about computing concept distance.3. We put forward direct concept match method based on attribute, define the virtual concept. Through looking for the appropriate location of virtual concept on basic lattice to obtain virtual concept's similarity, in terms of this value we can filter the URL unvisited, so this paper presents a focused crawling strategy based on formal concept analysis.4. Finally, establishing focused crawling system, collecting Web data, utilizing average harvest Tate and F-Measure two evaluating methods to compare with traditional breadth first strategy, we prove the feasibility of our strategy.
Keywords/Search Tags:search engine, focused web crawling, fonnal concept analysis, concept lattice, web spider
PDF Full Text Request
Related items