Font Size: a A A

Research And Implementation Of Focused Crawling Based On Web Connectivity Information

Posted on:2008-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:X JiangFull Text:PDF
GTID:2178360212495960Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Focused crawling aims at searching, collecting, storing, updating and maintaining the web page of the specific topic have the great efficiency. Such search services meet the specific user's requirements; even satisfy the user's demand for the information of professional field. The relevant Research on focused crawling involves the artificial intelligence, natural language understanding, and visualization of web, semantic web, and likes. The improvement of any technology mentioned above, can have influence on enhancing the reptiles of the focused crawler.Tunnel problem in focused crawling is a major constraining factor to focused crawler's efficiency. How to crawling through the tunnel in the process of collecting web is a key technology of focused crawler. The so-called tunnel problem refers to the situation that exist some unrelated webs inserted between the target web page and current web page when crawler collecting the relevant webs, while often general focused crawling strategies are trying to collecting webs along the path constructed by relevant web pages, and it is very hard for these crawlers to go through these irrelevant web pages to get the relevant pages at the other side of the tunnel. Therefore such result narrows the area of the related web pages. Tunnel problem just like a vehicle through the tunnel arrive at the destination, and the tunnel is the irrelevant web pages linked the current web page and target web page., the destination is the target webs in other side of the tunnel.Tunnel technology is definitely used to resolve the tunnel problem, also referring to how let the focused crawler go across the low relevance region of the web into the high relevance region of the web. The technology for resolving the tunnel problem already includes the method of enhancing study, generalized the topic, expression generalization, adjusting weight, and they all have their own characteristics and focuses. The purpose of this paper is to improve the ability of the tunnel-problem resolving to enhance focused crawler's overall performance, as well as the aim of above solutions, but the method is difference. The main efforts and feather of this method are : (1) obtain each topic's structure of the distribution and connectivity using appropriate method. (2) Find the might-be tunnel web pages in use of such information. (3) Improve the focused crawler's frame to deal with tunnel web pages in relatively priority in the process of focused crawling.First, in this paper use the concept of web connectivity information to describe the real web's topic distribution. Web connectivity information refers to, classifies the web by topic content, the probability of any topic A's web arriving at any topic B's web is called the connectivity information of this topic web to any other. Let's describe this connectivity information into the formula A ? B ( P) which is called the web connectivity rules. ( A and B stand for topic, and they can be in the same category. P stands for the probability from A to B) This rule is following the link on a web page belonging to the topic A reach a web page belonging to the topic B with such probability as P.Due to the huge size of internet, it is impossible to take all the web pages into account to statistic web connectivity information. It should be started with choosing the topic-related category to found the web connectivity information, according to the focus crawler's target topic. Basic information web connectivity reflects the probability that followed one link on the web to get another topic web. The basic web connectivity information can be gained by below steps. First, use established topic hierarchy to get web sample set 1, and from sample set 1 to obtain another sample set 2. Then use web classifier classifies the sample set 2.and then there will be two web set classified already. Account the numbers of each topic's web and the connectivity of them, then get the basic web connectivity information based on the specific formula, and this information is the simplest web connectivity information. Because this information is not complete yet, it should also get more comprehensive web connectivity information. This paper present three methods to construct the more complete web connectivity information, such as get the sum value iteratively, establish sample sets repeat, and get the maximum iteratively. It shows that the method of get the sum value iteratively is the better method than other two , and it is the most effective reflection to the distribute of Internet web.After constructing the information of the web connectivity, it must find a way to let this information fully play its role in the process of focused crawling. It should establish the concept of standard probability which is the value of the web connectivity information and describes the Possibility from target topic web page to the target topic web page. In order to increase the flexibility of the system, it should use a weighted parameter to adjust the standard probability. According to the value of standard probability web pages, the web page can be divided into tunnel web type and normal one. They will be done with different strategies in the process of focused crawler collecting web pages.In the structure of focused crawler, change the design of priority queue of the focused crawler, original single type of priority queue became two different types of priority queue, one is called ordinary priority queue, and the other called the tunnel link priority queue. And in the process of focused crawling, whenever obtain a new web page, if it was a web of general type, insert the links on this web into the ordinary link priority queue using soft focused crawling strategy, otherwise insert links on this web into the tunnel link priority queue according the probability of the rule that describes the possibility of such web type's web page arriving at the target web page. When focused crawler collects the web page, select the link from two type priority link in turn. It will guarantee that the tunnel web pages will get more opportunity to be climbed comparing with other methods, and spontaneously increase the capacity to resolve tunnel problems and improve focused crawler's efficiency.The comparative experiment illustrate that this new method develops crawler performance, resolve the problem of tunnel, and achieve the intention to increase the harvest rate and coverage. Nevertheless the shortage is its differences in the magnitude of harvest rate according to different target topics, also the degree of effect influenced by the standard probability of connectivity. Therefore this approach needs to be improved and enhanced in performance and stability. If it can be perfected on setting up the web connectivity information, or possessing of the ability of automatically adjusting standard probability of connectivity as system capability, the method will have a more broad space for development. This can be further improved in later work.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items