Font Size: a A A

Customizable Focused Crawler

Posted on:2010-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:H L ZouFull Text:PDF
GTID:2178360275954781Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Requirement for information asked by user in intemet is normally aimed at some field and a specific subject oriented,the ratio of recalling and exactness for some traditional search engine can not be turned up trumps in all these aspects.The aim of subject oriented for verticalsearch engine is to provide a search service of classifying in exactness, all-around data,and updating in time so that there is a specific advantage in satisfying individuation requirement aspect.At the back of a powerful search engine,there is always a powerful crawler,whose performance determines the satisfaction of the search engine for users in such aspects as recall ratio and exactness.Based on traditional crawler,a focused crawler evaluates the topic relevance of the web page context and URL.As one of the current research focus,many problems,for example:the ambiguity and polysemy of human language,, semi-structured of the network information resources blocks the further progress,there are many difficulties in topic judgement and evaluation, natural language understanding and tunneling.This paper presents a Customizable Focused Crawler,CFC,Mainly including:Study and implementation of customization algorithms,on the basis of communication between users and computer,A topic model is formed with vector space model,which expresses the user's interest more explicitly and allow the computer to better understand.Implementation of Ajax interpretor.Web2.0 has become a mainstream technology,more and more of the pages using Ajax,for such a page,rich information saw in browser can not be found in HTML source file.Hence, the Ajax interpretor is bound to improve the recall ratio.In this paper,the page load function in Ajax operation is handled. For the tunnelling,this paper presents a simple and effective algorithm called tolerance.This algorithm imitates the behavior of people, a page or a link not related to the topic is not abandoned immediately,it will be handled as the relative according to the threshold size. Implement the search strategy based on the value of link.This method makes use of link structure and content-based methods of evaluation, considering both the topic-relevance and authority of links in order to give a priority to the more valuable links.
Keywords/Search Tags:Focused Crawler, Vertical Search Engine, Topic Customization, Ajax parsor
PDF Full Text Request
Related items