Font Size: a A A

Focused Crawler Based On Domain Ontology And Similarity Concept Context Graph

Posted on:2013-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LiuFull Text:PDF
GTID:2248330377453803Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the exponential growth of information on the World Wide Web, moreand more information have been found on the web, it is increasingly difficult for user to searchuseful information, so an efficient and effective search engine which is used to organize andretrieve the information is deadly demanded. Crawler is an important component of searchengine, it is mainly used to collect document information from the Internet. However, crawlersused by general purpose search engines require a huge amount of dist space and networkbandwidth, and the precision of search result is much lower. Consequently, vertical searchengine has become an active research area of academic and industrial circles with thecharacteristics of intellengce, personalization, domain, specialization.Focused crawler aims to selectively seek out web pages which are relevant to a predefinedset of crawling topics, instead of searching the whole Web exhaustively, and it rely on the factthat pages about a topic tend to have links to other pages on the same topic. One of the majorproblems of focused crawler is how to assign proper order to unvisited web pages and maintaina high harvest rate during the crawling process. To address this issue, we propose an effectivefocused web crawling method which based on domain ontology and Formal Concept Analysis(FCA).The method constructs a core similarity graph based on WordNet and conceptrelatedness firstly, and then a similarity concept context graph (SCCG) is built with conceptlattice knowledge. Finally, combined with the relatedness between crawling topic and anchortext corresponding to the URL, and link analysis techniques calculate the URLs’ priority scoresand determine which URL should be crawled first.The main contents of this paper are summarized as follows:1. This paper presents a measure of semantic relatedness. Semantic relatedness is aconcept which used to measure the semantic relevance between documents or termsand reflects the degree of correlation between two objects. With the help of the richsemantic content contained in the WordNet and read a variety of methods to measuresemantic relatedness, at last, we conclude a method used to measure semanticrelevance for this paper.2. This paper presents a method for building similarity concept context graph. In thisstage, we should process gathered basic pages which describe crawling topic to basiclattice and current pages which linked by basic pages to current lattice first,meanwhile, we can get feature terms for describing crawling topic from basic pages.After this, we should expand feature terms with synonym to expanded feature termsbased on WordNet, afterwards construct core similarity graph using the measure ofcalculating semantic relatedness. Finally, a similarity concept context graph is builtaccording to the algorithm we proposed with the core similarity graph, basic lattice and current lattice.3. This paper presents a measure for predicting the priority scores of URLs based onsemantic link analysis and similarity concept context graph. Generally, anchor text isa brief summary of the web page summarized by the one who reference this pagefrom another point, so it can best reflect the web page topic. Besides, this paper alsopresents a method for calculating the relatedness between anchor text and crawlingtopic, and then combined with similarity concept context graph built in the last step, ameasure used to calculate the URLs’ priority score is proposed. Afterwards, accordingto the priority score to determine the URLs crawling order.Finally, this paper utilizes three strategies (recall, recall-precision and F-Measure) toanalyze experimental results, and the result shows our approach has higher precision andF-Measure than the standard breadth-first search crawling approach (BF), the approach withContext Graph (CG), the approach with Relevancy Context Graph (RCG) and Concept ContextGraph (CCG) at the same condition, which also demonstrated the effectiveness and feasibilityof this method.
Keywords/Search Tags:Focused crawler, Formal Concept Analysis, concept relatedness, WordNet
PDF Full Text Request
Related items