Font Size: a A A

The Research Of Specific-topic Crawling Strategy Based On Hierarchical Optimized Dynamic Concept Context Graph

Posted on:2015-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiFull Text:PDF
GTID:2298330431497446Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, people are used to obtain information fromthe Internet. Specific-topic search engine collects web pages those associated with a particulartheme. The focused crawler is the resource collector of the specific-topic search engine. Astraversing the network resources,the focused crawler will judge the relevance between webpage and the topic. And then the system appoints a priority value for the unvisited URLaccording to the relevant degree. So this can ensure that the search engine downloads thoseweb pages which are related with the topic and abandons the web pages which are deviatedfrom the topic. The collected web page set can make user more satisfied. As we can see, anexcellent topic search engine is influenced by the quality of its focused crawler and a goodfocused crawler needs to have good models to support.In this paper, an optimization CCG which owns an optimal layer number is researched.At first, the key words are sent to the world famous search engine such as Google and Yahoo!.And then some relevant retuned web page is selected out as the initial seeds. By theinformation of the selected web pages, the formal context is built later. At last, a dynamicCCG is proposed.The main research content as follows:1. An approach is proposed which can stratify the traditional CCG. A full CCG is cutinto several different Sub-CCG and then research each focused crawler’sperformance with the guiding of the SCCG.2. The Optimal CCG is proposed in this paper. The traditional CCG usually contains allthe concepts which are in the concept lattice. As a result, the performance of focusedcrawler may be affected by those concepts which have a low relevance degree.3. A Dynamic CCG is proposed. CCG is constructed by the seed URLs and the termswhich can express the seed URLs web pages. But with the crawling going on, theremay be many web pages which are closer with the topic. In order to keep the CCG’smost active, those high relevant web pages should be inserted into the CCG andthose concepts with poor quality should be removed. The whole replacing processtakes an elimination mechanism.We use the Precision, Recall and F-Measure to compare the optimal CCG with thetraditional CCG in this paper. At the same time, we also take several optimized CCGs forcomparison and the result show that our strategy owns some advantages and feasibility.
Keywords/Search Tags:Focused Crawler, Formal Concept Analysis, Concept Lattice, SearchEngine
PDF Full Text Request
Related items