Font Size: a A A

The Research Of Topic Crawler Search Strategy Based On Genetic Algorithm

Posted on:2011-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y J LiangFull Text:PDF
GTID:2178360305988637Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Traditional search engine on the Internet requires extensive information collection and analysis and processing, with the rapid expansion of the Internet, traditional search engine need to handle more and more network information, while also inevitable that provides the user with more or less irrelevant information.The Genetic Algorithm is used in our topic crawler search, the introduction of it improves the search strategy of the reptile, using the efficient, parallel, global optimization genetic algorithm to improve the search efficiency of the reptile. This study mainly includes the following two aspects:Improve the traditional Genetic Algorithm according to the network features; Test the improved results by experiments.The topic of information search strategy,which based on Genetic Algorithm search strategy, first of all, submit the question will be retrieved to the general search engine, process result set returned, and select a certain number of URL as initial group; then it extract all the hyperlinks included in the page corresponding to the URL in initial group, produce a large number of new individuals, predict similarity among all the hyperlinks, and elect a high correlation of seeds as cross-cutting results, next, it introduce directory-type page to expand the search range through the mutation operation,and elect the results come from the genetic treatment to get the individuals with high suitability degree which as a new generation of seeds go on into a new round of inheritance.At last, it end the search conditions by reptiles.In this paper, when construct the initial cluster, it submit the questions will be retrieved to the general search engine Google. With the previous n-URL in the result set returned, there is a series of process like reexpand,de-emphasis,and calculate the Authority and Hub values。The paper focus on the Alexa ranking, then next select the initial seeds group according to integrated rank values. In the cross-process, it effectively predict the relevance between the corresponding page with the topic according to the anchor text of hyperlinks. In the variation phase, it find related pages according to the large number of links and a detailed classification included in the directory-type page.This paper designed an experiment to test and verify the feasibility of Genetic Algorithm in reptile search as well as the effect of improved Genetic Algorithm. In the experiment, our paper used three kinds of algorithms searching the given topics, calculating the similarity of searched web pages to the topic according to the vector space model algorithm, then count related web pages which are searched out by this three algorithms. The results show that:the efficiency of GA algorithm is also higher than BF and HITS algorithms.
Keywords/Search Tags:Topic Crawler, Genetic Algorithm, Best-First, HITS
PDF Full Text Request
Related items