Font Size: a A A

Researching On The Sorting Strategy Of Agricultural Search Engine Based On Nutch

Posted on:2011-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:C H WangFull Text:PDF
GTID:2178360305474329Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engines is a technology which locates information from the Internet quickly and effectively, and in which the most closely with customer relationship is the technology searching results sequencing, the results direct response to the user。To some extent,a good sort results will become a good search engine. With the popularity of our computers in the countryside, and the increasing of the agricultural information , agricultural engine research becomes a hot researching topic. The aim of this researching is to analyze the sorting strategy of search engine in-depth, to improve the traditional PageRank algorithm, and to apply it to the agricultural search engine Nutch-based.Analyzing the work flow of search engine, and researching the factors of impacting sorting be containing by the web crawling, indexing, retrievaling and other sections,which is the main work. At the same time, Analyzing the sorting processes, and finding out the critical factors and the basic principle of affecting sorting,which is also the important jobs that have been done. By Analyzing the Nutch which is an open source search engine and its implementation process, researches a classic sorting algorithm, and improves the sorting algorithm based on two aspects whice are the authority based on hyperlink analysis and the content correlation. Finally,based on Nutch, established an agricultural search engine by controlling the address of Crawlling the web page to, which is improved by using the improved sorting algorithm.In the experiment, the specific processes of agricultural search engine Nutch-based is brought forward.With the general evaluating method of the P@n and the Home duplicating rate, the improved algorithm is been well evaluated. Through the specific experiment, the efficiency of the algorithm is been discussed from the quantitative point of view, and the following results are been improved: the improved algorithm derived customer satisfaction and improved page repetition rate than the before algorithm increases about 7%.The main achievement of this paper is the improvement to the link analysis for ultra-authoritatives based on PageRank algorithm.Including the following two aspects: the ideology to the hyperlink analysis based on 2 degrees deep which is the weight of the parent page transmist non-average, and the compensation strategies in the new or isolation resources. Mainly analyzes the basic improvement ideas of these two aspacts, and puts forward the specific formula, and a brief analysis shows. For researching into the relevance of the content analysising,introduces the concept of the agricultural theme vectors and the methods of calculation and construction, and gives the document's agriculture-related degree formula. Finally, the algorithm is been further introduced which is integrated content analysissing based on parent-child transmissing non-average weight.
Keywords/Search Tags:2 degrees deep, compensation for new resources, the relevance of document to agriculture, the scale of Unification, evaluation method based on compared curve
PDF Full Text Request
Related items