Font Size: a A A

Research On The Web Structure Mining Algorithm Based On Nutch

Posted on:2012-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:J J WenFull Text:PDF
GTID:2178330335475497Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the exponential growth of the internet information, it is becoming increasingly difficult for the ordinary internet user to access the information they want. Therefore, it is rather urgent to have a good search engine services by which we can turn the miscellaneous contents into the accessible information. The search engine technology makes a great contribution to the settlement of this problem. The quality of the homepage in the results set put forward by the search engine is of great significance to the user of the search engine. Whether the high-quality homepage can get the good ranking in the results set also counts a lot. These are the two key factors to judge the quality of the search engine. It is the difficult problem for the search sorting algorithm to evaluate the importance of the homepage and sort accordingly.This thesis introduces the conception and the classification of the Web data mining especially the Web structure mining and the data show method. Then we analyze the classical PageRank and HITS algorithm in the field of Web structure mining and make a study on the defection and the improvement direction of the PageRank algorithm in terms of the internet status quo. Furthermore, the attention will be given to the relevant search engine knowledge and the Nutch. Finally, we elaborate on the relevant index of the search engine evaluation and the working procedure of the Nutch.By the in-depth study on the PageRank algorithm, the strategy of improving PageRank algorithm is proposed in view of the outdated page phenomenon of the PageRank algorithm and theme drift phenomenon and the defection of the average homepage weight. then this thesis propose that we should classifies the homepage and compute on the similarity degree of the linked homepage and store the outcome in order to tackle the fact that PageRank algorithm neglects the user's interests. Finally, the improved PageRank algorithm is proposed. Afterwards, this thesis designs and realizes the traditional PageRank algorithm and the improved PageRank algorithm. Then after we capture a great number of the homepage by using the Nutch search engine, we make a test and comparison between the traditional PageRank algorithm and the improved PageRank algorithm in terms of relevant indexs. The results prove that the improved algorithm possesses the better accuracy and satisfy the need of the users greatly.
Keywords/Search Tags:PageRank Algorithm, Nutch, Web structure mining
PDF Full Text Request
Related items