Font Size: a A A

A Vector Space Projection HITS Algorithm Based On Similarity Value

Posted on:2011-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:D H LiuFull Text:PDF
GTID:2178360305980380Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet has become the main approach for people to exchange information and share resources in modern society. As an important platform for searching information, internet has some characteristics, such as mass data, heterogeneity, semi-structure, dynamic, user diversity and so on. Therefore, it is considerably difficult to mine web resources in the Internet. Traditional data mining technology is not applicable for the Internet any longer. Accordingly, web data mining technology has been developing in recent years.Search engine is an important tool for people to search information. By applying web data mining technology, search engine can provide the required information to people quickly and effectively. There are several kinds of available search engine algorithms. However, most of algorithms aim to mine text content of web pages and generally return a large result set. Therefore, users can hardly obtain their needed information quickly. This has become a problem for people to use information resources efficiently.In recent years, hyperlink analysis technology, as a new approach, is applied to solve this problem, because the link structures of Internet contain a large number of potential rules, by which we can deduce many resources that web documents do not include. Web structure mining has become an important research direction of web data mining. By analyzing the structures of hyperlinks based on the result generated by traditional search engine, HITS (Hyperlink-Induced Topic Search) algorithm is used for the web data mining.This paper focuses on the research of HITS algorithm, which is one of web structure mining algorithms and mainly applied to arrange the result sets of web pages of search engine. Firstly, the web data mining and its classification are introduced in this paper. In particular, the typical algorithms of web structure mining are summarized. Secondly, this paper studies the HITS algorithm in depth and elaborates its principle. Finally, depending on the study of the HITS algorithm and pertinent improved algorithms, a vector space projection HITS algorithm based on similarity value is proposed, which has three improved aspects as follows:1) Reduce base set. A number of web pages are introduced into the base set when the root set extends to the base set by using the traditional HITS algorithm. These pages are mostly composed of many pages from the same domain and advertising links. The hyperlinks among them generally exist for navigation and have no reference value. By judging and deleting unrelated pages and pages in the same domain for the reduction of base set, the improved HITS algorithm can save computing cost obviously.2) Obtain the similarity value returned by search engine. Considering the fact that traditional search engine converts the text content and the query topics into entry eigenvectors after it crawls web pages. Then it acquires the similarity values by computing vector dot product and returns them as well as the result set to the user. The improved algorithm uses these easily-accessed similarity values to compute the relativity between the hyperlinks and user query topics. By doing so, the ability of distinguishing links' importance is enhanced, the iterative analysis of page text content is avoided, and the system cost is saved.3) Vector space projection method based on similarity values. Each eigenvector is projected to the high authority subspace based on similarity value, and the returned page result sets has closest hyperlinks to the high similarity page sets. It effectively improves the topic shift phenomenon of HITS algorithm without causing extra computing cost.Finally, an experimental system is also designed for the evaluation of feasibility and validity of the algorithm proposed in this paper. The result indicates that the improved HITS algorithm exceeds the original algorithm in computing cost, the topic relativity of authority pages, the topic relativity of central pages and so forth. In addition, it can restrain the topic shift phenomenon and improve the user query quality significantly.
Keywords/Search Tags:web data mining, hits algorithm, link structure, similarity value, vector projection
PDF Full Text Request
Related items