Font Size: a A A

Optimization And Implementation Of HITS In Web Structure Mining

Posted on:2008-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:B XiaFull Text:PDF
GTID:2178360215972247Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet is a huge, widely distributed and global information service center, which provides various kinds of information services. Meanwhile, how to obtain required information or useful knowledge from the great deal of information provided by Internet has then become a problem required to be solved at once. Search engine is the most commonly used tool for Web information retrieval. But the quality of documents returned by the search engine is not so good and can not satisfy the uses' requirements for high quality documents.It is a very important method to implement Web data mining by combining traditional data mining technology and Web. Web structure mining is a important dimension in web data mining, researchers discover that rich and import information is contained among link structure of web page, hyperlink analysis has been successfully used in analyzing the hyperlink data of web pages to extract authoritative information source. Among various hyperlink analysis methods, HITS (Hyperlink-Induced Topic Search) algorithm is the most typical. By studying the classical Web structure mining algorithm HITS in depth, we discover that there are some shortcomings in HITS. HITS algorithm selected too many invalid links in the stage of expanding base root, directly affect the quality of ultimate authority information sources;Provides unequal impact weight to different Web site author,leading to unreasonable mutually strengthen relationship of the links. Web link structure of the self-organizing nature often lead to iterative analysis converges to the link chart with the inquiry subject not too related Tightly-Knit Community (TKC), thus causes the subject displacement. In view of the above insufficiency, this paper proposes an improved algorithm W-HITS with combination of content analysis and link analysis, and has developed the experimental system, has carried on the confirmation to this algorithm. Based on the analysis of experimental results, demonstrated the improved algorithm than the original algorithm more reasonable and effective.The main contributions are as follows:(1) Proposed the more effective method of gaining base root, and give the document authors equality in the impact weight, to make the authority and the hub pages are more objective and reasonable.(2) Through content analysis give the source information relevance weight to the given subject, and applies weighted I/O operations in the iteration, enable a subject correlation higher information source to receive higher scores.(3) Pruning lower relevance nodes, remove their ranking scores calculated interference, further ensuring the theme selected results of the inquiry is the real theme of the authority/hub source.(4) Proposes an experiment plan to confirm this algorithm valid, and has developed the experimental system to confirm this algorithm and the experimental results are analyzed and discussed.
Keywords/Search Tags:web data mining, link structure, topic distillation, HITS, content analysis
PDF Full Text Request
Related items