Font Size: a A A

Research And Improvement Of Web Structure Mining Algorithm

Posted on:2011-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:P RenFull Text:PDF
GTID:2178360308457139Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past 10 years of rapid development, Web has become the world's largest source of data open to the public. At the same time, faced with such a huge source of information, how to access to useful information fast and efficiently has become increasingly important. Unlike traditional document, Web has many unique features, in order to obtain valuable information; we need to combine related technologies in data mining with Web to achieve Web data mining. When users access the network, they not only want to quickly find the information they need, but also to find the contents of the authoritative, which is the authority website. Web can be expressed as a huge graph model composed of hyperlinks, the links between pages provide a new way to find authoritative Web page, HITS algorithm is one of the most representative algorithms. So HITS algorithm has a high research value.In this paper we deeply studied the problems in HITS algorithm. HITS algorithm only considered the links between pages but ignored the link text. In the process of expanding the root set, root set may contain a large number of pages unrelated to the topic. This will easily become a topic drift problem; eventually leading to that the returned results do not meet the needs of users. To solve these problems, we put forward an improved algorithm T-HITS, first of all to establish a trust-model. And then use network structure chart to map the set of spam links to their corresponding website, and refuse the spam links by analyzing link texts. At last we fix the result with trust-model.At last, by implementing a simple prototype system we compared the HITS algorithm, BHITS algorithm and T-HITS algorithm in the experiments. Then we compared the top 20 results from each algorithm, experimental data showed that the T-HITS algorithm which established a trust-model improved the relevance of query results, and reduced the occurrence of topic drift. This improved customer satisfaction.
Keywords/Search Tags:Data Mining, Web Data Mining, HITS, Search Engine, Information Retrieval
PDF Full Text Request
Related items