Font Size: a A A

Web Structure Mining Based On The Maximum Flow And Page Similarity

Posted on:2012-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2208330335971187Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web information is increasing at an alarming rate. Extracting, filtering, and finding useful information from Web data become people's urgent needs. Web mining dig into a hyperlink analysis of the information technology, which provides a new ideas for this problem that can be effective use of a great of information. So web structure mining for provide personalized service, improved Web performance and structure, providing support business decisions is important theoretical and practical value.This paper study Web Structure Mining of hyperlink analysis, focuses on analyzing the principles of hyperlink analysis algorithms and the existing problems. Used vector space projection methods and Web community discovery methods to the Web hyperlink analysis, in order to solved Web structure mining of "topic drift" issues. The main work and innovation grouped into the following three aspects:(1) Using web community discovery technique to optimized base set. For one thing, in the case of documents with hyperlink structures such as Web pages are analyzed in depth, in order to choose some notes that is relevant to the topic around the root set and add them to the base set. For another thing, remove the notes which are no relevant topic form the base set. Then extend the root set two layers and take the notes serve as seed notes in root set to find the Web community by using of Max-flow algorithm. It ensures the base set quality and also reduces the cost of operation.(2) Page similarity value was introduced. Current search engines will obtained the similarity value by calculation when it crawl the web document and analyze each document content,then feed it back to the users. The improved algorithm uses easy to pick up the similarity to measure the relevant level between the web pages and user query themes. By doing so, the capacity of distinguish about links structure'status on the Web community is enhanced; the replicate analysis of page text contents avoided, and the system cost become lower.(3) Using the core idea of space projection method to construct the vector space based on the page similarity values. On the premise of obtain the page similarity values, build the high authority subspace based on page similarity value on the base of the VSM. By getting the most closely linked with the space of the characteristic vector for projection. Then get the biggest absolute value vector from the projection space and iterate. It can solve effectively the "topic shift". Finally, this paper designed a system for demonstrate the validity and feasibility of the improved algorithm. The experimental results demonstrated that the improved HITS algorithm is superior to the original algorithm in the theme relevance of authority pages, the theme relevance of hub pages and operation cost. In addition, it can effectively reduce the topic shift phenomenon and make further improvement on the user query quality.
Keywords/Search Tags:web data mining, hits algorithm, page similarity value, space vector projection
PDF Full Text Request
Related items