On Similarity Search Among Web Pages

Posted on:2014-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:D L Jin

Full Text:PDF

GTID:2268330401977056

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the popularity and rapid development of Web technology, all kinds of information that people can easily access through a browser. But the users not proceed that the rapid growth of web data is often faced with a lot of information.requires an effective Web page similarity search methods to obtain useful information from the Web. It requires an effective similarity search methods of web page to obtain useful information from the web. The problem on similarity search among web page to be solved is:according to a given web page, you can return a number of other similar pages. Furthermore, hyperlinks between pages can reflect the possible direction of information transfer between each page. It can provide an important basis for the discovery of similar pages. To solve this problem is to find an efficient and accurate similarity algorithm, and be able to quickly respond to user requests.Web page network usually presents large-scale and rapid growth, it makes the similarity calculation seems more challenging. But traditional algorithm of content-based similarity may be related to some privacy in the contents of the user, and is not accurate enough. In contrast, based on the structure of the similarity algorithm on the accuracy of calculation will be much higher. SimRank is a classical algorithm based on structural similarity, and has the characteristic "rapid convergence", however, it can not be applied to large networks well since its space and time cost is very high. To solve the above problem, we proposed an efficient similarity search method among Web pages (WSR) based on a SimRank. It can significantly reduce the space cost and pre-computation time cost. Using static pruning technique to optimize the web network and proposed WSR-pruning algorithm, and improve the efficiency of pre-computed efficiency and online query processing.The work includes mainly the following sections:1. We explained the background and significance of the research, and the development and status on similarity search among web pages.2. Introduced the classic link analysis algorithm, as well as several similarity calculation method.3. Proposed on similarity search among web pages based on SimRank. Elaborated the basic idea and recursive iterative process on SimRank, and analyzed its strengths and weaknesses.4. To the problem of calculation on the time and space overhead huge for SimRank, we proposed an efficient similarity search method among Web pages (WSR) based on a SimRank and Network of relationships to Web pages. It only compute the2-hop similarities among Web pages, i.e., for a given query, we computing the2-hop similarity between query and each page in the network based on the pre-computed1-hop similarity matrix. Given a algorithms of online query processing, and analysis online query processing algorithm time complexity and error precision Theoretical. There are some of the Web page in the network is not important page link relationship, so similarity calculation is not of great value, and the page can not be deleted, then consider to reduce these unimportant link relationship in order to improve the calculation efficiency. Using static pruning technique to optimize the web network and proposed WSR-pruning algorithm.5. Comparative analysis by experiment. The comparative analysis of results show, compared with traditional SimRank, WSR and WSR-pruning reduce the storage overhead and computational time overhead. They have a higher accuracy and rapid query response time.

Keywords/Search Tags:

web page network, similarity search, SimRank, static pruning

PDF Full Text Request

Related items

1	Study On Index Pruning For Web Search Engine
2	SimRank Computation On Large Graphs Based On Spark
3	Page Ranking Algorithm Based On Link Similarity Study
4	Top-k SimRank Algorithm Optimization And Its Application In Scientific Literature Retrieval
5	Research On Webpage Recognition Technology Based On Vision And Semantics
6	Web Search Based On Social Tagging
7	Similarity-Based Approach To Neural Network Pruning
8	Research On Search Engine Based On Web Page Mining
9	Research On Continuous Learning Based On Task Similarity And Network Pruning
10	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology