Research On Ranking Algorithm And Spam Detection Techniques Of Search Engine

Posted on:2011-12-05

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Wang

Full Text:PDF

GTID:2178360305951059

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid developments of information technology, the World Wide Web is now becoming an important resource of knowledge. Search engine, which is the port of the WWW, plays a decisive role in the Web information retrieval. Search engine ranking algorithm recommends user the most valuable web pages in priority, and it helps to enhance the quality of search service. Link analysis based ranking algorithms such as PageRank achieve great success in today's search engine. This kind of ranking principle can be used in many other information retrieval systems.However, search engine ranking algorithm faces with a serious problem, which is called web spam. Web spam refers to deliberate actions that deceive the ranking algorithms of search engine with the goal of achieving higher than deserved ranking position for some pages. Web spam not only increases the running cost and decreases the performance of search engine but also reduces the experience of web searchers. Therefore, detecting web spam and eliminating spam pages on the web are of great significance for both the search engine and web searchers.Our work mainly includes two aspects:1. We applied link analysis into literature retrieval system and proposed a PageRank based algorithm for scientific literature quality evaluation.2. We propose a web spam detecting method based on expansion strategy and link-based similarity measures. The main contents of this thesis are as follows:1. Summarize and analyze the ranking algorithms of search engine. On the basis of investigating theory and technology of ranking algorithms, we also propose a PageRank based algorithm for scientific literature quality evaluation, which is an application of search engine ranking algorithm in literature retrieval system. Our algorithm introduces the idea of the PageRank into citation analysis and takes many factors into account, such as the authority of the literature's publisher, the authority of the authors and the literature's published time, when evaluating a literature's quality. Experimental results indicate that the new algorithm can give more accurate results in line with what expected.2. On the basis of studying web spam techniques and present anti-spam methods, we propose a detecting approach based on expansion strategy and link-based similarity measures. Our approach starts from a small spam seed set and then expands this seed set iteratively. The expansion step finds other spam nodes according link-based similarity between these nodes and the seed nodes, and then adds these nodes into the seed set. There are three link-based similarity measures at present. They are Cocitation, Bibcoupling and Amsler. All of the three measures are firstly proposed in citation analysis and then introduced in link analysis. Because of the neglect of differences between citation analysis and link analysis, it may lead to problem when applying these measures into link analysis directly. To address this problem, we propose three new link-based similarity measures, and apply them into our detecting approach.3. We experimentally test our approach on a public dataset named WEBSPAM-UK2006, analyze the results and compare our approach with other two state-of-art detecting methods. Firstly, we compare our newly proposed link-based similarity measures with the present measures in terms of their performance in detecting web spam. We find that our new measures achieve higher precision but less nodes number than present measures. Secondly, we compare the performance of the three new measures and find that the three measures have their own strengthens. Finally, we compare our approach with ATR and BRW, which are two state-of-art detecting methods. Results indicate that our approach is better than the two methods in terms of both precision and number of result nodes.

Keywords/Search Tags:

search engine, ranking algorithm, web spam, link-based similarity, expansion strategy

PDF Full Text Request

Related items

1	Research On Automatic Seed Set Expansion Algorithm In Anti Search Engine Spam
2	Research On The Approach To Detecting Spam Page Ranking Based On Link Analysis
3	Optimizing Page Ranking Based On Link Analysis
4	Research On The Scheduling Strategy Of Meta Search Engine And Results Ranking Algorithm
5	Page Ranking Algorithm Based On Link Similarity Study
6	Research On Web Spam Combating Algorithm Based On TrustRank
7	A BBS Search Ranking Strategy Based On C4.5 Algorithm
8	Research On Search Engine Ranking Algorithm Based On Link Analysis
9	Research On Search Engineâ€™s Anti-spam Technologies Based On Link Analysis
10	A Novel Page Ranking Method Based On Analyzing The Diversity Of Network Structure