Research On Web Spam Combating Algorithm Based On TrustRank

Posted on:2017-08-03

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhou

Full Text:PDF

GTID:2348330512977431

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet,the scale of network information is increasing,and the spam pages are gradually increasing,which greatly affects the accuracy and efficiency of the search engine,how to find the high quality search results in mass information to meet the needs of users become more and more important.How to identify the spam web has become one of the most serious challenges of the Internet and search engines.Search engines cheatings are divided into two types,cheating on the content and cheating on the link.On one hand,the spam pages link to the high trust value pages to improve their ranking,on the other hand,web adopts content cheating of awash with keywords.The thesis transforms spam web detection to web sorting.Based on the characteristics of the search engine cheating,the Trust Rank algorithm based on the quality of web pages is optimized from two aspects of the link and the content.The main work of this thesis is as follows:(1)In this thesis,the existing problems of current algorithms are described.The existing detecting methods based on links are according to the existing topology,and ignore possibility of spam links.Based on the problems,we first extract feature properties from web content to constitute feature vector,calculates the similarity of adjacent pages,distinguish spam link and update new weights according to link scores and hits.(2)Optimize Trust Rank algorithm.The traditional Trust Rank algorithm which based on the random walk model,supports the information of the back propagation,that is,the web page A link to the B,whether A is a spam web affects the score of B.In this thesis,a new TDRank algorithm based on two-way random walk model is proposed,which makes the score of web page A and B to affect each other,the method can avoid spam pages which link to many high quality pages obtain high value.Meanwhile,the thesis tries to study other simple and fast algorithm as the method of selecting the seed set,which to provide a suitable input feature vector for the TDRank algorithm,and to make the results of the experiment accurate and effective.(3)Based on WEBSPAM-UK2007,the thesis designs experiments to test and verify algorithms above.In addition,experimental results on challenging real-world datasets show that our proposed algorithm is effective.

Keywords/Search Tags:

Spam Link, Ranking Algorithm, Seed Selection, Web Similarity

PDF Full Text Request

Related items

1	Research On Ranking Algorithm And Spam Detection Techniques Of Search Engine
2	Research On Web Spam Detection Algorithm Based Link And Topic Relevance
3	Research On Web Spam Detection Algorithm Based Link Weight
4	Link Analysis Based Page Ranking Improvement And Related Link Spam Algorithm
5	Research On The Approach To Detecting Spam Page Ranking Based On Link Analysis
6	Optimizing Page Ranking Based On Link Analysis
7	Page Ranking Algorithm Based On Link Similarity Study
8	Research On Automatic Seed Set Expansion Algorithm In Anti Search Engine Spam
9	Research On Web Spam Combating Algorithm Based On K-means
10	The Study On Ranking And Similarity Calculation In Information Retrieval