Font Size: a A A

Research On Web Spam Combating Algorithm Based On TrustRank

Posted on:2017-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2348330512977431Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,the scale of network information is increasing,and the spam pages are gradually increasing,which greatly affects the accuracy and efficiency of the search engine,how to find the high quality search results in mass information to meet the needs of users become more and more important.How to identify the spam web has become one of the most serious challenges of the Internet and search engines.Search engines cheatings are divided into two types,cheating on the content and cheating on the link.On one hand,the spam pages link to the high trust value pages to improve their ranking,on the other hand,web adopts content cheating of awash with keywords.The thesis transforms spam web detection to web sorting.Based on the characteristics of the search engine cheating,the Trust Rank algorithm based on the quality of web pages is optimized from two aspects of the link and the content.The main work of this thesis is as follows:(1)In this thesis,the existing problems of current algorithms are described.The existing detecting methods based on links are according to the existing topology,and ignore possibility of spam links.Based on the problems,we first extract feature properties from web content to constitute feature vector,calculates the similarity of adjacent pages,distinguish spam link and update new weights according to link scores and hits.(2)Optimize Trust Rank algorithm.The traditional Trust Rank algorithm which based on the random walk model,supports the information of the back propagation,that is,the web page A link to the B,whether A is a spam web affects the score of B.In this thesis,a new TDRank algorithm based on two-way random walk model is proposed,which makes the score of web page A and B to affect each other,the method can avoid spam pages which link to many high quality pages obtain high value.Meanwhile,the thesis tries to study other simple and fast algorithm as the method of selecting the seed set,which to provide a suitable input feature vector for the TDRank algorithm,and to make the results of the experiment accurate and effective.(3)Based on WEBSPAM-UK2007,the thesis designs experiments to test and verify algorithms above.In addition,experimental results on challenging real-world datasets show that our proposed algorithm is effective.
Keywords/Search Tags:Spam Link, Ranking Algorithm, Seed Selection, Web Similarity
PDF Full Text Request
Related items