Font Size: a A A

Research On Web Spam Combating Algorithm Based On K-means

Posted on:2018-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y MengFull Text:PDF
GTID:2348330542984998Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of modern science and technology,Internet has been paid more and more attention by people,and has become an important way for people to get information.In order to obtain more traffic,web spammers attempt to deceive search engines by web spam,which not only influences the good environment of the Internet,but also decreases the accuracy and efficiency of search engines and brings bad experience to the users.Therefore,web spam detection has become one of the most serious challenges of Internet and search engines.Search engine cheatings are divided into two types,one is link cheating,and the other is content cheating.The link-based spam pages usually link to the high trust value pages to improve their rankings,which will decrease the accuracy.The content-based spam pages usually improve the content similarity by keywords stuffing to deceive the search engines.Based on the characteristics of spam pages,the PageRank algorithm is optimized and the IPR(Individuation Page Ranking)algorithm is proposed.What's more,the IPK-Means(Individuation Page-Based K-Means)algorithm which combines IPR algorithm with K-Means algorithm is proposed to better detect spam pages.The main work of this thesis is as follows:(1)The optimization of K-Means algorithm is described in this thesis.The current spam pages detection algorithms are mainly based on the content or the links.Because of the shortcoming that PageRank assigns equal weight to the pages,this thesis optimizes the PageRank algorithm to assign edge weight according to the authority of pages,taking the authority in page into account.(2)The optimization of K-Means algorithm is described in this thesis.Based on the fact that the initial cluster centroids of K-Means are randomly chosen,the improper centroids will lead to bad clustering effect.Considering this phenomenon,the thesis proposes IPK-Means algorithm which combines the IPR algorithm with the K-Means algorithm.As IPR value represents the authority in the proposed algorithm,the page having maximum IPR value is taken as the cluster centroid of non-spam pages,and the page having minimum IPR value is taken as the cluster centroid of spam pages to better detect spam pages.(3)Based on WEBSPAM-UK2007,the thesis designs experiments to test and verify algorithms above.In addition,experimental results on challenging real-world datasets show that our proposed algorithm is effective.
Keywords/Search Tags:Spam Link, Ranking Algorithm, PageRank, K-Means
PDF Full Text Request
Related items