Font Size: a A A

Web Spam Detection With Learning To Rank

Posted on:2015-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2268330425495911Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Research shows that eighty percent of search engine users browsed the results returned bysearch engine no more than three pages. Therefore, in the results returned by search engine, thehigher the rank the greater the profits. Meanwhile, many web pages get higher rankings bydeceiving search engines which are called web spam. Driven by the profit, web spam get higherranking by deceive the search engines which interfere with the user access to information. Webspams damaged the reputation of search engines and weaken users trust to search engines. So webspam detection is one of the major challenges faced by the search engines. In order to detect spampages effectively, we analyzed content and link features distribution of normal pages and spampages. We combined web content features and link features as well as machine learning methodsand other sorting algorithms to detect spam pages. Details are as follows:1. Trustrank algorithm is the ranking algorithms which based on the link structure. Thetraditional TrustRank algorithm detected spam pages only with information of link feature, butthis method is not effective for all spam pages. Such as a set of web pages provide some usefulresources to attract other links, but this group of web pages contains many links to the target pagewhich cheating the search engines, and those links may be hidden. In this way, The TrustRankvalue of target page could be very high. It is difficult to detect web spam which topology structureis very similar to the normal page, so the spam detection method based on content features iseffective. Therefore, we extracted and analyzed the distribution of the web content features,according to the different of the distribution of the normal web content features and spam pagescontent features, we combined the difference with the pages links feature to detect spam pages.2. Spam page detection methods based on the Content features only consider the content ofthe web page features difficult to adapt to the evolving web cheating technology, spam pagedetection method based on the link structure ignored the content information of pages. It isdifficult to detect web spam which topology structure is very similar to the normal page if we onlyconsider the topology structure of the page. We analyzed the distribution of web content featuresand linked features and indicate that normal web features distribute regular but spam web featuresdistribute scattered. So we employ function to fit the distribution of normal web features thencalculate the difference between web proportion and the distribution function. The difference ofnormal web pages is small but the difference of spam pages is large. Finally, we use decision treesto detect spam pages with difference as threshold.3. Many people refused Web spam detection as a classification problem and use of thevarious machine learning classifiers to detect spam pages, such as SVM and decision tree such ascan be trained to detection spam pages. In our opinion, web spam detection also can be regardedas a ranking problem. the basic requirement of ranking model is that normal pages rank higher andspam pages rank later, allowing users will not be disturbed by spam pages when using searchengines. First of all, we obtained content feature vector by analyzed the distribution of the contentfeatures, and then we trained a mathematical model with content feature vector represent valueswhich determined by the link information. Finally, we used the model ranked pages.Web spam pages not only affects the user find useful information through the search engine, but also waste a serious of resources for the search engine, search engines index pages accordingto the user’s request need to deal with a lot of spam pages, so studies detect spam pages havepractical significance.
Keywords/Search Tags:Spam pages, Search engines, Page ranking, Trust value, content features, linked features, ranking model
PDF Full Text Request
Related items