Font Size: a A A

Research On Web Spam Feature Analysis And Detection

Posted on:2016-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H JiFull Text:PDF
GTID:1228330470450087Subject:Management of engineering and industrial engineering
Abstract/Summary:PDF Full Text Request
In the global information era, people acquire more and more information from the Internet. Inthe past several decades, Internet has been developed rapidly, and people all over the world havebeen benefitted from it. But since then, web spamming has gradually appeared as well, and thequality of the search results is degraded strongly because of the presence of web spam. In recentyears, web spam has become the very serious problem in the search engine industry. Web spamcan mislead search engines, leading to spam pages ranking higher than some reputable pages insearch results. Nowadays, since most people have learned a considerable amount of wealthyinformation on line from the Internet and the Internet has been developed very fast, webspamming leads to more and more tremendous consequences.In general, the page with higher rank always has more accessed chance, since people usuallyonly read a few top ranked pages returned by search engines while searching on line. Driven bycommercial motivations, some websites or pages owners attempt to deceive search engines forranking their websites or pages higher than they deserve. This is called web spamming. The pagemade by this way is called web spam.Web spam becomes a serious problem for search engines because it strongly deteriorates thequality of the search results, and weakens the trust between the users and the search engineproviders. On one hand, it always leads to many people wasted lots of time to query on line; onthe other hand, it also wastes a significant amount of computational resources in the searchengine. So it has become one of the main challenges that web search engines need address, and itis necessary to put forward methods to tackle spamming.Nowadays, there are three kinds of web spam: link spam, content spam, and cloaking, andmany methods have been proposed to detect or demote them.Since web spamming has been attempted to deceive the search engines, leading to muchtrouble for people queries, many methods have been proposed to detect or demote web spam. Atpresent, one of the most successful techniques for web spam detection is the page features takendifferent values for spam and non-spam pages, as it has been shown that spam and non-spampages exhibit different statistical properties. In this thesis, we emphasize that there are indeedmany differences between non-spam and spam pages in their content features and link features,which can be used as auxiliary information to demote and detect web spam. Furthermore, basedon the TrustRank algorithm, we established the linked based algorithm with bi-directionalinformation feedback for spam detection. Plenty of experimental results indicate that our methodcan achieve satisfactory performance which is better than PageRank algorithm and TrustRankalgorithm.The main contents of this thesis include the following aspects.1. Analysis on the content features and their correlation of web pages for spam detectionWe analyze all the individual feature and the feature groups in contrast to those of spam in thecontent features dataset, and find out there are tremendous differences between spam and nonspam.We also explore all the content features for all pages, and find out that for non-spam pages,thedistribution of the individual content features including the number of words in the page, thenumber of words in the title, the average word length, fraction of anchor text, the fraction ofvisible text, the compression rate of pages, the entropy, and the independent n-gram LH are allapproximately consistent with the common probability distributions within some scope. But forspam pages, all the individual content features possess no regularities, which demonstrate thatthe spammers wanted to dramatically improve their search engine ranks in lots of differentrandom ways, and all these reasons directly cause the irregularity of the content features of spampages.We also propose the calculation formulae of the correlations within content feature groupssuch as the page entropy and independent LH, as well as within the corpus precision, corpusrecall, query precision, and query recall groups respectively. The experimental results indicatethat the corpus precision and corpus recall of spam is higher than those of non-spam, whichreflect the high usage rate of popular terms in spam pages in contrast to those in non-spam pages;but the query precision of spam is obviously lower than that of non-spam, and query recall isalso lower than that of non-spam, and the evidently lower query precision of spam indicates thatweb spam strongly deteriorates the quality of the search results, and waste lots of time for userqueries. The lower query recall also reflects the same performance of spam.In this thesis, we explore comprehensively all the content features for all pages. Theexperimental results indicate that both for the individual content features and for the contentfeature groups, there are obvious distinctions between non-spam pages and spam pages.Theanalysis on the content features and their groups of web pages aims at providing supplementaryhelp for the spam demotion and detection.2. Exploration on the individual link features and feature groupsWe also explore all the individual link features and link feature groups for all pages, and thenfind out the significant difference between the non-spam and spam pages.In general, the number of eq_hp_mp assigned1of spam websites is greater than that ofeq_hp_mp assigned1of non-spam websites. For non-spam pages, the distributions of theindividual link features such as the assortativity_hp(/mp), avgin_of_out_hp(/mp),avgout_of_in_hp(/mp), indegree_hp(/mp), outdegree_hp(/mp), pagerank_hp(/mp),prsigma_hp(/mp), and trustrank_hp(/mp), are all approximately consistent with the commonprobability distributions within some scope, except the residual link features about bothreciprocity_hp and reciprocity_mp which are not consistent with the common probabilitydistributions. The individual link features of spam reflect all sorts of methods the spam makeradopted to rank their pages higher in search engines resulted from adding the in-links for spampages, so as to lead to the link structure changed, and all these methods result in the linkirregularity of spam.The statistical results of the link feature groups such as the neighbors group, siteneighborsgroup, and truncatedpagerank group, also indicate the sub features of nonspam are very differentfrom those at the same level of spam. The link features groups also reflect the fact that the spam subgraphs are independent of the whole network to some extent, and the raised PageRankoriginated from the precursors for spam is much higher than that for nonspam. All these factsindicate that the link structure of the spam is much different from that of non-spam.Through several practical web spam data sets, we have found out that for most link features,there are lots of potential regularities existed in non-spam pages, and very few regularities inspam pages which justly demonstrate the spam pages are made randomly by spammers. Theanalysis on the link features and their groups of web pages proposed in this thesis aims atproviding supplementary help for the spam demotion and detection.3. Web spam detection aided by the bi-directional trend information feedbackWe proposed a novel web spam detection method aided by the bi-directional informationfeedback. In our opinion, each webpage has its two sides: the positive side trended to thereputable page, and the negative side trended to the spam page. Thus each webpage has itspositive score and negative score. We employ link and anti-link information fed back to demoteand detect web spam. Here the bi-directional trend information is resulted from two tendencyfunctions calculated by the positive scores and negative scores of pages on the web. We realizethe bi-directional trend information flow circulating throughout all the pages in the entire webwithin every iterating time, and when the algorithm converged, the positive and negative scoresbelonging to each page are achieved finally. So we achieve the scores of all pages and realize todemote and detect spam pages on the entire web. Plenty of experimental results show that ourmethod can achieve satisfactory results, and it is better than the PageRank and TrustRankalgorithm.In summary, web spam is one of the serious problems for search engines, and many methodshave been proposed for spam detection and demotion. We explore all the content features andlink features of non-spam in contrast to those of spam. We summarize the difference of thecontent features and link features between non-spam and spam pages, and propose the notablecontent and link features may be used as auxiliary information for spam detection. Furtheromre,we propose the linked based algorithm with the bi-directional trend information feedbackflowing on the web link graph for spam detection, to achieve the better scores than PageRankand TrustRank algorithms. We believe that the techniques for spam detection and demotion willensure the high query precision of search engines for the information diffusion of human society.
Keywords/Search Tags:detection and demotion for web spam, content features, link features, thebi-directional trend information feedback
PDF Full Text Request
Related items