Font Size: a A A

The Research Of Technique On Anti-spamming Of Web Page

Posted on:2021-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhuangFull Text:PDF
GTID:1488306473472024Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Web spam refers to web pages that adopt techniques to deceive search engine algorithm for higher than they deserved rankings in search engine result.Web spam seriously affects user experience of search engine,leads economic loss for search engine company,and obstructs the normal and orderly development of the Web.Web spam demotion and Web spam detection are two ways for combating Web spam.Based on the link structure,Web spam demotion algorithm computes probabilities of nodes being spam or normal by propagating scores through links.The probabilities are then used for ranking Web pages so that normal pages are expected to have higher rankings than spam.Web spam detection uses machine learning algorithm to build binary classification models with a set of Web page features to detect spam.Most Score Propagation Based Web Spam Demotion Algorithms(SPB-WSDAs)are based on Page Rank model,which propagates “trust” or “distrust” from a labeled seed set to other pages in the Web graph.Differences among SPB-WSDAs are their different propagation rules.There are three defects of conventional SPB-WSDAs: 1)the lack of unified computing framework and theory;2)the vulnerability to identify content spamming pages since only linkage information are used;3)model improvements are very empirical and lack of data-driven analysis method.To address the above mentioned issues,we did the following three research works in the dissertation.Firstly,we proposed Unified Score Propagation Model(USPM)for Web spam demotion algorithms.USPM defines a unified computing framework at higher and more abstract level,and sums up common design strategies for different algorithm modules.In USPM,SPB-WSDA consists of a Forward Score Propagation Function(FSPF)and a Backward Score Propagation Function(BSPF),which are composed of three sub-functions,respectively,splitting function,accepting function and combination function.Thus,the differences among different SPB-WSDAs are the differences among their sub-functions.On the basis of USPM,we proposed Supervised Forward and Backward Ranking(SFBR)algorithm.There are two important improvements of SFBR: 1)SFBR adopts asymmetric design for FSPF and BSPF;2)SFBR uses score normalization method to avoid utility enhancement and diminishing of static distribution vector.The experiments conducted on three public datasets demonstrated the superiority of SFBR.Secondly,we proposed Deep Learning to Rank based Web Spam Demotion Algorithm(DLR-WSDA).DLR-WSDA uses deep belief network to construct preference function for evaluating priority of a pair of data samples.Based on the estimated priorities of all pairs of data samples,we proposed a new data aggregation algorithm TRPA(Top-Ranking Probability based Algorithm)to obtain total ranking of data samples.DLR-WSDA benefits from the content features of Web pages and the local property of TRPA,which improves algorithm computing efficiency.The experimental results show that DLR-WSDA outperforms other SPB-WSDAs.Thirdly,we proposed a supervised Page Rank algorithm: Learning Rank.Learning Rank uses parametric model(e.g.deep belief network)to learn “propagation strategies” of SPB-WSDA in end-to-end way,instead of designing the strategies manually as did in most conventional SPB-WSDAs.Therefore,we designed specific objective function and training algorithm for Learning Rank.The experiments on two real-world applications,Web spam demotion and recommendation,show the effectiveness of Learning Rank.In Web spam detection,conventional decision tree algorithms cannot take advantage of group structures of Web page features.To address this issue,we proposed Dynamic Feature Bundling Decision Tree(DFBDT).DFBDT extends the definition of information gain and information gain ratio for single feature of C4.5 to a set of features.Three splitting algorithms are designed for DFBDT to find split points for a set of features: abstract optimal bundling,abstract greedy bundling and local greedy bundling.Based on the DFBDT,we further proposed Dynamic Feature Bundling Random Forest(DFBRF)algorithm.On the Web spam detection task,the experimental results show that: 1)DFBDT obtained substantial performance improvement compared with C4.5;2)DFBRF outperformed other popular Web spam detection algorithms.
Keywords/Search Tags:Combating Web spamming, PageRank, score propagation model, deep belief network, learning to rank, decision tree
PDF Full Text Request
Related items