Font Size: a A A

Research On Multi-view Learning For Web Spam Detection

Posted on:2015-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:S GaoFull Text:PDF
GTID:2268330425496248Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now the Internet has greatly changed the way people express themselves and interact withothers and has become one of the most important information retrieval methods. Because of this,the addition of information to HTML pages or other network files is becoming easier, and it willbe more difficult for users to distinguish between accurate and inaccurate information or reliableand unreliable information. Thus creating an effective spam pages detection method is a majorchallenge currently facing. Web spam detection work is often focused on two types of spampages: content-based spam pages, and link-based spam pages. Existing web spam detectionmethods typically use single view web features to detect, but multi-view learning methods byapplying two aspects of the features of web spam detection can address the problem morecomprehensively.This paper focuses on multi-view learning for web spam detection and it consists ofmulti-view learning feature extraction methods, classification methods, and specific web linkstructure etc. The main innovative works of this paper can be summarized as follows:(1)We considered the content and link based features of web spam detection problem as twodifferent views. Firstly we employed canonical correlation analysis and promotion methods forfeature extraction to generate two new feature sets for each web page. Then this paperimplemented different combinations of the two new feature sets of web pages to produce a singleview for web pages which was used to train different classifier for classification. The results ofour experiments show that considering web page data as two view data and applying multi-viewcanonical correlation analysis techniques can effectively improve the recognition accuracy ofweb spam.(2)There only exist a small number of labeled pages in web spam detection. Thus, we usedsemi-supervised co-training to detect the web spam pages. Divided the page features into twoviews, the content view and the link view. We firstly extracted the independent components ofeach view by using the independent component analysis and then used the co-training to detectthe label of each web page. Experimental results show that, this method can effectively improvethe recognition accuracy of web spam. Besides, the results also verify that two respectiveindependent component analyses of each view are more effective than the other methods.(3)In this paper, we modify the SVM classifier by exploiting web link structure. We firstlyconstruct the link structure preserving within-class scatter matrix with direct link matrix andindirect link matrix. Then we incorporate web link structure into SVM classifier to reformulatean optimization problem. The proposed method has taken advantage of the link information onthe web. Experimental results on web spam dataset show that such combination of web linkstructure and SVM classifier can significantly outperform related methods and demonstrate thechange of classification accuracy along with the round of indirect links.(4)We tackle this problem by taking the different formulations and statistical properties ofcontent and link views into full consideration. We reformulate principal componentanalysis(PCA) for content feature and locality preserving projections(LPP) for link feature and then incorporate them into our method to drive the consensus pattern from multiple embeddingsof multiple representations. Then the different embedding for each view is computed andsimultaneously the transformation from consensus pattern to representations of each view isconstructed by applying an iterative algorithm. We also provide a method to compute theconsensus pattern for out-of-sample data points. Our experimental results onWEBSPAM-UK2006and WEBSPAM-UK2007datasets demonstrate that our method of usingconsensus pattern to solve the problem of web spam detection outperform related dimensionalityreduction approaches.
Keywords/Search Tags:Multi-view learning, Web spam detection, Canonical correlation analysis, Co-training, Support vector machine, Link structure, Feature extraction
PDF Full Text Request
Related items