Font Size: a A A

Study On Under-sampling And Unbalanced Ensemble Classification For Web Spam Detection

Posted on:2019-07-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:M S ChenFull Text:PDF
GTID:1318330542477683Subject:Information management and information systems
Abstract/Summary:PDF Full Text Request
Web spam refers to websites and web pages that have good rankings and poor real value in search engine results.Web spam appears because search engine users tend to click on the top ranked links.In order to get the top of the rankings,the website will try every possible means to optimize the site.However,legitimate means to improve site ranking cost extremely high,so all kinds of Web cheating means are used extensively.Web spam weakens the authority of search engines,wastes a lot of computing and storage resources,deprives the legitimate interests of legitimate websites,and reduces the quality of search results.Web spam detection has become one of the most important tasks of search engines.According to the features of the WEBSPAM-UK2006 and WEBSPAM-UK2007 data sets,including their content-based,link-based,link transaction based and web graph based features,we use under sampling technique and ensemble C4.5 decision tree to detect web spam in this paper.The main work and achievements can be summarized as the following four aspects.(1)Three kinds of random undersampling algorithm(C4.5+RUS-once,C4.5+RUS-multiple,C4.5+RUS-replacement)are put forward for web spam detection.These methods improve the classification performance in web spam detection samples with undersampling diversity and balance techniques.Especially the latter two methods,they build a large number of diverse C4.5 decision tree classifier and integrate them into an ensemble classifier.The ensemble classifiers improve the performance of web spam detection and the detection results reach to the state-of-art result.In addition,this paper proposes an ensemble decision tree classifier based on feature set partition and Random undersampling(C4.5+FP+ RUS),whose classification performance has also reached the state-of-the-art results,and has greatly improved the performance of Web spam detection.(2)An immune clonal selection algorithm(ICFSUS-ERC4.5)for feature selection has proposed to select optimal feature subsets for web spam detection and then to build an ensemble C4.5 decision tree classifier based on undersamplingtechnique and the optimal feature subsets.The ensemble classifier ICFSUS-EC4.5further improves the classification performance for web spam detection and the classification results have exceeded the state-of-the-art results.(3)An optimal immune network algorithm(opt-aiNet)has been improved to select an optimal feature set partition for web spam detection.The improved algorimthm is called INFPUS-EC4.5.According to the conclusion that an ensemble classifier based on feature set partition and undersampling technique can improve the classification performance for web spam detection,an idea has been put forward: is there an optimal feature set partition based on which and undersampling technique an optimal ensemble classifier can be obtained? According this assumption,an immune network algorithm(opt-aiNet)has been improved to select the optimal feture set partition.Although the experimental results show that the improved opt-ai Net algorithm is indeed a good optimization algorithm,it leads to overfitting for the classification task and can not improve the final classification performance.(4)A improved Co-Forest algorithm has been proposed to improve the performance of web spam detection by using the unlabeled data.Accordding the the assumption that the features for web spam detection are sufficient and redundant,the algorithm builds an ensemble C4.5 decision tree classifier by improving Co-Forest algorithm based on the undersampling and feature subset selection techniques.The ensemble classifier is a semi-supervised classifier and is expected to improve the classification performance for web spam detection by using the massive unlabed data.The experimental results show that the improved Co-Forest algorithm improves the final classification performance by using unlabeled data.
Keywords/Search Tags:web spam detection, decision tree, ensemble classification, feature selection, feature set partition, under-sampling, immune clonal algorithm, immune net algorithm, Co-Forest
PDF Full Text Request
Related items