Font Size: a A A

The Research On Key Technologies In Web Spam Detection Based On Genetic Programming And Ensemble Learning

Posted on:2013-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X F NiuFull Text:PDF
GTID:1228330395470276Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the explosive growth of information on the web, search engine has become an important tool to help people find their desired information in daily lives. Given a certain query, search engines can generally return thousands of pages, but most users read only the first few ones. Therefore, the page ranking is highly important in search engines. So many people employ some means to deceive the ranking algorithm of search engines to enable some web pages to achieve undeserved high ranking values, which can attract the attention of users and help obtain some benefits. All the deceptive actions that try to increase the ranking of a page in search engines are generally referred to as Web spam. Web spam seriously deteriorates search engine ranking results, leads to great obstacle in users’information acquisition process and brings the poor user experience. From the point of view of a search engine, even if spam pages are not ranked sufficiently high to annoy users, there is a cost to crawl, index and store spam pages. Detecting web spam has become one of the top challenges in the research of web search engines.According to the characteristic of Web Spam dataset, this thesis focused on constructing classifiers based on the features of web pages in order to improve the Web Spam detection performance. It contains the following three parts:(1) Developed to learn a discriminating function to detect Web Spam by Genetic ProgrammingAn individual is defined as a discriminating function to detect Web Spam. Genetic Programming could find the optimized discriminating function to improve the Web Spam detection performance after genetic operators. However, one problem occurs when Genetic Programming is employed to generate discriminating functions. It is difficult to know the proper length of an individual because we have no prior knowledge about optimal solutions. If the length of an individual is too short, the individual contains few features and its discrimination is poor. The classification performance of the corresponding functional expression is not good. If we want to make full use of the features in Web Spam dataset, such as content features, link features and so on, the length of the discriminating function need to be longer and the scale of the corresponding individual is larger. For the population composed of some large-scale individuals, construction and search require more time. Based on the principle that a long discriminating function is composed of some short ones, this paper proposes a new method to learn a discriminating function to detect web spam by Genetic Programming. This method first constructs multi-populations composed of some small-scale individuals and every population can generate one best individual belonging to the population by genetic operators. Then the best individuals in every population are combined by Genetic Programming to gain a possible best discriminating function. This method can generate a better discriminating function to detect Web Spam within less time. We also study the effect of the depth of the binary trees representing the individuals in the Genetic Programming evolution process and the efficiency of the combination.We perform experiments on WEBSPAM-UK2006. The experimental results show that:(1) the multi-population Genetic Programming by two combinations can improve spam classification recall performance by5.6%, F-measure performance by2.25%and accuracy performance by2.83%compared with one population Genetic Programming;(2) the approach can improve spam classification recall performance by26%, F-measure performance by11%and accuracy performance by4%compared with SVM.(2) Developed to detect web spam by ensemble learning algorithm based on Genetic Programming.At present, most Web Spam detection methods based on classification only employ one classification algorithm to create base classifiers, and ignore the imbalance between spam and normal samples, i.e. normal samples are much more than spam ones. Since there are many types of Web Spam techniques and new types of spam are being developed continually, it is impossible to expect that we are able to find an omnipotent classifier to detect any kinds of Web Spam. Integrating the detection results of multi-classifiers is a way to find an enhanced classifier for Web Spam detection, and ensemble learning is also one of effective methods for the classification problem on the imbalanced dataset. Two key issues in ensemble learning are how to generate diverse base classifiers and how to integrate their results. This paper proposes to detect Web Spam by ensemble learning algorithm based on Genetic Programming. This new method first generates multiple diverse base classifiers, which use different classification algorithms and are trained on different instances and features. Then Genetic Programming is utilized to learn a novel classifier, which gives the final detection result based on the detection results of base classifiers.This method generates diverse base classifiers with different data sets and classification algorithms according to the characteristic of Web Spam Dataset. Ensemble on the results of base classifiers by Genetic Programming can not only be easy to integrate their classification results of heterogeneous base classifiers to improve classification performance, but also to select part of base classifiers for integration to reduce prediction time. This approach also combines the under-sampling technology with ensemble learning to improve the classification performance on imbalanced datasets. In order to verify the effectiveness of the Genetic Programming-based ensemble learning, we perform experiments on balanced and imbalanced data sets respectively. The experiments on the balanced dataset first analyze the effect of classification algorithms and feature sets on the ensemble. Then the experimental results are compared with those of some known ensemble learning algorithms and the results show that the new approach performs better than some known ensemble learning algorithms in terms of precision, recall, F-measure, accuracy, Error Rate and AUC. The experiments on the imbalanced dataset show that this method can improve the classification performance whether the base classifiers belong to the same type or not, and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones. The F-measure of this new assemble method is higher than those of AdaBoost, Bagging, RandomForest, Vote, EDKC algorithm and the method based on Prediction Spamicity.(3) Developed to generate new features by Genetic Programming to detect Web Spam.For classification problem, features play an important role. In publicly available WEBSPAM-UK2006dataset, there are96content-based features,41link-based features andl38transformed link-based features. The transformed link-based features are the simple combination or logarithm operation of the link-based features manually which needs to be accomplished by experts and is labor-intensive. In addition, it is not easy to combine different kinds of features, such as content features and link features. This method proposed to derive new discriminating features using GP from existing features and use these newly generated features as the inputs to a SVM classifier and GP classifiers for web spam detection.Experiments on WEBSPAM-UK2006show that the classification results of the classifiers that use10new features are much better than those of the classifiers that use original41link-based features and are equivalent to those of the classifiers that use138transformed link-based features.
Keywords/Search Tags:Web Spam Detection, Genetic Programming, Ensemble Learning, Classification on the Imbalanced Dataset
PDF Full Text Request
Related items