Font Size: a A A

Research On Web Spam Detection Method Based On Xgboost Algorithm

Posted on:2021-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2518306473980819Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development and application of Internet technology in recent 20 years,the connection between people's life and network become more and more close.At the same time,there are many illegal people who try to use Internet technology to harm the physical and mental health of network users and the property security of netizen.Web spam is a kind of common fraudulent means used by lawbreakers.By pretending the behavior of webpage link and webpage content,they cheat search engines and Internet users,so as to spread pornographic,gambling or drug information and steal users' privacy.Therefore,how to accurately identify spam pages is an urgent problem to be solved.Because the data volume of spam pages data is far less than that of normal web pages,it is difficult for traditional classification algorithms to fully learn the characteristics of spam pages,and ultimately it is difficult to correctly identify spam pages.This thesis chooses Xgboost algorithm as the basic detection algorithm,and focuses on the influence of imbalanced spam page data on the algorithm,and puts forward Xgboost algorithm optimization based on gradient distribution adjustment strategy(LCGHA-Xgboost).This method defines the Loss Contribution Density(LCD)to measure how easy it is to classify the samples correctly in the Xgboost algorithm,and adjusts the first-order gradient distribution of individual samples according to the Loss Contribution Density to increases the proportion of spam page losses,and ultimately achieve the goal of enhancing the detection accuracy of Web spam.The comparison experiments show that the LCGHA-Xgboost algorithm can effectively improve the recognition and detection ability of difficult samples such as spam pages compared with the other comparison algorithms.This thesis also investigates the cost of spam and normal web pages being misclassified,and constructs a cost-sensitive spam detection mechanism.This mechanism introduces cost-sensitive learning technology into Xgboost algorithm,puts forward a cost-sensitive Xgboost algorithm(CS-Xgboost),and CS-Xgboost improves the bat algorithm,so as to find the optimal misclassification cost parameters to improve the detection accuracy of spam web pages.In order to sovle these problems which exist in Bat Algorithm(BA),such as inadequate searching accuracy and easy to fall into local optimum,a dynamic weighted bat algorithm(CBDW-BA)based on Cauchy variation and bit variation is proposed.The cost sensitive Xgboost algorithm is encapsulated into the fitness function of the improved bat algorithm,and the cost parameter of misclassification in cost sensitive learning is used as the optimization problem of evolutionary algorithm,and the AUC value of classification algorithm is used as the fitness value of the fitness function,so the cost-sensitive spam detection mechanism(CSSDM)is constructed.On the one hand,the mechanism guarantees the classification performance of the cost sensitive algorithm.On the other hand,it avoids the influence of the artificial misclassification cost.In order to verify the effectiveness of the mechanism,this thesis first carries out performance test experiments for the improved bat algorithm,and the experimental results show that the optimization performance and convergence performance of the method are greatly enhanced.And next,several traditional integrated learning algorithms and outstanding research results in recent years are selected as comparison algorithms for experiments.The experimental results show that the cost-sensitive spam detection mechanism(CSSDM)proposed in this thesis is superior to the other comparison algorithms,which effectively improves the detection performance of spam pages and reduces the loss caused by classification errors.
Keywords/Search Tags:Imbalanced Data, Spam Page Detection, Xgboost, Gradient Distribution, Loss Contribution Density, Cost Sensitive, Bat Algorithm
PDF Full Text Request
Related items