Font Size: a A A

Research On Classification Detection Method For Highly Imbalanced Data Of Web Spam

Posted on:2019-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z LinFull Text:PDF
GTID:2348330569988912Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technologies,lots of network applications have penetrated into every aspect of-people's lives.While these applications make life more convenient,some evil people also exploit their vulnerabilities to seek self-gain.As a result,a large number of fraudulent web sites have been created to deceive visitors and spread unhealthy and illegal information into the Internet.Because these malicious web spam pages can spread harmful information,reduce the quality of search engine services and threaten the network security,how to detect them efficiently has become a hotspot of the web security research.From the view of classification algorithms,the imbalance training datasets will result in biased classification results that is,the classification accuracy of the majority class in the training and testing datasets will be so high but the accuracy of the minority will be too low.Based on the comparative analysis of common classification algorithms,this thesis selects the RandomForest algorithms the basis and analyzes its performance on imbalanced training dataset.Then,an ensemble learning algorithm is proposed.Each sub-classifier of the ensemble algorithm takes advantage of the number of training samples to improve the detection accuracy of the minority class and the similarity of training samples to maintain the detection accuracy of the majority class.Experimental results show that the proposed algorithm improves the detection accuracy of the minority class and obtains better detection accuracy of two classes.From the point of dataset balance,the most straightforward solution for the imbalance classification problem is to balance the training dataset.However,only using the oversampling technique to increase the number of the minority class has limited effect on the classification performance.An intelligent hybrid data balancing algorithm is proposed in this thesis.The over-sampling algorithm SMOTE improved by reducing noise,is used to increase the number of the minority class;in addition,a cascaded under-sampling algorithm based on a outlier removal and a density reduction is used to reduce the number of the majority class.At the same time,a simulated annealing algorithm is used to optimize the sampling parameters.Experimental results show that the proposed algorithm effectively enhances the performance of all the traditional classifiers used in the experiment and obtains a good performance with RandomForest algorithm and C4.5 algorithm.
Keywords/Search Tags:Imbalanced Data, Web Spam Detection, RandomForest Classification Algorithm, Tree-Structure Ensemble Framework, Outlier Detection Algorithm, Intelligent Balancing Algorithm
PDF Full Text Request
Related items