The Research Of Web Pages Filtering Based On Random Forests Algorithms

Posted on:2017-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Ma

Full Text:PDF

GTID:2348330485498422

Subject:Systems analysis and integration

Abstract/Summary:

PDF Full Text Request

Web spam is called as web page which unnecessarily ranks high among search engines by improper ways. It designs web pages by search engine rules and codings web's background contents so as to achieve users' goal of stealing benefits in maximum which, for users, spoils UE(user experience) and leads to a sharp fall of credibility to search engine, and, for search engine companies, wastes numerous computing and storage resources, and known as the great challenge facing internet searching work. Hence, the topics about method against cheating are significant.At the beginning, this paper explores and analyses measurement techniques for web spam which as a foundation to run an optimization study for data pre-processing in the process of detecting the web spam. The main contributions of this paper are as follows:A better arithmetic for Random Forest was mentioned. As to the imbalance of the data in the web spams' data sets, this paper proposes an improved SMOTE named as BKM_SMOTE. When SMOTE is choosing a new sample, data structure changes and problem occurs in positive and negative fuzzy class boundary after data modified. This paper incorporates clustering algorithm and corrects the problems existed in SMOTE. For this, this arithmetic operate clustering by proposing Bisecting K-Means and figures the center of negative samples and interpolates number on the line between the center and sample point so as to formulate a new way to sample which describes the distribution of data in the process and offsets the problem related with balance of web spam and improve the classifying effect of Random Forest to some extent when it handles such problems.As to the problem over distribution effect which is not ideal in the situation where data set losing its balance, this paper deals with data set in a random forest algorithm after BKM_SMOTE and trains and classifies by the balanced data sets. Through experimental analysis show that the precision of the detection system to detect garbage pages (84+0.75)%, the average speed of processing a single web page is about 702 ms, than is currently using the filter effect is very obvious.

Keywords/Search Tags:

Web Spam, Bisecting K-Means, SMOTE, Random Forests, imbalanced data sets

PDF Full Text Request

Related items

1	Classification Learning Of Imbalanced Data Sets Based On Sampling Processing
2	The Study On Random-SMOTE For The Classification Of Imbalanced Data Sets
3	The Improved Random Forests Based On The Imbalanced Data Classification
4	Research On Credit Evaluation Based On Improved Random Forests Algorithms
5	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
6	Research On The Classification Of Imbalanced Data Sets Based On R-SMOTE
7	Research On Optimization And Improvement Of Random Forests Algorithm And Its Parallelization
8	Research On The Classification Of Imbalanced Data Sets And Related Problems
9	An empirical study of random forests for mining imbalanced data
10	Research On Random Forest Similarity Algorithm