Font Size: a A A

The Research Of Web Pages Filtering Based On Random Forests Algorithms

Posted on:2017-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:2348330485498422Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Web spam is called as web page which unnecessarily ranks high among search engines by improper ways. It designs web pages by search engine rules and codings web's background contents so as to achieve users' goal of stealing benefits in maximum which, for users, spoils UE(user experience) and leads to a sharp fall of credibility to search engine, and, for search engine companies, wastes numerous computing and storage resources, and known as the great challenge facing internet searching work. Hence, the topics about method against cheating are significant.At the beginning, this paper explores and analyses measurement techniques for web spam which as a foundation to run an optimization study for data pre-processing in the process of detecting the web spam. The main contributions of this paper are as follows:A better arithmetic for Random Forest was mentioned. As to the imbalance of the data in the web spams' data sets, this paper proposes an improved SMOTE named as BKM_SMOTE. When SMOTE is choosing a new sample, data structure changes and problem occurs in positive and negative fuzzy class boundary after data modified. This paper incorporates clustering algorithm and corrects the problems existed in SMOTE. For this, this arithmetic operate clustering by proposing Bisecting K-Means and figures the center of negative samples and interpolates number on the line between the center and sample point so as to formulate a new way to sample which describes the distribution of data in the process and offsets the problem related with balance of web spam and improve the classifying effect of Random Forest to some extent when it handles such problems.As to the problem over distribution effect which is not ideal in the situation where data set losing its balance, this paper deals with data set in a random forest algorithm after BKM_SMOTE and trains and classifies by the balanced data sets. Through experimental analysis show that the precision of the detection system to detect garbage pages (84+0.75)%, the average speed of processing a single web page is about 702 ms, than is currently using the filter effect is very obvious.
Keywords/Search Tags:Web Spam, Bisecting K-Means, SMOTE, Random Forests, imbalanced data sets
PDF Full Text Request
Related items