Font Size: a A A

The Design And Implementation Of Web Spam Detection System

Posted on:2011-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:X J YangFull Text:PDF
GTID:2178330332971270Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web spam refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. Web spam is one of the major challenges for web search engines. Web spam can significantly deteriorate the quality of search engine results. Web spam also damages the reputation of search engines and it weakens the trust of its users. The web spam detection can improve search engine ranking results.This paper describes the current research status of the domestic and international web spam detection technology, focuses on the basic principle and features of the web spam detection technology, and concludes the limitations of the current detection technology.It also describes the requirement, design and coding of web spam detection system in detail. In this paper, the machine learning framework is designed for the web spam detection system. Three kinds of features are extracted, including content based features, hyperlink related features and host level link analysis features; based on all the features, machine learning based web spam detection is performed. The system consists of two parts: feature extraction and classifier. In the feature extraction part, the effectiveness of various features is analyzed in detail. In the classifier part, the random forests classifier is designed. First, it uses random forests to calculate the importance of individual feature, and uses them to modeling to get the optimal random forests model. Then the classifier is used for web spam detection of the host. Machine learning based detection methods demonstrate their superiority for being easy to adapt to newly developed spam techniques. The detection algorithms used in the system are random forests, which have been proven to be effective for spam detection. The experiment proved that random forests classifier has good classification performance in the open standard data sets WEBSPAM-UK2007.
Keywords/Search Tags:web spam, feature extraction, random forests
PDF Full Text Request
Related items