Font Size: a A A

Research On Web Spam Detection Based On Semantic Analysis

Posted on:2014-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:W J LiFull Text:PDF
GTID:2248330395998889Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web spam is designed for search engines rather than users. In order to make certain that pages get higher score than its actual ranking in the search engine results, spam try to cheat the search engines. It is a challenging problem to create an effective spam page detection method. Internet cheating leads to the declining of the quality of the search engines’search results and the serious deterioration of the user’s search experience. It is recognized as one of the biggest challenges faced by internet search. The study of effective web spam detection method is a meaningful research topic.In this paper, we describe the current research status of the domestic and international web spam detection technology, and summarize the basic ideas and features of the web spam detection methods in details. This paper sums up the shortcoming of existing detection technology and discusses the designing of web spam detection system in detail.In this paper, we focus on the study of features selection and classifier design after learning the characteristics of web spam. The detection framework is designed using machine learning algorithm. The basic idea of the framework is to extract web content features, and integrate these features to detect web spam using a machine learning algorithm. We treat the detection of spam pages as a classification problem. The C4.5classification algorithm is adopted to build a decision tree classification model and classify web pages into normal web pages and spam pages. Bagging and Boosting methods are added to further improve the classification accuracy. We did experiments on the standard testing data sets WEBSPAM-UK2007. The results of experiments show that our classification model based on web content can detect spam pages effectively.
Keywords/Search Tags:Search engine, Web spam detection, Decision tree, C4.5classificationalgorithm
PDF Full Text Request
Related items