Font Size: a A A

Research On Hiden Web Spam Detection Technology

Posted on:2014-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2248330398976036Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web spam refers to actions of web spammer intends to confuse or mislead search engines,making web pages ranked higher than the actual ranking in the search results. Web spam not only affect the search engine retrieval accuracy and efficiency,and also seriously deteriorate the user’s search experience.Web spam is regarded as one of the Internet retrieval of the bigges challenges.In the web spam technology,hidden spam with the characteristics of hidden,deceptive and difficult to detect,has become a spam detection problem urgently to be solved.This article summarize the present detection technology of hidden spam pages at home and abroad.It introduces the types and characteristics of hidden spam technology.It sums up a variety of phenomenon of cloaking and offered a detailed analysis of the characteristics and the cause of these phenomena.It also detailedly introduced several typical detection technology of hidden spam that proposed by domestic and fogeign scholars.According to have summarized several phenomena of cloaking,this article raised a cloaking detection algorithm based on the type of cloaking and designed the cloaking detection system framework.The framework consisted of four modules.They were data collection,web page feature information extraction,cloaking detection and file management.Data collection module detailed introduced simulation search engine crawlers and user’browser to gain the result of search. Web page feature information exaction module detailed analyzed the effectiveness of the specific label,content and link characteristics.Cloaking detection module implemented the cloaking detection algorithm,which had been put forward.The naive bayes algorithm was selected to detect complex cloaking,and the results compared with the classification of several common algorithms.File management module implemented the file management of system.This article built a Chinese junk-lexicon and a Chinese sample data set of detecting cloaking. Through the experiment,the article proved effectiveness of cloaking detection algorithm,and detailed analyzed the experimental results.
Keywords/Search Tags:Hidden spam, Cloaking, Feature extraction, Naive Bayes algorithm
PDF Full Text Request
Related items