Font Size: a A A

On Detecting The Cloaked WEB SPAM

Posted on:2015-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:X W JiangFull Text:PDF
GTID:2268330428978812Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today, along with the rapid development of the Internet, people have great convenient lives throughout every corner of daily life. And how to get the information the user wants in the sea of the Internet, search engine is undoubtedly a convenient and reliable tool. However, some people improve page ranking of search engine to achieve their goals of malicious or illegal profit by improper means. Scholars call such web pages web spam collectively. This thesis focuses on the cloaking, these web spam fraud by multi-copy approach and strongly conceal. This thesis combines similarity measurement with classification to detect this type of web spam.This thesis summarized the present detection technology of cloaking web spam by scholars and analyzed the causes and cheating mode of cloaking web spam in detail. It also summarized the page text similarity measurement technology by scholars, and introduced the methods to measure text similarity in detail. The purpose is providing the basis for achieving detecting cloaking web spam.According to cheating mode of cloaking and methods of text similarity measurement, this thesis proposed a detection scheme by combining similarity measurement and classification to detect cloaking web spam. In similarity measurement module, we focused on the use of the design and implementation method based on LDA (Latent Dirichlet Allocation) topic model, then we use the random forest classifier to detect cloaking web spam and obtained valid results.This thesis constructed a Chinese cloaking web sample data sets and used experiments to make comparative validation on detecting cloaking web spam. Finally we analyzed the experimental results in more detail.
Keywords/Search Tags:Cloaking Web Spam, Similarity Measurement, Classification, LDA Topic Model, Random Forests
PDF Full Text Request
Related items