Font Size: a A A

Research And Implementation Of Web Spam Detection Technology

Posted on:2015-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y LvFull Text:PDF
GTID:2298330452950748Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search quality is the core of the search service, which decides the quality of thesearch engine.Not only does search engine search out the results that are mostrelevant to the keywords and satisfy user’s requirement, but also identifies and treatsthe low quality, or even artificial "high quality" cheat page.This thesismainlyresearches the content based technology and method of page cheating detection, andapplies the native text classification and KNN classification method to the binaryclassification detection.The thesistreated the detection of page as binary value classification process andclassified page as normal or cheating. Firstly, this thesis constructed the vector spacewhich consisted of similarity between four elements: title, keyword, description andcontent, then applied the native text classification to page prediction. To get the bestpredictionaccuracy, the thesisiterated the threshold value and computed the similaritybetween page vectors bycosine value or Euclideandistance.However, the experimentsshowed that it couldn’t get the expected positive and negative recall rate.To solve theaccuracy problem,the thesisanalyzed the two-dimensional scatter plot between somefeatures of corpusset, and found that the class of pagecouldn’t be predicted the by acertain threshold for the reason that the distribution of positive and negative sampleswere staggered.Therefore, a new supervised classification method KNN was applied,and some new features also were added and standardizedwhich reduced the impactproduced by different measure unit of page feature.Finally,the new experimentindicated that KNN was better than native text classification in positive and negativerecall rate.According to the classification above, the thesisdeveloped a news vertical searchprototype system to detect cheating, and introduced design and implementation ofeach system modulebriefly, such as page crawling module, feature extraction module,binary classification modules etc.Some tests and analysisabout spam page percentalsowere doneon condition of using classification or not and different boosts of pageelementssuch as title,keyword, description.The results showed that search quality ofusing classification filter was better than not.At last, the thesisdoes a brief overview of the thesis and has a discussion andprospects about the issues that not to be token full account.
Keywords/Search Tags:Cheat detection, Web Spam, Text similarity, KNN, Lucene
PDF Full Text Request
Related items