Research And Implementation Of Web Spam Detection Technology

Posted on:2015-07-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Lv

Full Text:PDF

GTID:2298330452950748

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Search quality is the core of the search service, which decides the quality of thesearch engine.Not only does search engine search out the results that are mostrelevant to the keywords and satisfy user’s requirement, but also identifies and treatsthe low quality, or even artificial "high quality" cheat page.This thesismainlyresearches the content based technology and method of page cheating detection, andapplies the native text classification and KNN classification method to the binaryclassification detection.The thesistreated the detection of page as binary value classification process andclassified page as normal or cheating. Firstly, this thesis constructed the vector spacewhich consisted of similarity between four elements: title, keyword, description andcontent, then applied the native text classification to page prediction. To get the bestpredictionaccuracy, the thesisiterated the threshold value and computed the similaritybetween page vectors bycosine value or Euclideandistance.However, the experimentsshowed that it couldn’t get the expected positive and negative recall rate.To solve theaccuracy problem,the thesisanalyzed the two-dimensional scatter plot between somefeatures of corpusset, and found that the class of pagecouldn’t be predicted the by acertain threshold for the reason that the distribution of positive and negative sampleswere staggered.Therefore, a new supervised classification method KNN was applied,and some new features also were added and standardizedwhich reduced the impactproduced by different measure unit of page feature.Finally,the new experimentindicated that KNN was better than native text classification in positive and negativerecall rate.According to the classification above, the thesisdeveloped a news vertical searchprototype system to detect cheating, and introduced design and implementation ofeach system modulebriefly, such as page crawling module, feature extraction module,binary classification modules etc.Some tests and analysisabout spam page percentalsowere doneon condition of using classification or not and different boosts of pageelementssuch as title,keyword, description.The results showed that search quality ofusing classification filter was better than not.At last, the thesisdoes a brief overview of the thesis and has a discussion andprospects about the issues that not to be token full account.

Keywords/Search Tags:

Cheat detection, Web Spam, Text similarity, KNN, Lucene

PDF Full Text Request

Related items

1	Research On Spam Recognition Based On Microblog
2	Text Clustering With Noise And Application In Anti-spam
3	Research On Web Spam Detection Technology Based On Immune Clonal Selection
4	Image Spam Filtering Technology Research
5	A Mixture Of Spam Filtering Technology Research
6	The Research And Implementation Of Full-Text System Based On Lucene And Textual Image
7	Research On Chinese Spam Filtering Based On Semantic Body And Text Clustering
8	Research On Spam Review Detection Of Logistics Front-end Trading Platform
9	An Automatic Similarity Detection Engine Between Sacred Texts Using Text Mining and Similarity Measure
10	Research On The Cheating Problem For MMOG Based On P2P