Font Size: a A A

Research And Implementation Of Spam Pages Filtering Based On Bayesian And Decision Tree Algorithms

Posted on:2013-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:Q H QiuFull Text:PDF
GTID:2248330362968659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Age of the Internet, search engines faced with tremendous pressure, not onlybecause of the new pages every day to the tens of thousands of speed of growth, butalso because many website operators through a variety of illegal means to cheat thesearch engines, high ranking. How can obtain accurate information from the vastnetwork of marine filtering unhealthy, illegal and useless information has becomeanother hot spot of the present Internet research. Current research focus is primarilyon the filter useless information, without taking into account mixed in many unhealthyand illegal web pages. Therefore combine the advantages of decision trees andBayesian text classification algorithm, to find a not only rule out the simple to cheat ahigh ranking pages can filter transmission unhealthy and illegal information pagesintelligent algorithm is needed to become current affairs.Based on the above considerations, the paper first defines two spam pages, one issome cheating, improve its ranking of the results of search engines to index weights,resulting in accurate rate for search engine indexing, seriously affecting thenormal use of search engines, pages in this section is called search engine spampages; the other is the text information is expressed in the pages of garbage inviolation of the ethical, legal and cultural information, such information may have aserious negative impact on society, this web page termed spam pages forinformation. Whether standing on its own or the point of view of society, to detectand filter these two spam pages is search engine at this stage is an important task.Through the analysis of the status quo of a spam page detection algorithm,combined with the decision tree algorithm (ID3) and Bayesian algorithm to filter bothspam pages. Combination of the two algorithms is found by experimental analysis, theID3algorithm on the page of search engine spam detection accuracy is very high, butit is difficult to capture some of the information the same as normal web page featuresa spam page, use the shell Yates algorithm can just make up for the lack of ID3algorithm in this respect, mainly because of the Naive Bayes classifier, highclassification accuracy of the content-based text. ID3is a classification algorithmbased on information gain inherent in many of the deficiencies and shortcomings, thecharacteristics of a spam page, and an improved ID3algorithm, experimental resultsshow that the new improved algorithm not only improves the accuracy ofclassification (classification accuracy over93%), also effectively reduce thedimension of the feature space (cut a lot of unnecessary branches, so that theimplementation of the algorithm more efficient). Many of the details of theimprovements, the basic strategy of the naive Bayes classifier on the spam page detection by the experimental results, the classification is also very good, leak rateunder control in less than1%.In order to verify the combination of the two algorithms feasibility of a detectionsystem, the detection accuracy of a single class of spam pages reaching (92±1.5)%,for the simultaneous detection accuracy for two spam pages reaching (95±0.85)%,while the detection accuracy than the filter has been proposed and are being used toenhance the effect is very obvious.
Keywords/Search Tags:Web Spam, Spam Web Filtering, the ID3algorithm, Naive Bayesianclassifier
PDF Full Text Request
Related items