Font Size: a A A

Double Filtering Of Objectionable Web Pages Based On Content Identification

Posted on:2013-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2218330371485299Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today no exaggeration to say the new definition of network is covering almost everyaspect of our life. The network has been recognized by many people because of many featuressuch as convenience, efficiency, easy to operate and low-cost. The Web page has also becomeone of the important means of information dissemination. For this reason, the network is alsofilled with a lot of bad information: reactionary, pornographic, violent Web site and a varietyof false or deceptive advertising information, etc. It's not recognized by the user because it'snot only danger to social stability, but also seriously interferes with the normal use of theInternet, and creates an unnecessary waste of resources.At present, a great deal of research has been done in the 'green line' at home and abroad,and there are many mature products. The purpose is to filter out spam and purify the Internetenvironment. This paper lists several Web text filtering methods. Through the analysis of thepros and cons various methods and the summary of illegal page content, on the basis of theexisting technology, this paper improves the feasibility analysis. In considering the overallefficiency, the system overhead, false positive rate and stability factors, the paper ultimatelymake the Bayesian and SVM (Support Vector Machines) algorithm as the core of the dualfilter method.Bayesian and SVM is classification algorithm identifying text-based content which isused in the mainstream technology. This paper finds that content-based text classificationusing Bayesian and SVM algorithm is complementary with each other in terms of theaccuracy and efficiency. On the base of introducing Bayesian and SVM principle in details,the paper establishes double filter model based on content identification Bayesian-SVM, anddescribes the structure, principle and process of this method. This two-tier structure bettermakes up the shortage of the Bayesian classification accuracy, and effectively reduced thecomputational complexity of SVM. At the same time, it can make two-way choice in theunderlying filter. That is to say it can respectively make the second floor filter to the tworesults of the top-level classification. In the three modules of the system, besides Upper andlower levels of classification filter modules, the paper introduces a feature lexiconself-learning modules. The principle is making classification learning in the corpus'scharacteristics uncontained. It solves the self-updating problem and makes the system adapt to the changes in the external environment.Finally, the paper does experiment comparing Bayes, SVM with double-filter method inthe test set. The experimental results show that both accuracy and overhead at the same time,web-based text content Bayesian/double filter method of support vector machines has agood effect at the right rate, recall rate, precision rate and false positive rate, although there isa certain overhead loss, the influence is little on the entire system.
Keywords/Search Tags:Bad information filtering, Bayesian methods, Support Vector Machine, double filter, thecharacteristics of self-learning
PDF Full Text Request
Related items