Double Filtering Of Objectionable Web Pages Based On Content Identification

Posted on:2013-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:L Wang

Full Text:PDF

GTID:2218330371485299

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Today no exaggeration to say the new definition of network is covering almost everyaspect of our life. The network has been recognized by many people because of many featuressuch as convenience, efficiency, easy to operate and low-cost. The Web page has also becomeone of the important means of information dissemination. For this reason, the network is alsofilled with a lot of bad information: reactionary, pornographic, violent Web site and a varietyof false or deceptive advertising information, etc. It's not recognized by the user because it'snot only danger to social stability, but also seriously interferes with the normal use of theInternet, and creates an unnecessary waste of resources.At present, a great deal of research has been done in the 'green line' at home and abroad,and there are many mature products. The purpose is to filter out spam and purify the Internetenvironment. This paper lists several Web text filtering methods. Through the analysis of thepros and cons various methods and the summary of illegal page content, on the basis of theexisting technology, this paper improves the feasibility analysis. In considering the overallefficiency, the system overhead, false positive rate and stability factors, the paper ultimatelymake the Bayesian and SVM (Support Vector Machines) algorithm as the core of the dualfilter method.Bayesian and SVM is classification algorithm identifying text-based content which isused in the mainstream technology. This paper finds that content-based text classificationusing Bayesian and SVM algorithm is complementary with each other in terms of theaccuracy and efficiency. On the base of introducing Bayesian and SVM principle in details,the paper establishes double filter model based on content identification Bayesian-SVM, anddescribes the structure, principle and process of this method. This two-tier structure bettermakes up the shortage of the Bayesian classification accuracy, and effectively reduced thecomputational complexity of SVM. At the same time, it can make two-way choice in theunderlying filter. That is to say it can respectively make the second floor filter to the tworesults of the top-level classification. In the three modules of the system, besides Upper andlower levels of classification filter modules, the paper introduces a feature lexiconself-learning modules. The principle is making classification learning in the corpus'scharacteristics uncontained. It solves the self-updating problem and makes the system adapt to the changes in the external environment.Finally, the paper does experiment comparing Bayes, SVM with double-filter method inthe test set. The experimental results show that both accuracy and overhead at the same time,web-based text content Bayesian/double filter method of support vector machines has agood effect at the right rate, recall rate, precision rate and false positive rate, although there isa certain overhead loss, the influence is little on the entire system.

Keywords/Search Tags:

Bad information filtering, Bayesian methods, Support Vector Machine, double filter, thecharacteristics of self-learning

PDF Full Text Request

Related items

1	Research On Some Problesm Of Support Vector Machine Learing Algorithm
2	The Study Of Classification Methods And Its Applications In Web Mining Based On Statistical Learning
3	Spam Filter Based On Support Vector Machine Theory Model
4	Research On Support Vector Machine Classifier And Its Bayesian Framework
5	Research Of Learning Methods On Single-class Support Vector Machine
6	Support Vector Regression Machine Theory And Its Industrial Application
7	SAR Image Target Recognition Based On Robust Bayesian SVM
8	Research On Semi-Supervised Support Vector Machine Learning Methods
9	Research On Filtering Algorithms Of Text Information Based On SVM
10	Research, Key Technology For Information Filtering Based On Vector Space