Research On The Classificatin Of Objectionable Web Based On Content Identification

Posted on:2014-02-22

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2248330395997685

Subject:Network and information security

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information age, the network not only gives usconvenient information but also brings a lot of junk and harmful information. Thisinformation will not only affect the normal use of the network, and some inappropriate deeplypoisoning on young users. It is gradually valued that how to filter web information efficientlyfaced to the huge number of web pages. This paper introduces web filtering processes and thecurrent filtering methods detailedly based on the main research direction of web filtering.In text classification process, first the mainstream of classification algorithm is convertedthe documents into the vector space model, then use classification to classify calculation.Each dimension usually represents a feature word of the test in the vector space model. Thesecharacteristics must be independent of each other, it means the location and sequence of featurewords appeared nothing to distinguish the category of the document and semantics areindependent of each other between feature words. For any article, for the selected feature, thecurrently used methods can not completely make the feature words independent each other.Most methods focus on selected the feature words which is highly correlated to documentcategory. It is easy to overlook the independence between each features. In order to improvethe relative independence of the features in the vector space model. This paper selectes thefeature words which is highly relevanted to the document categories using Chi-squaretest.Then combines the highly relevanted feature words into clusters using WordCo-occurrenc Matrix and simplified DBSCAN.And we use TF-IDF algorithm to calculate theweights of feature clusters. Then we can use the feature cluster model to represent the webdocuments.This paper introduces the major classification algorithm,then analyse and compare theiradvantages and disadvantages. In order to better play to the advantages of variousclassification algorithms and further improve the classification accuracy, Firstly we classifiethe the easy distinguished documents using Naive Bayes. Output to the easily distinguish thedocument directly, these documents do not participate in the second filter, and the remainingdocuments filtered again. Secondly we use the SVM to refine the remanent documents. Thismethod balances the number of plus-minus samples in the second filtering time whichcontributes to improve the accuracy of support vector machine. Although this paper adds theweb filtering link and the algorithm’s complexity of the time and the space, it still doesn’tmatter. It compares double filter method performance with the single filter method performance by the experiment. It finally proves the double filter method’s value because itcan improve the Correct, Recall, F1. We can selecte single layer Naive Bayes filter or supportvector machine filter flexibly based on the size of data quantity.

Keywords/Search Tags:

web Filtering, double filter, feature cluster model, Naive Bayes, SVM

PDF Full Text Request

Related items

1	Research On Chinese Spam SMS Filtering Method Based On Rough Set And Naive Bayes
2	Research On Improving Naive Bayes Classification Model
3	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
4	Research And Implementation Of Spam Filtering System Based On Improved Naive Bayes Algorithm
5	Application Of Improved Naive Bayesalgorithm In Spam Filtering
6	The Research And Application Of Text Categorization Arithmetic In Spam Filtering
7	Research Of Intrusion Dynamic Forensics Model Based On Classification Analysis
8	Spam Filtering Technology Research Based On Statistical Model
9	Text Classification Algorithm Research Based On Naive Bayes
10	Research And Application On Naive Bayes Classification Algorithm