Research On Text-Content-based Web Filtering Technology

Posted on:2009-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:D R Si

Full Text:PDF

GTID:2178360245481605

Subject:Computer applications

Abstract/Summary:

PDF Full Text Request

The rapid development of the Internet makes the Internet have become the world's resource base which is the most extensive in coverage, largest in scope, and richest in content. On the one hand, people are in the full enjoyment of the convenience brought by the information share, on the other hand they suffer a large number of "spam" problem. Especially for the broad masses of the juvenile students, a number of "harmful information" is a threat to their physical and mental health. In addition, the activity of the employees should be standardized, when they access to the Internet. Thus the Web Filtering came into being.At present, many commercial Web Filtering products used a text-content-based web filtering technology. It starts with the analysis of web content, obtaining the effective text information, using text classification algorithms, learning the website classifier in the training pool. When people access to the Internet, by the means of the ahead or real-time calculated web page types, it makes the "allowing" or "prohibiting" judgment. This shows that the core of the content-based Web filtering lies in the accuracy of the page classification.The thesis studies the text-content-based Web Filtering technology, the effectiveness of the Web filtering depends on the accuracy of classified pages. There are two steps for page classification. The first step is that the analysis of web content, from webs obtains the text message which is on behalf of the page semantics. Such technologies contain the way that takes the use of document structure, by the method of page summary calculation, as well as the algorithm based on the link. However, these algorithms have certain inevitable weaknesses, which will affect the accuracy of the successive classification. This thesis proposes the algorithm which is to find the similar pages in the same site, and it can overcome the weaknesses of other Web content extraction. The second step is taken out from the pages of useful text, and then to this text classification, the paper of a number of mature text classification techniques, including Bayesian algorithm, SVM, kNN algorithms and decision trees, etc.. It has the smallest risk of errors for Bayesian classification, while in the term of test results, it performances the rare speed and accuracy for the naive Bayesian on the large data sets.This paper chooses Bayesian algorithm as the page classifier for the text classification methods experimentally, selects SurfControl page classifier as the reference, and the results of its two classified sets are as the training set and the test set. The experimental results show that the method proposed by this thesis can take good effects of the classification on the overwhelming majority of categories.

Keywords/Search Tags:

Web Filter, Text Extraction, Text Classification, Bayers

PDF Full Text Request

Related items

1	Learning-Based Text Extraction In Natural Background
2	Research On Keyword Extraction Technology Oriented To Conversational Text
3	Research On Network Text Classification Technique
4	Analysis Of Text Information Based On Deep Learning
5	Design And Implementation Of Text Classification Model Based On The Improved TF-IDF Feature Extraction
6	Study Of Text Filtering Based On WEB Content Security
7	Filter Spam Messages Based On Text Classification Algorithm
8	Research Of Text Classification Based On Word2vec And Self-attention
9	Research On Text Classification Of Web Text Mining
10	Design And Implementation Of Text Information Extraction On Smart Phone