Font Size: a A A

Research On Text-Content-based Web Filtering Technology

Posted on:2009-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:D R SiFull Text:PDF
GTID:2178360245481605Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet makes the Internet have become the world's resource base which is the most extensive in coverage, largest in scope, and richest in content. On the one hand, people are in the full enjoyment of the convenience brought by the information share, on the other hand they suffer a large number of "spam" problem. Especially for the broad masses of the juvenile students, a number of "harmful information" is a threat to their physical and mental health. In addition, the activity of the employees should be standardized, when they access to the Internet. Thus the Web Filtering came into being.At present, many commercial Web Filtering products used a text-content-based web filtering technology. It starts with the analysis of web content, obtaining the effective text information, using text classification algorithms, learning the website classifier in the training pool. When people access to the Internet, by the means of the ahead or real-time calculated web page types, it makes the "allowing" or "prohibiting" judgment. This shows that the core of the content-based Web filtering lies in the accuracy of the page classification.The thesis studies the text-content-based Web Filtering technology, the effectiveness of the Web filtering depends on the accuracy of classified pages. There are two steps for page classification. The first step is that the analysis of web content, from webs obtains the text message which is on behalf of the page semantics. Such technologies contain the way that takes the use of document structure, by the method of page summary calculation, as well as the algorithm based on the link. However, these algorithms have certain inevitable weaknesses, which will affect the accuracy of the successive classification. This thesis proposes the algorithm which is to find the similar pages in the same site, and it can overcome the weaknesses of other Web content extraction. The second step is taken out from the pages of useful text, and then to this text classification, the paper of a number of mature text classification techniques, including Bayesian algorithm, SVM, kNN algorithms and decision trees, etc.. It has the smallest risk of errors for Bayesian classification, while in the term of test results, it performances the rare speed and accuracy for the naive Bayesian on the large data sets.This paper chooses Bayesian algorithm as the page classifier for the text classification methods experimentally, selects SurfControl page classifier as the reference, and the results of its two classified sets are as the training set and the test set. The experimental results show that the method proposed by this thesis can take good effects of the classification on the overwhelming majority of categories.
Keywords/Search Tags:Web Filter, Text Extraction, Text Classification, Bayers
PDF Full Text Request
Related items