Font Size: a A A

Key Technologies Of The Web Content Filtering, And To Achieve

Posted on:2006-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:G Q BaiFull Text:PDF
GTID:2208360155966106Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Internet brings us information explosion and the rapid economic development, and it also brings harmful information, so filtering the information from WEB becomes a hot research field. However, current web filter systems often use URL blocking and keyword filtering, and these technologies fail on accuracy and efficiency. To solve this problem, we must analysis the content of WEB pages. On the basis of discussion on current web filter technologies, we propose using artifical neural networks (ANNs) to classify Web pages during content filtering , and we also implement the system blocking pornography.Web pages are documents descripted with HTML, and they have special structure. DOM is API that defined to access HTML and XML document. After studying the structure of web pages, we propose the method that parsing the content of HTML documents according to their structure, i.e. extracting the content in the elements of web page with the technology DOM provided.Documents are defined in real number field using the form of vector, pattern recognition and mature calculate methods in other fields can be used, which improves calculability and maneuverability of nature language document obviously. We discuss some kinds of information filtering models, analysis their advantages and disadvantages, and then select Vector Space Model (VSM) as the form of descripting HTML documents.The technology of Chinese automatic word segment is crucial during transforming documents to vectors. We introduce technologies current used, the difficulties and the achievements in Chinese automatic word segment.According to the characteristics of pornorgraphy web pages,we set up a special keyword dictionary.Using this special dictionary and the arithmetic provided by the third party,we improve the accuracy of Chinese automatic word segment.The essential of filtering web pages is to categorize HTML documents according to their content. We discuss the universal speech of text categorizing andapply the method of text categorizing to filtering web pages. The distribution -parall theory of ANN can utilize process unit of normal efficiency to fulfill high-speed calculation, and the learning ability and non-linear of ANN can complete some kinds of tasks that traditional methods can't.ANN take advantage of the condition when a large amount of data are categorized. We use self-organized mapping (SOM) neural network to categorize web pages.At the end, we compare the effect of our system with systems already exist. It has been proved the method proposed in this paper has the characteristic of efficiency and accuracy.
Keywords/Search Tags:DOM, SOM, ANN, Chinese automatic word segment, Text Categorizing
PDF Full Text Request
Related items