Font Size: a A A

Design And Implementation Of Web Content Filtering System

Posted on:2015-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:X J WangFull Text:PDF
GTID:2308330473953066Subject:Software engineering
Abstract/Summary:PDF Full Text Request
A lot of unhealthy and useless information filled in the online world, and poses a serious challenge to the campus network management. Campus network provides convenience to students and teachers, but also to bring them harm. Web content filtering is an effective approach, which could automatically filter out harmful information.This paper analyzes the development status, problems and common filtration methods in network filtering area of domestic and international. Finally, we study all the components which are necessary to implement web content filtering system. The web content filtering system has two main functions: One is the implementation of specific URL filtering; the other is the implementation of web page text content filtering. This system has realized the two function modules: One is the implementation of network packet capture and restructuring; the other is the implementation of the network text data processing.Network data capture module mainly study and analyze the network protocol parsing, in particular the analysis process involves the Ethernet data frames, IP packets, TCP data segments and HTTP packets. Based on the network protocol analysis this module completed the capture and analysis of the network packet using the network packet capture Winpcap library under Windows system. This module implements functions of URL filter and HTML pages reorganization as text data in web text data processing module. According to the characteristics of the campus network, URL filtering function can define multiple rule bases and according to different time periods run different filtering rule base.Web text data processing module mainly study the web text classification technology. Web text is a semi-structured text data, this module studied and implemented to extract text from a web page text data. Then focus on the text classification technology, including text preprocessing and training of the text classifier. Text preprocessing technique also involves a lot of details technologies: Chinese word segmentation, feature selection and weighting calculation. Text classifier implementation involves a variety of text classification algorithm analysis and comparison. According to the characteristics of the campus network I select the category center vector classification(Rocchio) as the text classifier. Using the training set text complete classifier learning, and the classification results were cross-validation test, and achieved satisfactory classification results.Finally, web content filtering system are summarized and discussed. I hope the next step I can achieve a more comprehensive web content filtering system, not just text data, also include pictures, sound and video and other multimedia information filtering.
Keywords/Search Tags:campus network, content filtering, data capture, text classification
PDF Full Text Request
Related items