Font Size: a A A

Web Information Filtering Technology Research And Application Based On Mutual Information

Posted on:2013-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:W H WangFull Text:PDF
GTID:2248330362466498Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In today’s highly developed information society, people can feel the convenience of informationtechnology,such as online shopping, online banking, telecommuting and so on. Meanwhile, avariety of illegal information, such as pornography, violence, reactionary, feudal superstition,appeared in front of people through the network. These illegal information not only make peoplehave suffered a double damage in mental and physical, but also further deepen the negative effectsof the network culture industry to the society. Now filtering out these illegal information hasbecome the top priority of people. Therefore, it has already become a focus of research by manyexperts and scholars that how to effectively filter the illegal information.In information theory, mutual Information is the unit of measuring the statistical correlationbetween two random variables. In text classification, if a feature word belongs to a category, thevalue of their mutual information is the largest. So people use mutual information to measure thecorrelation between feature word and classification. The greater the value of mutual information,the larger the correlation between feature word and classification; otherwise, the smaller thecorrelation. Because the mutual information does not require any assumption and limitation beforeusing, it is widely used in these aspects of Segmentation Word, Image Processing, TextClassification and so on. Therefore, this paper regards mutual information as a correlate measurebetween tested text and themes and presents the research and application of the web filteringtechnology based on the model of mutual information.This paper use the construction of thetraining corpus, the representation of text vector, feature extraction and the model of Resnik’sselection and tendency to build a illegal web infortion filtering system. The details are as follows:First, building a training corpus is the most basic premise of Statistical Model Application. Inthis paper, we use two regular expressions to extract title and text from web, and then manuallycheck and process the extracted information in order to get a moderate size, evenly distributedcorpus.Secondly, another fundamental problem of the statistical model is how to effectively representtext content. This paper will do word segmentation and filter stop words. The single word will beseen as an item of the vector, so that we can use text feature vector to present it. In order to improvecomputing speed and filtering efficiency, this paper designs an algorithm of simplifying the vectorspace to obtain a suitable dimension feature vector according to CHI model.Following that, this paper improves the average mutual information and then uses it tocalculate the average value of mutual information of tested vector and topic feature vector. Andthen comparing them with pre-set threshold. If one is greater than predetermined threshold, the tested text will be filtered; if these values are less than pre-set threshold, the tested text is legal andwill be presented to the user.Finally, the problem of dynamically updating the feature item of feature vector is also animportant part of the illegal web filtering system. So according to Resnik’s selection and tendencymodel, this paper designs an algorithm of dynamiclly updating the feature item of feature vector tosolve above problem.In the theoretical basis of the study, the paper designs and builds a filtering system of illegalweb information, and carry out a series of experiments through it. Experimental results shows thatthe system has a higher speed of execution and achieves a good filtering effect.
Keywords/Search Tags:Mutual Information, Web Filtering, Corpus, Feature Extraction
PDF Full Text Request
Related items