Font Size: a A A

Research On Web Content Filtering Based On Concepts Of Collection

Posted on:2011-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:A T WangFull Text:PDF
GTID:2178330332475371Subject:Information networks and security
Abstract/Summary:PDF Full Text Request
As the popularity and development of Internet, human society is changing into the information society. Internet has become an important part of people's daily information exchanging. Because of contents are more richer than before, the areas of contents are more widely, and the forms of website have more diversifications as well, such as text, image, video, audio. Text which is the most common form, is the main carrier of content. So, with the computer and the Internet promotion and application, from data processing, information processing to knowledge processing, the depth and breadth requirements of language processing are growing. Because text is so important, some sensitive information is more likely to be added into the text and harm people's lives, even all society. In this paper, focus on the analysis of sensitive information and research on filtering method, try to achieve the security of network information filtering.Most of the past web filtering algorithms is based on statistical filtering or keyword filtering. This filter algorithm is relatively simple and fast, but there are also some shortcomings, such as, only mechanical understanding content, ignoring the text in the semantic constraints, can not effectively identify the information with a semantic bias, the result is that the effect of filtering is not ideal. So it is showed that, if improve the accuracy of filtering algorithm, try to add the judge of semantic bias and understand the writer's real meaning of the content.In this paper, using HowNet and the classification algorithms, it is proposed a web filtering algorithm which is based on the concept of a collection. Because of Internet resources'characteristics are rich and open, first do the pre-work of the texts collected from websites, including word segmentation and part of speech tagging, make some preparations for the next steps. And then follow the proposed concept collection of algorithm, match the similarity of every set. As the best expression of ideas or intents of the information often comes from a verb or adjective, while negative words and adverbs are also particularly important. Therefore, according to this emotion construction dictionary, match these words and then classify and compare them. And try to determine whether the sensitive information should be filtered.Finally, the simulation verified the feasibility of the improved method. Collecting three areas of information such as political, military, entertainment to calculate and match templates, the results demonstrated that the feasibility of the improved algorithm. And it also improved that Web filtering effects are better than regular method. The new model detected some sensitive information, and then analyzed the results of test. The results of experiment are:the different types of information often result that the results of the precision and recall rate will be different.
Keywords/Search Tags:Web filtering, HowNet, semantic similarity, emotional tendency, KNN classification, the concept of a collection
PDF Full Text Request
Related items