Font Size: a A A

Technology Of Sensitive Information's Automatic Extraction In Blog Texts

Posted on:2009-08-26Degree:MasterType:Thesis
Country:ChinaCandidate:W X ZhuFull Text:PDF
GTID:2178360242976821Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and information industry in recent years,the applications in the Internet has increased day by day.In 1990s,Blog appeared in western countries and become the vogue in Internet by 2001. In 2002,Blog was introduced to China. In 5 years,it attracted nearly 50million people.There is one blogger out of 4 netizens in China.Blog has become the 4th biggest worldwide medium.With the rampancy of net information crimes activities,a mount of researches have been laid out to the network and system security.But to the Internet media information content security, it is paid attention to during these recent years.On huge open information source ,such as Blog,once sensitive information spreads out of control,Internet users will be greatly influenced and our society will suffer great lost.In order to protect the stabilization of countrythe and network users from the intrusion of bad messages,we must take necessary measures to monitor and control this kind of information in Blog text.Meanwhile,we should provide techniques and service of access control to this information to Web service organization.Thus, it is an urgent and important task to research advanced text information control technology.This paper maintain the knowledge referring to natural language understanding, Chinese information processingand so on,and combine it with the research development of text information processing in our laboratory.We put forward an idea to build decision tree based on the attributes of the Blog text,and make the automatic extraction of unknown sensitive information in Blog text come true.In this paper, firstly, the development of Blog is introduced,and several examples of sensitive information in Blog text are presented,in order to analyze the signification of text information filtering.The research actuality in or out of China is introduced,too.Then,paper refers to the technology of Chinese text preprocessing,presentation,and classification.We introduced automatic segmentation of Chinese words,vector presentation for text,the feature extraction,feature dimension reduction,and feature weight calculation. Besides,several classic text classification methods are introduced. We also give introduction to common useful algorithms in new-word-find orientation.Next,we introduce the methods of extraction of web texts and useful attributes.And also the technology of using Chinese characters constituent to deal with the character-split problem. But because of the speed of using the monitor and control technology,a new problem comes up,so we think of a new technology which builds decision tree based on the attributes of the Blog text to discover the unknown sensitive texts.We unfold the concept of decision tree , and some useful methods to construct it , here we take ID3 algorithm.We present several improved versions of ID3 algorithm.At last,we show the flow chart of the whole system,and explain the word of each part of it .Use improved ID3 algorithm to realize the system,and make comparison with existed technology. The result is exiting.In the end of the paper,we give some conclusion to the above researh work and give corresponding measures to some problems maybe occur in later research work.
Keywords/Search Tags:Blog, Non-known sensitive information, Decision tree, ID3 algorithm, Bayesian
PDF Full Text Request
Related items