Font Size: a A A

Study On How Net Ontology Based Text Categorization Algorithm And It's Application

Posted on:2010-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:X C WangFull Text:PDF
GTID:2178360275451088Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,the resources of the text messages being increasingly rich,the Internet has become the world's most massive information storage and has gradually become an important new source of public information for public security organs.But to face of a broad array of text data,the public security organs can not simply rely on manual analysis.Text classification is a automatic technology which can organize document information in an orderly way can greatly improve the efficiency of public security officers.But the traditional text classification algorithm exists high-dimensional,sparseness,multi-word terms and ignore the links between the semantic and other issues,so it can't meet the present work of Public Security Department.In this context,this paper studied on the text classification algorithm and applied the research results to public security intelligence classification system of a actual province.Firstly,the definition,the general process and common algorithms of text categorization were introduced,and a general summary of current research condition was also given in this paper;secondly,in order to obtain a more accurate version of the text concept features,a HowNet and context weighted based word sense disambiguation algorithm(HCWSD) was proposed;thirdly,to resove the problems of traditional text classification algorithm,a HowNet ontology based text classification algorithm(HOTC) was proposed;finally,the HCTWSD algorithm and the HOTC algorithm were applied to the classification system in the Public Security Intelligence System.The main work of the paper is as follows:(1) A HowNet and context weighted based word sense disambiguation algorithm (HCWSD) was proposed.It calculated the concepts of polysemantic words with the acceptation of their context words with weighted semantic relevance to eliminate ambiguity of the polysemantic words in real-time without corups training,overcoming the traditional algorithm which does not take into account the impact of context words' distance to calculate the semantic relation of the words and the irrational of calculation etc.(2) A HowNet and statistics based concept similarity calculation method was proposed.The method made full use of HowNet and corpus statistical information,so it took into account that in the different corpus the words similarity are different.A revised formula of the text semantic similarity was proposed to overcome the disadvantages of the traditional method.(3) Considering the shortcomings of the traditional classification algorithm,a HowNet Ontology based text classification algorithm(HOTC) was proposed,the algorithm firstly made used of the HCWSD algorithm to disambiguate the multi-words to resolve the issue of multi-word,and then made used of the disambiguation concepts to denote the text solved the problems of high-dimensional and sparseness,finally,made used of the semantic similarity of the text to classify the text taking into account the inter-word semantic links.(4) The algorithm HCWSD and HOTC which were proposed in this paper were applied to the public security intelligence classification subsystem.Practical application shows that the use of the system can obtain better results for text classification.
Keywords/Search Tags:text categorization, HowNet, word sense disambiguation, concept features, semantic similarity
PDF Full Text Request
Related items