Font Size: a A A

Research Of Hierarchy Document Classification

Posted on:2008-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2178360212993742Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the blooming of Internet information, the information-processing is becoming, more and more, a necessary tool for people to have access to useful information. Text classification system is one of the most important research areas which classify texts to classes according to the content of the texts under given classes system. Since 1990s, Internet has been in such a dramatic increase that it contains huge amount of raw information including text, sound, and image. How to achieve the most virtual information in the huge and disordered text information is one of the objects of information-processing. Recently, Text Automatic Classification, which has been mixed with search engine, information pushing, sending, and filtering, has improved information service effectively.Text Automatic Classification is the problem of automatically assigning predefined categories to free text documents. From the beginning to now, Text Automatic Classification has experienced the period from rule-based to statistics-based and now it has been developed into the phrase which mixed both the rule-based method and the statistics-based method.Following contents are included in this paper:First of all, we make a general introduction about the concepts, methods, categories and applications of document classifying. We design and achieve a simple document classifying system.Secondly, we proposed a hierarchy text classification model according to the shortage of the traditional methods. In this approach, all classes are organized as a tree according to some given hierarchical relations. The task of classification is divided into some sub-task corresponding to hierarchy structure. The predefined topic categories are organized hierarchically, in the hierarchy, each internal node has a classifier which is trained on the samples. Through these hierarchical classifiers, new documents are classified in to one leaf node of the hierarchy beginning from the root. In other words, all classes are organized as a tree according to some given hierarchical relations, and all the training documents in a class are combined into a class-document. In order to construct the class models, it is just only to compare among the class-document attached to the same node of the same layer. When it is going to classify the documents, one matching process is hierarchically performed from the root node to the leaf nodes until a corresponding subclass is found.Last, the experiments show that the classification precision of this method has been closed to the traditional ones and it can enhance the efficiency of document classification greatly.
Keywords/Search Tags:text categorization, hierarchy, accuracy, efficiency
PDF Full Text Request
Related items