Font Size: a A A

Research On A Concept Vector Model Of Documents Based On Ontology

Posted on:2008-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:S DengFull Text:PDF
GTID:2178360212995657Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The automated classification of texts into pre-specified categories has gained a rapid progress in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. Machine learning technologies are used in this process to automatically build a classifier by learning, from a set of previously classified documents, the characteristics of categories.The vector space model (VSM) is a conventional text classification model that represents documents as vectors in a multidimensional space. When key words are extracted from a document collection, each document is represented as a vector of weighted key words frequencies. In the traditional VSM, the system's relevance judgment is based on the basic assumption that documents are related to each other only if there are shared key words in the documents. However, the difficulty lies in the fact that most key words have multiple meanings on the one hand, and on the other hand, some concepts can be described by more than one key word. In addition, the traditional text categorization use key words occurring in documents to determine the class of the documents, but it have two main flaws: the one is less category information, and the other is high dimensionality which causes data sparse. Phrase can be used to relieve the first problem but: it will aggravate the second one. For the second one, the usual way is using dimensionality reduction (DR) methods which can remove none-effect or less-effect features and the left features are used to represent the text. According to the nature of the result terms, DR has two types: (1) Term Selection: the result terms is a subset of the original terms; (2) Term Extraction: the result terms is not a subset of the original terms. The TC method based on concept is not using key words but concepts to make up characteristic items and considering hyponymy-hyponymy relation between synonymy sets. The approach can keep the text information mostly and solve the two problems at the same time.The main works of this paper were introduced as follows:1. We established the text categorization model based on ontology.2. We proposed a method based on ontology that obtained concepts.3. The keywords are matched against the attribute terms of the concepts in the given ontology, requiring exact matches. Based on the amount of matching terms for each concept a weight for each concept can be defined. We considered the possible application of the proposed theory on calculating similarity degree of documents, which is the fixed domain. These constructed the concept vector model.4. We introduced KNN and SVM, and they were implemented for the purpose of the proposed document classification.We empirically tested the proposed model on documents in order to demonstrate the general applicability of the method. The experimental results show that we can incorporate domain ontology to assist in document classification. For some data sets the concept vector model (CVM) is more effective than the vector space model (VSM) based term. Moreover, the performance comparisons of SVM and KNN based on CVM show that SVM achieves better performance than KNN, and SVM training is thus performed over the reduced training set.
Keywords/Search Tags:Text classification, Ontology, Concept Hierarchy, Feature selection, Concept vector model (CVM), SVM, KNN
PDF Full Text Request
Related items