Font Size: a A A

Research On Text Classification Based On Domain Ontology

Posted on:2013-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:T T WeiFull Text:PDF
GTID:2248330371489411Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the amount and the spread of information, the number of documents on the Internet increase exponentially. people are surrounded by huge amounts of information, it’s difficult for people to find the information they are interested in from the Internet accurately and quickly. So how to organize these massive data and classify them in an accurate way has become a significant issue of information technology. Text classification is a key technology of the organization and management for information and help us locate the interested information quickly, so it needs higher and higher requirement.Traditional text classification algorithm uses the key words as features to build vector space model, which keywords are mutually independent, no semantic association, so it lost much of semantic information and can’t express the main content of the text, and affect the classification results. With the emergence of semantic web, semantic-based text classification has become an effective way to improve the traditional method. Ontology as its well-structured and it can express more semantic information, so it is widely used in the semantic text classification. While semantic text classification algorithm has been rapid developed, still faces some problems such as follows: The use of ontology mostly only stay in dictionary level, and the semantic relationship among terms are not deeply research; Concept vector space model didn’t contain the ontology properties and instances, so it can’t express the semantic of the text very well; Most of the algorithms are ignoring the most useful ability of ontology—reasoning. After full research present situation about the traditional and the ontology-based classification method, this paper propose a method to solve the existing problems, the main work are as follows:(1) This paper introduces the relevance knowledge of ontology and the principle and method of its construction, and the description language—OWL2. Detailed introduces the process of the construction of the tourism domain ontology. The key technology of text classification process are introduced, including the definition, the representation of text, feature extraction and selection, commonly used classifier, etc.(2) The primary problem of text categorization is text representation model. In order to resolve the problem that existing text representation methods lack semantic information, a new text representation model method was promoted. It bases on concept mapping, not only map to the concept of ontology, but also ontology’s properties and instances, and fully express semantic relations among terms. Due to the ontology concept contains more semantic information than common terms, the traditional weight calculation method which based on statistical cannot fully express the significance of the concept in the text, therefore this paper proposes an improvement method, which attach more weight to the concept that contain more semantic information.(3) As the computational complexity of the traditional machine learning methods, and they also vulnerable to the influence of the size of training texts. This paper put forward a method which takes the structure of ontology as classification standard, and it is realized by combining the semantic correlation degree of concepts and terms and the ontology reasoning abilities. The text is classified to the ontology concepts as the individuals. Experiments show that this method obtain higher accuracy compare to the Bayes and the KNN classifier method.(4) In order to fully use the ontology in the process of classification, and then improve classification efficiency, the ontology reasoning rules are combined into the classification method. Ontology reasoning mechanism can provide implicit knowledge and semantic information for the classification, so it can reduce the cost of calculation. Experiments show that, combine with the classified method of reasoning rules obtain higher efficiency.(5) This paper based on the background of tourism area, through the crawler grab travel information relevant web pages, and using the proposed calculation method for tourism web text categorization. Each module are the specific process, including preprocess, how the concept vector space model generate, classification process, etc. And then the analysis and summary of the experiments is given.
Keywords/Search Tags:text classification, semantic correlation, domain ontology, ontology reasoning
PDF Full Text Request
Related items