Font Size: a A A

Study On Chinese Text Classification Combined With Ontology

Posted on:2012-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:L L FuFull Text:PDF
GTID:2178330338996679Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and computer network technology, the amount of electronic text has been increased dramatically but the way of information acquisition for people has been changed gradually. It's difficult for people to find the information they are interested in from the Internet accurately and quickly. So how to organize these massive data has become a significant issue of information technology. Text classification is one of the key technologies to solve this problem. Additionally, as the fundamental technology of information retrieval, information push, information filtering and search engines, text classification has a significant academic value and a broad applied future.How to reduce the number of features efficiently is a key technology in the text classification field where data has hundreds or thousands of features. The goal of feature reduction is to select these features that can provide strong discriminating power. The existing feature selection methods in the model of VSM (Vector Space Model), such as Information Gain, Mutual Information and so on are based on the statistical information of word frequency, but they ignore the feature's semantic relevance to the class label and do not take the feature redundancy into consideration, which results in the limitation of useful features in the feature subset. They are the intrinsic weaknesses of these feature selection methods. Based on a detailed analysis of existing feature selection methods, this thesis will study a new feature selection method combined with Chinese Ontology-HowNet.Ontology provides a shared vocabulary, which can be used to model a domain, that is, the type of objects and concepts that exist, and their properties and relations. It's a new trail to introduce Ontology into text classification research in order to resolve the problem that existing feature description methods lack semantic information. After studying the Chinese Ontology HowNet, a new feature selection method was promoted. It is a mixed method of feature reduction based on concept mapping. Firstly, a subset of features was selected by traditional method of feature selection. Secondly, every feature in subset was mapped into the semantic dictionary of HowNet and then selected again to form the final subset of features. The approach could not only get rid of redundant features but also preserve the semantic information of text. Meanwhile, it keeps the strength of VSM in text description and text computing. So the features selected can describe the text more accurately.Finally, in order to validate the efficiency of the new feature selection method, an experimental system was designed and realized in text classification. Meanwhile, the contrast experiments were done with existing feature selection methods. The experimental data were evaluated by the indicators of recall, precision and F1 value. The result showed that the new method based on concept mapping is better than existing methods based on word frequency in text classification.
Keywords/Search Tags:Text Classification, Ontology, Feature Reduction, HowNet
PDF Full Text Request
Related items