Font Size: a A A

Research On Text Representation Using The Wikipedia Category

Posted on:2012-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2268330425491621Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
There are more and more electronic readable texts, with the development of Internet technology. It is hard to find the actual needed resources quickly and accurately because of so much information. It has been an important used topic that how to classify and organization and management the texts and resources. Text categorization aims to classify the documents automatic. However, the current text categorization systems still have many problems.Recently, many researchers have done research deeply on text classification task, included in the text representation, feature selection, the weight calculation and classifier. This dissertation focuses on how to learn text features for text categorization, we present a text representation method by using wikipedia categories as text features.In traditional text representation, a text is expressed as a text features vector, and the text features is expressed by words. That is BOW model. This method is simple and feasible. Most text classification systems are using the text representation methods. But the feature space dimension is too high, expression ability is limited use words as text features. In this paper, we present a text representation method by using wikipedia categories as text features. This method can map each word of text to one of wikipedia categories. It can enhance the representation ability of features and reduce the dimensions of a text vector.An approach is presented by using clustering techniques to resolve the limited coverage of wikipedia categories by mapping unknown words into predefined categories. Then, a text category system is developed that uses these learned wikipedia categories as text features.The experimental results show that text representation based on wikipedia categories has the obvious effect of dimension reduction, and achieves5.14%F1improvement over the BOW-based method when700features are used for text classification. The global-wiki-based method gathered all of words to a small amount of wikipedia categories. It can reduce the space of text effectively, and the text classification performance does not fall.
Keywords/Search Tags:text classification, wikipedia category, text representation, wikipedia
PDF Full Text Request
Related items