Font Size: a A A

Text Classification Based On Gravitation Field Model

Posted on:2013-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2248330362474399Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of IT industry, especially the universal application of theInternet, information processing has become a critical technology to help to obtainuseful information, and the automatic text classification technology, which canautomatically assign a text document to the predefined categories based on the contentsof the text, is an important research topic for information processing.This paper describes, firstly, the framework of Chinese automatic text classificationsystem, and then introduces several techniques associative with Chinese textcategorization, and then sets out several classic text classification methods with theiradvantages and disadvantages, and finally, this paper gives a new method for Chinesetext classification.Inspired by the gravitation field, this paper designs a new method, Virtual Kernel(VK, for short), for the task of text classification under the gravitation field model. Themain idea of the method is: firstly, in the training stage, the target is building aclassification model by obtaining the “category virtual kernel” for each single categorythrough computing the field strength, the specified mathematical transform of termfrequency, at each term point from the labeled texts in the training set; and secondly, inthe test stage, when an unlabeled text comes, this method need compute, according tosome rule, the attractive force of each category virtual kernel to it; and finally, thismethod assigns the unlabeled text to the class which has the most strong attractive force.By its very nature, this approach automatically assigns an unlabeled text to somecategory according to the relationships between text features and the predefinedcategories.In order to verify utility of the proposed approach VK, this paper has done somewell-designed experiments, in which, using vector space model to represent texts,comparing VK and the two classic text classification methods-kNN and Naive Bayeswith two feature selection methods–DF and IG, respectively. We do these experimentson two corpora respectively, and draw some meaningful conclusions:1) VK is superior to kNN and Naive Bayes both in terms of time and classificationperformance.2) VK can still show satisfied classification results on the non-equilibrium corpus.3) VK classification has no strong dependence to the size of the training set. 4) On the term of feature selection method alone, IG is superior to DF.5) The quality of the corpus can make direct effects on classification results.
Keywords/Search Tags:text classification, feature selection, vector space model, gravitation fieldmodel, virtual kernel
PDF Full Text Request
Related items