Font Size: a A A

Research On KNN Text Classification

Posted on:2011-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:C YanFull Text:PDF
GTID:2178360302994928Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification appears initially for text information retrieval systems. The text classification can make the people to know whether this text is that they need or not, and needn't read them one by one. It will classify those texts into the proper class, which defined by user in advance. This technology can be used into text mining, intelligent search engine, and the individual software assistant fields.In this paper, we analyzed classification thought, text pretreatment methods, text featuer vectors selection and feature representation methods of all kinds of algorithms, and made an intensive study of K-Nearest Neibghor text classification.Firstly, we made an intensive study of traditional TFIDF formula, analyzed its deficiencies, and proposed separately to the TF function and the IDF function improvement opinion in this foundation, to make it more suitable for K-Nearest Neibghor text classification.Secondly, to solve the boundary problem in K-Nearest Neibghor text classification, class density and class imbalance in text classification are defined. Class density is determined by standard deviation, to decide class imbalance or not. And the shrink factor was introduced, to shrink the class density which is imbalance, until the class density is not imbalance. Afterwards, traditional KNN is adapted with decision functions by class detensity which has been shrinked. This version of K-Nearest Neibghor is called self-adaptive K-Nearest Neibghor classifier with weight adjustment.Finally, a density-based method for reducing the amount of training data is presented, which solves the classification speed problem. Text in Class central area has been heavily reduced. This method reduces the calculation of the K-Nearest Neibghor algorithm.Thus the classifier's speed in the classification stage is improved.Experimental results show that the viewpoints proposed in this paper are more efficient than traditional ones, and have higher precise, recall and speed.
Keywords/Search Tags:Text classification, Vector space model, Feature selection, Weight, Class imbalance
PDF Full Text Request
Related items