Research On KNN Text Classification

Posted on:2011-04-27

Degree:Master

Type:Thesis

Country:China

Candidate:C Yan

Full Text:PDF

GTID:2178360302994928

Subject:Computer application technology

Abstract/Summary:

Text classification appears initially for text information retrieval systems. The text classification can make the people to know whether this text is that they need or not, and needn't read them one by one. It will classify those texts into the proper class, which defined by user in advance. This technology can be used into text mining, intelligent search engine, and the individual software assistant fields.In this paper, we analyzed classification thought, text pretreatment methods, text featuer vectors selection and feature representation methods of all kinds of algorithms, and made an intensive study of K-Nearest Neibghor text classification.Firstly, we made an intensive study of traditional TFIDF formula, analyzed its deficiencies, and proposed separately to the TF function and the IDF function improvement opinion in this foundation, to make it more suitable for K-Nearest Neibghor text classification.Secondly, to solve the boundary problem in K-Nearest Neibghor text classification, class density and class imbalance in text classification are defined. Class density is determined by standard deviation, to decide class imbalance or not. And the shrink factor was introduced, to shrink the class density which is imbalance, until the class density is not imbalance. Afterwards, traditional KNN is adapted with decision functions by class detensity which has been shrinked. This version of K-Nearest Neibghor is called self-adaptive K-Nearest Neibghor classifier with weight adjustment.Finally, a density-based method for reducing the amount of training data is presented, which solves the classification speed problem. Text in Class central area has been heavily reduced. This method reduces the calculation of the K-Nearest Neibghor algorithm.Thus the classifier's speed in the classification stage is improved.Experimental results show that the viewpoints proposed in this paper are more efficient than traditional ones, and have higher precise, recall and speed.

Keywords/Search Tags:

Text classification, Vector space model, Feature selection, Weight, Class imbalance

Related items

1	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
2	Reasearch On Text Classification In The Application Of Customer Complaint Prediction Of Operator
3	Research On Classification Module Of Core Competency Assessment System
4	Term Weight-Based Chinese Text Classification Algorithm
5	Sparse Bayesian Model Based On Text Classfication
6	On Research For Chinese Automatic Text Categorization Technology Based On VSM Model And Feature Selection
7	Research On Feature Selection Of Text Classification
8	Research On Classification And New Class Recognition Of Complaint Text In Business
9	Research Of Text Categorization Based On Vector Space Model
10	Research On Chinese Text Categorization Algorithms Based On Technology Text