Font Size: a A A

Research Of Feature Extraction Technology In KNN Text Classification Based On The Genetic Algorithm

Posted on:2012-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:2178330338493798Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the data mining, text classification is an important area of research, KNN algorithm which is one of the best methods of text classifying in the vector space model (VSM) is a simple, example based and none-parameter method. The main steps are: text segmentation, feature extraction (feature weight calculation and characteristics of the word choice), building the feature model, training classifier. The feature extraction which is the core of the text classification system, the feature extraction method has a major impact on the result of text classification. The traditional feature extractions methods are based on statistical methods, commonly used are: DF (Document Frequency), ECE(Expected Cross Entropy),OR, IG (Information Gain), MI (Mutual Information),χ~2statistics (CHI) and so on. Above methods have many deficiencies: when categories and features have a high degree of uneven distribution, you cannot deal effectively with low-frequency words; for the mishandling of individual characteristics, resulting in the local optimal solution. In addition, KNN classification algorithm whether can select the appropriate K value will also affect the quality of classification results, the fixed K value ignores the influence of the category and the document number of training text. If the K value is too large, the text tends to belong to the class which contains more texts, classification performance is poor; If K value is too small, text has too few neighbors, this will reduce the classification accuracy.Aiming at the problems at the feature extraction technology, this paper puts forward a new feature extraction technology which based on the genetic algorithm. In this method theχ~2 statistical value of words that can identify the size of the correlation between words and category, which will be introduced to feature vector, as the initial population for genetic algorithm heuristic search, while the nature of the feature extraction. At the same time, this paper presents a new fitness function and crossover rules. This paper put forward a new fitness function and the cross-rule for the nature of feature extraction. Experiments have proved that the new feature extraction technology which based on the genetic algorithm can choose a category of accurate characterization of text feature.In order to solve the defect of the fixed K value, this paper proposes a kind of dynamic obtain k-valued for KNN classification algorithm, experimental results show that the dynamic obtain k-valued KNN classification algorithm with high performance.This paper puts forward a new feature extraction technology which based on the genetic algorithm and to use it in the improved KNN text categorization algorithm. Experimental results on data sets prove that the combination of the improved feature extraction algorithm to the dynamic obtaining K values can effectively obtain high quality classification results.
Keywords/Search Tags:Text classification, Feature selection, Genetic algorithm, KNN classification algorithm, K-valued
PDF Full Text Request
Related items