Research Of Feature Extraction Technology In KNN Text Classification Based On The Genetic Algorithm

Posted on:2012-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y N Liu

Full Text:PDF

GTID:2178330338493798

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the data mining, text classification is an important area of research, KNN algorithm which is one of the best methods of text classifying in the vector space model (VSM) is a simple, example based and none-parameter method. The main steps are: text segmentation, feature extraction (feature weight calculation and characteristics of the word choice), building the feature model, training classifier. The feature extraction which is the core of the text classification system, the feature extraction method has a major impact on the result of text classification. The traditional feature extractions methods are based on statistical methods, commonly used are: DF (Document Frequency), ECE(Expected Cross Entropy),OR, IG (Information Gain), MI (Mutual Information),χ~2statistics (CHI) and so on. Above methods have many deficiencies: when categories and features have a high degree of uneven distribution, you cannot deal effectively with low-frequency words; for the mishandling of individual characteristics, resulting in the local optimal solution. In addition, KNN classification algorithm whether can select the appropriate K value will also affect the quality of classification results, the fixed K value ignores the influence of the category and the document number of training text. If the K value is too large, the text tends to belong to the class which contains more texts, classification performance is poor; If K value is too small, text has too few neighbors, this will reduce the classification accuracy.Aiming at the problems at the feature extraction technology, this paper puts forward a new feature extraction technology which based on the genetic algorithm. In this method theχ~2 statistical value of words that can identify the size of the correlation between words and category, which will be introduced to feature vector, as the initial population for genetic algorithm heuristic search, while the nature of the feature extraction. At the same time, this paper presents a new fitness function and crossover rules. This paper put forward a new fitness function and the cross-rule for the nature of feature extraction. Experiments have proved that the new feature extraction technology which based on the genetic algorithm can choose a category of accurate characterization of text feature.In order to solve the defect of the fixed K value, this paper proposes a kind of dynamic obtain k-valued for KNN classification algorithm, experimental results show that the dynamic obtain k-valued KNN classification algorithm with high performance.This paper puts forward a new feature extraction technology which based on the genetic algorithm and to use it in the improved KNN text categorization algorithm. Experimental results on data sets prove that the combination of the improved feature extraction algorithm to the dynamic obtaining K values can effectively obtain high quality classification results.

Keywords/Search Tags:

Text classification, Feature selection, Genetic algorithm, KNN classification algorithm, K-valued

PDF Full Text Request

Related items

1	Text Classification Feature Down-dimensional Method Of Research
2	Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM
3	Research On Feature Selection Algorithm And Classification Algorithm In Chinese Text Categoriztion
4	Improvement On Feature Selection And Classification Algorithm For Text Classification
5	Genetic Algorithm Based Model Parameter Selection And Its Application In Text Classification
6	Research On Text Feature Selection And Classification Algorithm Based On CHI And KNN
7	Research And Application Of Text Classification Based On Heuristic Algorithm
8	Research On Text Classification Method Based On Improved Feature Selection Algorithm
9	Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification
10	Research And Improvement Of Automatic Classification Technology For Chinese Text