Font Size: a A A

Text Classification Feature Down-dimensional Method Of Research

Posted on:2011-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2208330332973078Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Text classification is to sort documents to one or more categories automatically. It plays an important role in text mining and document management. It can resolve the disorder information form a large extent, users can find the useful information conveniently. In the field of text classification, there is a factor that high-dimension feature space influences the performance of the classification. How to reduce the dimension of the feature space, improve the efficiency and accuracy has became a serious problem for text classification. Therefore, feature selection is a very important process for text classification, it is about finding useful (relevant) features to describe an application domain, removing noise words (such as empty word, adjectives, etc.), and could reduce the number of features in feature set. A good feature selection method can find the smallest feature set to represent the indexes of a given dataset, improve the efficiency and accuracy of the classification.Concerning the requirements of accuracy and efficiency in text feature selection, we had an in-depth study of feature selection technology and proposed two feature selection methods. At last, we compare several existed feature selection methods from two aspects of the efficiency and time consumption. Our paper includes the following contents:1) A text feature reduction method based on similar combination. Analysis of the velocity of a number of different types of ant colony in ant colony algorithm, which independents and parallels to search analysis. The correlation between each cluster obtained by different searching, and then calculate the intersection of corresponding clusters of these search results, put the feature items to secondary selecting by improved Mutual Information methods. In precondition of the information loss least, we complete the text feature reduction effectively. The experiment shows that this method has a good effect on dimensionality reduction and improves the efficiency of the clustering.2) A text feature selection method based on the combining of genetic algorithm with K-means algorithm. In view of the high-dimensional feature in text categorization influence the accuracy and efficiency of classification. As the traditional feature reduction methods can not find the best feature set. Genetic algorithm has the characteristics of global optimization and high searching efficiency. However, its strong randomicity influence convergence rate. K-means algorithm has the characteristics of high-performance. This paper presents a novel methodology combining genetic algorithm with K-means for optimally feature reduction. By the operations of selection, crossover and mutation, the optimal feature set can rapidly be obtained. Experimental results show that this method can effectively improve the accuracy of feature selection.
Keywords/Search Tags:text classification, feature reduction, feature selection, ant colony algorithm, genetic algorithm, k-means algorithm
PDF Full Text Request
Related items