Font Size: a A A

Research On Feature Selection And Classification Based On Intelligent Optimization Algorithms

Posted on:2015-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiFull Text:PDF
GTID:1318330467482951Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, data is growing and accumulated at an unprecedented rate. People have entered the era of big data. In the era of big data, data are increasingly showing complex features of huge amounts and high dimension. When facing these highly complex data, the traditional method of data classification often has little effect, which makes it hard to understand or explore the hidden information or rules in these data. Therefore, how to effectively select features and classify them from the highly complex data has become one of the basic scientific problems of data processing in the era of big data.The difficulty of processing and identifying of high-dimensional, complex data for computer is mainly reflected in feature selection and classifier design. Feature selection is a process to select the best feature subset from a group of features or to reduce the dimension of feature space by generating features through transforming. The classification is to construct a classification model on the basis of known data sets, and to predict the classifications of unknown data set through this model.Feature subset selection usually includes the selection of feature subset evaluation and the selection of search method. There are two ways for feature subset evaluation. One is the filtering approach, and another is wrapping approach. Search method often includes suboptimal method and the optimal method. The basic principle based on the feature extraction is to reduce or eliminate the redundant information by choosing the appropriate transform. Transformation is categorized into linear transformation and nonlinear transformation. Nonlinear feature extraction is mainly based on dimension reduction theory and technology dominated by manifold learning. The previous manifold learning method is mainly focus on data distribution research and how to better describe data to use in dimension reduction, and data visualization. However, it does not closely relate to classification.Bayesian theoretical model based on statistics and SVM proposed by the Vapnik are the two dominated classification models. In Naive Bayesian assumptions, for a given class, all the properties of the instance must be independent. With the attributes being independent from each other, the parameters can be estimated separately for each property, making it particularly suitable for a very large number of properties classification problems. But in real classification problems, this assumption is usually not established. Penalty parameter C of support vector machine (SVM) and parameter sigma of RBF kernel are key parameters to influence the performance of classification. Born in the1950s, intelligent optimization algorithms simulate natural biological behavior to solve the optimization problem, which have been widely used in practice in pattern recognition. Intelligent algorithm mainly includes genetic algorithm, particle swarm optimization algorithm, differential evolution algorithm, clone algorithm, and so on.In this paper, a feature extraction method titled nonparametric discirminant multi-manifold learning (NDML) is put forward and involved in different manifolds recognition, to apply manifold learning to the classification better. The intelligent optimization algorithms are applied into the classification of Naive Bayes and SVM. In Naive Bayes classification, the method applies optimization algorithms firstly to search out an optimal subset of attributes reduction in the original attribute space, and then constructs a naive Bayes classifier on the gotten subset of the attributes reduction. In SVM classsification, through putting the parameter C and sigma into individual coding, the method maximizes the classification accuracy as the optimization goal and gets the optimal parameter combination. In addition, through setting appropriate fitness function, the method can optimize the feature subset selection and parameter C, sigma in the same time. It both reduces the dimension of feature subset, and improves the classification accuracy.The specific contributions of this paper include:1. The paper systematically summarizes two kinds of the feature selection:the feature subset selection and feature extraction. Feature subset selection usually includes the selection of feature subset evaluation and the selection of search method. There are two ways for feature subset evaluation. One is the filtering approach, and another is wrapping approach. Search method often includes suboptimal method and the optimal method. Transformation in feature extraction is categorized into linear transformation and nonlinear transformation. The two stages of classifier are introduced; various classifiers are compared; two kinds of classifiers——naive bayes and support vector machine are described in detail; the principles of the genetic algorithm, particle swarm optimization algorithm, differential evolution algorithm and clonal selection algorithm are summarized; their working process are analyzed.2. As for traditional manifold learning is not suitable for multi-manifold recognition, a dimensionality reduction method titled nonparametric discirminant multi-manifold learning (NDML) is put forward and involved in different manifolds recognition. In the proposed method, a novel nonparametric manifold-to-manifold distance is defined to characterize the separability between manifolds. And then an objective function is modeled to project the original data into a low dimensional space, where the manifold-to-manifold distances can be maximized and manifolds locality will be preserved. These will serve multiply manifolds identification well.3. Aiming at the limitations of simple bayesian assumption, the paper uses intelligent optimization algorithms for feature selection (that is, the best subset) and builds improvement naive bayesian classifier on this basis. Three improved naive bayesian classifiers based on genetic algorithm, particle swarm optimization and differential evolution algorithm are proposed in the paper, and they are compared with the decision tree algorithm and other classic algorithms.4. The optimization methods for penalty parameter C of support vector machine and parameter sigma of RBF kernel are proposed based on particle swarm optimization algorithm and differential evolution algorithms.5. Considering that differential evolution algorithm convergence is not high and the local search ability is not strong, two kinds of hybrid models are respectively put forward in this paper. The first hybrid model integrates opposition-based learning to increase the diversity of population in differencing phase, and adopts mix competition between the adjacent two generations to enhance convergence in the selection phase.The second kind of hybrid model is based on differential evolution in combination with the clonal selection algorithm, improving the overall fitness and maintaining the individual diversity of the population characteristics. Both hybrid models are well used in the parameter optimization of support vector machine (SVM).6. A method of simultaneous parameters optimization and feature selection for support vector machine is put forward in the paper. Through setting the appropriate fitness function and encoding, this method deletes the redundant features as well as improving the classification accuracy.
Keywords/Search Tags:feature extraction, classification, manifold learning, Naive Bayes, supportvector machine, parameter optimization, differential evolution
PDF Full Text Request
Related items