Font Size: a A A

A Study On Feature Selection Algorithms Using Information Entropy

Posted on:2011-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:H W LiuFull Text:PDF
GTID:1118360305453663Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the emergence of new technologies and their rapid development, a large volume of data, which plays an increasingly role in our everyday life, in various fields has been accumulated. Nowdays, how to get important and useful information from the mass data and then take full advantage of these information to achieve realistic goals is becoming a problem facing to people. Under this context, the concept of data mining has been introduced. It refers to the process of figuring out those hidden and potentially useful information or knowledge from the mass data with noise. Data classification is one of the most extensively studied issues in data mining. Its main purpose is to obtain patterns or behavior from known data, so as to assist users in predicting or determining the patterns of unknown data. As feature of the patterns contains a lot of important information or knowledge, it becomes one of the theoretical bases for predicting behavior patterns of unknown data.However, lots of features in databases are redundant or irrelevant. They may bring great challenges to traditional classification algorithms, such as reducing the efficiency of learning algorithms, distracting the learning process, and ultimately resulting in the obtained model with poor quality and over-fitting. Additionally, the more features in a database, the more noise inhabited it. This, however, may bring learning algorithms into more adverse situations. To tackle with this problem, the concept of feature selection has been introduced. It refers to the process of identifying an optimal feature subset, which embodies most information of data, from the original space of features in terms of some given measurements. By now, feature selection is becoming one of hot topics in various fields, such as data mining, statistical pattern recognition and machine learning, and attracting many focuses of scholars, but also is widely used in practice.This study mainly focuses on the issue of removing redundant or useless features from databases. According to the characteristics of selection process, the relevant degree between features in this thesis is measured by information measurements, whose bases are information entropy and its related concepts. Moreover, this thesis has proposed five different kinds of feature selection algorithms using information measurements to deal with the prediction tasks in different situations of classification. Specifically, the related research work of this thesis is carried out in the following aspects:Firstly, a new filter feature selection algorithm, called ISFS, has been proposed. The idea of this method mainly originates from data clustering analysis. However, unlike traditional clustering algorithms, the data item in ISFS is a single feature, rather than sample in dataset. In addition, the class feature, i.e., the classificatory labels, of the dataset is taken as a special class or group, called label group, while other features are marked as candidate groups and selected group in ISFS, where the candidate and selected groups denote single candidate feature and selected features, respectively. In this case, the feature selection problem is now transferred into the issue of hierarchical clustering analysis in aggregated manner. Thus, during each clustering or selection procedure, the selected group will continuously combine with the candidate group, which has the greatest"distance"to the label group and the smallest"distance"to the selected group. This clustering procedure will be repeated until the number of features in the selected group is larger than a specified threshold.To measure the"distance"between data items, ISFS adopts two information criteria, i.e., mutual information and correlation coefficients, to measure the distances to the label group (inter-class distance) and the selected group (intra-class distance), respectively. Consequently, the purpose of clustering is to keep the selected group with the minimal intra-class distance and the maximal inter-class distance with the label group at the same time, and the feature with more discriminability would be selected in a higher priority during the whole clustering process.Secondly, a generalized representation of information metric has been introduced. This generalized representation can bring most information metrics used in feature selection algorithms into a unified framework. Moreover, the relationship between our measurement and others are discussed in detail, and then a unified framework for feature selection using information measurements is given.As information entropy or mutual information can effectively measure the degree of uncertainty of feature, they have been extensively studied in feature selection. However, the value of entropy or mutual information in existing selection algorithm may contain"false"information. As a result, it can not exactly represent the degree of relevance between features. To this end, dynamic mutual information is introduced to accurately represent the relevance between features, which dynamically changes during the selection process. Unlike traditional mutual information, the value of dynamic one is not estimated on the whole sample space, but unrecognized samples. Based on this concept, two feature selection algorithms using dynamic mutual information (DMIFS) and conditional ones (CDMI) have been proposed respectively. These algorithms work more like the method of decision tree in classification, where the samples, which can be identified by the newly selected feature, will be removed from the original space, and then the value of mutual information of candidate features will be estimated on the remaining samples.Thirdly, based on data sampling techniques, a new selection framework for feature selection algorithm using boosting technique has been presented. Additionally, several related issues about this framework, such as updating strategies and error functions, have also been discussed. The purpose of using boosting technique is to overcome the negative aspects resulted from the fact that the recognized samples would be removed in estimating the values of dynamic mutual information of features, and to alleviate the impact of noise on the whole process of feature selection.During the selection process, the proposed method estimates mutual information by dynamically adjusting the weights of samples, so as to the estimated values can accurately measure the correlation between features, which is dynamically changed along with the selection process. As a result, the selected feature at each time can exactly represent the characteristics of classification. Unlike other boosting selection algorithms, the result of our framework is a feature subset, rather than one or several classifiers. In addition, its updating strategy is not the error of base-classifier, that is, the information measurement in our method is independent of base-classifiers.Finally, for the less-sample and high-dimensionality of gene expression datasets in bioinformatics, an ensemble feature selection algorithm, named EGSG, has been proposed. This selection algorithm takes use of Information Correlation Coefficient to measure the relevance between genes, and approximate Markov blanket technique to divide genes into several groups, which represent different biological functions. After the partition stage, a feature subset is formed by picking a representation gene from each group. Since genes in the subset come from different groups, it has similar prediction capability to the original genes.Due to the limited capability of a single gene subset for high-dimensionality gene expression datasets, EGSG take the advantages of multiple gene subsets to further improve the prediction capability in disease diagnostic by ensemble technique. That is to say, EGSG obtains several gene subsets in the same way, and then aggregates many base-classifiers trained on these subsets into an overall one by virtue of the major voting strategy. Compared with other ensemble methods, the diversity of EGSG lies in combining different gene subsets, where each subset has similar biological function with each other and higher discriminative capability.Simulation experiments carried out on public datasets show the performance and effectiveness of the selection algorithms proposed in this thesis. They can effectively improve the efficiency and performance of learning algorithms for classification in most cases. Nevertheless, there still are several problems in the proposed algorithms. For example, ISFS has relatively poor efficiency, and CDMI and DMIFS are sensitive to noises. The performance of our boosting algorithm relies on the specific values of parameters, while the results of EGSG are difficult to be explained. Therefore, our future work will be carried out on these factors in order to further improve their performance and efficiencies.
Keywords/Search Tags:Feature selection, Classification, Data mining, Machine learning, Information entropy, Mutual information, Learning algorithm, Ensemble learning, Clustering analysis, Boosting learning
PDF Full Text Request
Related items