Font Size: a A A

Research On Feature Selection And Classification Algorithms Based On Information Theory

Posted on:2006-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2168360152994976Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the development of computer science and technology, people come to realize the importance of information. In the present times of knowledge explosion, people need urgently a method to locate the useful information in the vast amount of data provided by computer. And thus data mining comes in its way in such a situation. Over the past ten years, great progress has been made in the research of data mining. The utilization of various data mining software has greatly promoted the people's ability to grasp and dispose of the computer information and thus has brought about a good benefit.Data Mining is an analytic process designed to explore data (usually large amounts of data -typically business or market related) in search of consistent patterns and systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.Feature selection and classification are two important aspects of data mining. The process of feature selection is to select a best subset of features from a primitive set, the best subset should hold all or a majority of class information existing in primitive set. The mission of the data classification is to find out the concept description of a category, and it represented the whole information of these data.In this paper, we do some researches in algorithm of feature selection and data classification. In the aspect of feature selection algorithm, firstly we study the action of mutual information in weighting the relation among features and the relation between feature and class. Based on mutual information, some concrete and in-depth researches aiming at the problem of eliminating redundant features in the process of feature selection have been achieved. In the aspect of classification algorithm, we investigate the effect of conditional mutual information when applying in evaluating the importance of features in the process of classification, establish the mutual-information network with conditional mutual information, and make the classification rules be concise. Finally, we evaluate the results of the two aspects correctly by some empirical studies.In chapter 4 of this paper, we analyze and demonstrate the importance of removing redundant features in the process of feature selection, and then provide an algorithm of removing redundant features in the process of feature selection called Approximate Markov-blanket Filter. Moreover, based on the Approximate Markov-blanket Filter, we put forward a feature selection algorithm based on information theories called ECBF algorithm. Using information-theory measures, combining the redundant feature analysis in subset evaluation and making use of the advantage in calculation of individual evaluation, The ECBF algorithm achieves efficient feature selection in high-dimensional data by applying redundant feature analysis to individual evaluation.In this paper we select correlation measures(RMI) based on information-theory to measure thecorrelation among features and between a feature and class. The RMI is constructed by information-entropy and mutual information. By calculating the correlation between feature and class, we can remove the irrelevant features from the primitive set, and form a relevant feature set. But usually there are some redundant features in relevant set, eliminating redundant features will lead to a more efficient process of establishing classification model.The feature redundancy is usually identified by the correlation measure among features. It is comprehensive to think that: If the values of two features is completely correlated, then they are redundant each other. In fact, when a feature is partly correlated with a set of features, we can not decide directly that the feature is redundant.Markov-blanket is a powerful method in analyzing redundant features. It has described the main performance of a redundant feature. We can identify the redundant feature by the statistical measure between a feature and a subset of features. And this forms Markov-blanket Filter. However, Because of a large of calculation of Markov-blanket Filter, applying it to eliminate redundant features in high-dimensional data set will depress the efficiency of feature selection algorithm by all means. So this method is unsuitable in high-dimensional data sets.By picking up the basic property in the Markov-blanket Filter, analyzing its main structure, combining the correlation measure, we deduce and construct an Approximate Markov-blanket Filter, and used it to analyze redundant features in relevant features subset.In our Approximate Markov-blanket Filter, we define the correlation between feature and class as C-correlation, the correlation among features as F-correlation, and a relevant feature with no approximate Markov-blanket as predominant feature.The primary principle of approximate Markov-blanket is: If the C-correlation of the feature X is larger than the C-correlation of the feature Y, and the F-correlation of the two features is larger than the C-correlation of feature Y, then the feature Y is a redundancy. From the primary principle we can take a conclusion: The feature that has the largest C-correlation value do not have any approximate Markov-blanket And the feature is a predominant feature. Making use of the above conclusion, after sorting the C-correlation, we can eliminate all redundant features and remain the predominant features.Our method approximates relevance and redundancy analysis by selecting all predominant features and removing the rest features. It uses both C- and F-correlations to determine feature redundancy and combines sequential forward selection with elimination so that it not only circumvents full pair-wise F-correlation analysis but also achieves higher efficiency than pure sequential forward selection or backward elimination.In chapter 5 of this paper, we bring a classification algorithm called MTTN which based on the relevant feature subset brought by ECBF algorithm. Using information-theory measures in the process...
Keywords/Search Tags:Data mining, Feature selection, Classification, Information theory, Markov blanket
PDF Full Text Request
Related items