Font Size: a A A

Research On Measures And Models In Feature Selection

Posted on:2018-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z C SongFull Text:PDF
GTID:2428330575496190Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The existence of a large amount of irrelevant and redundant information in high dimensional data affects and limits the performance of machine learning algorithm greatly,and imposes higher requirements on the temporal and spatial complexity of learning algorithm.Feature selection is one of most important part of machine learning and pattern recognition,which can efficiently mine effective information in data and improve the generalization ability of learning model.Meanwhile it can hugely reduce data size.This paper mainly studies feature selection from two perspectives,one is the relationship between variables in the data,and the other is the feature selection model.We firstly study the effects of different measures,and then focus on the information measures and present multiple different feature selection methods in the later chapters.Especially,we focus on information measure based semi-supervised feature selection method in order to make full use of labeled and unlabeled data.The studies in this paper are divided into the following aspects:Firstly,when we apply different measures to the same feature selection method,the selected subsets based on these measures are often different a lot and the classification accuracies of the feature selection methods based on the same measure on different datasets may also be different.We choose a variety of representative linear and nonlinear measures.By combining the fast correlation-based feature selection algorithm with the selected measures,we study the above differences on two different kinds of datasets.Secondly,in order to enhance the ability of the feature selection methods to measure the relationships among variables in data,the maximum information coefficient,which can measure the linear,nonlinear and non-functional relationships at the same time,is introduced.Based on this measure,a new evaluation function which can consider both feature correlation and feature redundancy is proposed.In order to speed up the feature selection,a novel search strategy based on the maximum information coefficient measure is proposed.Based on these bases,a new method of supervised filtering feature selection is then proposed.Finally,we propose a normalized mutual correlation measure based on the research of basic concepts,such as information entropy and mutual information.And we then propose a semi-supervised relevant and redundant measure.Two different feature selection methods based on the measure are proposed.The proposed two methods are based on relevancy-redundancy and hierarchical clustering respectively which can use both labeled and unlabeled data.
Keywords/Search Tags:feature selection, measure, feature relevance, feature redundancy, information theory
PDF Full Text Request
Related items