Font Size: a A A

Research And Application Of Max-Correlation And Mix-Redundancy Unsupervised Feature Selection

Posted on:2011-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:R Y LiuFull Text:PDF
GTID:2178330332464799Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The research and the application of unsupervised feature selection has became a attention issue and plays an important role on processing unlabeled data as the appearance of many unlabel datasets.The thesis does an elementary research on unsupervised feature selection, and then makes a further investigation on unsupervised feature selection based on filter model. The challenge of unsupervised feature selection based on filter model is how to define redundancy and iirelevent features [1]. According to the two challenges and the state of art of unsupervised feature selection based on filter model, ther are two disadvantages on current unsupervised feature selection based on filter model:(1) Defines redundant feature through feature exaction and feature clustering methods. But feature exaction only obtains the feature transformation and can't get the original feature subset; Feature clustering with k-means method, the uncertain value of k brings some troubles for removing redundancy features. (2) The purpose of unsupervised feature selection base on filter is removing redundante or irrelevant features, but now the existed methods only consider one aspect, removing redundant or irrelevant features.Thus, according to the two disadvantages of unsupervised feature selection based on filter, the thesis proposes two methods of removing redundant features from statistics and ensemble clustering methods, and proposes two unsupervised feature selection algorithm base on Laplasian Score, that is LS-CORR(Laplasian Score and Correlation) and LS-EC(Laplasican Score and Ensemble Clustering) Experiences in standard UCI datasets and manual dataset demonstrate that LS-CORR and LS-EC can well process the redundant and irrelevant features of datasets, and obtain a more small feature subsets, also can improve the accuracy of clustering. In the end, the thesis applies Laplasian Score and LS-CORR algorithms on the analysis of aroma features of tobacco to select key aroma features according to the essential attribute and distribution of data. The comparison experiences with other methods show the effective, practicality and reality of the applications.
Keywords/Search Tags:Unsupervised feature selection, Irrilevant and Redundancy, Pearson Correlation Coefficient, Ensemble Clustering
PDF Full Text Request
Related items