Font Size: a A A

A Novel Feature Selection Algorithm For High-dimensional Data

Posted on:2013-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:K TianFull Text:PDF
GTID:2248330371477966Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the developing of Internet, more and more information emerges in the WWW. Among them, more than80percentages of the data are unstructured, such as Web page, email, image and so on. In order to handle such unstructured data, it is better to represent them via structured model such as vector space model (VSM) for text data and bag of words (BOW) for image data. However, the number of components in the vector is often millions and more, which will result in Curse of dimensionality. This situation will further lead to more challenges for text classification, information retrieval, genetic engineering, computer vision and etc. Therefore, feature selection is proposed to remove irrelevant and redundant features, and then effectively and efficiently improve the learning performance.This dissertation focuses on studying feature selection methods and ensemble learning based on feature selection for high-dimensional data. The main contribution of this paper includes:(1) Sparse representation-based feature selection method is designed for high-dimensional data. In this method, the sparse representation is proposed to represent the relevant features which are selected via the existing evaluation methods (IG, EVSC) from high-dimensional data, and then the redundant features are removed according to the sparse representation coefficient. Experimental results on five real-world high-dimensional datasets have shown that the proposed method has comprehensive performance with respects to accuracy, size of feature subsets, and efficiency;(2) Ensemble learning is one good choice to deal with large volume and high-dimensional data. This dissertation studies the effect of feature selection on ensemble learning, presents a new stratified sampling method to build ensemble components. Experimental results indicate that the new method can obtain higher quality ensemble components than the existing feature selection methods (random sampling and random projection) for high-dimensional data, and finally it can effectively increase the ensemble clustering accuracy.
Keywords/Search Tags:Feature selection, supervised learning, unsupervised learning, ensembleclustering, stratified sampling
PDF Full Text Request
Related items