Font Size: a A A

Semi-supervised Clustering And Feature Selection For Symbolic Data

Posted on:2014-12-10Degree:MasterType:Thesis
Country:ChinaCandidate:W T WangFull Text:PDF
GTID:2268330395489212Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the fields of the machine learning, clustering and feature selection have been providing an effective and efficient method for data analysis. Clustering is a fundamental technique of unsu-pervised learning, where the task is to find inherent structure form unlabeled data. A good cluster should divide the data into several clusters so that the intra-cluster similarity is maximized while the inter-cluster similarity is minimized. On the other hand, feature selection is applied to reduce the number of features in many applications where datasets have hundreds or thousands of features. It has proven in both theory and practice effective in enhancing learning efficiency, increasing predictive accuracy, and reducing complexity of learned results, especially for high-dimensional datasets.Recently, Semi-supervised learning has become an attractive methodology for improving clas-sification models and is often viewed as using unlabeled data to aid supervised learning. Similarly with supervised learning, semi-supervised clustering and semi-supervised feature selection have been active areas. However, most of those semi-supervised methods are applied to continual-value data, few methods are suitable for clustering and feature selection of symbolic value data in semi-supervised learning. This paper proposes two semi-supervised clustering algorithms and two semi-supervised feature selection approaches for symbolic data respectively.The first method is based on clustering ensemble and the clusterers are generated by k-Modes. In addition, we provide four voting strategies to obtain the final clustering results. Base on split-merge model, we propose a semi-supervised symbolic clustering method. Though equivalent rela-tions using unsupervised and supervised information, we obtain small partitions where objects are similar. The final cluster partition are merged by four different distances measurements between clusterings.Inspired feature selection method mRMR in supervise learning, semi-supervised relevance and redundance measurement are redefined and a novel stop criterion is proposed to control the number of feature selection. In the last method, we extend the classical dependence degree in rough set to the semi-supervised framework, called daul dependence degree. The dual degree measures not only the dependence with respect to the decision attribute, but also the redundancy between conditional attributes. For the proposed semi-supervised feature selection, we present two different search feature subset strategies.Experiments show that our semi-supervised clustering and feature selection methods for sym-bolic data are effective and efficiency, which can provide an alternative solution for semi-supervised learning.
Keywords/Search Tags:Machine Learning, Semi-supervised Learning, Clustering, Feature Selection, Sym-bolic Data, Clustering Ensemble
PDF Full Text Request
Related items