Font Size: a A A

Research Of Semi-Supervised Learning In High Dimensional Data

Posted on:2014-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:G X YuFull Text:PDF
GTID:1228330401460238Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the collected data in each areaaccumulate sharply. The advance of these areas depends heavily on efficiently discoveringknowledge from these data. Machining learning is one of the bases of data mining andknowledge discovery. It is and will be the hottest research area in computer science.Traditional machine learning methods mainly focus on supervised learning, they require thesamples are all labeled and do not have a large number of features. However, with the wideapplication and the advance of data collection techniques, the collected samples not onlyhave a large number of correlated features, but also few of them are labeled. As a result,traditional machine learning methods can not efficiently learn from these samples. It is quitenecessary to develop new machine learning paradigms, which can learn from few labeledand many unlabeled data。Semi-Supervised Learning can leverage both the labeled data and unlabeled data toachieve a learner with good generalization ability, and it quickly turns to be one of the hotsubfields in machine learning. Current semi-supervise learning methods, especially thegraph-based ones, often concern on how to utilize the labeled and unlabeled data. But theyignore a more fundamental and important problem--how to construct a graph, which canprecisely reflect the similarity information among samples. As the number of featuresincreases, the number of noisy and redundant features also rises, various widely-usedmetrics can not properly measure the similarity between samples. Therefore, it is difficult toconstruct a well-structured graph on these samples. However, semi-supervised learningmethods depend on a graph to leverage the labeled and unlabeled data. Thus, awell-structured graph on the high dimensional samples determines the final effectiveness ofthe graph-based semi-supervised learning methods. The graph also decides the effectivenessof other graph-based learning methods.To address these problems associated with graph-based semi-supervised learning, westart from graph construction on high dimensional data with the aim to improve the learningaccuracy, and take metric learning and ensemble learning as basic tools. We conduct extensive research on graph-based semi-supervised dimensionality reduction,semi-supervised classification and semi-supervised multi-label classification. We proposeseveral graph construction schemes and incorporate them into graph-based semi-supervisedlearning. In summary, the key contributions of the thesis are:(1) We propose an Enhanced Locality Preserving Projections (ELPP) method and it issemi-supervised version (SELPP). ELPP can address the ineffectiveness and parametersensitivity of original LPP. ELPP capitalizes on robust path based similarity to construct agraph, and incorporates it into the objective function of LPP. Experimental analysis showsthat ELPP has higher classification accuracy than original LPP and is also robust to variousinput parameters. These results also corroborate the importance of graph construction ingraph-based embedding. SELPP inherits all the advantages of ELPP, it can use must-link toboost the learning results and performs better than other related methods.(2) We introduce a mixture graph construction scheme and apply it intosemi-supervised dimensionality reduction based on side information, and proposed amethod called Mixture Graph-based Semi-Supervised Dimensionality Reduction(MGSSDR). MGSSDR has lower time complexity than ELPP, and has better performancethan other related methods. In addition, MGSSDR is robust to noisy features andneighborhood size selection. The proposed mixture graph can be used in other graph basedmethods.(3) We propose a method coined as Semi-Supervised Classification based on RandomSubspace Dimensionality Reduction (SSC-RSDR). SSC-RSDR first constructs multiple knearest neighborhood (k NN) graphs in multiple generated random subspaces and executessemi-supervised dimensionality reduction in these subspaces. Next, SSC-RSDRre-constructs multiple k NN graphs in these dimensionality reduced subspaces and trainssemi-supervised nonlinear classifiers on these graphs. Finally, SSC-RSDR fuses theseclassifiers into an ensemble classifier. Experimental analysis demonstrates that, SSC-RSDRhas higher accuracy than other related methods and can balance the accuracy and diversityof base classifiers, it is also robust to various input parameters. In addition, SSC-RSDR canovercome the drawback of mixture graph, which is dependent on subspace size selection. The graph construction scheme of SSC-RSDR can be used in other graph-basedsemi-supervised learning.(4) We study a method named Semi-Supervised Ensemble Classification (SSEC) insubspaces. SSEC first divides the original feature spaces into several subspaces with equalsize and constructs k NN graphs in these subspaces. Next, it trains Semi-Supervised LinearClassifiers (SSLC) in these subspaces. And finally, SSEC combines these classifiers into anensemble classifier by majority voting. Theory analysis shows SSEC has lower timecomplexity than SSC-RSDR. SSEC avoids the risk of discarding important features, it justmakes use of simple k NN graphs and has better performance than other semi-supervisedclassification methods, which are based on various graph optimization techniques. SSEC isalso robust to various input parameters values. We observe SSLC trained in the subspaceholds higher accuracy than SSLC trained in the original space. This fact confirms that thereare many redundant features in high dimensional data and it is rational to ensembleclassifiers in subspaces.(5) We introduce a directed bi-relation graph, which can avoid the risk of labeloverwritten in undirected bi-relation graph. Based on directed bi-relation graph, we proposea method called Transductive Multi-label Classification (TMC) and a method calledTransductive Multi-label Ensemble Classification (TMEC). We apply them into proteinfunction prediction using multiple heterogeneous data sources. Experimental results showthe directed bi-relational graph is better than the undirected one, the classifier ensemblebased methods are more suitable for protein function prediction tasks than kernelintegration based methods.
Keywords/Search Tags:Semi-Supervised Learning, High Dimensional Data, Graph Construction, Ensemble Learning, Multi-label Learning
PDF Full Text Request
Related items