Font Size: a A A

Research On Classification Problems And Algorithms For High Dimensional Data

Posted on:2014-09-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q Y WuFull Text:PDF
GTID:1108330422990350Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technologies, the data in modern applications usually are large scale and high dimensional. Because of the common used of high di-mensional data and their potential applications, high dimensional data mining has caught broad attention in the research community over the past few years. Currently, researcher-s mainly concentrate on high dimensional data classification. However, due to curse of dimensionality, traditional learning methods may suffer a reduction in their classification performance. As a consequence, high dimensional data classification is a new challenge for traditional supervised learning.High dimensional data classification is very meaningful, which may be able to help people automatically classify the classes of data. For instance, news text categorization, gene expression function prediction, user interest identification and public opinion anal-ysis in social networks. In the literature, however, there are rarely research technologies and results regarding to high dimensional data classification. Therefore, this dissertation focuses on the research problem of high dimensional data classification. We mainly con-sider multi-noise high dimensional data classification, multi-transfer learning, multi-label learning, multi-instance multi-label learning. A series of algorithms are proposed to solve these problems. The main contributions of this dissertation are as follows:1. A stratified sampling method is proposed for features subspace selection in ran-dom forest (RF) for high dimensional data. The key idea is to stratify features into groups, and random select features from each group for feature subspace selection. For high dimensional data classification, we proposed the SRF algorithm. Test on gene and image data consistently demonstrates its effectiveness. For the imbal-anced text data classification, we propose the ForesTexter algorithm. ForesTexter is able to deliver better performance against RF, especially on minority classes. For the Genome-wide association analysis problem, we propose the GWA-SRF algorithm. It is able to avoid a very high computational cost of exhaustive search of an optimal subspace size, and identify interesting features that may be associ-ated with complex disease for further biological investigation.2. A new machine learning strategy called multi-transfer learning is studied. The main challenge is how to use labeled data of different feature spaces to enhance the classification of different learning spaces simultaneously. Our idea is to model the problem as a coupled Markov chain with restart. The transition probabilities in the coupled Markov chain can be constructed by using the intra-relationships based on affinity metric among instances in the same space, and the inter-relationships based on co-occurrence information among instances from different spaces. We imagine a random walker iteratively propagating the labels via the transition prob-ability graph. The steady state probabilities give ranking of labels to indicate the importance of a set of labels to a test instance. Theoretically, we proof the con-vergence of the algorithm, and there is a unique solution for the algorithm. Ex-perimental results on benchmark data (image-text classification) have shown that the learning algorithm is computational efficient and effective in learning across different spaces.3. A hierarchical tree model is proposed for multi-label learning (ML-Tree). It is composed by one-against-all SVM classifiers at each node to recursively partition the data from parent node to child nodes. Each node is labeled with a set of relevant labels which are used for multi-label predictions as well as automatically discovering the label relationships. The amount of predictive label co-occurrence in the leaf nodes provides an estimation of the label relationships. The ML-Tree is examined on11read data sets and employ Friedman and Nemenyi tests to assess the statistical significance of differences in performance. Experimental results have demonstrated the effectiveness of this new method.4. An efficient and novel Markov chain based multi-instance multi-label (Markov-MIML) learning algorithm is proposed. The algorithm computes ranking of labels to indicate the importance of a set of labels to an object. The rank is depends on (i) the affinity between the bag of instances of this object and the bag of instances of the other objects;(ii) the rank of a class label of similar objects. An object similar to the other objects with a high rank of a particular class label receives a high rank of this class label. Theoretically, we proof the convergence of the algorithm, and there is a unique solution for the Markov-MIML algorithm. Experimental results on benchmark data have shown that the proposed algorithm is computationally efficient and effective in label ranking for MIML data.In summary, this dissertation concentrates on the problem of multi-noise high di-mensional data classification, multi-transfer learning, multi-label learning, multi-instance multi-label learning. We proposed SRF, MT-Learn, ML-Tree and Markov-MIML algo-rithms for these problems. The research in this dissertation may improve the high dimen-sional data classification techniques further, and at the same time it is promising to bring new research areas in the fields of high dimensional data classification.
Keywords/Search Tags:High dimensional data classification, Transfer learning, Multi-transfer learn-ing, Multi-label learning, Multi-instance multi-label learning
PDF Full Text Request
Related items