Font Size: a A A

Research On The Data Selection And Learning Algorithms Under Big Data

Posted on:2016-06-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L XiongFull Text:PDF
GTID:1108330464462882Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The age of information explosion brings us unprecedented huge amounts of information regardless of the category and the quantity. With the breakneck development and the widespread use of the computer communication, internet technology and internet of Things based on all kinds of sensors, the collection of large amounts of data becomes so much easier and low-cost. It provides the necessary data support for the rapid development of the machine learning, pattern recognition and computer vision which are urgent demands of artificial intelligence domain. However, how to select data effectively and how to learn useful information from the data have become important issues of researchers. This thesis has made some systematic research focused on the data selection and the study of intrinsic subspace and manifold information of data in terms of modeling, algorithm design and analysis. We also apply the related algorithms to the collaborative filtering, image inpainting and background modeling of video. The author’s major contributions are outlined as follows:1. On account of expensive cost and time-consuming process for annotating huge amounts of data, active learning as a method of minimization effort of annotation is concerned by a growing number of researchers. In the existing algorithms of active learning, a method exploits structure information of unlabeled data, whereas the selection of the representative point needs additional calculation, such as hierarchical clustering; a method needs to train multiple classifiers in advance at each iteration, from the view of ensemble learning, data needed for annotating are selected; a method only consider the data point which is the most closest to the optimal decision boundary at each iteration. To overcome such limitations, a novel active learning approach, coupled K nearest neighbor pseudo pruning is proposed. It is enlightened from the K nearest neighbor pruning preprocessing, only one classifier is trained and data points to be near or on the optimal classification hyperplane are considered. Analysis of computational complexity and analysis of the parameter are given in the paper. Experimental results on a variety of UCI datasets, image datasets of aircrafts and the dataset of radar high-resolution range profile indicate our method obtain superior performance compared with the state-of-the-art active learning algorithms.2. Low-rank matrix completion and recovery problems are classical practical problems which we have to learn the intrinsic structure and information from the known data. In the last several years, those problems have been well solved by the technique of trace minimization of matrix or other variations of singular value decomposition(SVD) in the data pool settings. In these settings, we know previously the size of data, the number of samples or video frames, by default. Those problems can be solved iteratively by SVD with all data in each of iterations, and then suffer from high time complexity with numerous SVDs. Thus, it is inappropriate that former methods are applied in the real-time environment. In this paper, we propose an online gradient descent algorithm on the grassmannian manifold under- norm framework(OGDAGML1L2) to settle low-rank matrix completion and recovery problems in the data stream settings. By introducing the optimization on Riemannian manifolds, an optimal subspace can be found along the geodesic on grassmanian manifold. Only one data sample is involved iteratively in each iteration as incremental learning. And the- norm framework is designed to recover approximately the original data from corrupted data with sparse outliers and Gaussian noises. We also present an iterative algorithm based on alternating direction method of multipliers(ADMM) and Grassmannian manifold optimization to figure out robust low-rank matrix completion, robust low-rank matrix recovery and background modeling in video surveillance with online mode. Furthermore, a new strategy of adaptive step-size update characterized by exponential multiple energy levels(EMEL) is proposed to track efficiently subspace. Experiment results on a great variety of artificial and real-world datasets demonstrate that our OGDAGML1L2 method provides more robustness and effectiveness as well as more efficiency compared with the state-of-the-art online algorithms.3. To learn the intrinsic subspace information from the known data can be extended to learn its Riemannian quotient manifold structure behind fixed-rank matrix factorization, where the low-rank constraints can be represented by full-rank matrix factorization. In order to solve more general matrix completion problems, which include well-conditioning and large-scale matrices, this paper constructs a novel Riemannian metric characterized by the linear combination of the Riemannian geometry and the scaling information on the horizontal subspace of the quotient manifold for fixed low-rank matrix completion. We reconsider and analyze all the necessary components of the optimization on Riemannian quotient manifold. The non-linear conjugate gradient method is implemented on the Riemannian quotient manifold in order to verify the effectiveness of the proposed Riemannian metric. The numerical experiments indicate that the proposed Riemannian metric outperforms several existing metrics via comparing the convergence performance of the algorithm. We also show that the conjugate gradient algorithm equipped with this metric is competitive with the state-of-the-art algorithms for low-rank matrix completion.4. Combining multiple individual classifiers to improve the performance of single classifier is attracted more and more attention in recent years. The following question is whether do all the individual classifiers provide benefit to decrease the generalization error of ensemble system. Balancing diversity among individuals and accuracy of individuals is not only a starting point but also a difficult point for designing ensemble learning algorithm. A new selective classifier ensemble based on integer matrix linear transformation is proposed. It considers the balance of diversity among individuals and accuracy of individuals. In order to enhance the diversity, the individual classifiers are considered as original targets of the linear transformation, and instead of the mean value of samples, the true labels are considered to construct an integer matrix. By projecting individual classifiers on the lines through the true labels, a set of new classifiers is obtained based on the project transformation. In order to ensure the accuracy of individuals, according to two rules that measuring the performance of the classifier, accuracy rate and RPF-measure, some new classifiers that obtain better performance are selected to ensemble for increasing the accuracy of classification of an ensemble. The experimental results of radar range profile indicate that the proposed method balances effectively diversity among individuals and accuracy of individuals. It obtains better performance for radar target recognition compared against single classifier algorithms and other methods.5. In supervised learning task, if the samples with given labels are scarce in target domain(TD), it is bound to affect the learning ability and the generalization of the learner in TD. In order to solve this problem, besides the active learning algorithms which query and select the most informative samples then annotate them by supervisor and add them into the training set in TD, there are many samples with labels which are easier to obtain than the training set of the TD. Moreover, those samples have a different data distribution from the training set of TD in some real environment and constitute the source domain(SD) in transfer learning. Thus, the transfer learning is introduced to deal with the classification problem with lacking training samples. Two new methods are proposed. One is the method that classification with transferred samples based on Random forest(RF)-spaces, which selects effectively some available samples from SD to improve the classification performance of the classifier in TD. In the proposed method, samples of SD and training samples of TD are transformed by RF-spaces corresponding to them, respectively. We compute the similarity between RF-spaces of all subsets of SD and TD, the most similar subset of SD is selected to add into TD. The promising experimental results on text datasets indicate our method achieves the higher classification performance than other methods. The other is the method that transfer ensemble based on data-driven linear space mapping, which samples in source domain are projected on the specific samples from target domain by DDLSM, and new samples are obtained, then available samples are selected by computing similarity between new samples of source domain and the specific samples of target domain. Classification predictions of transfer learning are combined by ensemble learning. In experiments, the classification results for UCI dataset and mstar image dataset indicate that the proposed method can improve the performance compared with single target domain and other transfer learning, and ensemble transfer avoids effectively the instability of selecting samples of transfer learning.
Keywords/Search Tags:Active learning, Transfer learning, and norm, Riemannian quotient manifold, Non-linear conjugate gradient, Fixed-rank matrix completion
PDF Full Text Request
Related items