Font Size: a A A

Study Of Data Completion, Labeling And Retrieval Based On Machine Learning

Posted on:2016-08-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:D B ZhangFull Text:PDF
GTID:1108330470967836Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of Internet, we enter the era of Big Data. Many researchers focus on the problem of data understanding and retrieval which has many applications in recommender system, face recognition and image retrieval. However, even in the Big Data era, we are still facing many challenges about the data itself. Firstly, although we can collect large amount of data easily, how to solve the problem of data completion and recovery is important, since it is high possible that the collected data is missing, uncomplete, or corrupted. Secondly, lots of labeled data is needed for classification, recognition and data understanding, whereas most data is unlabeled. How to select the most important and informative data to label in crucial as it is very costly and laborious to label all data. Thirdly, how to understand the exact intent of users, and retrieve the related information as soon as possible is also challenging. Based on these three aspects, the main contributions of this thesis are as follows:1. We propose a new matrix completion method based on truncated nuclear norm to better recover data information. Inspired by traditional nuclear norm based matrix completion, we modify nuclear norm to make it approximate matrix rank better by discarding the largest r singular values in the definition of nuclear norm. In this way, we can achieve better low rank solutions. Meanwhile, we propose two efficient optimization algorithms to solve the objective function based on truncate nuclear norm. And, this work actually provides a general idea of the advantage of truncate nuclear norm compared to traditional nuclear norm. So truncate nuclear norm can be used in many other applications where nuclear norm was adopted.2. We define the procedure of automatically sampling the most informative data points to label as active learning. Based on traditional active learning methods and kernel space theory, we extend the existing distance sensitive reconstruction based active learning method to its nonlinear form. The original distance sensitive reconstruction based active learning method represents data points by simply linear reconstruction. However, in many real applications, the data distribution is very complex. Kernel theory tells us that we can map the original data into an infinite dimensional Reproducing Kernel Hilbert Space (RKHS) by selecting an appropriate kernel function. And in a sufficiently high dimensional space, the complex nonlinear structure of data is more likely to be unfolded. So we induce the formula of how to do distance sensitive reconstruction in kernel space, and propose a new optimization method to effectively solve the problem. Experiments show that we can get better results by dealing with the data structure more carefully in kernel space.3. We propose a unified approximate nearest neighbor search scheme by combining data structure and hashing, in order to accelerate the speed and accuracy of data retrieval. Traditionally, data structure based retrieval and hashing based retrieval are independent approaches. The proposed scheme can combine many data structures (such as k-mean tree and k nearest neighbor graph) and any hashing algorithm, and make full use of the advantages of both approaches. On one side, we accelerate the computational speed in each step of data structure based search by using Hamming distance to approximate Euclidean distance. On the other side, we use data structure to do the pruning in hashing based search, thus avoiding the exhaustive linear scan, and reducing the search complexity from linear to log linear. What is more, in traditional hashing approach, only very short hashing codes (such as 32 or 64) can be used, which will be inevitably inaccurate. In our proposed scheme, much longer hashing codes (such as 512 or 1024) can be supported, which can guarantee a much higher accuracy.
Keywords/Search Tags:Big Data, Matrix Completion, Active Learning, Nearest Neighbor Search
PDF Full Text Request
Related items