Study Of Data Completion, Labeling And Retrieval Based On Machine Learning

Posted on:2016-08-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:D B Zhang

Full Text:PDF

GTID:1108330470967836

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the fast development of Internet, we enter the era of Big Data. Many researchers focus on the problem of data understanding and retrieval which has many applications in recommender system, face recognition and image retrieval. However, even in the Big Data era, we are still facing many challenges about the data itself. Firstly, although we can collect large amount of data easily, how to solve the problem of data completion and recovery is important, since it is high possible that the collected data is missing, uncomplete, or corrupted. Secondly, lots of labeled data is needed for classification, recognition and data understanding, whereas most data is unlabeled. How to select the most important and informative data to label in crucial as it is very costly and laborious to label all data. Thirdly, how to understand the exact intent of users, and retrieve the related information as soon as possible is also challenging. Based on these three aspects, the main contributions of this thesis are as follows:1. We propose a new matrix completion method based on truncated nuclear norm to better recover data information. Inspired by traditional nuclear norm based matrix completion, we modify nuclear norm to make it approximate matrix rank better by discarding the largest r singular values in the definition of nuclear norm. In this way, we can achieve better low rank solutions. Meanwhile, we propose two efficient optimization algorithms to solve the objective function based on truncate nuclear norm. And, this work actually provides a general idea of the advantage of truncate nuclear norm compared to traditional nuclear norm. So truncate nuclear norm can be used in many other applications where nuclear norm was adopted.2. We define the procedure of automatically sampling the most informative data points to label as active learning. Based on traditional active learning methods and kernel space theory, we extend the existing distance sensitive reconstruction based active learning method to its nonlinear form. The original distance sensitive reconstruction based active learning method represents data points by simply linear reconstruction. However, in many real applications, the data distribution is very complex. Kernel theory tells us that we can map the original data into an infinite dimensional Reproducing Kernel Hilbert Space (RKHS) by selecting an appropriate kernel function. And in a sufficiently high dimensional space, the complex nonlinear structure of data is more likely to be unfolded. So we induce the formula of how to do distance sensitive reconstruction in kernel space, and propose a new optimization method to effectively solve the problem. Experiments show that we can get better results by dealing with the data structure more carefully in kernel space.3. We propose a unified approximate nearest neighbor search scheme by combining data structure and hashing, in order to accelerate the speed and accuracy of data retrieval. Traditionally, data structure based retrieval and hashing based retrieval are independent approaches. The proposed scheme can combine many data structures (such as k-mean tree and k nearest neighbor graph) and any hashing algorithm, and make full use of the advantages of both approaches. On one side, we accelerate the computational speed in each step of data structure based search by using Hamming distance to approximate Euclidean distance. On the other side, we use data structure to do the pruning in hashing based search, thus avoiding the exhaustive linear scan, and reducing the search complexity from linear to log linear. What is more, in traditional hashing approach, only very short hashing codes (such as 32 or 64) can be used, which will be inevitably inaccurate. In our proposed scheme, much longer hashing codes (such as 512 or 1024) can be supported, which can guarantee a much higher accuracy.

Keywords/Search Tags:

Big Data, Matrix Completion, Active Learning, Nearest Neighbor Search

PDF Full Text Request

Related items

1	Research On K-nearest Neighbor Search Algorithm In High Dimensional Space
2	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
3	Approximate Nearest Neighbor Search For High-Dimensional Based On Nearest Neighbor Graph
4	Research On Hashing Accelerated Approximate Nearest-Neighbor Search
5	Design And Implementation Of Uncertain Data Neighbor Search Algorithm
6	Online Matrix Factorization Hashing For Nearest Neighbor Search
7	Study On Generalized Nearest Neighbor Pattern Classification
8	The Researc Of Active Inductive Matrix Completion
9	Research On K-nearest Neighbor Query Technology On High-dimensional Data
10	Research On Approximate Nearest Neighbor Search And Maximum Inner Product Search For High-dimensional Dataset