Hashing Coding And Clustering Analysis On Large-Scale Data

Posted on:2018-07-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2428330623450780

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clustering belongs to unsupervised learning methods.It is independent from prior knowledge such as data labels or similarity between data.It has been an important data analysis technique in the field of bioinformatics and computer vision.In recent years,data in these fields are characterized by large scale,high dimensionality and rapid growth,while conventional clustering methods can not meet the requirements of existing applications both in efficiency and performance.Therefore,exploring new clustering algorithms has important practical significance and application prospect.Density Peaks Clustering(DPC)and Sparse Subspace Clustering(SSC)are two new methods that emerged in recent years and have been widely used in data mining and computer vision.However,their high temporal and spatial complexity makes them not suitable for large-scale data.The complex data dependencies in these algorithms also made them difficult to parallelize.This paper introduces hash coding to improve DPC and SSC algorithms,reducing both their temporal complexity,making them more easy to paralyze.On one hand,the DPC algorithm and the SSC algorithm have been widely used in many fields,the improved algorithms further extend their application scope;on the other hand,these new algorithms were used to analyze large-scale mass spectral data,motion segmentation problems,and face recognition problems,they are all of great significance in the field of bioinformatics and machine vision.This article contains two innovative points,set out below.In order to solve the problem that DPC is difficult to parallelize,this paper propose an innovative algorithm called LSH-DPC based on Locally Sensitive Hashing(LSH),and implement it in parallel.By designing reasonable functions,hash algorithm can coarsely divide the data with very low time cost.Time complexity of DPC will be significantly reduced.In addition,the introduction of hash algorithm also reduces the difficulty of parallelization.Experiment on large-scale mass spectrometry dataset showed: compared with the original algorithm,LSH-DPC significantly reduces the time complexity,while keep the clustering accuracy almost unchanged.In order to improve the performance of SSC and the high temporal complexity,this paper proposes two algorithms: smooth sparse subspace clustering(SM-SSC)and highly efficient sparse subspace clustering(LSH-SSC)respectively.Introducing smoothing theory gives a more accurate correlation matrix,and then improve the clustering accuracy.Clustering on the result of hashing can significantly reduce the time complexity of the algorithm.Experiments on motion segmentation and face recognition data showed that SM-SSC has higher accuracy than SSC,experiments on motion segmentation data showed LSH-SSC has less time complexity than SSC algorithm.

Keywords/Search Tags:

Hashing, DPC, SSC, MS Analysis

PDF Full Text Request

Related items

1	A Study Of Hashing For Fast Approximate Nearest Neighbors Retrieval
2	Hashing Coding And Clustering Analysis On Large-Scale Data
3	Supervised Hashing Methods For Information Retrieval
4	Study Of Cross-modal Hashing Algorithms With Applications
5	Research On Supervised Hashing Method
6	Working Towards Performance Analysis Of Locality Sensitive Hashing
7	Image Indexing By Sparse Spectral Hashing
8	Abnormity Detection And Analysis Of Mammography Based On Hashing
9	Research On Semantic-Guided Hashing Algorithms For Multimedia Data Retrieval
10	Towards Performance Analysis Of Locality Sensitive Hashing