Font Size: a A A

Hashing Coding And Clustering Analysis On Large-Scale Data

Posted on:2018-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2428330623450780Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering belongs to unsupervised learning methods.It is independent from prior knowledge such as data labels or similarity between data.It has been an important data analysis technique in the field of bioinformatics and computer vision.In recent years,data in these fields are characterized by large scale,high dimensionality and rapid growth,while conventional clustering methods can not meet the requirements of existing applications both in efficiency and performance.Therefore,exploring new clustering algorithms has important practical significance and application prospect.Density Peaks Clustering(DPC)and Sparse Subspace Clustering(SSC)are two new methods that emerged in recent years and have been widely used in data mining and computer vision.However,their high temporal and spatial complexity makes them not suitable for large-scale data.The complex data dependencies in these algorithms also made them difficult to parallelize.This paper introduces hash coding to improve DPC and SSC algorithms,reducing both their temporal complexity,making them more easy to paralyze.On one hand,the DPC algorithm and the SSC algorithm have been widely used in many fields,the improved algorithms further extend their application scope;on the other hand,these new algorithms were used to analyze large-scale mass spectral data,motion segmentation problems,and face recognition problems,they are all of great significance in the field of bioinformatics and machine vision.This article contains two innovative points,set out below.In order to solve the problem that DPC is difficult to parallelize,this paper propose an innovative algorithm called LSH-DPC based on Locally Sensitive Hashing(LSH),and implement it in parallel.By designing reasonable functions,hash algorithm can coarsely divide the data with very low time cost.Time complexity of DPC will be significantly reduced.In addition,the introduction of hash algorithm also reduces the difficulty of parallelization.Experiment on large-scale mass spectrometry dataset showed: compared with the original algorithm,LSH-DPC significantly reduces the time complexity,while keep the clustering accuracy almost unchanged.In order to improve the performance of SSC and the high temporal complexity,this paper proposes two algorithms: smooth sparse subspace clustering(SM-SSC)and highly efficient sparse subspace clustering(LSH-SSC)respectively.Introducing smoothing theory gives a more accurate correlation matrix,and then improve the clustering accuracy.Clustering on the result of hashing can significantly reduce the time complexity of the algorithm.Experiments on motion segmentation and face recognition data showed that SM-SSC has higher accuracy than SSC,experiments on motion segmentation data showed LSH-SSC has less time complexity than SSC algorithm.
Keywords/Search Tags:Hashing, DPC, SSC, MS Analysis
PDF Full Text Request
Related items