Font Size: a A A

Improvement Of Hash Dimension Reduction And K-means Clustering Model

Posted on:2019-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:H YuFull Text:PDF
GTID:2428330566975923Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The application of machine learning in high-dimensional data is very extensive,and there are also many problems.How to reduce dimensionality,storage,analysis,and management of high-dimensional data is a common problem encountered in machine learning tasks.This paper studies and extends the existing hash dimension reduction and K-means clustering model and proposes a new hash algorithm for hashing.Functional construction usually does not consider the problem of data similarity structure and some problems in high-dimensional data clustering algorithms.Specially,we describe the detail of the main contents as follows.(1)Proposed a PCA-based Principle Component Analysis Rotation Hashing Algorithm(PCAR).Hash algorithm is widely used to reduce dimensionality of high-dimension data because it can encode high-dimensional data into binary strings.However,existing hashing algorithms still have the following problems:(i)Traditional hashing algorithms use fixed The mathematical formula constructs a ha sh function and cannot fit the data,so a good hash effect cannot be obtained.(ii)The existing hash algorithm usually learns the hash function and the binarization threshold separately,and the process is complicated and easily causes errors;(iii)There have been some improved hashing algorithms that do not consider the global and local structural information of the data at the same time.Therefore,the PCAR algorithm proposed in Chapter 3 of this paper combines Principal Component Analysis and manifold learning to solve traditional hash algorithms.Usually only one structure is considered.In detail,the PCAR algorithm uses principal component analysisto preserve the overall similarity of data,while combining manifold learning to preserve the local similarity of the data.In real data simulation experiments,the PCAR algorithm is better than the common AGH,DSH,KLSH,LSH,MDSH,SGH,PCAH algorithms.The proposed objective function based on PCA and manifold learning not only considers the local structure of the data,but also considers the global structure of the data.Therefore,the original hash framework is improved to some extent,and the existing hash frame is enriched.The scope of its application has been expanded to improve the performance of hash algorithms in high-dimensional data retrieval and analysis.(2)A Self-paced Learning for K-means Clustering Algorithm(SPKC)is proposed.Clustering algorithm is a key algorithm in machine learning algorithms,however,existing K-means clustering has some deficiencies:(i)Clustering results are very sensitive to noise samples and outlier samples,and are likely to cause large errors;(Ii)The k-value selection has a great influence on the clustering result,and the robustness is poor.(iii)Since the traditional K-means clustering process is a non-convex optimization problem,it is often easy to fall into a local optimal solution.Therefore,this paper proposes a clustering algorithm that introduces self-regularization terms to solve the above problems.First,the K-means clustering is transformed into a matrix decomposition problem.Based on this,a self-paced regularization term is added and samples are sorted by self-paced regularization factors.In order to simulate the human learning process from easy to difficult to add samples to the model for training.In real data experiments,the SPKC algorithm presented in this paper is better than K-means,K-means++,ISODATA,FCM,and methods.The proposed K-means clustering algorithm based on self-paced learning enriches the existing clustering model framework to some extent,and also applies the self-paced learning technology to high-dimensional data clustering.In this paper,the High dimension reduction algorithm in the field of machine learning is studied on how to combine the improvement of the K-means clustering model that preserves the local and global similar structure and high-dimensional data,that is,firstly using principal component analysis,Manifold learning,self-paced learning to improve existing problems in the existing algorithms of hashing and clustering,two improved machine learning algorithms are proposed.Each of the algorithms studied in this paper uses real public data sets and comparison algorithms for experimental comparison and analysis.The two algorithms proposed in this paper are superior to existing common algorithms under multiple evaluation indicators.
Keywords/Search Tags:Machine Learning, Principal Component Analysis, Manifold Learning, Hashing, Self-paced Learning, Cluster analysis
PDF Full Text Request
Related items