Font Size: a A A

Variational Auto-Encoder Combined With T-Distributed Stochastic Neighbor Embedding For Dimensionality Reduction And Cluster Analysis

Posted on:2020-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y GuoFull Text:PDF
GTID:2428330596982764Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Today's Internet is booming,and our access to information has grown rapidly with the development of technology.The development of big data has entered a stage of intense heating.However,the data touched by various fields are often high-dimensional data.Thousands of dimensions bring great challenges for subsequent analysis and calculation.Many commonly used algorithms mostly fail in high-dimensional data sets.In order to mine and analyze its potential information from high-dimensional data,a series of algorithms for data dimensionality reduction emerged.The core idea of data dimensionality reduction is to use some kind of mapping on data in high-dimensional data sets,and transform high-dimensional data to obtain its representation in low-dimensional space,so that it can be applied to existing low-dimensional space.The algorithm that can be used.In this paper,the VAE based on MLP neural network and t-distributed stochastic neighbor embedding are combined to reduce the dimension of high-dimensional data unsupervised.We designed the encoder and decoder of three layers,the encoder extracts the characteristics and then approximates the original sample through the decoder.Mini-batch gradient descent method is used for network training,the encoder is used to reduce the high-dimensional data to intermediate dimensions,next,t-distributed stochastic neighbor embedding to further dimensionality reduction,then K-means is used for the low-dimensional data.Practice has proved that: the black box variational inference improves the variability and versatility of the model in the large sample size and high dimension,which makes the dimension reduction effect better.Secondly,t-distributed stochastic neighbor embedding is the most likely to maintain the neighborhood probability distribution characteristics of high-dimensional data consistent with that of low-dimensional data.After data is reduced to the intermediate dimension,the distance of t-distributed stochastic neighbor embedding is mapping farther data into lowdimensional space avoids data point aggregation,maximizing consistency between the final result and the intermediate dimension space.Compared with traditional PCA dimensionality reduction method,the method in this paper can extract features of the data more effectively,the dispersion between classes is improved in the cluster analysis,and it has better effect of clustering.Finally,numerical examples are given to illustrate the effectiveness of the proposed algorithm.
Keywords/Search Tags:Variational auto-encoder, t-distributed stochastic neighbor embedding, Mini-batch gradient descent, K-means, Dimensionality reduction
PDF Full Text Request
Related items