Font Size: a A A

Application Of T-SNE Algorithm In Dimensionality Reduction Of High Dimensional Data

Posted on:2022-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y XuFull Text:PDF
GTID:2568306326974649Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Processing high-dimensional data and mining relevant information from it is a popular direction in current statistical data analysis.In data analysis,the analysis of high-dimensional data faces many challenges.Among them,"dimension disaster" is a common problem.One of the common methods to deal with "dimension disaster" is dimension reduction,that is,mapping high-dimensional data to low dimensional space in some way to achieve dimension reduction.Dimensionality reduction methods can be divided into linear dimensionality reduction and nonlinear dimensionality reduction.Principal component analysis(PCA)is the most commonly used method in linear dimensionality reduction,while kernel principal component analysis(KPCA),Isomap,LLE and t-SNE are commonly used in nonlinear dimensionality reduction.Among them,t-SNE algorithm has more advantages than other traditional nonlinear dimensionality reduction methods in dealing with some high-dimensional and multi manifold data sets.But the limitation of t-SNE algorithm is also more prominent.When it deals with some high-dimensional data sets with outliers,because the similarity measure will be affected by outliers,the effect of dimensionality reduction is not ideal.In addition,the loss function used in model iteration has some defects.Based on this,this paper proposes an improved t-SNE dimension reduction algorithm based on Hsim distance and JS divergence,which is called HJ-t-SNE dimension reduction algorithm.The advantage of this method is that the Hsim distance is not easily affected by some abnormal dimensions,and the JS divergence fully satisfies the nonnegativity,boundedness,symmetry and trigonometric inequality required for the measurement of distribution similarity.In the simulation verification,this paper simulates three distributions of outliers in data sets with similar sample size and dimension,and considers data sets with large differences in sample size and dimension.K-means clustering is used to cluster the low dimensional spatial data set after dimensionality reduction to compare the effect of different dimensionality reduction methods.The results show that HJ-t-SNE algorithm has better clustering effect than t-SNE algorithm for high-dimensional data sets with a small number of outliers and scattered outliers when other conditions remain unchanged.In the empirical part,HJ-t-SNE algorithm,t-SNE algorithm and some other commonly used data dimension reduction methods are used to reduce the dimension of urban construction land data set with scattered outliers,and the low dimension data set obtained after dimension reduction is clustered.Based on the contour coefficient and adjusted Rand index,it is found that HJ-t-SNE algorithm has the best dimension reduction effect.
Keywords/Search Tags:Dimension Reduction, t-SNE, Hsim Distance, JS Divergence
PDF Full Text Request
Related items