Font Size: a A A

Embedding And Visualization For High Dimensional Unit Data

Posted on:2017-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2348330518493445Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
This thesis is mainly to study the algorithm of dimension reduction,this algorithm is not only for unit high-dimensional data,but also suited for embedding the high-dimensional data into a low dimensional space,where the data can be visualized directly.In real industry and research,visualization is an effective method for analyzing and showing the distributed and clustering situation of high-dimensional data,that is to use scatter diagram,where each point corresponds to a high-dimensional data point,so it can directly show the distributed situation of the data,even the clustering situation.However,if the data can be directly visualized like this,its dimension can not be over three.So for visualizing the high-dimensional data,the dimension reduction is an effective method.In addition,the essence of dimension reduction is to make the structure of the high-dimensional data,much closer to the structure of the data in embedding low-dimensional space,so,the algorithm of dimension reduction must take care of the data structure or the distribution of data,such as the common unit data,their structure is plane or spherical,so if we want to achieve a better effect for such data,we must optimize the common algorithm of dimension reduction.Until now,there have been a lot of methods for data embedding and data visualization,such as t-SNE,which is an algorithm based on a hypothesis that all data are subject to unconstrained Gaussian distribution in an Euclidean space.However,in many situations,the data are in a constrained space which is no longer subject to Gaussian distribution.For example,for spherical data,which is L2-norm normalized,can be better described by vMF distribution than Gaussian.The same is for plane data,which is L1-norm normalized,can be better modeled by dirichlet distribution.Therefore,this thesis presents two new embedding algorithm based on vMF distribution and dirichlet distribution.Because,as long as the data's dimension is no more than three,they must can be visualized as mentioned before,and the drawing technique doesn't have a high research value,while studying the algorithm of dimension reduction,suited for visualizing the high-dimensional data,has a higher research value,which is the main point of this thesis.So this thesis is studying such algorithm of dimension reduction,except the drawing technique.The main work of this thesis is as follows:1.Analyze the traditional algorithm for high-dimensional data embedding,especially for a more effective method,t-SNE,elaborate the advantages of t-SNE over other methods,and the disadvantages of this algorithm in embedding data that is in a constrained space.2.For spherical data,this thesis presents a new embedding method based on vMF distribution:vMF-SNE.Analyze the process of this algorithm,and compare with t-SNE experimentally.3.For plane data,this thesis presents another embedding method based on dirichlet distribution:dirichlet-SNE.Analyze the process of this this algorithm,and also compare with t-SNE in experiment.This thesis presents two new embedding algorithm for constrained high-dimensional data,such as spherical data and plane data,which is good for research and engineering.
Keywords/Search Tags:data embedding, data visualization, t-sne, von mises-fisher distribution, dirichlet distribution
PDF Full Text Request
Related items