Font Size: a A A

RESEARCH ON DIMENSIONS AND DATA LAYOUT METHODS IN HIGH-DIMENSIONAL DATA VISUAL ANALYSIS

Posted on:2019-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:X R FengFull Text:PDF
GTID:2428330542496911Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of science and technology makes it easy for all walks of life to collect large amounts of dynamic high-dimensional data.How to effectively deal with,analyze and visualize these high-dimensional data has become a hot topic in today's research.In high-dimensional data processing,regression analysis is the use of mathematical statistics to reveal the interdependence(correlation)between two or more dimensions.If there is such a relationship,then all the samples will be displayed in a certain trend when they are visualized.If there is no such relationship,then the visualization effect is a group of discrete points;Cluster analysis divides the dataset samples into several groups according to a certain relationship,the similarities in the same group are large,and the similarities between different groups are small.Parallel Coordinates is a mature high-dimensional data visualization method,which can accurately show the distribution of samples in each dimension;Radar chart is used as a deformation of parallel coordinates,and it is often used in visualization of multidimensional data such as finance,weather,and multi-index analysis;RadViz(Radial Coordinate Visualization)is an improved form of radar chart,it is a visualization method based on circular parallel coordinates.The dimensions of the high-dimensional data are evenly projected in the form of points onto the circumference of the two-dimensional plane.The sample data is also projected into the circle in the form of points,it can clearly observe the distribution of the sample.In the RadViz diagram,the dimensions of the high-dimensional data are evenly projected onto the circumference to form dimension points,which cannot show the correlation characteristics between dimensions.So we propose a layout method based on the Travelling Salesman Problem(TSP)algorithm and a multidimensional scaling(MDS)algorithm to improve the projection of dimensions in the RadViz graph onto the plane.First,a dimensional correlation matrix is established using the Pearson correlation coefficient(each element in the matrix is the correlation coefficient between the corresponding data dimensions),then use the power function as the transformation function to transform the dimensional correlation matrix into the Euclidean distance matrix between the dimension points on the plane(each element in the matrix is the Euclidean distance between dimension points in the plane),then use the TSP algorithm to project the dimension onto a fixed-length line segment as a dimension point,and then map the line segment to the unit circle of the plane.Therefore,the projection of the data dimension to the planar point is obtained.Finally,the CM algorithm is used to adjust the position of the dimension point on the circumference so as to minimize the stress error,thereby realizing the display of the dimensional correlation in the RadViz method.When adjusting the dimension point layout of RadViz using MDS algorithm,the same dimension is first projected onto a one-dimensional line segment and mapped onto the unit circle to realize the projection of the data dimension to the planar point,and then the CM algorithm is used to adjust the position of the dimension point on the circumference.Minimizing stress errors,thus enabling the display of dimensional dependencies in the RadViz method.After the position of the dimension point on the plane is determined,we then use the Generalized Barycentric Coordinates method to project the sample points into the circle to realize RadViz visualization.We also separately address the PTI of the distance matrix in the MDS algorithm,the method of generating the initial position of the point,and the displacement strategy of the point,a detailed discussion of the relationship between global stress error values and proposed a method to further reduce the global stress error value.1)We defined the PTI index of the distance matrix,then randomly generated a plurality of distance matrices with different PTI values,and performed one-dimensional and two-dimensional dimensionality reduction calculations based on the MDS algorithm to obtain the corresponding global stress error and obtained the PTI index.Corresponding global stress error is inversely related,thus the power function of the value in the distance matrix is performed under the condition of obeying the monotonicity constraint of the magnitude of the distance matrix to improve the PTI index and reduce the MDS algorithm.Global stress error in dimension reduction process;2)For the Random method to achieve the initial position generation of the point,resulting in the problem that the final result of the MDS algorithm can not be copied,we propose an initial position generation based on the TSP algorithm and a DRGT(Delineated Range and Generated in The Turn)algorithm implements the initial position generation of the point instead of the Random method,thereby realizing the reproducibility of the experimental results.3)For the problem that the force steering algorithm does not converge in the displacement strategy,we designed the SEFM(Systematic Error in the same direction First Move)algorithm instead of the force steering algorithm,and by weighting the force steering algorithm and SEFM algorithm,Further reduce the global stress error.Finally,we compare the variation of the global stress error when the initial position generation method of different points and different displacement strategies are combined to achieve dimensionality reduction of the MDS,so as to obtain a better MDS algorithm.
Keywords/Search Tags:high dimensional data, RadViz, TSP, MDS, CM algorithm, global stress error
PDF Full Text Request
Related items