Font Size: a A A

Learning Structure Features In High Dimensional Data Based On Natural Nearest Neighbor

Posted on:2012-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L ZouFull Text:PDF
GTID:1118330362454439Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Analyzing and investigating the structure of data set with large-scale and high-dimensional has became a very important task in the fields of data mining, machine learning and pattern recognition, and faced us with a puzzled problem in modern science and technology conditions. The features of structure contained in high-dimensional data are usually displayed in clustering and manifold styles, the corresponding methods used to reveal the features involve many mathematics branches, such as multivariate analysis, functional analysis and statistics etc., but the simple and efficiency approaches have been received special attentions to researchers.In this paper, a novel concept in terms of nearest neighbor is proposed and named natural nearest neighbor (3N neighbor), in contrast to k-nn andε-nn, it is a scale-free nearest neighbor. The key idea is that only one neighbor should be taken as a nearest neighbor for some farthest outliers within data set, however the dense points should have more nearest neighbors, also the calculation of this type of neighbors for all points can be implemented in an automatic way, it is the highlight point included in the paper. Of course, an adaptive nearest neighborhood graph (ANNG) or data network induced by connecting each data point to its corresponding 3N neighbors can also be constructed automatically with respect to any data sets, by using the similar concept within complex network, it indicates a scale-free network closely associated to the whole data and presents a natural relationships among data points, where the density information is clearly displayed and can be obtained easily. By using of the mechanism of forming 3N neighbors, the risk of crossing the boundaries between classes to find the neighbors should be reduced significantly, the agglomerate status existed in original data set will be preserved within the corresponding ANNG or network where an infrastructure with primary form of clustering is maintained and provided a sparse and more efficient graph model for the subsequent steps of learning the structure of clustering or manifold. On the other hand, applying the constructed ANNG to the classic manifold learning algorithms, such as Isomap, LLE, HLLE and LTSA etc., the extended and adaptive versions corresponding to these algorithms can be easily formulated, and more importantly, the long standing problem in manifold learning about the optimal size of neighborhood selection would to be avoid in practice. It provides a new point of view for the problem of dimensionality reduction which is one of the two hotspot issues in the field of machine learning.Additionally, two subspace structures, i.e. outlying features subspace and clustering features subspace, are discussed in this paper from a uniform point of clusters view, and a new outlier index is defined in order to seek the outlier's behaviors. Main innovations and contributions are listed as following.(1) Natural Nearest Neighbor (3N), a novel concept is proposed and implemented in a very simple way. The rationality in terms of 3N neighbor is validated on the data sets with standard distributions (uniform and normal distribution) and on the regular data derived from special mechanism such as grid data. The status of data distribution can be observed from the histogram of all numbers of 3N neighbors which is independent of the dimension. 3N neighbor contains more meaningful information than the tranditional k-nn or∈-nn.(2)Applying the concept 3N to the global manifold learning algorithm ISOMAP and the local manifold learning algorithms LLE, HLLE and LTSA, the corresponding extended algorithms 3N-ISOMAP, 3N-LLE and 3N-LTSA can be easily obtained but without in use of the free parameter. It means that these algorithms may be used in anywhere and anytime without the consideration of what the value of free parameter should be taken as. Such as analysing the complexity of some homogeneous high-dimensional data subsets, automatically learning the multi-fold or clusterings, and the large scale manifold learning etc..(3) An improved spectral graph clustering algorithm 3N-MNcut is obtained by applying the ANNG graph to the original algorithm MNcut, due to the constructed ANNG graph induced a well-formed similarity matrix with sparseness and preserving the local manifold structure.(4)The features of structure subspace in terms of outliers and clusters are closely interralated and correspond to the largest eigen-values. A new outlier index is proposed for observing the behaviors of outliers from the neighbors with different distances.
Keywords/Search Tags:Natural Nearest Neighbor (3N), Adaptive Manifold Learning, Spectral Graph Clustering, Universal Random Sampling, Outlying Behavior Analysis
PDF Full Text Request
Related items