Font Size: a A A

The Research Of Data Reduction Methods Based On Manifold Learning And Application

Posted on:2013-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z M YanFull Text:PDF
GTID:2248330371969923Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology, in a lot ofscientific studies, we may meet some data sets with high dimension characteristicsometimes, the characteristics of the high dimensional data will bring a great deal ofdifficulties for data inner laws and structures.Therefore, we should use appropriate methods of data reduction to processthese data sets. Data reduction is also called dimension reduction or data dimensionreduction; the existing dimension-reduction methods can produce different treatmentresults for different data sets. From the structure of the present data to see, datareduction methods that based on the manifold learning can be classified into twocategories: linear methods and nonlinear methods. The linear dimension reductionmethods can process the data sets and Gaussian data sets effectively, the nonlineardimension reduction methods can project the data that embedded in the highdimension space, and map them to the low dimensional space coordinates, so we canexplore the data inherent geometric structure further . Manifold learning will showthe data inherent geometric structure by the data analysis technology. The conciselow dimensional structure can show the complex data of high dimensional. The mainpurpose of Manifold learning is to seek the data internal distribution that embeddedin the high dimension space. In recent years, Manifold learning has become the hotresearch in the field of the machine learning and other researches.In the article, we study the method of manifold reduction in a certain extent,and discuss the manifold learning algorithm research and detail respectively fromtwo aspects, the neighborhood parameter selection and the processing new datapoints. And apply the improved method to the text clustering effectively; also use theexperimental results to verify the feasibility and effectiveness of the method. Themain work summarized as follows:1. put forward a method of the fit discrimination of neighborhood parameterselection. Use the method of kernel principal component to reconstruct the data error,take the reconstruct data error together, and judge the fit of neighborhood choiceaccording to the number of clustering. Because the kernel principal component analysis method belongs to nonlinear method, it is produced based on the principalcomponent analysis, with the nuclear function instead of inner product data vector,and it also has the characteristics of the principal component analysis method. Whenuse the nonlinear function to map the original data to the high dimension spacefeatures, it need the inside accumulate computation. With the kernel functioncalculation of the original data instead of the inside accumulate computation, thecorresponding calculation is reduced greatly. We use the AIC information criterionto judge the cluster numbers in the clustering effect evaluation. When the data errorsare gathered for a class, the selected of neighborhood parameters don’t cause thechange of the structure of error, so we say that the neighborhood value is the right;When the data errors are gathered more than a class, the selected of neighborhoodparameters change the error structure seriously, we say that the neighborhood valueis not appropriate.2. Discusses a new method of dimension reduction. From the current studies,the method of Learning Technology Systems Architecture has many defects in somecases, so it use rarely. For example, the inner structures will distortion or incompletewhen treatment the large data sets, it can be seen that the method of LearningTechnology Systems Architecture method is not ideal in processing the new datasample points. The optimization of the linear discriminate analysis method is a lineardimension reduction method, it optimizes the Fisher criterion of the original method,and it makes the method more conveniently. In the article, we combine theoptimization of the linear discriminate analysis method and the Learning TechnologySystems Architecture method, and use the optimized Fisher criterion to solute theprojection matrix within and between classes, finally get the optimal projectionmatrix of data. By using the combination of the two methods, we can process thenew data points effectively.3. Discuss the application of the dimension reduction method based on themanifold learning in text clustering. In general, by structuring matrix of thefrequency of information, we can get the text information, and these matrixes alsohave high dimension characteristics. We should use the proper dimension reductionmethod to explore the inner rules of text data further. In recent years, the datareduction technology has been used in text clustering gradually. In the article, we usethe Learning Technology Systems Architecture method which based on the optimallinear discriminative to process the high dimensional text data, align the localcoordinate in the low dimension space, and show the global coordinates, we can get the local neighborhood of the data and Local cut space vector coordinates. Throughminimize the local error to align the global and local cut space vector coordinates. Inorder to get good visual effect, we use k-means method to analysis the dataclustering, and use the entropy value to evaluate the cluster quality.
Keywords/Search Tags:Manifold learning, Data reduction, AIC information criteria, Text clustering, Entropy value
PDF Full Text Request
Related items