Font Size: a A A

Research Of Chinese Text Classification Based On Manifold Learning Method

Posted on:2015-12-22Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhaiFull Text:PDF
GTID:2298330452953445Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information science is highly developing in the world today, all the time therewill be a lot of data generated around us is filled with all kinds of data, these data areoften high-dimensional, high-dimensional data is difficult to be understood, and thusselecting the most representative characteristics data from the high-dimensional set,mining the most useful information became hot research in the field of textprocessing.In this paper, the manifold learning methods are introduced into Chinese textclassification, to handle the high-dimensional data set of non-linear text processingproblems. After the text preprocessing, chinese text word segmentation processing, weget a series of words, we use feature selection function to calculate the weight of thewords, at the same time the noise data and redundant information has been deletedfrom the corpus of data sets. Then the remaining lexical items can be used to constructthe document as a feature item-feature space matrix. On the basis of the text featurespace matrix, we use manifold learning algorithms for nonlinear dimensionalityreduction of high-dimensional data. Manifold learning algorithm processing nonlineardata dimensionality reduction than conventional nonlinear dimensionality reductiontechniques such as: self-organizing map algorithm, master curve approach hasinherent advantages generated topology mapping method and based on kernelprincipal component analysis.This paper studies the isometric mapping algorithms and local linear embeddingalgorithm both manifold learning methods. Isometric mapping algorithm applied tothe entire data set, it seeks to maintain the geodesic distance between data points; thusable to maintain the topological invariant manifold structure on the whole data set;locally linear embedding algorithm is in the data sets from high-dimensional spacelow-dimensional mapping process to maintain a linear relationship between each ofthe data points in the field of local unchanged. Both methods are started from the localneighborhood to maintain certain properties, so as to maintain the topologicalproperties of the overall geometry of the dataset. But these two have a commonproblem when the flow manifold learning algorithm, ie no comprehensive method forfeature dimension data space estimates.Furthermore, in this article feature weight calculation method is improved. For afeature items, we can better express both required information document itself, butalso contains the categories of information to be used for text classification. Generalfeature weight calculation method does not contain text categories of information, andthus can not be used for classification. This paper improves the traditional featureitem weight calculation method, the feature selection function was integrated in, ablend of categories of information to enhance the final classification.
Keywords/Search Tags:manifold learning, text classification, isometric mapping, local linearembedding, feature item
PDF Full Text Request
Related items