Font Size: a A A

Research On Text Classification Method Based On Manifold Learning

Posted on:2013-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:G WangFull Text:PDF
GTID:2268330392465640Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification has a wide range of applications in many fields which include Information Retrieval, Text Filtering, Document Organization and so on. Item Word Vector is used to represent the text data usually. The dimension of text data is often a big number because of the more characteristic words. If dealing with text data by the current text classification algorithm, it will cause a problem of "Dimension Disaster" called and get a result including poor efficiency and accuracy. So, this paper reduces the dimension of text data by manifold learning algorithm in the text classification, and then classifies the low-dimensional data. This paper is as follows:Firstly, the paper suggests a text classification method based on manifold learning. The dimension of the text data to be classified is reduced by manifold learning algorithm. Text classification algorithm classifies the low-dimensional data of the result. It will get a higher efficiency because of the reduction of text data dimension.Secondly, Euclidean Distance with failure can’t represent the degree of similarity between the two samples when manifold learning algorithm solves the nearest neighbor in high dimensional space. As a result, item words intersect distance, a new similarity metric measure, is proposed. Item words intersect distance represents the same part of item words in the sample, and it can express the degree of similarity between the samples greatly. After the dimension of text data is reduced by manifold learning algorithm based on item words intersect distance, experimental results show that the classification result is improved greatly.At last, that Angle Cosine Distance and Item Words Intersection Distance combine together comes up with a new distance measure. Euclidean Distance in the manifold learning algorithm is replaced by Angle Cosine Distance, Item Word Intersection Distance and the new distance measure. The improved manifold-learning algorithm reducing the dimension of text data compute the low-dimensional data for text classification. In the experiments, the manifold learning algorithm includes ISOMAP, LLE and LE algorithm and the classification algorithm includes SVM, NB and KNN algorithm. Experimental results show that the classification accuracy and efficiency are improved greatly by the improved manifold learning algorithm.
Keywords/Search Tags:text classification, manifold learning, angle cosine distance, item words intersectiondistance
PDF Full Text Request
Related items