Font Size: a A A

Centroid-based dimension reduction methods for classification of high dimensional text data

Posted on:2002-07-01Degree:Ph.DType:Dissertation
University:University of MinnesotaCandidate:Jeon, Moon-GuFull Text:PDF
GTID:1468390011991678Subject:Computer Science
Abstract/Summary:
In today's information retrieval systems using the vector space model, dimension reduction is essential for efficiently manipulating massive quantities of data. To be useful, this reduced dimensional data must be a good approximation of the full dimensional data, and also provide reduced computational complexity and improved accuracy. Three methods of dimension reduction which fulfill those requirements are developed in our research, named Centroid, CentroidQR and DiscGsvd.; Our Centroid method projects full dimensional data onto the centroid space of its classes, which gives tremendous dimensional reduction, reducing the number of dimension to the number of classes while improving the original class structure. One of its interesting properties is that even when using two different similarity measures, the results of classification for the full and the reduced dimensional space formed by the Centroid are identical when the centroid-based classification is applied. The second method, called CentroidQR, is a variant of our Centroid method, which uses as a projection space, k columns of orthogonal matrix Q from QR decomposition of the centroid matrix. This method gives the same dimensional reduction as the Centroid method, but it has an interesting property which we call the “order preserving property”. Experimental results reveal this property, and its mathematical proof is provided.; The third method is DiscGsvd, which is an improved version of linear discriminant analysis using generalized singular value decomposition. A limitation of the original linear discriminant analysis is that the within-class scatter matrix must be nonsingular to get the optimal transformation matrix. This prevents the application of the original linear discriminant analysis to high dimensional data such as electronic documents in which the number of features may exceed the number of data. Applying GSVD solves the singular matrix problem, and allows us to avoid the explicit formation of the scatter matrices in favor of working directly with the training data matrix, which requires less memory than using explicit scatter matrices. Finally, we present experimental results that compare the classification results of the full and the reduced dimensional space using centroid-based and kNN classification. Our results confirm the effectiveness of our methods.
Keywords/Search Tags:Dimensional, Centroid, Method, Classification, Data, Using, Space, Linear discriminant analysis
Related items