Font Size: a A A

Text Classification Based On Latent Semantic Indexing

Posted on:2006-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:D QuanFull Text:PDF
GTID:2168360155458059Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The automated classification of texts into pre-specified categories has gained a rapid progress in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. Maching learning technologies are used in this process to automatically builds a classifier by learning, from a set of previously classified documents, the characteristics of categories.The traditionally TC methods are based on bag of words which has two main flaws: one is less category information, and the other is high dimensionality which causes data sparse. Phrase can be used to relieve the first problem but it will aggravate the second one. For the second one, the usual way is using dimensionality reduction (DR) methods which can remove none-effect or less-effect features and the left features are used to represente the text. According to the nature of the result terms, DR can be devided into two types: (1) Term Selection: the result terms is a subset of the original terms; (2) Term Extraction: the result terms is not a subset of the original terms. Latent Semantic Indexing (LSI) is one of the term extraction methods which can project the terms form word space to laten semantic space, and solve the two problems at the same time.Singular Value Decomposition (SVD ) is a traditional LSI methods, it has gained very good performance. The main flows of SVD are speed and memory. Semi-descrete decomposition (SDD) is another LSI mathod, it has faster speed and less memory need with the sacrifice of a little reduce in perormance.In this paper, we study text classification based on LSI. We study the factors which may affect the porformance, mainly in different term selection methods and in different weighting methods. We also approve an impovement method of LSI model, this method can improve the porformance of SDD remarkably.We design a serial of experiments on two corpus, Chinese and english, and we use KNN as the classifier. The results of experiments showed that methods of different feature selection and different term weighting have a lot effect on LSI. But there is no one method is perform well under all conditions. The results alse said that our improvement method has a very good effect on SDD.
Keywords/Search Tags:Text Classification, Latent Semantic Indexing, Singular Value Decomposition, Semi-Discrete Matrix Decomposition
PDF Full Text Request
Related items