Text Classification Based On Latent Semantic Indexing

Posted on:2006-06-25

Degree:Master

Type:Thesis

Country:China

Candidate:D Quan

Full Text:PDF

GTID:2168360155458059

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The automated classification of texts into pre-specified categories has gained a rapid progress in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. Maching learning technologies are used in this process to automatically builds a classifier by learning, from a set of previously classified documents, the characteristics of categories.The traditionally TC methods are based on bag of words which has two main flaws: one is less category information, and the other is high dimensionality which causes data sparse. Phrase can be used to relieve the first problem but it will aggravate the second one. For the second one, the usual way is using dimensionality reduction (DR) methods which can remove none-effect or less-effect features and the left features are used to represente the text. According to the nature of the result terms, DR can be devided into two types: (1) Term Selection: the result terms is a subset of the original terms; (2) Term Extraction: the result terms is not a subset of the original terms. Latent Semantic Indexing (LSI) is one of the term extraction methods which can project the terms form word space to laten semantic space, and solve the two problems at the same time.Singular Value Decomposition (SVD ) is a traditional LSI methods, it has gained very good performance. The main flows of SVD are speed and memory. Semi-descrete decomposition (SDD) is another LSI mathod, it has faster speed and less memory need with the sacrifice of a little reduce in perormance.In this paper, we study text classification based on LSI. We study the factors which may affect the porformance, mainly in different term selection methods and in different weighting methods. We also approve an impovement method of LSI model, this method can improve the porformance of SDD remarkably.We design a serial of experiments on two corpus, Chinese and english, and we use KNN as the classifier. The results of experiments showed that methods of different feature selection and different term weighting have a lot effect on LSI. But there is no one method is perform well under all conditions. The results alse said that our improvement method has a very good effect on SDD.

Keywords/Search Tags:

Text Classification, Latent Semantic Indexing, Singular Value Decomposition, Semi-Discrete Matrix Decomposition

PDF Full Text Request

Related items

1	Based On Latent Semantic Indexing, Text Classification And Research In Science And Technology Information Retrieval
2	Research On Text Clustering Algorithm Based On Latent Semantic Indexing
3	Research On Some Field Text Information Processing Based On Latent Semantic Analysis
4	Chinese Text Clustering Based On Latent Semantic And Its Applications
5	Folding-up: A hybrid method for updating the partial singular value decomposition in latent semantic indexing
6	Detection Of Sensitive Data From Big Data Using Classification Algorithms
7	Research On Text Classification Based On Ontology And Latent Semantic Indexing Algorithm
8	The Research On Latent Semantic Classification Model
9	Research On Web Text Categorization Based On Latent Semantic Analysis
10	The Research Of Optimization Technology In Latent Semantic Indexing Based On Pseudo Text