Font Size: a A A

Latent Semantic Analysis Based On Multi-system Combination

Posted on:2014-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:L W ChangFull Text:PDF
GTID:2248330395487186Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Latent Semantic Analysis is a technology of data mining based on statistic, which iswidely applied in the task of text categorization of Information Retrieval. The kernel model ofthe technology is an optimization model of Vector Space Model. Through extracting theinformation of potential semantic structure, a new space with some new abstract semanticdimensions formed by the information will be constructed. Based on the above description,the assumption of feature independence is broken, and a text can be described by lessdimensions.Many studies show that the performance of data mining system based on statistic can beimproved effectively through increasing the scale of training data. However, after addingsaturated information to training data, the strategy will not work well. At the same time, muchnoise and redundant information will be brought into the model. On the one hand theperformance of the model we construct on the training data will be restrained, and on theother hand the model we get will be too big. So the focus of this paper is the optimization andcombination of multi-model.Through researching the features and the applications of two technologies of subspaceoptimization and system combination, a new concept, which is named Augmented SpaceModel, is proposed based on a lot of experiments and error analysis. The Augmented SpaceModel combines two Latent Semantic Analysis models’ most important dimensions, so thattwo Latent Semantic Analysis models can learn knowledge from each other, then the LatentSemantic Analysis will be optimized effectively. Besides, the strategy of data segmentation isalso researched to get a new method which is named as Data Segmentation Strategy based onthe lengths of documents and DF Distribution. The strategy can ensure that the informationdistribution is damaged as slightly as possible. At the same time because the scale of thesubspace is much smaller than the original space, the noise information brought by redundantinformation is reduced. Based on the researches of the optimization of the Latent Semantic Analysis,multi-strategy and multi-system combination has been applied to construct a textclassification system. Experimental results show that the final classification precision wasabout3percent higher after multi-strategy and multilevel system combination than that of thebest baseline model.
Keywords/Search Tags:Latent Semantic Analysis, Subspace Optimization, System Combination, Augmented Space Model, Text Categorization
PDF Full Text Request
Related items