Font Size: a A A

Research Of Latent Semantic Analysis Based On Paragraph

Posted on:2015-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:C BiFull Text:PDF
GTID:2298330467468633Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a technique of data mining based on statistic, Latent SemanticAnalysis is widelyapplied in many fields, such as Information Retrieval and Text Categorization. By optimizingthe Vector Space Model, this technology has good effect on extracting potential semanticstructure information among features, which is based on the context and co-occurrence. Thetechnology reduces dimensions of the original vector space model, filters noise of texts, andhighlights the potential semantic relationships among features by mapping features anddocuments into a potential semantic space with lower dimensions. It breaks the independenceof features assumption, and has better description of texts.Current researches on Latent SemanticAnalysis mainly concentrate on analysis andoptimization work of the relevant mathematical model and feature weight, but researches onoptimization of latent semantic space are relatively less. Meanwhile, when the technique isapplied in text classification, most researches focus on filter features from classifieddocuments for subsequent work. As for the way how features affect the building process oflatent semantic space, thereby affecting classification performance of the system, there areless researches. To solve these problems, this paper focuses on the way in which featuresco-occurrence affects latent semantic space when latent semantic analysis uses text paragraphinstead of origin documents.By studying the principle of features co-occurrence, analyzing features’distributionthrough context and global document, and studying data from lots of experiments, this paperintroduces concepts and construction methods of sub-documents and fake-documents. Bycombining the two optimization methods, this paper finds a way which effectively optimizesthe latent semantic analysis techniques, strengthens reasonable co-occurrence of similarfeatures and weakens un-reasonable co-occurrence of un-similar features.On the basis of research on document combination optimization of latent semanticanalysis technique, optimized technique based on document paragraph combination is applied on LSApatent classification system. Experimental results show that the final classificationprecision is about3.2percent higher than that of the best baseline model throughmulti-methods combination.
Keywords/Search Tags:Latent SemanticAnalysis, Document Parts, Sub Title, Feature Extraction, TextCategorization
PDF Full Text Request
Related items