Font Size: a A A

Triple Co-occurrence Latent Semantic Vector Space Model And Dimension Reduction

Posted on:2020-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:S C WangFull Text:PDF
GTID:2370330578469132Subject:Statistics
Abstract/Summary:PDF Full Text Request
Vector representation of text is of great significance to the research of text topic aggregation,clustering,information retrieval and recommendation system.In the traditional text representation model,vector space model(VSM)is relatively simple and widely used.However,the model assumes that terms are independent of each other,which will affect the clustering effect to a certain extent.On the basis of VSM,the co-occurrence latent semantic vector space model(CLSVSM)uses co-occurrence analysis to deeply mine the latent semantic relationship between feature words in text information.The relationship is estimated by the relative intensity of co-occurrence,and the similarity between the document and the feature items is finally estimated.Experiments show that CLSVSM clustering accuracy is higher than VSM.Based on the co-occurrence latent semantic vector space model constructed by Boolean weights,the following improvements have been made in this paper.based on the limitation of Boolean weights,the co-occurrence latent semantic vector space model is reconstructed using word frequency information.This model is called frequency co-occurrence latent semantic vector space model(FCLSVSM).In order to extract the latent semantic information adequately,we further introduce triple co-occurrence information.By studying the representation of triple co-occurrence matrix,the calculation of frequency and relative strength of triple co-occurrence,we finally establish the triple co-occurrence latent semantic vector space model(T-CLSVSM).However,with the increase of the number of texts,the dimension and the amount of computation of the expression model will increase accordingly,which will ultimately lead to the decrease of the marginal effect of model application.Therefore,after building the model,the penalized matrix decomposition(PMD)method is used to reduce the dimension.The specific methods include: calculatingthe K-rank approximation and extracting the core feature words.In the experiment,The extended data set is selected to validate FCLSVSM.And we validate the model with basic data sets,test the scope of application of model by using common data sets.Finally,the following conclusions are drawn: choosing word frequency statistics to estimate the model can significantly improve the clustering effect.Under the selected evaluation index(purity,entropy,1F value),the clustering accuracy of T-CLSVSM is better than VSM and CLSVSM.PMD algorithm can achieve dimension reduction effectively by finding the K-rank approximation and extracting the core feature words.Compared with CLSVSM_K,its clustering accuracy is higher and dimension reduction effect is better.This paper improves the co-occurrence latent semantic vector space model,including model reconstruction based on word frequency information,the construction of triple co-occurrence latent semantic vector space model and dimension reduction using PMD algorithm.Finally,it is proved that the new model can improve clustering accuracy,reduce computational complexity and save costs.The improvement of the model provides a new choice for text representation.It also provides a reference for the research of similarity measurement,document retrieval and classification in document aggregation.
Keywords/Search Tags:VSM, CLSVSM, T-CLSVSM, PMD, Text clustering
PDF Full Text Request
Related items