Triple Co-occurrence Latent Semantic Vector Space Model And Dimension Reduction

Posted on:2020-09-23

Degree:Master

Type:Thesis

Country:China

Candidate:S C Wang

Full Text:PDF

GTID:2370330578469132

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Vector representation of text is of great significance to the research of text topic aggregation,clustering,information retrieval and recommendation system.In the traditional text representation model,vector space model(VSM)is relatively simple and widely used.However,the model assumes that terms are independent of each other,which will affect the clustering effect to a certain extent.On the basis of VSM,the co-occurrence latent semantic vector space model(CLSVSM)uses co-occurrence analysis to deeply mine the latent semantic relationship between feature words in text information.The relationship is estimated by the relative intensity of co-occurrence,and the similarity between the document and the feature items is finally estimated.Experiments show that CLSVSM clustering accuracy is higher than VSM.Based on the co-occurrence latent semantic vector space model constructed by Boolean weights,the following improvements have been made in this paper.based on the limitation of Boolean weights,the co-occurrence latent semantic vector space model is reconstructed using word frequency information.This model is called frequency co-occurrence latent semantic vector space model(FCLSVSM).In order to extract the latent semantic information adequately,we further introduce triple co-occurrence information.By studying the representation of triple co-occurrence matrix,the calculation of frequency and relative strength of triple co-occurrence,we finally establish the triple co-occurrence latent semantic vector space model(T-CLSVSM).However,with the increase of the number of texts,the dimension and the amount of computation of the expression model will increase accordingly,which will ultimately lead to the decrease of the marginal effect of model application.Therefore,after building the model,the penalized matrix decomposition(PMD)method is used to reduce the dimension.The specific methods include: calculatingthe K-rank approximation and extracting the core feature words.In the experiment,The extended data set is selected to validate FCLSVSM.And we validate the model with basic data sets,test the scope of application of model by using common data sets.Finally,the following conclusions are drawn: choosing word frequency statistics to estimate the model can significantly improve the clustering effect.Under the selected evaluation index(purity,entropy,1F value),the clustering accuracy of T-CLSVSM is better than VSM and CLSVSM.PMD algorithm can achieve dimension reduction effectively by finding the K-rank approximation and extracting the core feature words.Compared with CLSVSM_K,its clustering accuracy is higher and dimension reduction effect is better.This paper improves the co-occurrence latent semantic vector space model,including model reconstruction based on word frequency information,the construction of triple co-occurrence latent semantic vector space model and dimension reduction using PMD algorithm.Finally,it is proved that the new model can improve clustering accuracy,reduce computational complexity and save costs.The improvement of the model provides a new choice for text representation.It also provides a reference for the research of similarity measurement,document retrieval and classification in document aggregation.

Keywords/Search Tags:

VSM, CLSVSM, T-CLSVSM, PMD, Text clustering

PDF Full Text Request

Related items

1	Research And Application Of Recommendation Algorithm Based On Latent Co-occurrence
2	Short Text Clustering Based On Frequent Word Co-occurrence Network
3	Text Clustering Based On Frequent Word Sets And Complex Networks
4	Fuzzy Clustering And Its Applied Research In The Chinese Text Clustering
5	The Research Of Clustering Analysis Based On Coupled DNA-GA-P Systems
6	Literature Clustering And Evolution Of Topic Innovation Based On Weighted Network
7	Research And Application Of Fuzzy Clustering Based On Tissue Like P-system
8	Research On Lightning Disaster Text Clustering And Prediction Method Based On Enhanced Gray Wolf Optimization Algorithm
9	Deep Learning-Based Methods For Biomedical Text Filtering And Information Extraction
10	Density Peak Clustering Algorithm Based On Three-Way Decision And Its Application