The reasonable representation of text information plays an important role for information retrieval.Based on the traditional vector space model(VSM),co-occurrence latent semantic vector space model(CLSVSM)uses co-occurrence analysis to dig out the potential semantic information between feature words and improve the text clustering effect.However,with the increase of the number of text resources,the expression model has a higher dimension,which makes the model sparse and the calculation complex.Although the semantic kernel function based on CLSVSM combines the semantic information between words,it still fails to extract the semantic information between the core feature words sufficiently,and its dimension is not low.To solve this problem,this paper intends to further study text representation model by using Penalized Matrix Decomposition(PMD),in order to enhance text topic clustering effect.Firstly,this paper applies the penalized matrix decomposition to the text vector space models VSM and CLSVSM,and uses the matrix decomposition method to carry out sparse constraints on vectors,extract the core feature words,and then reconstruct the original data to enhance its interpretation.Secondly,through co-occurrence analysis theory,semantic kernel function(PMD_K)is constructed on the basis of PMD,and semantic information among core feature words is deeply mined.Finally,in view of the complex structure of source data in the era of big data and the potential existence of a large number of outliers,the penalized matrix decomposition algorithm based on L2,1 norm is proposed by introducing L2,1 norm for denoising,so as to process data more effectively and avoid the shortcomings of classical algorithms in this field.In the experiment,self-collected data and public data were selected to represent category equilibrium data and category disequilibrium data respectively,and the text topic clustering experiment was conducted on them using the above method.The experimental results show that the proposed method is better than the traditional method in clustering effect,and the PMD_K method is 21.9%higher than the 95%CLSVSM_K method in English dataset.Compared with the PMD algorithm based on L2 and 1 norm,the clustering purity and entropy of the PMD algorithm in the text dataset are improved by 4.1%and 5.4%,respectively,and the entropy is reduced by9.3%,which showed effectiveness of this algorithm.The application of PMD to text representation model improves the efficiency and precision of text topic clustering,and avoids the complicated operation of high-dimensional matrix. |