Penalized Matrix Decomposition And Its Application In Text Topic Clustering

Posted on:2022-01-22

Degree:Master

Type:Thesis

Country:China

Candidate:S J Feng

Full Text:PDF

GTID:2507306509469744

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

The reasonable representation of text information plays an important role for information retrieval.Based on the traditional vector space model（VSM）,co-occurrence latent semantic vector space model（CLSVSM）uses co-occurrence analysis to dig out the potential semantic information between feature words and improve the text clustering effect.However,with the increase of the number of text resources,the expression model has a higher dimension,which makes the model sparse and the calculation complex.Although the semantic kernel function based on CLSVSM combines the semantic information between words,it still fails to extract the semantic information between the core feature words sufficiently,and its dimension is not low.To solve this problem,this paper intends to further study text representation model by using Penalized Matrix Decomposition（PMD）,in order to enhance text topic clustering effect.Firstly,this paper applies the penalized matrix decomposition to the text vector space models VSM and CLSVSM,and uses the matrix decomposition method to carry out sparse constraints on vectors,extract the core feature words,and then reconstruct the original data to enhance its interpretation.Secondly,through co-occurrence analysis theory,semantic kernel function（PMD＿K）is constructed on the basis of PMD,and semantic information among core feature words is deeply mined.Finally,in view of the complex structure of source data in the era of big data and the potential existence of a large number of outliers,the penalized matrix decomposition algorithm based on L_2,1 norm is proposed by introducing L_2,1 norm for denoising,so as to process data more effectively and avoid the shortcomings of classical algorithms in this field.In the experiment,self-collected data and public data were selected to represent category equilibrium data and category disequilibrium data respectively,and the text topic clustering experiment was conducted on them using the above method.The experimental results show that the proposed method is better than the traditional method in clustering effect,and the PMD＿K method is 21.9%higher than the 95%CLSVSM＿K method in English dataset.Compared with the PMD algorithm based on L2 and 1 norm,the clustering purity and entropy of the PMD algorithm in the text dataset are improved by 4.1%and 5.4%,respectively,and the entropy is reduced by9.3%,which showed effectiveness of this algorithm.The application of PMD to text representation model improves the efficiency and precision of text topic clustering,and avoids the complicated operation of high-dimensional matrix.

Keywords/Search Tags:

CLSVSM, PMD, Semantic kernel function, Text topic clustering

PDF Full Text Request

Related items

1	Research Of SVM Kernel Functions In Text Classification
2	Research Of Text Representation Method Based On Co-occurrence Analysis
3	Research And Application Of Text Clustering Based On Topic Model
4	Research On Scientific Document Clustering And Topic Evolution Based On Citation Networks
5	Research On Topic Extraction In Online Public Opinion Based On Multi-label Classification
6	Research On Emotional Evolution Of Network Public Opinions Based On Topic-emotion Joint Model
7	12345 Mayor Public Telephone Text Clustering Based On K-means
8	Keyword Extraction And Topic Clustering For Education Big Data
9	KNN Algorithm Based On Gaussian Kernel And Its Applications
10	A Study On Hierarchical Clustering Of Micro-learning Units Based On Topic Feature Centers