With the development of the internet,text clustering has been widely used as an important unsupervised data analysis technique.In recent years,deep learning has demonstrated its powerful feature learning ability to process high-dimensional data.How to apply the method of deep neural network to the clustering task,that is,deep clustering,is a research direction recently.Although the existing deep clustering algorithms have achieved good results on text datasets,the existing methods still face many challenges due to the data characteristics of text which is semantic sparsity and semantic ambiguity.In view of the above two problems,this paper proposes a Deep Document Clustering method via Key Semantic Information Complementation(DCKSC)and the Semantic to Structural deep document Clustering algorithm(Sq2St).Aiming at the problem of semantic sparsity faced by classic deep clustering method in text clustering,DCKSC first enhances the original text data by extracting keyword data,and designs a key semantic information completion module to improve the traditional autoencoder to make up for the key semantic information which lost in the mapping process.Secondly,by combining clustering loss and keyword semantic autoencoder reconstruction loss,the model is more suitable for clustering task.Experiments show that the clustering effect of the proposed algorithm on five real datasets is better than the current advanced clustering method.The clustering results prove the importance of key semantic information completion methods and text data augmentation methods for deep text clustering.Aiming at the semantic ambiguity problem in the process of text clustering,we propose a novel and lightweight model called the Semantic to Structural deep document Clustering algorithm(Sq2St).Specifically,we design a semantic to structural autoencoder which maps from semantic information to structural information for a more comprehensive representation learning.With this novel autoencoder,a structure-enhanced semantic representation that combines semantic information and structural information can be learned.Then we use a self-training clustering objective to iteratively improve the clustering results.By integrating the self-training and semantic to structural autoencoder’s reconstruction into a unified framework,our model can jointly optimize the cluster label assignments and embeddings suitable for clustering.Experiments on several datasets validate the effectiveness of our model. |