| Text clustering is one of the most fundamental challenges in unsupervised learning.Its purpose is to group semantically similar text segments without relying on human annotations.With the development of information technology,the amount of data is constantly expanding,and the relationship between data features and data has become increasingly complex.The difficulty of text clustering tasks has also increased,and traditional clustering methods are no longer able to handle high-dimensional and complex data types.The main reason for this is that the feature representation and clustering process of the text are separated from each other,and the two cannot form positive mutual feedback,thus unable to capture the complex relationships between samples well,which to some extent limits the performance of clustering algorithms.With the rapid development of deep learning,deep clustering has shown significant advantages over traditional clustering methods.Although good results have been achieved,but most existing deep text clustering methods require the use of pre trained representations in the general domain,which may not be the most appropriate solution for clustering in specific target domains.In addition,most existing deep text clustering methods require designing specific clustering schemes based on specific tasks,which may not be universal and therefore may not be well promoted.In order to solve the above issues,this paper proposes a self-supervised learning framework for text clustering,which aims to improve the feature representation iteratively by introducing classification objectives,so as to improve the clustering performance of the clustering algorithm as a whole.In each iteration,we first use the language model to retrieve the initial text representation,and then use our proposed classification separation and comparison clustering algorithm to collect clustering results from it.Then,through strict data filtering and data aggregation process,we retrieve samples with clean classification labels,which are used as supervision information,and update the language model with classification goals through rapid learning methods.Finally,the updated language model with improved representation ability is used to enhance clustering in the next iteration.In addition,this paper also proposes a deep text clustering method based on contrastive learning,which is a component of the framework CEIL.The basic idea of CDCC is to improve feature representation through contrastive learning and promotes better separation between categories through specific category loss,so as to achieve better clustering results.A large number of experiments show that the proposed framework significantly improves the clustering performance of clustering algorithms in the iterative process,and is suitable for traditional clustering algorithms and deep clustering algorithms.In addition,by introducing the proposed depth clustering method CDCC into the proposed framework CEIL,our model achieves advanced clustering performance on a wide range of text clustering benchmarks. |