With the development of information technology,a large number of documents have emerged on the Internet.Mining valuable topic information from these documents is an important research content in the field of text mining.In this field,the most commonly used algorithm for mining hot topic information in big data is text clustering.Traditional text clustering is mostly oriented to documents from a single source.However,with the increase of information platforms,clustering analysis of singlesource datasets no longer meets the needs of the times.Text clustering has begun to focus on multi-source datasets.The structure of the multi-source dataset is complicated.According to the difference of the topic information of different sources,this paper summarized them as heterogeneous heterogeneity,heterogeneous differences,and heterogeneous similarities.Traditional clustering methods cannot be directly applied to multi-source datasets,because these complex structures make it difficult for users to set a more accurate number of topics before clustering,and they will also cause topic confusion during the clustering process.Therefore,how to solve the problems caused by the complex structure is a research difficulty.In addition,the multi-source dataset contains a large number of short texts,and how to solve the feature sparse problem is also a research difficulty.Aiming at the problems caused by the complex text structure,this paper proposed a Hierarchical Dirichlet Multinomial Allocation Topic Clustering Model(HDMA)based on the Dirichlet Multinomial Allocation(DMA).On the one hand,the HDMA model combines the advantages of DMA to reduce the dependence on the number of topics entered in advance,and can automatically estimate the number of topics in each source during the clustering process.On the other hand,the HDMA model provides independent parameter space for each source of topic information,which can prevent the mixing of topic information from different sources.Experiments showed that the HDMA model has a good clustering effect on multi-source datasets.Aiming at the problem of sparse text features,this paper proposed a Hierarchical Dirichlet Multinomial Allocation Topic Clustering Model with Latent Features(LFHDMA)based on the HDMA model.The model adds two semantic latent feature matrices,and provides additional semantic information for each dataset with the help of word vectors trained from a large-scale corpus,reducing the impact of feature sparseness on clustering.Experiments showed that the LFHDMA model can effectively improve the clustering effect on multi-source text data sets. |