| Text clustering is an important technology of text mining.It has become a new research hotspot in the field of data mining with the rapid development of Internet technology in the information era.The number of online documents in the network is increasing rapidly.The organization of documents dataset in real life is often complex.Users are directly faced with two problems.1)Text dataset is usually imbalanced,revealing in the distribution of the number of samples in the text dataset is imbalanced.Some clusters only contain a few samples(minority cluster),and the number of samples in some other cluster comprise the majority of samples(majority cluster).2)How to determine the number of clusters in text dataset.Based on the non-parametric Bayesian text clustering model,which combined with the prior knowledge of text,has become a popular model for automatic learning the number of cluster from dataset.Compared with other models,the non-parametric Bayesian model relaxes the hypothesis that the number of clusters and provides an adaptive way for model selection.Therefore,based on non-parametric Bayesian models has important research value in the text clustering.However,because the “rich-get-richer” clustering property of the non-parametric Bayesian model,then the samples in the minority clusters to be attracted by the majority clusters.It would be useful to a clustering model is developed which could not only automatically learn the number of clusters from this kind of dataset,but also effectively solve the imbalanced problem of dataset.In order to deal with these problems,this paper proposes a text clustering algorithm based on the Pitman-Yor process model,which is named as DAPYP(discount adaptation Pitman-Yor process)model.In the process of text clustering,the model automatically adjusts the discount parameters of PYP(Pitman-Yor process)according to the number of texts in each text category to identify the minority cluster and majority cluster from the imbalanced dataset at the same time.Experiments on artificial datasets and real news text datasets show that the proposed model can effectively solve the data imbalance problem in the analysis of real text data sets and learn the number of clusters automatically. |