Research And Implementation Of Text Clustering Algorithm Based On Non-parametric Bayesian Model

Posted on:2018-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:W L Zhong

Full Text:PDF

GTID:2358330536488536

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text clustering is an important technology of text mining.It has become a new research hotspot in the field of data mining with the rapid development of Internet technology in the information era.The number of online documents in the network is increasing rapidly.The organization of documents dataset in real life is often complex.Users are directly faced with two problems.1)Text dataset is usually imbalanced,revealing in the distribution of the number of samples in the text dataset is imbalanced.Some clusters only contain a few samples(minority cluster),and the number of samples in some other cluster comprise the majority of samples(majority cluster).2)How to determine the number of clusters in text dataset.Based on the non-parametric Bayesian text clustering model,which combined with the prior knowledge of text,has become a popular model for automatic learning the number of cluster from dataset.Compared with other models,the non-parametric Bayesian model relaxes the hypothesis that the number of clusters and provides an adaptive way for model selection.Therefore,based on non-parametric Bayesian models has important research value in the text clustering.However,because the “rich-get-richer” clustering property of the non-parametric Bayesian model,then the samples in the minority clusters to be attracted by the majority clusters.It would be useful to a clustering model is developed which could not only automatically learn the number of clusters from this kind of dataset,but also effectively solve the imbalanced problem of dataset.In order to deal with these problems,this paper proposes a text clustering algorithm based on the Pitman-Yor process model,which is named as DAPYP(discount adaptation Pitman-Yor process)model.In the process of text clustering,the model automatically adjusts the discount parameters of PYP(Pitman-Yor process)according to the number of texts in each text category to identify the minority cluster and majority cluster from the imbalanced dataset at the same time.Experiments on artificial datasets and real news text datasets show that the proposed model can effectively solve the data imbalance problem in the analysis of real text data sets and learn the number of clusters automatically.

Keywords/Search Tags:

Data Mining, Text clustering, Pitman-Yor Process, Imbalance dataset

PDF Full Text Request

Related items

1	The Research Of The Clustering Ensembles Based On SEAM Algorithm And It's Application On Text
2	Data Mining And Its Application In The Chinese Text To Speech
3	Implementation Of Distributed Hierarchical Clusterting Algorithm Faced To Huge Commodity Dataset
4	The Study And Application Of Web Text Data Mining Technology Based On The Approximate Pages Clustering Algorithm
5	Research Of Clustering Algorithm Based On Web Text Mining
6	Research On Text Clustering Methods And Their Applications
7	Research On The Application Of Feature Screening And Clustering Algorithm In Text Mining
8	Study On Outlier Mining Algorithms Based On Clustering
9	Research Of Clustering Mining Algorithm Oriented Big Data
10	Research On Clustering Process Model About The Text Of The Web Based On Concept Lattices