Font Size: a A A

Model-based Algorithms For Text Clustering

Posted on:2018-11-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H YinFull Text:PDF
GTID:1368330566487978Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text clustering is an important technique in data mining and machine learning,and widely used in applications such as topic detection and tracking,document summarization,and search results clustering.Although many studies have been done on text clustering,there are still many challenging problems to be solved:(1)How to set the number of clusters?Can we learn it from the dataset?(2)How to deal with the high-dimensional problem of text clustering?(3)How to deal with the sparse problem of short text?(4)How to obtain good representation of the clusters?(5)How to detect the outlier documents?(6)How to cluster huge amount of documents efficiently?This paper proposed several model-based text clustering algorithms which can cope with the above challenges:1.We proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model called GSDMM algorithm.When the initial number of clusters is larger than the true number of clusters,GSDMM can detect the number of clusters automatically.GSDMM is fast to converge and can balance the completeness and homogeneity of the clustering results.Meanwhile,GSDMM can cope with the sparse and high-dimensional problem of short texts and can also obtain the representative words of each cluster.The experimental results show that GSDMM can achieve better performance than the other three clustering methods.2.We proposed GSDPMM algorithm which is based on the Dirichlet Process Multi-nomial Mixture model as an improvement of GSDMM algorithm.GSDMM needs to assume the maximum number of clusters.When the assumed maximum number of clusters is smaller than the true number of cluster,the performance of GSDMM will be effected.When the assumed maximum number of clusters is much larger than the true number of clusters,the efficiency of GSDMM will be effected.GS-DPMM does not need to assume the maximum number of clusters and can learn the number of clusters from the dataset.Meanwhile,GSDPMM can detect the outlier documents automatically.The experimental results show that GSDPMM can achieve better performance than GSDMM and other clustering methods.3.We proposed FGSDMM algorithm whose time complexity is linear with the num-ber of non-empty clusters.FGSDMM solves the efficiency problem of GSDMM in another way and can control the number of clusters.We further proposed FGS-DMM+ algorithm based on the idea of online clustering which has a better method for initialization and can improve efficiency and performance of FGSDMM.The experimental results show that GSDPMM can achieve better performance than other clustering methods when the assumed maximum number of clusters is large enough.
Keywords/Search Tags:Text Clustering, Gibbs Sampling, Dirichlet Multinomial Mixture, Dirichlet Process Multinomial Mixture
PDF Full Text Request
Related items