Font Size: a A A

Probabilistic Model-based Document Clustering

Posted on:2006-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:L YuanFull Text:PDF
GTID:2168360155952977Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Document clustering has become an increasingly important technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. Till the mid-nineties, hierarchical agglomerative clustering using a suitable similarity measure such as cosine or Euclidean distance formed the dominant paradigm for clustering documents. However, hierarchical clustering is not suitable to work on larger document set. The increasing interest in processing larger collections of documents has led to explore diverse approaches to the document clustering problem. Current document clustering methods can be mainly divided into two categories: similarity-based approaches and model-based approaches. In similarity-based approaches, they assume that the data points belong to a metric space and the metric is known. In other words, it is assumed that a well defined distance or similarity measure exists between any pair of objects. The clustering process is essentially to optimize an objective function involving the pairwise document similarities, aiming to maximize the average similarities within clusters and minimize the average similarities between clusters. In model-based approaches, on the other hand, they assume that the data is generated form a series of underlying parametric, probabilistic models. Values for the parameters are estimated from the input data, and the properties of the clusters are the inferred from these parameters. So the clustering process is essentially to learn a series of probabilistic models whose parameters were fixed on from the documents, and each model representing one particular document cluster. In this paper we focus on model-based approaches since they provide several advantages. First, model-based partitional clustering algorithms often have a complexity of O(n), where n is the number of data samples. In similarity-based approaches, just calculating the pairwise similarities requires O(n2) time. Second, each cluster is described by a representative model, which provides a richer interpretation of the cluster. Third, online algorithms can be easily constructed for model-based clustering using competitive learning techniques. Online algorithms are useful for clustering a stream of documents such as news feeds, as well as for incremental learning situations. Following is the essential framework of the probability model-based clustering approach: First, one should chose appropriate probability model for the data under consideration. Second, employ the EM (Expectation Maximization) algorithm to estimate the parameter in the model. Finally, use the probabilistic model with certain parameters to partition the input data. In this paper, we present an algorithm for document clustering-mvMF-Clustering algorithm. The algorithm represents the document clusters by von Mises-Fisher (vMF) distributions, and employs the mixture of vMF distribution to model the document set. Research has show that a natural model for multivariate directional data is provided by the vMF distribution on hypershphere that is analogous to the multi-variate Gaussian distribution in R d. We represent the document set by mixture model, just because it can capture the data from local by each component density of the model, and combine all the components to represent the global dataset. This approach can provide an effective description of the dataset under considering the local data. The core process of mvMF-Clustering is fixing the value of parameters in mixture model by EM approach. The EM algorithm can be used to solve problems which have some data or value missing. In clustering processing, the latent class label of dataset is invisible, is the hide variable,one can not know it in advance. The process of EM algorithm can be divided into two steps: E-step, supposing all parameters in the probabilistic model has been...
Keywords/Search Tags:Probabilistic
PDF Full Text Request
Related items