Font Size: a A A

Exploring Data Clustering with Non-negative Matrix Factorization Models

Posted on:2016-08-21Degree:Ph.DType:Thesis
University:Drexel UniversityCandidate:Xiong, ZunyanFull Text:PDF
GTID:2478390017477758Subject:Computer Science
Abstract/Summary:
The clustering problem has been widely studied in data mining and machine learning. It has numerous applications to pattern recognition, information retrieval, image analysis and bioinformatics, etc. In general, clustering is a fundamental unsupervised machine learning technique that aims to partition the data set based on their similarity. Recently there has been significant development in the use of non-negative matrix factorization (NMF) methods for various clustering tasks. The method finds two non-negative matrix whose product approximates the original matrix. The non-negativity of the factored matrices is superior to other matrix factorization methods because it makes the data interpretation much easier. Moreover, NMF has attracted much attention due to the newly discovered ability of solving challenging data mining and machine learning problems. Studies has proved that NMF is equivalent with kernel k-means and probabilistic latent semantic indexing under some circumstances. Compared to most other clustering methods, NMF has been proved to achieve better or similar clustering results.;In the thesis, our primary goal is to study the clustering problem by establishing NMF models reflecting the features of given data. First, in the case when the similarity of the data is available, we proposed two modified NMF models, one with a constraint (CNMF) and the other with a regularization term (RNMF). We take this situation as an example to show how to model the data information. Also, we compare the two commonly employed approach in this simple case. Next, we propose a novel model named augmented nonnegative matrix factorization (ANMF). The novelty of the model is that it incorporates the geometric closeness of the data on both dimensions of the data matrix. In addition to the experiments conducted on benchmark data sets, the model is also applied to real application, i.e. CiteUlike data set. Finally, for data sets with sparse features, we propose a new model named sparse regularized non-negative matrix factorization (SpaNMF). This type of data is ubiquitous in applications and has remained a hot topic for many years. Our novelty here is to combine the geometric structure and sparseness of the data. For all of the four models, we develop numerical algorithms and conduct the experiments. The results of the experiments show effectiveness of our proposed models compared with state-of-the-art clustering algorithms.
Keywords/Search Tags:Clustering, Data, Matrix factorization, Model, Machine learning, NMF
Related items