Font Size: a A A

Research On Clustering Problems And Algorithms For High-dimensional Data

Posted on:2016-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H HuangFull Text:PDF
GTID:1108330479478707Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of social media and mobile Internet technology, many new and interesting applications, such as Microblog, Weixin and Digg etc., emerge con-tinuously. These new applications have changed user’s behavior in information propaga-tion from passively accepting information in tradition to actively creating and propagating information nowadays. The process of using these applications by users generates large-scale high-value data. How to mine useful knowledge from these large-scale data is an international researching hotspot. High-dimensional data clustering is an unsupervised learning technique of data mining. Many researchers in China and abroad mined the useful knowledge hidden in social media data by using high-dimensional data clustering techniques and achieved many interesting results.However, in comparison to traditional high-dimensional data, the high-dimensional data generated by these new applications have new features. At first, due to the diversity of users and their randomness in process of using these applications, new high-dimensional data are usually very sparse and noisy. Secondly, with increasing complexities of the applications, the data generated by these new applications may include multiple type-s of features. Moreover, since many interesting applications have attracted large-scale users, the scale of data is becoming larger and larger. These characteristics of the high-dimensional data may bring to new challenges for traditional clustering algorithms. Re-cently, for the new characteristics of these large-scale complex high-dimensional data, researchers have proposed some new algorithms, such as subspace clustering algorithms, enriching sparse data clustering with external knowledge database, matrix factorization methods, etc. However, these studies also have some problems. For examples, these methods cannot effectively utilize the knowledge hidden in the data, need to import ex-ternal knowledge database, and cannot fuse heterogeneous features. This dissertation focuses on the problems:noise, sparsity, multi-view and high computational complexity in high-dimensional data. Based on the existing subspace clustering algorithms and ma-trix factorization methods, a series of new high-dimensional data clustering algorithms is proposed in this dissertation. The main contributions of this dissertation are as follows:First, a new subspace clustering framework by integrating intra-cluster compact-ness and inter-cluster separation is proposed. Based on this framework, three extend- ing algorithms:extending basic kmeans algorithm, extending variable weighting kmeans and extending attribute weighting algorithms are presented. Theoretical analysis about convergence of three extending algorithms have been given. As opposed to traditional kmeans-type algorithms, the extending weighting kmeans algorithms are able to obtain better effect of feature selection and reduce the weights of noisy features through utilizing intra-cluster compactness and inter-cluster separation simultaneously. Consequently, the three extending algorithms are able to gain better clustering results. Experimental results on synthetic datasets and real datasets have validated that the extending algorithms are superior to the traditional algorithms.Second, the STCEK (Short Text Clustering with Extending Keywords) algorithm is proposed for solving the problem of high-dimensional short texts clustering. It firstly ex-tracts concepts from a given dataset according to the cooccurrences of keywords. Then, a concept graph is constructed by using semantic relationship between concepts, and con-cept groups are extracted by partitioning the concept graph. At last, the concepts groups are used to enrich the semantics of short texts, as a result, the sparsity of short text is reduced to some extent. Since the STCEK algorithm employs only internal information of a dataset and does not require to import external knowledge database, it does not intro-duce noises. Experimental results show that the STECK algorithm performs better than the short text clustering algorithms of importing Wikipedia knowledge and not importing external knowledge.Third, The MNMF (Multiple Nonnegative Matrices Factorization) algorithm is pro-posed for solving the clustering problem of time-stamped multi-feature data. At first, each type of feature is represented by a nonnegative matrix and the whole dataset is represented by multiple matrices. Then, multiple nonnegative matrices factorization is used as tool of clustering analysis. Moreover, to obtain the smooth evolutionary trends of clusters, smoothness constraint is used to the time stamp dimension in the process of matrix fac-torization to relieve oscillation of the evolutionary trend produced by noises. According to the theoretical analysis about convergence, the algorithm is able to obtain local opti-mal solution. Experimental results on datasets TDT5 and NIPS show that the MNMF algorithm outperforms the existing time-stamped multi-view clustering algorithms.Fourth, The PMNMF (Parallel Multiple Nonnegative Matrices Factorization) algo-rithm is proposed for solving the high computational cost of multi-view data clustering problem. It transforms multiple nonnegative matrices factorization to the combinations of matrix multiplication, matrix addition and matrix element-wise operations. Then, the PMNMF algorithm is able to effectively reduce the time cost of matrix factorization by unitizing speed advantage of GPU (Graphic Processing Unit) on these operations. Exper-imental results on synthetic datasets and real datasets show that the PMNMF algorithm is superior to sequential execution MNMF algorithm with respect to the running time.In summary, this dissertation concentrates on the problems of noises, sparsity, multi-view and high computational cost in high-dimensional data, and proposes several algo-rithms:the extending kmeans algorithms, the STCEK algorithm, the MNMF algorithm and the PMNMF algorithm. Therein, the MNMF algorithm is the basis of the PMNMF al-gorithm. The research in this dissertation may improve performance of high-dimensional data clustering techniques further. The proposed algorithms can bring to alternatives to topic analysis, precision marketing, social searching, etc.
Keywords/Search Tags:high-dimensional data, clustering, nonnegative matrix factorization, multi- view data, graphics processing unit
PDF Full Text Request
Related items