Nowadays,we have entered the big data era.The communications between people benefit much from the developed science and technology and efficient communication tools,and produce a large amount of data.It has become a critical practical problem to dig valuable knowledge from these data.Non-negative matrix factorization(NMF)is a popular data mining method which decomposes a high-dimensional non-negative data matrix into the product of two low-dimensional non-negative matrices.After such dimension reduction,the computational complexity of the subsequent processing method can be reduced,and the data size can be compressed without losing the quantity of information.Due to the aforementioned advantages,the NMF method has been successfully and widely used in various fields.However,NMF lacks nonlinear information,and cannot preserve the geometrical structure of the original data during the decomposition procedure.To overcome these deficiencies,someone introduced the graph regularization technique into NMF and proposed graph regularized NMF(GNMF)method.Towards graph cutting problem in network,some introduced the symmetric NMF(SymNMF)method.The GNMF method incorporates the geometrical information of the data into the decomposition procedure of NMF,and SymNMF utilizes the adjacent matrix of a graph as the sample matrix.Both of them improve the performance of the result of decomposition by incorporating the nonlinear information embedded in the dataset.However,similar to GNMF,SymNMF cannot preserve the geometrical structure of the original data.We therefore propose a graph regularized symmetric NMF(GrSymNMF)in this thesis.Since the graph regularization terms make the entry in the factor matrix closely correlated,the GNMFs cannot handle large-scale data on distributed systems.To solve the above problems,this thesis proposes a framework of distributed GNMF,which can satisfy the requirement of dimension reduction on any distributed clustering systems and get good performance.The main contributions are four-folds:(1)We propose a distributed GNMF(DGNMF)method which reformulates the graph regularization term to avoid multiplying the graph Laplacian matrix with the factor matrix through introducing an auxiliary variable and incorporating an auxiliary equality constraint.This strategy reduces the amount of communication overheads and obtains a near linear speedup rate.(2)We propose a graph regularized SymNMF which enhances the effectiveness of SymNMF through incorporating the graph regularization term.We introduced a distributing strategy which can be successfully applied to large-scale dataset and achieves high speedup rate.(3)We optimized both models by using the greedy coordinate descent method in the frame of augmented Lagrange method and implemented these algorithms on a distributed system.(4)We introduced a distributing strategy for each of two graph construction methods,i.e.,k-NN graph and sparse representation graph,which make perfect the framework of GNMF on large-scale datasets. |