Online social network services is going through great changes with the arrival of web2.0 and the rapid development of information technology.Social networking platforms have sprang up and flourished.Social network analysis has got much attention and become a hot point in recent years.The research of community detection has attracted an extensive interest among academics at present.Community mining helps understand the structure of complex network and dig out hidden features behind the data which may optimize user-oriented services such as identification of opinion leaders and personalized recommendation.With the advantages of low threshold,great openness and multi-terminal,microblog attracted a large number of users.Microblog deeply affect our real life by providing an online virtual platform.This work aims at discovering the community structure of the microblog network.The dataset consists of personal information,user-generated blogs and the relational data crawled from Sina Weibo.However,the huge number of users,multi-dimensional,tremendous amount of user-generated content and complex relational network make microblog a huge data sea what results in data overload.It is difficult for users to seek for information or other users they may be interested in.Generally speaking,there exists two crucial challenges during microblog community detection.Firstly,How to effectively combine user-generated content with relational network is a big challenge.Secondly,how to model the large-scale,sparse and high dimensional data with low time complexity.This work develops a novel approach to community detection based on the factorization of high order tensor that integrates relational network with blogs.The major work and results are summarized as follows.On the one hand,model the user interest by preprocessing user-generated content,extracting keywords,expanding keywords based on word embedding and reducing dimension based on non-negative matrix factorization(NMF).On the other hand,propose the concept of user influence in the relational networkFirstly,A NMF model C-NMF is developed based on the key features.Secondly,develop a weighted NMF model WNMF combined with users’influence defined in relational network.By alternating least squares and stochastic gradient descent,we accomplished C-NMF and WNMF with the result being a baseline of this research.In this paper,we creatively present a unified framework based on non-negative tensor decomposition(NTF)which is decomposed by ALS and SGD successively.Initialization of the core tensor and factor matrix based on SVD and HOSVD accelerates the convergence rate.In order to avoid over fitting,a novel model RNTF that adds regularization item to NTF is formulated.This model is optimized with an iterative algorithm called improved stochastic gradient descent algorithm(ISGD)making use of the great sparsity of the dataset.This point is also an innovative point in this thesis.The efficiency is analyzed in this part.For the same cost function,the efficiency of ALS is generally better than that of SGD.And the efficiency of ISGD algorithm is far better than that of ALS and SGD because of the extremely sparse dataset.The choice of the learning rate and regularization parameters has a direct influence on the efficiency of the above algorithms.Evaluate community performance according to the concepts of overlapping community modularity and community topic similarity.RNTF maintaining the essential data structure has effectively dug out the latent features of the raw data.Compared to matrix models,the community structure based on tensor models is more clear and distinct with higher topic similarity and better performance. |