Font Size: a A A

Weibo Group Division And Feature Extraction Based On The Community Detect

Posted on:2016-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:T T WangFull Text:PDF
GTID:2298330467993155Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology today, web users are not only data readers but also data creators of the web platform such as Weibo. The key of personalized service in the Weibo liked asymmetrical information network is to cluster users to communities. To address this problem, this thesis presents two-layer model of Weibo users and user similarity model. From those models we can extract valuable features. With those features we group users to communities via cluster algorithm.Firstly, through the data we got, it shows that Weibo belongs to power-law graph and also an information network. The information has directionality, so we build two-layer model which is different from relation network. Since Weibo is power-law graph, the research of users who have large number of followers can bring us more benefits. We choose top10%users who have larger page-rank values as the upper-layer users. Secondly, this thesis proposes that Weibo user similarity can be calculated by user static information similarity, structural similarity and Weibo-Posts similarity. Considering the features above, we use Logistic Regression to train the similarity model and use L1regularizationed Logistic regression to extract-features that contributes more to the model. Through the above steps, we can get the features that affect more to user similarity. Finally, we group users to communities using K-means clustering algorithm. We define an effective method to calculate the distance between two points in K-means algorithm.We evaluate our models on real world Weibo datasets of100K users, implemented on Spark. Results show that using the proposed user similarity model, we can judge whether two users is in one community and the accuracy rate is82.98%in the validation datasets and77.27%in the test datasets. From a side view, our experiment also shows that users in our datasets of Weibo can be approximately categorized into460communities, with the RI value of0.69.
Keywords/Search Tags:Social information network, Logistic regression, Feature selection, Community detect
PDF Full Text Request
Related items