Font Size: a A A

Research On Sina Weibo User Information Based On Two Improved Clustering Algorithm

Posted on:2015-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhaoFull Text:PDF
GTID:2267330428460390Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
In recent year, The Weibo (a micro-blog service released by SINA) has developedstrongly, and has become a necessary part in people’s daily life. As a platform fordisseminating information, the Weibo can help people get the first-hand informationtimely. As a social platform, it can help people make friend with each others in a new way.Because the users play the core rule in the Weibo platform, the partition and refining ofthe Weibo users is a extremely important step for the Advertising marketing or the publicopinion monitoring or the other work with Weibo.The paper taken the Weibo users information data as the research objects, based onthe fans’ number, the weibos’ number, the followers’ number, the friends’ number andthe weibo age of the users, partition the Weibo users into different group. Fritsly, thepaper visualize the data information to get the whole understanding of the distributionfeature, and standardize the data as the data preprocessing. For the data has a very largevolume (which is21481), and the dimensions is more than three that make it impossibleto observe the cluster tendency. So the paper apply two improved clustering algorithm.One algorithm is the improved k-Means algorithm which added the C-H index into thetraditional K-Means algorithm, so the algorithm can select the number of the clustersautonomously. The other algorithm is the TwoStep algorithm which is the combination ofthe hierarchical clustering algorithm and the Birch algorithm which can handle with thevery big data set. The paper named the two different clusters which produced by the twoabove algorithms.Finally, the paper measured the quality of the clusters with three different indexes.The result told that the improved K-means algorithm has the better effect. Maybe tworeasons for this result, the first one is the loss of the information of the data that causedby the pre-clustering in the calculation of the TwoStep algorithm, the other one isunsuitable choice of the threshold T.
Keywords/Search Tags:Sina Weibo, Information of users, Clustering, K-Means algorithm, TwoStepalgorithm, Cluster validation
PDF Full Text Request
Related items