| Due to the rapid development of Web2.0applications, social media as a platform for people to record life, share information and make on-line friends has draw great attention of business, politics and academics. Social media could help to understand a user’s social relationship, on-line behavior and preference, and thus contribute to the recommendation systems such as friends, products and services recommendations. Furthermore, social media is able to be employed to predict on-line collective behavior by studying historical information diffusion patterns, so as to keep a harmonious society. And therefore, the collection of social media data and the notice of on-line collective behavior is one of the most critical and urgent research topics.Traditional sampling techniques fail to be directly applied to collect social media data because of its strong relationship dependency feature. Besides, our task is challenging due to the volume, velocity and variety of social media. Microblogging services, as one of the most typical social media platforms, have most social media’s characteristics. This thesis mainly focuses on data collection and information diffusion pattern discovery. Our contributions are listed as follows.· A structure-based data acquisition method for social media is proposed and imple-mented. According to the Weakly Connected Component(WCC) theory, the dis-tributed crawler starts from the selected seed users, and then extend in the followee network based on the breadth-first criteria. The collected data set are published and employed for further discussions.· The formalized definition of the popularity for microblogs is provided. The defi-nition considers both the retweet number#retweet and the possible view number#pv. Moreover, from observations, we draw a conclusion that tweets with larger#retweet would have larger#pv.· Life cycle and tipping points of tweets are studied. The results indicate that for most tweets with larger#retweet, their life cycles are less than48hours. In addition, tweets may have the tipping point, which is a burst in the process of the diffusion. The distribution of the retweet volume over time follows Sigmoid function based on real data, and thus Sigmoid function are employed to fit the tendency. The estimation of the parameters for the algorithm are provided and the experimental results show that our model and parameter estimating method could achieve high precision. ·A resource library for analysing on-line collective behavior is developed. It is able to illustrate an event based on time, location, sentiment and diffusion network. We also provide a demo system to visualize the evolvement of an event, including the event participants, people’s attitude, influential users and etc.This thesis explore the feasibility of data collection of microblogging systems. Based on the collected data and the proposed definition of popularity, we model the information diffusion in literature and study the life cycle and tipping point. Finally, an open resource library for collective behavior analysis is established. The visualization demo system indicates the role of social media in studying user collective behaviors in multiple aspects. |