| As a rising internet platform, microblog has great influence on the way people usemedia and the information propagation pattern. It has become a very important mediaplatform with the most instant news and the most active users among all the socialmedia. To the end of December2012, the microblog user number in China has reached309million, which is54.7percentages of all the Internet users in China. The researcheson microblog have great significance to society and research by helping understandingthe trends of public opinion, tracing hot topics and dividing social groups in socialnetworking services. All these studies require large amount of microblog data forsupport.Although there are already many organizations focusing on microblog datacollection, there is still no mature collecting method as for the traditional internetapplications. Therefore, research on microblog data collection is of great significance.This research designs and implements a distributed microblog crawling system,including the following parts:1) Designing and implementing the method of microblogdata collection through application programming interface of open platform, mainlyfocusing on the research and use of authorization in open platform and programminginterface.2) Designing and implementing the method of microblog data collectionthrough login simulation and webpage parsing, mainly focusing on the research and useof single sign on and webpage structure.3) Combining the two methods above,designing the general framework, modules and database, and implementing an efficientand expandable microblog data collecting system by using a distributed strategy. Usingthis system, the user can simply input the microblog accounts that need to be collected,and select the type of data to be collected, the results will be feedback quickly. It’s alsoconvenient to adjust the collection rate by modifying the amount of crawlers.After functional testing and data acquisition rate testing, it is proved that the systemis stable and efficient in microblog data collection, supporting dynamic extension. It haslaid a solid foundation for the research work carried out on the microblog data. |