Font Size: a A A

Research On The Microblogging Crawler Related Technologies

Posted on:2014-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LuoFull Text:PDF
GTID:2268330422452101Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the symbol of web2.0era, social media provide users with various kinds ofcommunication patterns and means. On the social media, users post and spreadmessages, follow the people they are interested in. In general, there are hundreds ofmillions of people, who are linked with each other by relations of following andform a huge social network, through which information is transmitted. Most socialmedia sites provide APIs as a convenient way to access data for related research, butdue to the restrictions of the APIs, to get enough data for some research is stillimpossible. So the research on the crawler of social media is of great signifi cance.This article mainly took domestic site of microblog as the research object,especially the Sina Microblog. We studied microblog crawler related technologieswhich include: research and implementation of crawling strategies, research andimplementation of microblog data acquisition, microblog data deduplication design.Firstly, we analyzed technologies of the crawler on microblog. In this section,we described a Hadoop-based distributed framework technology, an hbase-baseddistributed data storage technology. We introduced a microblog crawler frameworkbased on Hadoop using UID and MID as primary keys in the No-SQL database,discussed the backgrounds and processes of two main methods for obtainingmicroblog data, and compared various deduplication strategies.Secondly, we studied microblog topic data and personal data acquisitiontechnologies. In this section, we emphasized on the topic positioning method basedon Sina Meta-search and topic related keywords, then analyze the efficiency of thetopic data crawler.Finally, we studied the crawler for getting the whole network data on microblog.In this section, we analyzed the crawler for getting character data and its crawlingstrategy, implemented the breadth-first, depth-first and UID traversal algorithms.We also analyzed the efficiency of the whole network character crawler, and thenperform experiments of using different strategies in the crwaler systems.
Keywords/Search Tags:Social Media, Sina Microblog, Crawler, Crawling Strategy, hbase
PDF Full Text Request
Related items