Font Size: a A A

Design And Implementation Of Distributed Weibo Information Collection System

Posted on:2015-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:X X FanFull Text:PDF
GTID:2308330452955586Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology, social networking and mobilenetworks, the collection, analysis and forecasting of massive data has become a hot topicin various fields. Currently, massive data is mainly crawled across the entire network suchas some search engines with the stand-aline mode. This design lacks of targeted and playsa low performance. As a result, we will design a distributed system for crawling data toachieve several goals: scalability, high performance and availability.This thesis designs and implements a distributed system for crawling data. The datacollection module is designed to crawl and solve pages after simulated landing the weibosystem. Taking into account the scalability, the system was designed with the distrubutedmodel of Master/Slave. The system consists of two major module: the control node andthe work nodes.1) The former responsibles for node management, task scheduling, taskstatus detection and data storage. Task scheduling module uses a priority-based FIFOalgorithm. Task status is detected by periodic heartbeat mechanism. And data storage isimplemented by bulk starage mechanism based on message queue.2) The latterresponsibles for task execution, task status reporting, and task application. Task executionmodule crawls the data using the tool of HTTPClient and then use XQuery template toresolve the targeted data. In order to obtain global information, the work nodes send thetask status to the controller periodic. The task application module uses the tactics of threadpool saturation.The system now runs stably after function testing, however, the performance can bebetter. We still need to make lots of adjustments and improvements in some aspects of thedesign to make it more stable and efficient.
Keywords/Search Tags:Distrubued System, Usability, Data Crawling, Data resolving, XQuerytemplate
PDF Full Text Request
Related items