Font Size: a A A

Distributed Data Processing System Configuration And Task Management Module Design And Implementation

Posted on:2013-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:X Y PengFull Text:PDF
GTID:2248330374486400Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Internet is developing in unbelievable speed and become an integral part of dailylife, seizing the market of most traditional industries. With the rapid development andincreasing of users, data quantity is expanding in exponential speed. Under the pressureof accustomed data quantity in internet industry of TB or even PB level, traditionalsingle-node data processing strategy is quite difficult to take. Under circumstance likethat, distributed data processing strategy is developed, and quickly became themainstream of data processing solutions.Our design implemented a model which store and distribute global configurationdata, and manage unsplited tasks. Any configuration data in this system is stored andpreserved by administrator through our model. All tasks are triggered here and recycledhere, too.While the entire system is in a process of initializing, our model distributedconfiguration data to other models so that they can start initialize successfully,meanwhile, if the configuration data is modified, data of new version will be pushed tomodels which focus on them, so configuration data of all models are latest. Unifiedcentralized management of configuration data guarantees data with same contents butseparated-stored come from the same source, which avoid runtime or initializationexceptions caused by the inconsistence of configuration data.All tasks are generated and triggered by our model. For tasks like data off-lineanalysis, structuring, re-organization and backup, administrator draw up correspondingexecution plan and our model will execute the plans using timers or monitors the systemrunning status and trigger tasks that should be executed at the specific situation. For realtime query, re-organization tasks, administrator could set parameters here and trigger thetask directly. A task log will be saved to record the execution situation after a task isfinished, then release resources applied by the finished task. If it’s a query task, queryresult data will be cached to avoid unnecessary stress is put on the system. In order toavoid data incompleteness caused by task missing under extreme cases, we scan the tasklog regularly to find out which tasks are missing and re-trigger them to guarantee the data completeness of the system.To prevent our system from node failure, we backup task running situations toremote database by dual redundancy and cold backup, so that tasks won’t bere-executed and resources will definitely be released.
Keywords/Search Tags:MapReduce, distributed system, system configuration, task management
PDF Full Text Request
Related items