Font Size: a A A

Research On Wide-area Data-intensive Computing Systems For Spatial Data Processing

Posted on:2014-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhaoFull Text:PDF
GTID:2268330422460510Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid growth of scientific data brings the fourth paradigm of scientific research.Compared with traditional compute-intensive scientific computing, data-intensivecomputing requires more consideration of data storage, throughput delay, loadscheduling etc. Therefore the implementation and platform for these are also differentfrom previous technologies used. Data-intensive computing attracts lots attention fromboth industrial and academic areas. The main sources of applications are from Internet,scientific computing, business intelligence, data mining and so on. In this paper, we usethe remote sensing image processing as the case study for big data science-orienteddata-intensive computing. Parallel processing framework is also studied with severalkey issues discussed.1. Implementation and optimization of parallel processing frameworkBased on the Robinia platform, we designed and implemented the parallelprocessing framework for wide area network spatial data. Data model design andparallel processing logic design are explained. Performance evaluation, bottleneckanalysis and code optimization are also discussed. We focus more on the datareplication and load balancing. Use some rules to automatically copy data replica andassign those data into different data nodes to make the data distribution suitable for jobrunning. High performance of parallel processing relies on the good data distribution.Experiments confirms that Robinia parallel processing framework achieves goodperformance in scalability, robustness, flexibility and low overhead2. Study of data-intensive computing scheduling algorithmUpon the parallel processing framework we implemented several testing andschedule algorithms can be performed. We studied the scheduling strategies for remotedata fetching, data replica assignment and data importing. A multi-queue schedulingalgorithm is brought up for the scenario in which data nodes remain the same whilecomputing nodes increase. Test case for the experiments is the drought detectionalgorithm (including NDWI) provided by Institute of Remote Sensing and DigitalEarth in Chinese Academy of Sciences. We use the Master-Worker model for parallelprocessing and ran on Linux/windows heterogonous nodes. Results show that schedule algorithm is very important to the performance of distributed system, and data localitycan significantly reduce the processing time cost. Multi queue schedule alrogithm canachieve better performance compared with random schedule algorithm.
Keywords/Search Tags:Data-intensive computing, Parallel processing framework, Schedulealgorithm, Robinia platform
PDF Full Text Request
Related items