Research On Several Key Technologies For Data-intensive Heterogeneous Enviornments

Posted on:2016-05-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Huang

Full Text:PDF

GTID:1368330473467146

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Computer Applications are becoming increasingly data intensive,so that the computer architecture is now facing a great challenge.Data-intensive heterogeneous systems are promising practice solutions under existing technology.Including heterogeneous CPU-GPU architectures to improve computing capability;Heterogeneous Storage systems to improve I/O performance;Heterogeneous nodes and networks which have great scalability,etc.Data-intensive heterogeneous systems fall into two styles: Distributed computing systems or Parallel computing systems.The former can be represented by Map-Reduce Clusters and Cyber Physical Systems,and the latter can be represented by Super Computer and High-performance Clusters.There are plenty of research problems and this paper has selected fault tolerence and storage systems for researching.The distributed systems are mainly heterogeneous with nodes and networks,fault tolerence are key techniques to gurante the system performance.So this paper mainly research on its speculative execution strategies.Meanwhile,the parallel systems are mainly heterogeneous with processors and storages,while the storage system performance is the bottleneck.So this paper mainly research on data placement strategies of its heterogeneous storage system.The innovations of this paper are:(1)Estimate Remain time with System LoadFor the fault tolerance of data intensive heterogeneous ditrubuted systems,this paper proposes a speculative execution startegie,called ERSL(Estimate Remain time with System Load).Since the existing works have not consider the acute changes of the system load in heterogeneous environment,thus can reduce the efficiency of strategies.The key idea of ESRL is introducing the linear relation between execution times and system loads to estimate the remaining execution time of tasks,and improve the straggler selection.ESRL improve the existing work through four aspects,which are the estimation of the exectuiontime;the judgement of task priority;the discovery method of straggler;and the selection of the backup node.Experiments suggest that ESRL have smaller estimation error,and can reduce 10-15% completion timeo of jobs,when compare with LATE strategie.With the data skew,ERSL can improve even more than LATE,21% completeion time can be reduce.Due to the efficiency of ERSL,the throughputs of the cluster can be increase 10% and 17%,compare to NA?VE and LATE respectively.(2)Heterogeneity-aware Maximum Cost PerformanceFor the fault tolerance of data intensive heterogeneous ditrubuted systems,this paper proposes a speculative execution startegie,called HMCP(Heterogeneity-aware Maximum Cost Performance).Since the existing work try to guarantee the integral earnings of its system without take the heterogeneity of tasks and resources into consideration.The key idea of HMCP is to consider the differences of tasks,resources and slot value respectively.HMCP improve the existing work through three aspects,which are: distinguishment of task types,selectin of fast backup node,and HMCP modeling with slot value.Experiments suggest that HCMP can reduce 12%-26% and 6%-13% of completeion time and improve system throughput with 18% and 11%,when compare to LATE and MCP respectively.(3)Region Level Data PlacementFor the storage issue of data intensive heterogeneous parallel systems,this paper proposes a data palcement startegie,called RLDP(Region Level Data Placement).Aiming how to efficiently place data into heterogenous storage system,the key idea of RLDP is divids a large file into serveral small regions and selectively palces regions with highest value onto the underlying file servers.RLDP have four steps,which are: I/O tracing;Cost Modeling;Region Gain analysis;Placement and Region Redirection.Experiments suggest that under the inhomogeneous access pattern,RLDP can improve 86.98% read performance and 82.23% write performance by average.(4)Strip Level Data PlacementFor the storage issue of data intensive heterogeneous parallel systems,this paper proposes a data placement startegie,called SLDP(Strip Level Data Placement).Since the default layout with a fixed-size stripe will incur unbalance I/O time among HDD nodes and SSD nodes,can not fully utilize the advantage of SSD performance.The key idea of SLDP is using vary strip size based on access pattern,select key regions deloying with the optimal stripe configuration,so that remaining the parallelism of file system and without extra storage occupation.SLDP modeling the node cost,deciding the optimal stripe configuration;selecting key regions,placing and redirectiong divided files,in order to improve the overall performance of the storage system.Experiments suggest that even under uniform accesss pattern,SLDP can improve 51.3% and 44.6% read and write performance compared with SLDP.And can achieve familiar improvement with RLDP under inhomogeneous access pattern.

Keywords/Search Tags:

Data-intensive computing, Heterogeneous computing, Distributed systems, Parllel systems, Speculative execution, Data Placement

PDF Full Text Request

Related items

1	Research On Wide-area Data-intensive Computing Systems For Spatial Data Processing
2	Reseach On Data Placement Strategy For Data-intensive Applications In Cloud
3	Parallel Optimization Of Data Intensive Computing On Sunway TaihuLight
4	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
5	Joint Scheduling Of Data And Computation In Geo-distributed Cloud Systems
6	Data Placement Strategy Towards Efficient Execution Of Scientific Workflows In Cloud Computing Platform
7	Research On Data Placement And Fault-Tolerant Scheduling For Applications Of Data Stream In Geo-distributed Clouds
8	Research On Data Placement For Distributed Storage Systems With Heterogeneous Resources
9	Design And Development Of Data-intensive Computing Oriented Ship Emergency Response System
10	Job Scheduling Technologies In Data Intensive Supercomputing Systems