Font Size: a A A

Rcscarch On Construction Of ETL Cluster Model Based On Task Scheduling

Posted on:2013-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2248330395453776Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the expending of the enterprise size and their business, the dataenvironment which they are facing with becomes more complex. ETL, extraction,transform and load, is an important part of data warehouse building and has taken alarge account of workload. The study on how to improve the ETL processing capacityhas gradually become the research focus of academics and received someachievements. These model theories of ETL only involve some simple factors of dataenvironment, so they cannot do well with multiprocessors under the distributedenvironment.This paper starts from the design of ETL tools model and the task schedulingstrategies, some ETL data operation and efficiency problems are solved which are indistributed data environment. The main work and innovations are described in detailsas follows:1. From the perspective of the entire implementation process of the distributedETL, this thesis proposes an improved distributed ETL model——ETL cluster modelbased on task scheduling and makes the model realized. This model consists ofworkflow generation module and task scheduling module. The introduction of taskscheduling module makes up the limitation which is the previous theory only paysattention to the workflow generation and ignores the workflow execution. The clustermanagement of processors improves the autonomy of system and reduces thedifferences between the data sources and network.2. The ETL cluster model based on task scheduling joins the managementfunction of ETL processors. The implementation of this function enhances thestability and reliability of system which is neglected in previous theory of ETL model.3. Around the problem of processor heterogeneity, we have to resolve the issuefrom two aspects, namely, heterogeneity of hardware configuration and software. Forthe hardware heterogeneity, we balance the task execution of processors in cluster by using load balancing algorithm of heterogeneous ETL cluster which reduces theimpact of cluster heterogeneity on the implementation of the ETL work and makesfull use of processor resources. Combination of Web services technology is used tomake the ETL implementation platform-independence which would resolve theheterogeneity of the processor software platform.4. The heuristic algorithm is used to study the task scheduling optimizationproblem of the distributed ETL in this paper. Considering the characteristics ofdistributed ETL tasks, we apply discrete particle algorithm to the distributed ETLtools. Experiment results show that the application of this algorithm has theadvantages and feasibility.
Keywords/Search Tags:distributed ETL tool, cluster system, task scheduling, load balancingalgorithm of heterogeneous ETL cluster, discrete particle swarm optimization
PDF Full Text Request
Related items