Font Size: a A A

Research And Implementation Of Distribute Integration Tool Combining ETL With Data Cleaning

Posted on:2011-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:X F ChenFull Text:PDF
GTID:2298330452961306Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data Warehouse has been put forward as a solution of integrating the enterprisesilos information, and it can support the enterprise decision-making.ETL(Extract,Transform,Load) is a tool which plays an important role in the process ofbuilding a Data Warehouse for data migration and integration. ETL tools can extractdata from a number of heterogeneous systems, and then transform the data from thedaily operational data into decision-oriented data which is suitable for DataWarehouse. However, if the information flowing into the Data Warehouse is notaccurate and the quality of data can’t be guaranteed, Data Warehouse can’t make theaccurate analysis and thereby misleads the decision maker. Therefore, data cleansingbecomes another important step of data integration.Most of traditional ETL tools in the market are expensive and have highdependency on high-performance servers. To deal with the above defects of currentETL tools, we have proposed a solution of distributed ETL tool to solve the hardware,price and other issues.This paper presents a distributed integration tool which combines ETL with DataCleaning. First, in order to solve the bottleneck problem of the master server in thepre-model, we put forward the concept of the circular distributed servers, which usesthe Agent technique to make the distributed computing server machines. Serversconstruct a ring in the logical layer, so as to improve the performance of the ETL tool.Meanwhile, it simplifies the distributed load balancing work. Second, this systemadopts the design of memory buffer and the multi-thread modalities. Further more, therule analytical engine,which is based on the Memory Database enhances the ruleanalytical capability. Third, considering that the traditional ETL tool lacks of dataquality control, the module of data cleansing is introduced. In this paper, datacleansing principles, processes and algorithms are analyzed and summarized, and weare mainly aimed at cleansing properties and similar aspects of the algorithms ofeliminating duplication records. At the same time, this tool provides users with thevisualized operating environment in the way of the graphical models.Through combining ETL with data cleaning, this distributed integration tool can make full use of the advantages of both ETL and data cleansing. It’s not only anefficient integration tools, but also ensuring the accurate data loaded into the datawarehouse.
Keywords/Search Tags:ETL, Data Cleaning, Data Integration, CircularDistributed Computing, Agent
PDF Full Text Request
Related items