Font Size: a A A

Design And Implementation Of ETL Tools Based On Assembly Line Work

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:S K LiangFull Text:PDF
GTID:2268330428996044Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In computing, extract, transform, and load (ETL) refers to a processin database usage and especially in data warehousing. Firstly, extracts data from outsidesources, then transforms it to fit operational needs, which can include cleaning data,finally loads it into the end target database(more specifically, operational data store, datamart, or data warehouse). ETL first extracts dispersive or heterogeneous data, such asrelational data, flat data files, to the temporary middle-tier for cleaning, transforming andintegrating and then loads data from middle-tier into the data warehouse or data marteventually. This will be the basis of On-Line Analytical Processing (OLAP) and datamining. In general, ETL consists of data extracting, data cleaning, data transforming andtask scheduling etc. The business data from different sources are extracted, transformedand loaded into a target database by ETL tools.With the rapid development of Internet, a lot of enterprise internal and externalinformation needs to be integrated, Internal information contains the information ofvarious departments, external information includes information fromforeign cooperation between enterprises, such as sales and purchase subsystem.Enterprise use these information to make the correct decision will undoubtedly enhancethe competitiveness of effectively for the enterprise, there are many characteristicproblems in the source data from enterprise. For instance, the databases of the source datacan be ordinary databases or non-databases. There is a variety of the types of database, thedatabase can be run on different platforms. Generally speaking, the source data hascharacteristic problems, such as large volume, differed format and dis-unified platform,which cannot be fully addressed by ETL available on the market. This article, through thestudy of the key concerns of ETL tool design, raises solutions about the existing sourcedata problems. ODBC standard is proposed by Microsoft to access any data with different interfacesand diverse management methods. The management method of ODBC unifies the datasource configuration, and the uniformed access interface makes the programming easier.ODBC also supports Linux OS. Most databases have provided drivers for ODBC, so thatthe developers only need to load the corresponding driver path. In this thesis, we use ETLtools to unify the accessing interfaces of different databases with ODBC, and solve theuniform problem of the source data from different platform. The parallel work mode ofassembly line is helpful to improve the work efficiency. From industrial production toCPU instruction stream, both of them are using the assembly line parallel work mode. Thebackground service of ETL in this thesis adopts this working mode to solve the problem oflarge amount of data and low efficiency. The data from source databases should beupdated in real time in the specific applications of ETL; therefore ETL tools are requiredto extract source data at certain times. The data extraction includes full extraction andincremental extraction. Full extraction extracts all items from source databases, andincremental extraction only extracts items that have been changed in source databases. Wechoose different methods for different needs; we use incremental extraction to reduce thetransforming and loading operations of ETL, while increase the data extraction work.Incremental extraction methods consists timestamps, triggers, log tables and CDC, etc. Inthis thesis, we choose the ETL tools of the incremental extraction with timestamps. Wedesign the system with layers. The system is divided into an interface layer, aconfiguration layer and a service layer. The interface layer generates configuration filesaccording to the demands of users. The service layer realizes the requirements of users byanalyzing the profiles. The interface layer is developed based on MFC, so it can only berun on Windows OS. Instead, the service layer can run on different OS. Using MVCframework in designing the interface layer, we can improve the efficiency of the systemdevelopment and design. The essence of the ETL tool processing is the transformation ofthe data operation. The transformation includes attribute evaluation, attribute merge,attribute split, and conversion etc. The attribute evaluates integers and decimals etc. withthe operations of addition, subtraction, multiplication, and division, and the attribute split and the merge process the character types.In this thesis, the ETL is designed for integrating the enterprises information frominternal and external. It can access databases on SQL Server, Oracle, DB2, and can run onboth Windows and Linux OS. The method realizes the functions of incremental extraction,attribute evaluation, attribute merge, attributes split, and conversion. This thesis discussesthe ETL tools in the aspects of research background, key knowledge of the ETL, thesystem overall design, system detailed design and system implementation.
Keywords/Search Tags:ETL, Incremental extraction, MVC, Assembly line, ODBC, Data transformation
PDF Full Text Request
Related items