Font Size: a A A

Research And Implementation Of High Reliability Data Integration System Based On Cluster

Posted on:2017-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:H F LiFull Text:PDF
GTID:2308330485988109Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increasingly widespread applications of IT in enterprise, IT systems have being built and updated during the development of enterprises. Generally speaking, for the informationization process of government and large corporation, the construction of IT systems has the characteristics of distribution and periodicity, which caused the phenomenon of Information Silo. As a solution, Data Integration technology, also known as ETL technology, is responsible to logically and physically integrate data from various sources, format and characteristics. And then provides comprehensive data sharing for enterprises. After years of development, traditional DI technology has been widely applied in the area of Data Warehouse. In recent years, with the rise of Big Data and Cloud Computing technology, companies have become more and more dependent on data, and the resource of getting information become more various, such as mobile equipment, Internet, which led to growing attention on problem of integration of heterogeneous data.Existing DI architecture basically satisfied the demand of functionality and usability, but the demand of high efficiency, high reliability and expendability under the environment of big data have not been satisfied. Therefore, we concentrate on the concurrent processing of ETL workflow and transaction based ETL data dispose, and then reform it, at last propose the high reliability DI architecture based on cluster computing.The thesis is an extension of Intelligent Big Data Analysis System. According to the actual needs, we designed a high reliability DI architecture based on cluster computing, and use the open source stream processing framework Apache Storm dispose ETL data. And then we proposed the ETL workflow parallelizing method to solve the key technology of data caching. In order to satisfy the reliability requirements, we proposed an ETL data processing method based on transaction, designed a concurrency control protocol—process-commit, and solved the key technology including transaction coordination, transaction triggering and transaction state management. We studied and designed the mapping technology to make the logical ETL workflow running on date processing engine as ETL tasks. Finally, we have done a series of experiments to verify the correctness of the system.
Keywords/Search Tags:Data Integration, ETL workflow, Storm cluster, Transaction, concurrency control
PDF Full Text Request
Related items