Font Size: a A A

The Research And Design Of A Distributed Etl System

Posted on:2015-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2298330467462199Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and Internet of Things, an age of "Big Bang of Information" has come. First of all, in an enterprise, there are all kinds of report systems such as market, finance, human resource and production and so on. Secondly, with the rise of weibo, social networking and e-commerce, a flood of data will be generated almost every minute. Finally, with the development of Internet of Things, a large number of different types of wireless devices are used, thus sources of data are becoming more and more complex, formats of data are becoming more and more diverse, locations of data are becoming more and more discrete. For an enterprise, how to deal with those data and how to obtain useful information from the vast amounts of data is directly related to the survival of the enterprise.The data aggregation is the key technology to solve the problem above, while the ETL tool is a typical data aggregation technology. ETL is short of data Extract, Transform and Load. Traditional ETL tools typically have the following disadvantages:the centralized execution makes it can not handle massive amounts of distributed data effectively; it usually needs high performance and expensive machines.In this thesis, considering the shortcomings of the traditional ETL tools, a distributed ETL system based on Hadoop is proposed. In data extraction phase, this system not only contains two data extraction strategies including full extraction and incremental extraction but also supports extraction from both structured, semi-structured and unstructured data sources, while almost all the traditional ETL tools can only extract structured data source. In data transformation phase, this system takes Hadoop as the data transformation engine, thus it can handle massive distributed data. At the same time, this system also introduces Hive tool and designed a data transform inputer, which can enable the users write their conversion rules quickly.
Keywords/Search Tags:Data Aggregation, ETL, Hadoop, Distributed System
PDF Full Text Request
Related items