Font Size: a A A

Design And Implemetion Of ETL System For Unstructured Text Data Based On Hadoop

Posted on:2017-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y F XiFull Text:PDF
GTID:2348330512952061Subject:Software engineering
Abstract/Summary:PDF Full Text Request
For several decades, computer hardware and software technology has developed incessantly,informatization has been pushed swiftly in various industries, and various organizations have established internal information application systems of their own. At the same time, with leaping development of emerging mobile Internet, IOT and social media, data source abounds and the amount of data increases. How to integrate the diffuse, disorderly and non-standard data from data sources inside and outside the organizations and offer data preparation and data sharing for corporate business analysis and decision-making is a great challenge in the informatization process. ETL is short for data extract, transform, load. It is the key to obtaining high-quality data in data warehouse. It covers a process of extracting, cleaning, transforming and loading the scattered data in various business system. This thesis aims to design and implement an ETL system based on hadoop ecosystem, to complete data extracting, cleaning, transforming, loading for unstructured text data in the big data environment.The primary works in this thesis are presented as follows:(1) An investigation and analysis related theories of ETL system, current situations that are presented about ETL in this thesis.(2) An investigation and analysis related technologies of the hadoop ecosystem that the ETL system base on.(3) Requirements analysis, design of the overall architecture, and the detailed design and implementation of the four core modules of the system. The core modules are service interface, workflow scheduling, workflow executor, and data flow executor.(4) Testing and analyzing the system, that proved reliability and efficiency of integrating unstructured text data by the ETL system developed on Hadoop ecosystem.In this thesis, the ETL system is designed and realized, and it has passed the test in simulation environment of a domestic telecom carrier. By distributed technology based on open source hadoop ecosystem, the system complete data extracting, cleaning, transforming, loading for large amount of user communication data of the format of unstructured text, to satisfy fast, efficient, correct data intergration request for data analysis or data mining process. Since the system is stable and reliable in system testing, it achieve the desired design goals.
Keywords/Search Tags:ETL, Hadoop, ETL system, ETL scheduling, Data Intergration
PDF Full Text Request
Related items