Font Size: a A A

The Design And Implementation Of Teradata Data Warehouse Log Analysis System Based On Hadoop

Posted on:2015-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y DongFull Text:PDF
GTID:2308330461455045Subject:Software engineering
Abstract/Summary:PDF Full Text Request
eBay Inc. uses Teradata Enterprise Data Warehouse Solution to manage business data to provide comprehensive business analysis information for decision-making. As the business analysis requirements become more and more complicated, the items to be analyzed in each subject area increase, the number of the data tables to be created in the data warehouse system increase dramatically. Meanwhile, the data scale in each table increases along with time. Massive data takes up PBs of system space and complicated data queries cost extremely high CPU resources. Therefore, taking charge of the relationship between tables and estimating table usage on the basis of dataflow can efficiently help administrators to clean up system space timely to reduce data redundancy and reduce the cost of data processing.This thesis mainly introduces the design and implementation of Teradata Data Warehouse Log Analysis System which based on Hadoop, short as DBQL Parser below. DBQL Parser takes the system tables that record users query log data in Teradata data warehouse as analyzing objects. Based on the requirements of data volume and processing efficiency, the system extracts data from Teradata data warehouse, loads into HDFS, then uses Hadoop computing platform to process massive data on a distributed parallel mode. According to the requirements of table characteristics and processing methods, the system applies Cascading framework based on Hadoop to organize pipe formed data processing flow. Relying on the Teradata SQL Parser API to tokenize the Query Text, the system extracts the Target Table which has been created, deleted, inserted or updated by users from Query Text, and traces the Source Table for the Target Table on the basis of the other log data, thus tracing the upstream and downstream of each table in the data warehouse. Meanwhile, the system extracts the script executing plan information for management convenience.This thesis describes the project background of DBQL Parser log analysis system, the present development status of big data processing technologies and the general information about the technologies for system development, such as Teradata DataWarehouse, Hadoop computing platform, Cascading framework and Maven. Then this thesis emphasizes on the requirement analysis and general design of the system. By starting from analyzing the target system tables to obtain the relationship between tables, this thesis lists out the functional requirements, such as Query Text preprocessing, Target Table and Source Table extracting, Alias processing and the Query Band analyzing, also including the nonfunctional requirements, such as system tasks scheduling, data volume, and system executing time. Then more detailed key user cases descriptions are given by using user case diagrams to define different system actions. In the part of system general design, this thesis shows the system layer structure and modular design philosophy by system architecture diagram and system function module diagram, and detailed describes the functions and dependency of each layer and module.This thesis also focuses on the detail design and implementation of system task schedule module, build cascading module and query log analyser module. In the task schedule module, the schedule flow diagram explains the task division and automatic execution. In the build cascading module, detailed design class diagram shows the dependencies between various classes and the data process flow chart shows the logical process of pipeline links. In the query log analyser module, particularly describes the alias processing algorithms, and gives the realization code of Target Table acquisition and Query Band analysis.Currently, the log analysis system has been applied to reality efficiently. It has been deployed on eBay Ares Hadoop cluster, spending about 0.5 hour per day on the execution time to complete the analysis of 30G incremental data, which daily generated by Teradata data warehouse query log system tables, while this amount of data will take a single computer 40 hours to process, which greatly increases the project feasibility and efficiency. Meanwhile, the analysis result of the system reflects the relationship between tables and schedule information, which effectively help system administrators to understand the usage of tables and achieve the management of the system storage space.
Keywords/Search Tags:DBQL Parser System, Big Data Processing, Hadoop, Cascading
PDF Full Text Request
Related items