Font Size: a A A

Research On Design And Optimization Technologies Of Distributed Data Warehouse In Enterprise Environments

Posted on:2017-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:X L GaoFull Text:PDF
GTID:2348330518496588Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Since the beginning of the new century,driven by the development of Internet technology and IOT technology,the amount of data that companies can get is also growing.The demand for data is no longer only for the daily business processing,many companies began to build large data warehouse to store and analyze the massive amounts of data they face.Data warehouse collects user data of different structures from different sources and classifies them according to the theme of data,making the analysis results more accurate and reliable for the analysis of data from the same subject,also managers can get better reference data.The traditional integrated data warehouse has been unable to bear the pressure of the massive data processing due to their defects in scalability and performance.The emergence of Hadoop makes people realize the strong computing power of distributed technology,and data warehouse of distributed architecture will become the developing direction of data warehouse.In view of this situation,this paper makes analysis and design from three aspects:the distributed architecture design of data warehouse,the unified management of metadata,and the combination of data ETL(Extract-Transform-Load)work and Hadoop.Combined with Hadoop,MySQL,distributed storage technology and impala as parallel query technology,design a complete set of system architecture.The ETL task can be done by the Map-Reduce task.For metadata management,this paper studies the metadata management mechanism,and the metadata implementation scheme of impala query engine,and design and implement a centralized metadata management module based on MySQL.Through this system,the source data is firstly processed by the Map-Reduce task for data extraction and transformation,and then the intermediate results data can be divided according to the specified data segmentation method.MySQL database storage and management system metadata in the form of lib.An efficient single storage engine is used to achieve the efficient storage and scanning of the data.The query of data is implemented by impala parallel query engine.The query module and storage module share a common set of metadata scheme to achieve unified management of metadata.Through this system,enterprises users can achieve efficient management of massive data.Data processing and analysis work will be more convenient.The result of these work will provide data support for the designation and adjustment policy of enterprises.Finally,the performance of the distributed system is tested by experiments.The results of experiments show that the system is effective in dealing with enterprise data.
Keywords/Search Tags:distributed systems, data warehouse, ETL, parallel query, OLAP
PDF Full Text Request
Related items