Font Size: a A A

Study On Key Techniques Of The Real-Time Data Warehouse Based On Mapreduce Architecture

Posted on:2012-02-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J G ShiFull Text:PDF
GTID:1228330467982751Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the digital technology and the popularization of the computer applications, many enterprises and organizations have been using computers and other related information techniques to manage their data. The computers have strong abilities of collecting, storing, and processing data. The data collected and accumulated from operational systems year by year, such as production monitoring data, medical data, vital statistics data, finance and economics data, and marine data, are the assets of the enterprises. With market competition intensifying and the development of information needs, how to mine out patterns from the data to support decision making of production and marketing becomes more and more important. So, the data warehouse technology occurred. The data warehouse technology and the applications based on data warehouses are the hot topics of academic and industry communities.As enterprises become more familiar with the data warehouse, the capabilities to support decision-making of the data warehouse are increasingly used to drive better business decisions. However, the traditional data warehouse is periodically updated and the data for decision-making do not include the latest production data and information. And the strategic decisions based on the traditional data warehouse can not meet the requirements in the real-time performance. With the accelerated pace of business, real-time data that we need to analyze are growing faster and faster. The data warehouse must be able to support rapid business analysis and reflect the latest information to decision makers as soon as possible, in order to make rapid response to rapidly changing business conditions. Therefore, the real-time data warehouse technology appeared as the situation. There are many differences between the real-time data warehouse and the traditional data warehouse. The real-time data warehouse not only provides real-time data for business decisions, but also provides faster query analysis to users. Therefore, we do some research on the key issues in the real-time data warehouse, such as real-time data architecture modeling, updates and queries scheduling, parallel data warehouse query technology, parallel data cube construction technology, and so on. Here our work includes the following major aspects:Firstly, this dissertation designs a general architecture for the real-time data warehouse. And we do more research on the structure and design of the flexible variable and very important real-time data storage area, including the ODS partition, the alternating two-mirror partition, the copy partition of data warehouse, the real-time partition with multi-level caches and so on. Finally, we compare the design methods of the several real-time data storage area and analyze the most suitable application environment of these methods.Secondly, the dissertation proposes a priority-based updates and queries balance scheduling algorithm (PBBS) in the real-time data warehouse and describes the algorithm framework and idea in detail. And we consider the update and query task priorities, the real-time status of task queues and the feedback information of system resources, to conduct a parallel task scheduling. Then, the PBBS algorithm is not only able to adjust the resources allocation for updates and queries in accordance with user requirements, but also make rational use of system resources and ensure high-priority tasks can be scheduled to execute first. Thus it not only reduces the response time of the important queries, but enhances the data freshness of the important data.Thirdly, the dissertation proposes a QoS-based updates and queries scheduling algorithm in the real-time data warehouse. First, the algorithm defines some QoS parameters related to user queries, including the expected query response time and the acceptable real-time data delay. Then, according to the specific QoS requirements of queries, the algorithm makes a real-time scheduling for updates and queries. Finally, the algorithm can adjust the running order of tasks reasonably and use the system resources effectively to provide the faster query response and higher real-time (or fresher) data to users, according to the specific QoS requirements.Fourthly, the dissertation designs and implements parallel computing operations of the traditional relational data based on MapReduce framework, such as selection, projection, joint, division, aggregation, and so on. Then we propose a distributed relational database ChunkDB based on the chunk structure. And we design the architecture of ChunkDB, the data chunk mode, the data storage structure, sub-block allocation strategy, metadata information, fault tolerance and scalability, and so on. Finally, we design the MapReduce computation based on the ChunkDB database. And we extend the MapReduce framework to ensure that it is compatible with the ChunkDB database well. So we can access the data very easily and efficiently from the ChunkDB database.Fifthly, the dissertation proposes an efficient parallel Dwarf data cube construction algorithm using the MapReduce framework. First the algorithm divides the traditional Dwarf cube into several independent sub-Dwarf cubes, and then achieves parallel building, querying and updating of Dwarf cube, using the MapReduce framework. Finally, the parallel Dwarf algorithm not only provides the rapid cube construction and the high compression ratio for data storage, but also overcomes the disadvantage that there is no index in MapReduce mechanism, to achieve the fast query on the data cube by the self-indexing mechanism of the Dwarf cube structure. And the algorithm overcomes low incremental update performance of the traditional Dwarf cube. In additional, the parallel Dwarf algorithm has a good scalability. Along with the data increasing, we can dynamically increase new nodes into the cluster, to effectively improve the performance of Dwarf cube.Finally, we design and implement a prototype system of MR-RTDWH, which applies the theories and approaches about the architecture of real-time data storage, updates and queries scheduling, the parallel relational operations based on MapReduce, the integration of the MapReduce framework and relational database, the parallel construction of real-time data cube in this dissertation. The system shows validity and efficiency of these theories and approaches.In conclusion, this dissertation dedicates to study the real-time data warehouse based on MapReduce and key techniques in it, and proposes several novel solutions for research issues. Lots of theoretical analysis and experiments show that these approaches are efficient and effective. These algorithms and models will be the well foundations for the research of the real-time data warehouse for the future. And these approaches and techniques could make some contributions to the construction and development of the data intensive computing and cloud computing system.
Keywords/Search Tags:real-time data warehouse, change data capture, ODS, real-time scheduling, QoS, data cube, MapReduce, parallel dwarf
PDF Full Text Request
Related items