Font Size: a A A

Research And Application Of Data Stream Update Algorithm In Real-time Data Warehouse

Posted on:2015-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z B PanFull Text:PDF
GTID:2428330488499704Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The real-time data integration is the key step for implementing real-time data warehouse in ETL architecture,connection operation is directly related to the performance of real-time data integration in the real-time data warehouse environment where the connection operation,selection of data update depends mainly on the source data sources and the arrival rate.Therefore,efficient algorithm for connected data flow is the key to guarantee the real-time data integration.In order to make the data integration be carried out efficiently and real-timely for the common skew data distribution,a hybrid join algorithm is developed to complete the data stream join work.The main content of this paper are as follows:First,this paper is on the background of the widely used data warehouse in modern enterprise,deeply analyzes the differences and connections between the traditional data warehouse and real-time data warehouse.It also introduces the details of the ETL architecture in model real-time data warehouse and deeply researches the important parts and key technologies.Then,the paper analyzes several common real-time data stream update algorithms,evaluates these algorithms from several key aspects such as multiple input join efficiency,duplicate tuples probability,I/O complexity and so on,demonstrates the advantages and disadvantages of these algorithms.According to the actual common data skew distribution,this paper presents an extended hybrid join algorithm to complete the data stream join,modifies the traditional hash join method to make it be able to use index,and stores the frequently used master data in memory to solve the disk I/O problem of high-speed flow.This paper also presents the cost model of the EH-JOIN algorithm.By comparing the improved algorithm with other common join algorithms,the experiments prove the proposed algorithm is superior to other common algorithms;we also compare the proposed algorithm with non-containing non-swap part to prove the positive role of the non-swap part;The accuracy of the cost model is also validated by our experiments in this paper.Finally,in the specific project applications,the proposed EH-JOIN algorithm is applied to a golf company's existing data warehouse,supports the real-time data updating in real-time data warehouse well and assists the real-time data analysis of decision makers.In summary,this data stream updating algorithm proposed in this paper is valuable in terms of theory and application.
Keywords/Search Tags:Real-time Data Warehouse, ETL, Data Stream Update, Hash Index, Skewed Distribution
PDF Full Text Request
Related items