| With the arrival of the era of big data, competition among enterprises has not been only limited to the intense friction operational level. Especially in the emerging field of Internet commerce and other transaction, how to deeply use a data warehouse with the scientific method of corporate strategic decision has become the industry’s current research focus. In the data warehouse, with the development of the business, the enterprise is bound to be faced with changes to the data warehouse for data update. The main problem for the data update of tables is slowly changing dimensions issues, and it is one of the major problems of data warehouse construction and operation. Based on the above background, this paper does the following works: 1, This paper analysis two key point problems of updating the changed data in the data warehouse: extracting modalities and slowly changing dimensions problems. This paper gives each approach the fit business needs environment, which makes this paper be applicable and flexible for the problem of updating the changed data in the data warehouse. 2, This paper analysis the traditional data warehouse ETL change data updating methods and finds that the traditional method has serious drawbacks, such as: the algorithm inefficient use of the data, the data cannot be made again, it is difficult to retain historical data change information and so on. And then this paper gives an optimization for the association of different data. 3, After the combination of two works above, this paper proposes a new algorithm which uses Hive external table and Hive internal table to filter the log, and then use the zipper algorithm to update the changing data. The algorithm firstly uses Hive external table and Hive internal table to filter the Binlog log data which is in the data warehouse ODS layer partitioned by the target table name and the time. So that it can get the Binlog log snapshot table of the changing data, and then it can use it to make the snapshot of the changing data table. The algorithm then use Hive zip table algorithm to update the historical data on the target table with the snapshot table got before. Zipper table algorithm gives data the lifecycle and it plus another additional data to judge the status field, so it will ensure to record data’s historical changes and makes finding information on the most recent data to be more efficient. And it is a perfect solution to the traditional algorithm deficiencies4, Taking the e-commerce data as the experimental data, the paper tests three main properties of the new algorithm, that is data efficiency, data security back and recorded data history. And according to the test results, this paper detailly analyzes the advantages and disadvantages of both old and new algorithms. |