Font Size: a A A

Research And Implementation Of Data Reusing Strategy In Column-store Data Warehouse

Posted on:2015-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:J L ZhouFull Text:PDF
GTID:2268330425482029Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the high development of the Internet, we are facing the large amount of information and data. It’s urgent to integrate the existing information. At the same time, how to organize data using scientific methods and accurately analyze business information from different perspectives is more urgent than ever. As one of the data integration frameworks, data warehouse has brought great changes based on data and it is an effective way to solve the analytic application problems in the big data environment. As in data integration, it exists large redundancy between different data sets. So it needs more storage, and has greater impact on query response speed. Data warehouse often requires some solutions to make analysis cost more reasonable. Certainly, data reusing is one of solutions.In traditional relational data warehouse, data are all stored row by row, that is different attribute values from the same record are stored sequentially on the physical disk. Unfortunately, due to schemas in different data sets on memory hierarchy is often not the same, the redundancy is very little. That results it is not very easy to realize row-stores’data reusing. The column-store data warehouse eliminates the row-store data warehouse’s adverse conditions in the field of data reusing. In column-store data warehouse, the operation object becomes independent column. Because different entities in the real world tend to have similar attributes, and the redundancy between attributes is the prerequisite and key to implement data reusing strategy. Therefore, this paper pays more attention on how to effectively reuse data in column-store data warehouse.Firstly, this paper describes the significance of data reusing in the massive data environment. It analyses research background of this issue and related technologies, highlighting the necessity of in-depth study to data reusing in column-store data warehouse.Then, the paper simply introduces the core elements of data reusing in the column-store data warehouse, including the characteristics of column-store system, an overview of the data reusing strategy, definition of reusable data and query results equivalence principles.Subsequently, the paper analyzes the structure design about the data reusing strategy based on the column-stores. The data reusing strategy consists of four parts, mining candidate reusable data, filtering reusable data, implementing data reusing in storage layer and executing SQL queries based on reusable data. In these modules, the candidate reusable data mining module describes a reasonable solution about how to use CM mapping algorithm to quickly find the candidate mappings in the massive data, which greatly reduces the complexity of reusable data detection; reusable data filtering module conducts each attribute value’s match for the candidate mappings, so gets identified reusable data, which is the necessary guarantee to carry out data reusing; data reusing implementation module provides unified interface to queries; query execution module modified the traditional query execution process, realizing direct SQL queries based on the reusable data.Finally, this paper puts on column-store data warehouse management system DWMS as a platform and uses real data sets and benchmark data sets as test data to demonstrate these key technologies specifically. The experiment results conducted on the large-scale data sets indicate that the presented strategy can reduce the storage space, save data loading time and query execution time efficiently.
Keywords/Search Tags:data integration, data warehouse, column-stores, data reusing, reusable data
PDF Full Text Request
Related items