Font Size: a A A

Research On Key Techniques Of Data Warehousing And ETL For Multi-type Data Sources

Posted on:2009-10-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:J SongFull Text:PDF
GTID:1118360308978802Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The creation and the application of data warehouses is the only way for the enterprise to realize the advanced informationnalization. In the recent decade, lots of different scales data warehouse systems appear to solve the history data integration, management and decision support problem. The data sources of data warehouses are gradually various. Especially, the appearance of new real time data sources such as Web and textual data brings the new challenges to data warehousing and ETL. The data warehouse technologies faced with such serious problems:How to build a perfect data warehouse architecture to adapting the various data sources; how to implement a efficient ETL process of each layer of data warehouse system; how to guarantee a real or near-real time ETL and how to improve a access control model of data warehouse.This dissertation foucs on the characteristic of multi-type data sources first analyzes the existing requirements of data warehouse and the categories of various data sources, used the local ETL and the global ETL as two stages of the whole ETL process. Taking national data warehouse system as an example, the various data sources oriented data warehouse architecture is proposed, including the extraction layer, archive layer, summary layer, warehouse layer and application layer, the design and functions of each layer are also introduced in detail. Based on these, the key techniques of each layer are well studied.The main functions of extraction and archive layer are extracting and archiving data. The ETL software of these layers extracts data from various data source to the archive database, so it is called local ETL. This dissertation studied the local ETL based on the data sources of un-structured Web pages, semi-structured text and structured relative database. First, the issues of local ETL based on the data sources of un-structured Web pages are focused, and a more effective approach of collecting and storing Web pages is proposed. The approach divides the Web page into many blocks based on its layout, and treats these blocks as the units of version comparison, incremental storage and future process.Secondly, focusing on the issues of local ETL based on the semi-structured textural data sources, the dissertation studied on non-self-describing, semi-structured scientific data, purposed an approach of relationalization of textual data, accomplished the conversion of text model to object model then to relation model. Moreover, the efficiency and security of the model are also highlighted.Thirdly, focusing on the issues of local ETL based on the structured data source of relative database, some factors affecting the performance of ETL are summarized, and then a distributed database system based new ETL approach is purposed in this dissertation. Fartherly, a metadata-driven ETL approach is also proposed to provide the better flexibility, extensibility and maneuverability of the ETL tool. Based on the these approaches, a SQL-based, metadata-driven ETL tool is implemented and tested to prove the better efficiency.The summary layer and warehouse layer perform the data integration of the various data sources from the archive layer to the warehouse layer, this is some kind of ETL process named the global ETL. With the real time requirements, the global ETL faced not only the data integration issues but also the issues of real or near-real time ETL schedule. To solving the schedule opportunity of global ETL, and its competing with other applications for the resource of data warehouse environment, a new schedule approach of real time ETL is proposed, which trigger the ETL process and assign the resources according to the integration rules. Because real time ETL make use of all resources exclusively when it is executing, the running applications would lost the connections with data warehouse provisionally. In order to making the terminal users being not conscious of intermittent connectivity, a client framework supporting occasional connectivity is designed. The offline client framework is an environment-appreciable smart software framework with a certain universality.The application layer of data warehouse includes query, search, OLAP and data miming applications, it should also include a well organized access control mechanism. Both the applications and the data warehouse itself need a nice mechanism of access control. The two access control models are proposed in this dissertation. The proposed role and context based access control model is the extension of the classical role based access control model (RBAC), it is fit for the access control of data warehouse applications and for all of the use oriented applications. Another proposed model is purpose based access control model, it is fit for the database, data warehouse system and any other application oriented systems. Furthermore, according to the later model, an algorithm of mining hiberarchy relationships among the purposes is also studied in this dissertation. In conclusion, this dissertation first proposed an architecture of various data sources oriented data warehouse and its layers. Based on the architecture, the key techniques of each layer are well analyzed and studied. All the proposed apporaches and models have been implemented and applied in the practice projects, and their feasibility and effectivity also have been proved by the theoretics and the experiments. The whole researches focus on the design and performance of data warehousing and its ETL processes, and guarantee the opportunely, flexibly and efficiently of data flow and data access in the data warehouse system. These works are the guidance of building data warehouse and implementing ETL system.
Keywords/Search Tags:Data Warehouse, ETL, Metadata, Real Time Data Warehouse, Real Time ETL, Access Control
PDF Full Text Request
Related items