Font Size: a A A

Research And Implementation On A Number Of Key Issues Of Data Warehouse ETL Technology Internationalization

Posted on:2008-06-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:W XuFull Text:PDF
GTID:1118360242479145Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Chinese Companies today have an unprecedented opportunity to optimize and expand their business exponentially. With the advent of the worldwide web internet and intranet our data warehouse applications have the possibility of global exposure. But how to effectively manage the domestic and foreign data is a crucial problem for those fast growing Chinese enterprise's IT construction. This paper will focus on a series of key technology research of data warehousing globalization, such as multi-language character set compatibility, robust ETL data system and unified control and audit mechanism derived from multiple autonomous data sources. On the other hand, we have chosen some popular business intelligence products such as Informatica, Oracle and Teradata to build up a prototype platform. We hope our practice will be helpful to data warehouse's theoretical research and application, expect our deliverables to satisfy enterprise's need of meeting the global market requirements and have been looking forward to providing a technical practice for those units / organizations who are interested in deploying and implementing a similar system.In this paper, based upon our research on data warehouse globalization/internationalization, the following achievements have been obtained:First, we will have a comprehensive view at the MEETL (Multilingual Enabled Extract, Transform and Load) technology. The problem begins with a source Oracle database for which the code page is 8859-1 (Western European) while we implemented a global data warehouse project, but the actual data is a mixture of code pages, including shift-JIS, 8859-x, GB2312 and Big5. The goal is to move this data into a target Oracle database for which the code page is UTF-8. It is possible to manipulate this data within Oracle code page 8859-1 and Informatica ASCII data movement mode. It is even possible to view the data using an Oracle client with the appropriate code page. However, it has not been possible to move this data from 8859-1 to UTF-8 using Informatica. The essence of the problem is that both Oracle and Informatica attempt to convert the data to the correct code page, but they each defeat the other's attempt. The key contribution of this paper is to deliver the MEETL solution/methodology to resolve such character set conversion problems.Second, this paper discusses the character set conversion problem occurring among the source DB, target DB and ETL tool. Furthermore, based on the MEETL technology, we have put forwarded and proved the CSCCT (Character Set Conversion Compatible Theorem) and CPCT (Code Page Compatible Theorem) theorems.Third, we will bring forward the UTF8toUnicode algorithm: UTF8DecodeUnicode. Its purpose is to validate the characters stored in the UTF-8 format and converts them to the right Unicode values and input character strings are expected in UTF8 format. Originally within Teradata DB, when characters are not translatable between client and server, they will be replaced with an error character. Meanwhile an error message is occurred: 6705 an illegally formed character string was encountered during translation. If we embed our UDF (user defined function) into regular ETL process, it will function properly to replace Teradata character set mechanism and increase the error tolerant capability.Fourthly, we will introduce a practical production instance on how to accomplish an efficient, scalable, controllable and maintainable ETL (extract, transform, loading) architecture. Within this infrastructure, we have adopted three new control techniques: Process Synchronization Control (PSC), Time Range Control (TRC) and Common Audit Control (CAC). These techniques will contribute to optimize the ETL architecture, improve the efficiency of data loading, and generate the audit rules for log tracking and analysis.Fifthly, we will propose a new way of setting up environment variables and separate DB login passwords for the DW ETL process. It will be more efficient and less resource consuming. It will allow for segregation of DB user logins based on subject area and / or process unit name, replacing most of situations where all the DW ETL units of work are logging to the database using the same generic id. Otherwise spool space is shared amongst all processes using the same user id, so a badly skewed process can hog the entire spool space and cause all the others that are running at the same time to error out with "No more spool space" error. In addition, there is no way for the DBA to discriminate between different processes and allocate resources differently (based on criteria like: critical / non-critical path job, length of queries and current status of the system). Our purpose is to build up a robust data warehouse ETL platform, along with high availability and expansibility in the need of enterprise's future development.
Keywords/Search Tags:ETL, Internationalization, Unicode
PDF Full Text Request
Related items